Red Hat Developers Announce Work On New "Composefs" File-System
Red Hat is working on Composefs as a way to construct and use read-only images that are verifiable and have some immediate use-cases around sharing of Podman container layers and with the verification support for use by OSTree. Other projects like LXC and Snap may also be interested for this verified and opportunistic sharing support compared to their existing loopback mounts for images handling.
Alexander Larsson of Red Hat announced Composefs on the kernel mailing list. Some of the main highlights:
Giuseppe Scrivano and I have recently been working on a new project we call composefs. This is the first time we propose this publically and we would like some feedback on it.
At its core, composefs is a way to construct and use read only images that are used similarly to how you would use e.g. loop-back mounted squashfs images. On top of this composefs has two new fundamental features. First it allows sharing of file data (both on disk and in page cache) between images, and secondly it has dm-verity like validation on read.
So, given a trusted set of mount options (say unlocked from TPM), we have a fully verified filesystem tree mounted, with opportunistic fine-grained sharing of identical files.
So, why do we want this? There are two initial user cases. First of all we want to use the opportunistic sharing for podman container layers. The idea is to use a composefs mount as the lower directory in an overlay mount, with the upper directory being the container work dir. This will allow automatic file-level disk and page-cache sharing between any two images, independent of details like the permissions and timestamps of the files and the origin of the images.
Secondly we are interested in using the verification aspects of composefs in the ostree project. Ostree already uses a content-addressed object store, but it is currently referenced to by hardlink farms. The object store and the trees that reference it are signed and verified at download time, but there is no runtime verification. If we replace the hardlink farm with a composefs image that points into the existing object store we can use the verification to implement runtime verification.
In fact, the tooling to create composefs images is fully reproducible, so all we need is to add the fs-verity digest of the composefs image into the ostree commit metadata. Then the image can be reconstructed from the ostree commit, generating a composefs image with the same fs-verity digest.
These are the use cases we're currently interested in, but there seems to be a wealth of other possible uses. For example, many systems use loopback mounts for images (like lxc or snap), and these could take advantage of the opportunistic sharing. We've also talked about using fuse to implement a local cache for the backing files. I.e. you would have a second basedir be a fuse filesystem, and on lookup failure in the first basedir the fuse one triggers a download which is also saved in the first dir for later lookups. There are many interesting possibilities here.
On the kernel side there is these six RFC patches right now implementing this Composefs kernel driver. For the user-space tools around Composefs they are being worked on via containers/composefs on GitHub. There are also initial patches available for review for the proposed OSTree integration.