Wasmer 3.0 Released As The Latest "Universal WebAssembly Runtime"

NobodyXu replied

29 December 2022, 11:17 PM
Originally posted by linuxgeex View Post

I agree it's doable... I've just been concerned with performance and features at install/update/remove time.

Your "IMHO" objective... that's the very point I've been trying to make. How would you realise it with BTRFS or zfs and the tools they provide, in real time, without a Union FS as the arbiter of segregation?

I just check the snapd documentation and it seems that the image is built in a way similar to docker: You have a base image providing different distros with different pre-installed libraries and the actual image which is built on top of it.

Then merging them is going to be the same as how docker does it and it only needs to be done once when importing the image.

Originally posted by linuxgeex View Post

ie given 2 source trees /core and /app, and a RW overlay for the app /state, how would you combine them with BTRFS to achieve a single system image "/" to run the app in a chroot/container, whereby all the state changes end up in the /state folder so that for example the /state can be rsynced to another host for live migration without ever touching a single file from /core and /app? Would that need to rely on btrfs-send? Zfs has a way to sync subvolume deltas but it's a binary representation of FS blocks and very dependent upon the parent(s) at both ends of the link to be in sync, so it has limited usability... similar to taking an LVM snapshot and syncing that between hosts.

PS I don't mean to move the goalposts - I'm just following your Docker context because you seem more versed with that. The ability to move the state easily between hosts also exists in Snapd, which I mentioned before in the context of a developer using that capability for debugging / support.

I assume /core and /app refers to part of the app image, where /core is the base and /app is the layer created by the application.
That I think can be merged in the same way as docker.

For the app data /state, you can just use a bind-mount, which is also the same way used by docker.
In docker, you typically store these data in a volume, which is also bind mounted into the container.
The volume then can use a different filesystem other than btrfs/zfs for they might not give the best performance for these apps given that it might be write-heavy.

Originally posted by linuxgeex View Post

And BTW thanks this is turning out to be one of the most interesting off-topic convos I've had on Phoronix lol.

You are welcome.
Leave a comment:
linuxgeex replied

29 December 2022, 02:48 AM
Originally posted by NobodyXu View Post

I'm not so familiar with snap/flatpak, but for docker/overlayfs, it's certainly doable since in that model, images are layered in a tree model: Every layer has a parent (except for the root) and only contains modification to the parent.
...
IMHO container should be disposable and any data needs to persist should be put into a volume that mounted into the container.
...
While it defaults to overlayfs (union fs), it supports btrfs as an alternative.
That's why I say this is definitely doable.

I agree it's doable... I've just been concerned with performance and features at install/update/remove time.

Your "IMHO" objective... that's the very point I've been trying to make. How would you realise it with BTRFS or zfs and the tools they provide, in real time, without a Union FS as the arbiter of segregation?

ie given 2 source trees /core and /app, and a RW overlay for the app /state, how would you combine them with BTRFS to achieve a single system image "/" to run the app in a chroot/container, whereby all the state changes end up in the /state folder so that for example the /state can be rsynced to another host for live migration without ever touching a single file from /core and /app? Would that need to rely on btrfs-send? Zfs has a way to sync subvolume deltas but it's a binary representation of FS blocks and very dependent upon the parent(s) at both ends of the link to be in sync, so it has limited usability... similar to taking an LVM snapshot and syncing that between hosts.

PS I don't mean to move the goalposts - I'm just following your Docker context because you seem more versed with that. The ability to move the state easily between hosts also exists in Snapd, which I mentioned before in the context of a developer using that capability for debugging / support.

And BTW thanks this is turning out to be one of the most interesting off-topic convos I've had on Phoronix lol.

Last edited by linuxgeex; 29 December 2022, 03:12 AM.
Leave a comment:
NobodyXu replied

28 December 2022, 10:01 AM
Originally posted by linuxgeex View Post

You missed the bit about merging multiple RO fs images. Unless again you mean to extract all of them as your way of merging them. So then you'd be keeping protected copies of the extracted FS image contents, using snapshots or subvolumes or just putting them in their own restricted-access folders. And then to avoid using OverlayFS/UnionFS/Aufs, in order to present a single unified RW filesystem to the app in its container, you would copy them all into a single combined writeable tree, using reflinks on xfs or F2FS, or hardlinks plus ACL and LD_PRELOAD trickery on ext4, and native dedupe on FS that support it.

I'm not so familiar with snap/flatpak, but for docker/overlayfs, it's certainly doable since in that model, images are layered in a tree model: Every layer has a parent (except for the root) and only contains modification to the parent.

Originally posted by linuxgeex View Post

I wonder what the core usage would be like on compressed deduped BTRFS when 6 apps that use the 1GB GNOME overlay get that updated... will it decompress it, recompress it, find the duplicate compressed blocks, avoid the writes, so the only penalty vs status quo is decompressing 1TB and compressing 7TB for the copy operations at install time?

Well, Btrfs only supports offline deduplication, so it will start dedup periodically when the system is idle.
Zfs supports online duplication, but that consumes a lot of memory.

Compression can indeed help reduce I/O as it is done before writing to disk.
I remember reading somewhere on phoronix that the maximum size for one compressed block is 128K, so depending on the compression algorithm, it can store quire some data.

Using zstd:19 with force compression (zstd internally checks data for whether compression is feasible and it's more superior than btrfs's algorithm) can work quite effectively and since decompression speed of zstd is mostly independent of the compression level, you can set compression level to highest (19) for filesystem that performs read far more often than write.

Originally posted by linuxgeex View Post

Anyhoo... that's doable. The tradeoff at runtime is between the extra disk and RAM usage for the metadata of the per-app combined trees, vs the performance/complexity cost for a Union filesystem.

In fact, union filesystem like overlayfs and unionfs can occupy more space than btrfs/zfs.

When modifying files, it will first be copied onto upper layer, which is writeable, then performs the modification, where as in btrfs/zfs, only the block modified will be copied.

Originally posted by linuxgeex View Post

It's a bit of a mess when you go to update the app or one of the dependencies though, as you'll need to extract only the RW tree changes, set them aside, tear down the combined tree, build it back up, then re-apply the previously set aside changes. Or you could keep a database of where each file came from, and manage them individually within the writeable tree.

Using btrfs-send, that's doable without having to use a database, but it does affect performance since it is basically rewritten.

IMHO container should be disposable and any data needs to persist should be put into a volume that mounted into the container.

Originally posted by linuxgeex View Post

There's various optimisations that would be possible with each FS, ie with ext4 hardlinks you could update the extracted inodes, and all copies would be updated in place like magic... which is normally a problem but in this case it would be awesome lol. Reflinks would end up becoming COW copies if you tried that, but once you replaced all the child reflinks the old COW would get orphaned and garbage collected, and of course btrfs/zfs with native dedupe.

So... just write a tool to manage the app runtime/chroot/container trees in place. Make sure to take advantage of the benefits of each supported filesystem, and I'm sure it would get adopted. There would be meaningful runtime performance benefits for file metadata-intensive applications so long as RAM wasn't in short supply.

Or just keep using a Union filesystem to do that heavy lifting. Literally job done.

That's what I am trying to say: Docker already uses btrfs.
While it defaults to overlayfs (union fs), it supports btrfs as an alternative.
That's why I say this is definitely doable.
Leave a comment:
linuxgeex replied

28 December 2022, 08:58 AM
Originally posted by NobodyXu View Post

Turns out that docker is capable of pulling this off

They simply take the read-only image, create a subvolume of it that is writable, then mount that into the container.
...
So if we want to merge this in a btrfs/zfs subvolume, I guess we only need to handle the whiteouts and opaque directories specially and anything else can be simply copied.

You missed the bit about merging multiple RO fs images. Unless again you mean to extract all of them as your way of merging them. So then you'd be keeping protected copies of the extracted FS image contents, using snapshots or subvolumes or just putting them in their own restricted-access folders. And then to avoid using OverlayFS/UnionFS/Aufs, in order to present a single unified RW filesystem to the app in its container, you would copy them all into a single combined writeable tree, using reflinks on xfs or F2FS, or hardlinks plus ACL and LD_PRELOAD trickery on ext4, and native dedupe on FS that support it.

I wonder what the core usage would be like on compressed deduped BTRFS when 6 apps that use the 1GB GNOME overlay get that updated... will it decompress it, recompress it, find the duplicate compressed blocks, avoid the writes, so the only penalty vs status quo is decompressing 1TB and compressing 7TB for the copy operations at install time?

Anyhoo... that's doable. The tradeoff at runtime is between the extra disk and RAM usage for the metadata of the per-app combined trees, vs the performance/complexity cost for a Union filesystem.

It's a bit of a mess when you go to update the app or one of the dependencies though, as you'll need to extract only the RW tree changes, set them aside, tear down the combined tree, build it back up, then re-apply the previously set aside changes. Or you could keep a database of where each file came from, and manage them individually within the writeable tree.

There's various optimisations that would be possible with each FS, ie with ext4 hardlinks you could update the extracted inodes, and all copies would be updated in place like magic... which is normally a problem but in this case it would be awesome lol. Reflinks would end up becoming COW copies if you tried that, but once you replaced all the child reflinks the old COW would get orphaned and garbage collected, and of course btrfs/zfs with native dedupe.

So... just write a tool to manage the app runtime/chroot/container trees in place. Make sure to take advantage of the benefits of each supported filesystem, and I'm sure it would get adopted. There would be meaningful runtime performance benefits for file metadata-intensive applications so long as RAM wasn't in short supply.

Or just keep using a Union filesystem to do that heavy lifting. Literally job done.

Last edited by linuxgeex; 28 December 2022, 09:46 AM.
Leave a comment:
NobodyXu replied

28 December 2022, 06:37 AM
Originally posted by linuxgeex View Post

Yes. It's achieved with UnionFS. They mount the app, then the dependency overlays, and finally a RW folder on top of it.

You're suggesting to extract the compressed filesystem images into native BTRFS/ZFS folders, and then take per-app-context writeable snapshots of those folders so they don't affect each other ... two problems...

First, those RW snapshots still need to be merged into a single standard POSIX filesystem heirarchy. Neither BTRFS nor ZFS provide a way to do that, at least not that I'm aware of. So you'd still end up using UnionFS to merge them.

Turns out that docker is capable of pulling this off

Use the BTRFS storage driver

https://docs.docker.com/storage/storagedriver/btrfs-driver/#how-the-btrfs-storage-driver-works

Learn how to optimize your use of Btrfs driver.

Site not found · GitHub Pages

https://gdevillele.github.io/engine/userguide/storagedriver/btrfs-driver/

They simply take the read-only image, create a subvolume of it that is writable, then mount that into the container.

Regarding unionfs, what I know it is similar to overlayfs, in that you use one or multiple read-only base images (lower layers) and one writeable upper layer to form one overlayfs that is mounted into the container.

If you deleted any file from the lower layers, it will write a special file to upper layer called whiteouts and opaque directories for removing directories.
For changes to the files from lower layers, it will be copied from lower layer to upper layer and then modified.
Anything else acts just like regular fs operations.

So if we want to merge this in a btrfs/zfs subvolume, I guess we only need to handle the whiteouts and opaque directories specially and anything else can be simply copied.

This will be more complicated if OVERLAY_FS_REDIRECT_DIR or CONFIG_OVERLAY_FS_METACOPY is enabled, since it will only partially copy the metadata, but still can be dealed with.
I don't know how widely use it is though, I suppose docker might not use them for portability.

Originally posted by linuxgeex View Post

Second, when you want to make an archive of only the RW state changes (done numerously ie on app/overlay update/removal so there's versioned restore points) that state would be mixed into those multiple RW snapshots. Ultimately you'd use UnionFS to provide a single clean RW overlay as well, just like is done on every other parent FS, and sadly lose the seemingly helpful RW snapshots.

While I recognise that snapshotting the RW snapshots recursively appears to provide similar capability, there's drawbacks. ie one of the benefits of Snapd for developers is that if a user has a problem with an app, the user can save state and send it to the developer. How do you extract only the changed files from the app's merged FS image? Not impossible granted, but neither is it quick and easy. Snapshots also aren't free, and it's easy to ignore that. There are performance and storage costs, and in this context they don't really compare that favourably to a tarball.

That's indeed harder, need to use `sudo btrfs subvolume send -p /path/to/base_image /path/to/container/image` which requires root.
I won't say that i hard if snapd has built-in support for this like docker.

With btrfs v2 https://www.phoronix.com/news/Btrfs-...-v2-Linux-5.20 , it can include compressed data into the send stream to reduce stream size and speed up procession time on send and receive.

For container that only modifies part of the file instead of the whole file, this can be very beneficial since btrfs send stream would not incl, de the whole file but only the part that changes, unlike overlayfs/unionfs, though the compression is applied per block instead of the whole file, the stream might not be as small as the tar stream.
Leave a comment:
linuxgeex replied

27 December 2022, 09:38 AM
Originally posted by NobodyXu View Post

Isn't the images read-only while the actual running container writeable so that you can re-create the container multiple times?

Yes. It's achieved with UnionFS. They mount the app, then the dependency overlays, and finally a RW folder on top of it.

You're suggesting to extract the compressed filesystem images into native BTRFS/ZFS folders, and then take per-app-context writeable snapshots of those folders so they don't affect each other ... two problems...

First, those RW snapshots still need to be merged into a single standard POSIX filesystem heirarchy. Neither BTRFS nor ZFS provide a way to do that, at least not that I'm aware of. So you'd still end up using UnionFS to merge them.

Second, when you want to make an archive of only the RW state changes (done numerously ie on app/overlay update/removal so there's versioned restore points) that state would be mixed into those multiple RW snapshots. Ultimately you'd use UnionFS to provide a single clean RW overlay as well, just like is done on every other parent FS, and sadly lose the seemingly helpful RW snapshots.

While I recognise that snapshotting the RW snapshots recursively appears to provide similar capability, there's drawbacks. ie one of the benefits of Snapd for developers is that if a user has a problem with an app, the user can save state and send it to the developer. How do you extract only the changed files from the app's merged FS image? Not impossible granted, but neither is it quick and easy. Snapshots also aren't free, and it's easy to ignore that. There are performance and storage costs, and in this context they don't really compare that favourably to a tarball.

Last edited by linuxgeex; 27 December 2022, 10:22 AM.
Leave a comment:
NobodyXu replied

21 December 2022, 02:53 AM
Originally posted by linuxgeex View Post

No need for subvol - bind mounts are the compatible, KISS, solution.

Isn't the images read-only while the actual running container writeable so that you can re-create the container multiple times?

Originally posted by linuxgeex View Post

I suppose one could manually decompress the images. There still wouldn't be many shared blocks thanks to block alignment issues. One would need to rebuild the overlay filesystems with block size matching BTRFS's dedupe block size to avoid the alignment problems. SquashFS can be forced to disable tail packing (normally the compressed blocks are written back to back instead of block aligned), use a 16KB block size (BTRFS dedupe default.)

That's indeed a problem, though I am thinking of supporting btrfs/zfs natively instead of using a loop device.

Originally posted by linuxgeex View Post

Compression... different Zlibs could break binary stream identity. On the plus side, you could also get both compression (SquashFS) and dedupe (reflink) with xfs:

Code:

fdupes -r . | duperemove --fdupes duperemove -hdr --hashfile=/tmp/test.hash --dedupe-options=same,block

I forgot that xfs also supports reflink, which can be used for dedup, that's also a possible solution.

Originally posted by linuxgeex View Post

I have a hard time seeing Canonical/Snapd adopt this since EXT4 is popular. Anyone aware of a scheduled arrival of reflink for EXT4?

Flathub is more community-driven so maybe more luck with the concept there.

I think they can simply re-use runc/crun, the runtime for docker/kubernates, which already supports btrfs/zfs...
Leave a comment:
linuxgeex replied

21 December 2022, 02:30 AM
Originally posted by NobodyXu View Post

linuxgeex Thanks for the explanation.

I wonder does snapd/flatpak has special support for Btrfs where they can simply store the overlays as a Btrfs subvolume and then use Btrfs subvolume to create a new writeable overlay when launching a new application?

I know that docker supports this for Btrfs and Zfs, essentially replacing their overlayfs2 driver with btrfs/zfs driver to take advantage of the cow filesystem features.

No need for subvol - bind mounts are the compatible, KISS, solution.

I suppose one could manually decompress the images. There still wouldn't be many shared blocks thanks to block alignment issues. One would need to rebuild the overlay filesystems with block size matching BTRFS's dedupe block size to avoid the alignment problems. SquashFS can be forced to disable tail packing (normally the compressed blocks are written back to back instead of block aligned), use a 16KB block size (BTRFS dedupe default.)

Compression... different Zlibs could break binary stream identity. On the plus side, you could also get both compression (SquashFS) and dedupe (reflink) with xfs:

Code:

fdupes -r . | duperemove --fdupes duperemove -hdr --hashfile=/tmp/test.hash --dedupe-options=same,block

I have a hard time seeing Canonical/Snapd adopt this since EXT4 is popular. Anyone aware of a scheduled arrival of reflink for EXT4?

Flathub is more community-driven so maybe more luck with the concept there.
Leave a comment:
NobodyXu replied

19 December 2022, 06:51 PM
linuxgeex Thanks for the explanation.

I wonder does snapd/flatpak has special support for Btrfs where they can simply store the overlays as a Btrfs subvolume and then use Btrfs subvolume to create a new writeable overlay when launching a new application?

I know that docker supports this for Btrfs and Zfs, essentially replacing their overlayfs2 driver with btrfs/zfs driver to take advantage of the cow filesystem features.
Leave a comment:
linuxgeex replied

19 December 2022, 02:44 PM
Originally posted by NobodyXu View Post

I remember that they can do this since basically compares decompressed blocks.

I think Btrfs dedup will help here.

Once the blocks are deduped, the block cache for the fs will also be deduped.

So instead of loading the same data twice in different blocks, it gets loaded only once.

You're sadly misunderstanding the layer at which compression is happening, which is why I clarified that the data is obfuscated from the filesystem. It's a lot more productive if you make a good faith effort to see the other side of a conversation.

Here's a thought exercise for you. Imagine a folder of 100 images. Make a zip of that folder. Now remove the first image from the folder and make a second zip file. There's 99% data overlap of the data within the 2 zips. However there's not a single block of the data on disk that's the same. So BTRFS would not de-dupe it in any way.

The Snap overlays are compressed filesystems. They are like the zip files. Their compressed data as represented on disk is not identical, so it cannot be de-duplicated. Not unless BTRFS is going to attempt to decompress and match blocks of files within those compressed filesystems. That's not impossible - WinRAR for example does this if you attempt to compress multiple ISOs, zips, and a variety of other compressed file formats. But BTRFS today doesn't support such snooping. The CPU and disk bandwidth costs would be prohibitive. Maybe BTRFS will do that in the year 2222...

Last edited by linuxgeex; 19 December 2022, 03:00 PM.
Leave a comment:

Previous 1 2 3 template Next

Announcement

Wasmer 3.0 Released As The Latest "Universal WebAssembly Runtime"

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: