Announcement

**starshipeleven** · 26 August 2020, 08:01 AM

Originally posted by pkese View Post

I'm trying to understand the 'performant RAM caching' part...

Is ZFS's ARC cache really superior in any way compared to classical Linux block cache used by regular Linux filesystems?

In my testing yes, it very much seems so, at least for btrfs vs ZFS, ZFS is still slightly worse than mdadm raid and ext4/xfs.
I don't know if this is because btrfs can't use block cache properly or because ZFS's own caching is better. I wouldn't be surprised if this is a thing needed only by CoW filesystems.

The same "server" (It's the cheapest first-gen threadripper on a gaming asrock mobo and 128GB of ECC RAM, not a true server) with the same array of 20 SAS drives in the same system, arranged as RAID10 for both ZFS and btrfs, the same Windows VMs run like absolute lagfest with btrfs (5-10 seconds to register a click on screen), while with ZFS it's nearly as good as the same VM running on a mdadm raid with normal filesystem (ext4/xfs), even BEFORE I start adding SSDs as read/write cache for it.
Ah it also handles multiple VMs without any change, while running multiple VMs on a btrfs array is ridicolously worse.

And I'm limiting ZFS's cache to 32GB of RAM, while Linux page cache has no such limit and can use all the free RAM available, which is 100+GB (the server has 128GB, a single VM is using 16GB and there is only one up when testing, the host is a OpenSUSE Tumbleweed headless system so it's using less than 512MB of RAM.

Nevertheless, ZFS is known to perform very well on machines with lots of memory.

Yeah, but what was "lots of memory" years ago (the environment ZFS was developed for) is like 8GB.

A HP Microserver Gen7 that is absolute garbage as far as CPU goes (it's an embedded AMD pre-ryzen APU with ECC support) with 4GB of RAM will still be able to sustain 50-70 MB/s sequential writes to the array over the network (it's a NAS with someone writing VM disk images into its samba shared folder) for hours on end on a RAID5 with compression enabled, btrfs RAID5 can't match that even on RAID10.

**pkese** · 26 August 2020, 08:40 AM

Originally posted by starshipeleven View Post

or because ZFS's own caching is better. I wouldn't be surprised if this is a thing needed only by CoW filesystems.

Intersting.
Btrfs never overwrites any existing disk blocks. It always makes copies of modified blocks elsewhere on disk while preserving original ones (the CoW mechanism).
It could be that the filesystem doesn't inform the block cache, that old copies are safe to discard and use that RAM for the new ones.

Originally posted by starshipeleven View Post

the same Windows VMs run like absolute lagfest with btrfs (5-10 seconds to register a click on screen)

I wonder if you tried that while disabling the CoW for those VM image files?
Btrfs does CoW on a file level, whereas the CoW granularity on ZFS is more on a subvolume/snapshot level. That's why you can `cp --reflink` a single file on btrfs but not on ZFS.
The problem with that is that large files with lots of I/O will get a new CoW copy on every write and become extremely fragmented. The intended workaround for that is setting No-CoW for such files, e.g. VM images. There's not much point on doing CoW on VM images anyway, since they are running their own filesystems inside that VM image file.

Btrfs files with CoW disabled should then do the same thing as ZFS does, namely it will overwrite file in place and only do CoW for snapshots.
And possibly provide similar performance (at least that's the theory).

Thanks for sharing.

[Technically, for VM images, one should create a 0-byte file, mark it as no-cow using chattr +C and then copy existing image into that file in order to get consecutively allocated chunk of disk space. But yeah, ZFS is a safer bet and works fine out of the box without any of these extra steps].

**starshipeleven** · 26 August 2020, 09:49 AM

Originally posted by pkese View Post

Btrfs never overwrites any existing disk blocks. It always makes copies of modified blocks elsewhere on disk while preserving original ones (the CoW mechanism).
It could be that the filesystem doesn't inform the block cache, that old copies are safe to discard and use that RAM for the new ones.

ZFS is CoW too so it's doing the same thing when writing to disk. But imho the biggest difference I've seen between btrfs and ZFS in my testing is on random reads. The VM isn't writing much, it's just responding to input (open window, close window, move window, open Start, and such) so it's mostly random reads.

Linux block cache is valid for reads too (i.e. all that is read is also cached in the hopes that they will be requested again, and when it's time to evict stuff from cache the less-requested things go first), but here ZFS is clearly better.

I wonder if you tried that while disabling the CoW for those VM images?

No, because that turns off checksumming and compression as well (see below) and at that point I'm better off with mdadm and ext4/xfs

Btrfs does CoW on a file level, whereas the CoW granularity on ZFS is more on a subvolume/snapshot level. That's why you can `cp --reflink` a single file on btrfs but not on ZFS.

That's just the granularity of how you can set stuff to be CoW or not. The actual CoW functionality does not change. CoW is always happening at block level for both filesystems.

the decision to not allow so much granularity on deciding what is CoW and what isn't to allow reflinking isn't because it cannot do so, but because of performance reasons, according to ZFS developers https://github.com/openzfs/zfs/issue...mment-26165469

Since this is a VM disk storage array, would really like reflinking as it would drastically decrease the time it takes to clone a VM, but heh, no interest in doing that upstream.

The problem with that is that large files with lots of I/O will get a new CoW copy on every write and become extremely fragmented.

I'll deal with that when it becomes an issue, so far I'm not noticing much fragmentation increase in 3-4 months that this system is up, but I'm not running a true "production system" either.

If it's just something like copy stuff over to the backup and then delete main file and then copy back again every year or so I'll manage.

There's not much point on doing CoW on VM images anyway, since they are running their own filesystems inside that VM image file.

Easy there CoWboy. From https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs(5) (btrfs manpage)
Nodatacow implies nodatasum, and disables compression.

Now,

yes I can live without CoW for this purpose,
i'm annoyed to lose checksumming because on a large-ish array like this it's extremely useful to detect if some drive or controller is bullshitting the OS on what is going on (it's amazing for troubleshooting things, like if I have a lose cable or connected power wrong and 20 drives are overloading a single cable of the 650W Seasonic PSU that should theoretically be able to hold them fine, and actually holds them fine if I split them over two cables, both true stories),
I REALLY don't like to lose trasparent compression on a disk that stores VMs as with compression I save So. Much. Space.

Meanwhile, ZFS is running with CoW, checksumming and also lz4 compression (default settings afaik) in that array I mentioned. And is nearly as fast as mdadm raid with ext4/xfs.

Really, it is a no-brainer, and people making Proxmox (a turnkey distro that makes a KVM host system) also use ZFS is the default filesystem for VM arrays.

Btrfs with CoW disabled should then do the same thing as ZFS does, namely it will CoW only for snapshots.

ZFS is CoW all the time, not just for snapshots.

Mind me, if nocow didn't remove other features it would be fine, but it does not.

**pkese** · 26 August 2020, 10:26 AM

Originally posted by starshipeleven View Post

ZFS is CoW too so it's doing the same thing ...

Wonderful argumentation. Thanks.

**wikinevick** · 26 August 2020, 11:57 AM

Originally posted by pranav View Post

What's the status of the Open ZFS license? Is it the same as ZFS!

No need to ask about the status when there is no one working on changing it. It will remain CDDL for ever AFAICT.

**wikinevick** · 26 August 2020, 12:02 PM

Originally posted by pkese View Post

I'm trying to understand the 'performant RAM caching' part...

Is ZFS's ARC cache really superior in any way compared to classical Linux block cache used by regular Linux filesystems?

My assumption regarding ZFS's ARC cache was that it got integrated into ZFS because that made sense on Solaris.
Then when porting the code to Linux that caching layer was just too hard to refactor out of ZFS (in order to switch to using the regular Linux block cache), so they kept the ARC cache in the codebase.
In some cases ARC leads to worse resource utilization: memory mapped files end up being cached in both caches (thus wasting 2x RAM for no extra effect).

Nevertheless, ZFS is known to perform very well on machines with lots of memory.

Can somebody with more technical knowledge correct me if I'm wrong?

What used to be "Machines with lots of memory" is actually > 4G RAM, which is not much nowadays. I use it with 6G on an old machine with FreeBSD and it works fine.

**k1e0x** · 26 August 2020, 12:41 PM

Originally posted by Alexmitter View Post

I am a ext4 dude and I will probably stay one for the rest of my life but is there any valid reason for ZFS on Linux when we have BTRFS? Or is it just fanboyism.

ZFS is cross platform. MacOS, Windows, Linux, FreeBSD and Solaris. So it might be pretty cool to use in a dual boot system.
You wouldn't think at first that a NAS filesystem would be very good for USB drives.. but it actually is when you add in the cross platform ability and that those drives are usually quite flaky and ZFS checksums everything. Format one disk, read it with 5 OS's. Not bad. I use it personally for removable cold storage drives to go between MacOS, FreeBSD and Linux.

It also has a good block emulation layer (ZVOLS) this makes terrific back end storage for VM's.

Beyond, as mentioned, it has it's native encryption. One of the features of that is it works across all 5 OS's, zfs send and receive, cache and compression. (compression usually is defeated by encryption, it's cache also is compressed)

Overall it's just kind of refined all of it's features and parts work together.

Originally posted by pkese View Post

Is ZFS's ARC cache really superior in any way compared to classical Linux block cache used by regular Linux filesystems?

The ARC is actually very good. If you want a deep dive into how it was developed and works you can look here.

Bryan Cantrill on ARC: A Self-Tuning, Low Overhead Replacement Cache [PWL SF] 10/2017

https://www.youtube.com/watch?v=F8sZRBdmqc0

Bryan Cantrill on "ARC: A Self-Tuning, Low Overhead Replacement Cache" by Nimrod Megiddo and Dharmendra Modha ( https://www.usenix.org/legacy/event/fast03/te...

TLDW; it works like this:
[LRU Ghost List] [LRU Cache] |----------------------|p|----------------------|[MRU Cache] [MRU Ghost List]

So it has it's space divided between a most recent and a most frequent area and it decides how large those are based on p, that is set by cache misses in the ghost lists. It does this all this in one algorithm, basically if it's missing a lot it will adjust itself.

From my understanding Linux uses a modified LRU, so the ARC should out preform that. (It's open source Linux, take it.. I'd love to see native ARC in Linux)

It leads to ZFS preforming much better in real world than in benchmarks as benchmarks are designed to defeat the cache.. in real world ZFS ARC make up a lot of the difference. I'd like to see benchmarks with the caches enabled.

And a note about the licence.. it really doesn't matter as the licence is defacto compatible as you can comply with the GPL and CDDL at the same time without causing any harm to either licence. It makes it impossible to bring a case to court without showing harm. Being that it isn't in the Linux Kernel isn't really much of an issue because it's not in the MacOS or Windows kernel either.. it's just code you use to get features, nothing special. Not very much different than your Nvidia driver.

**useless** · 26 August 2020, 05:18 PM

Originally posted by k1e0x View Post

From my understanding Linux uses a modified LRU, so the ARC should out preform that. (It's open source Linux, take it.. I'd love to see native ARC in Linux)

Some insight [1]:

Per node, two clock lists are maintained for file pages: the inactive and the active list. Freshly faulted pages start out at the head of the inactive list and page reclaim scans pages from the tail. Pages that are accessed multiple times on the inactive list are promoted to the active list, to protect them from reclaim, whereas active pages are demoted to the inactive list when the active list grows too big.

But:

A workload is thrashing when its pages are frequently used but they are evicted from the inactive list every time before another access would have promoted them to the active list.

So:

For each node's LRU lists, a counter for inactive evictions and activations is maintained (node->nonresident_age). On eviction, a snapshot of this counter (along with some bits to identify the node) is stored in the now empty page cache slot of the evicted page. This is called a shadow entry. On cache misses for which there are shadow entries, an eligible refault distance will immediately activate the refaulting page.

This resembles the ghost list used in the specific ARC implementation in ZFS. I'm just too lazy, I still need to thoroughly test both approaches just for the sake of it (I use btrfs and ZFS in different uses cases, so).

[1] https://github.com/torvalds/linux/bl...m/workingset.c

**pkese** · 26 August 2020, 06:06 PM

Originally posted by k1e0x View Post

The ARC is actually very good. If you want a deep dive into how it was developed and works you can look here.

https://www.youtube.com/watch?v=F8sZRBdmqc0

This is amazing 👍👍👍

**k1e0x** · 26 August 2020, 08:47 PM

Originally posted by useless View Post

Some insight [1]:

This resembles the ghost list used in the specific ARC implementation in ZFS. I'm just too lazy, I still need to thoroughly test both approaches just for the sake of it (I use btrfs and ZFS in different uses cases, so).

[1] https://github.com/torvalds/linux/bl...m/workingset.c

it'd be interesting to see. I don't have very good info exactly how Linux's LRU works. Another new cache algorithm to look at would be the one AMD is using in Ryzen now.

Announcement

OpenZFS 2.0-RC1 Released With Unified Linux/BSD Support, Zstd Compression & Much More

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment