Announcement

**Zan Lynx** · 14 December 2018, 08:59 PM

Micheal, forgive me if I didn't see it anywhere, but could you run a straight dd read of every drive at the same time in parallel? That would give us a baseline for where the IO read tops out from SATA, controller, or NVMe lane limits. That's always interesting to compare to the filesystem numbers, I think.

**ryao** · 14 December 2018, 10:09 PM

“With the basic SQLite embedded database benchmark, ZFS on FreeBSD 12 was faster than Linux with either EXT4 or Btrfs. Btrfs with its default copy-on-write behavior led to noticeably slower performance.”
Michael Are you trying to say that ZFS is not Copy on Write? ZFS is only copy on write. There is no option to turn it off.

**ryao** · 14 December 2018, 10:19 PM

Originally posted by chilinux View Post

According to the ZFS rule of thumb of providing 1GB of RAM for every 1TB of disk, that petabyte array should be used with a system that has 1,000 gigabytes of RAM?!?! Majority of server motherboards I have worked with top out around below a fifth of that!

Every word of this is false. There is no such rule. I could have 1EB of storage on ZFS under on a RPi and it would work just about as well as you can imagine it would with any filesystem. There is no penalty for having less memory as cache (and ZFS does release memory as the kernel requests it). Also, many enterprise CPUs are limited to 256GB of RAM, which is 1/4 of 1TB.

By the way, for postgresql, set primarycache=metadata if you want ZFS’ cache to get out of your way. Also, set the recordsize to 8KB to avoid read-modify-write and put the PostgreSQL transaction log on its own dataset. Also, add a small SLOG device. That will make it perform really well.

**ryao** · 14 December 2018, 10:24 PM

Originally posted by some_canuck View Post

20 disks in a single vdev is suboptimal

I have given up on expecting Michael to benchmark meaningful configurations.

Also, he is still using compilebench, which is an utterly useless benchmark because it does not tell us what would be faster on a filesystem. Compilation takes about the same time on any filesystem because it is CPU bound, not IO bound.

**ryao** · 14 December 2018, 10:35 PM

Originally posted by edenist View Post

You mentioned tuning Postgresql to optimize for memory usage. Likewise with ZFS. There are many parameters which can help optimize if you're operating in a memory-constrained [or memory contended] system, notably arc_max, which will limit how much memory ZFS can use for it's caching. I don't think you can talk about tuning postgres, then complain when you haven't done the same for ZFS.

That memory 'rule-of-thumb' with ZFS is when using deduplication, which isn't something a lot of people need or use. If you're wanting to use de-dup on a petabyte worth of storage, on a system with 84 hard drives in a single vdev, I'd say 1TB of memory isn't exactly crazy.

Using edge-cases to argue against mainstream use of something seems like grasping at straws to me. If you just don't like ZFS, then that's fine I suppose. Just state it as it is.

Unfortunately, that rule of thumb was always wrong even for deduplication. The correct way of calculating it is a mathematical formula that varies based on your data’s deduplicability, your recordsize and some. ARC parameters, not a constant number. I have posted it enough times that I am not going to post it again unless asked. I do not keep it on hand and it is moderately annoying to derive.

By the way, my standard advice is to use primarycache=metadata on the application’s dataset if you are doing caching in userspace, but I imagine that adjusting the maximum arc size could potentially give better results when the machine is dedicated to a single application such as postgresql. You would have potentially more limited direct reclaim and ARC would able to act as a second level cache for anything on the dataset that the postgresql cache would not cache. Nice tip.

**torsionbar28** · 14 December 2018, 11:37 PM

Originally posted by ryao View Post

Also, many enterprise CPUs are limited to 256GB of RAM, which is 1/4 of 1TB.

If you bought a lesser chip that maxes out at 256 GB, and your requirement was for 1 TB, you bought the wrong server. Better enterprise CPU's can do 1 TB per socket. Heck, our old Dell R815 (4 socket Opteron) has 512 GB in it.

**linner** · 14 December 2018, 11:38 PM

With all the computing power available these days it's a shame we have to still manually tune something as basic as a filesystem.

Now if I could just keep my Linux servers from freezing the whole damn machine every once in a while when doing very heavy disk write activity... Makes me miss Solaris.

**waxhead** · 15 December 2018, 12:05 AM

To try to explain why BTRFS is slow in, BTRFS's implementation of "RAID" 1 and 10 is NOT yet optimized for parallel workloads. It simply use a scheme where it selects the storage device to use based on the PID of the process. Even or odd PIDS' may hug the same disk and there is not (yet) any optimization to balance the workload based on the storage device's queue length.

Note that patches has been posted on the mailing list to address this multiple times (by someone called Timofey Titovets) , but for some reason they have not been merged as far as I an tell by looking at the source code ( Ref: https://git.kernel.org/pub/scm/linux...4.20-rc6#n5187 )

**Beherit** · 15 December 2018, 01:00 AM

I take it hardware RAID is a thing of the past?

**untore** · 15 December 2018, 01:06 AM

Originally posted by chilinux View Post

I'm disappointed in the number of ZFS comparison benchmarks get published without discussing the FS implementation use of RAM. Phoronix is not the only one that has done this but I expected Phoronix to know better.

Try settings up a server dedicated to Postgresql and try to optimize RAM usage of the database (upping max_connections, shared_buffers, effective_cache_size, etc) on a system running ext4 or xfs. Once you get that tuning to take full advantage of the RAM in the database application, move the same configuration over to a ZFS setup. The result I get is a system will thrash because ZFS takes a great deal of the RAM for itself and Postgresql's tuning to attempt to use the same RAM causes the system into swapping. If you reduce that impact by lowering the Postgresql optimization parameters, you end up with a system that doesn't provide the same performance as the ext4 or xfs configuration. ZFS demand that memory be used for file system caching instead of application caching ultimately results in a poorly tuned database server configuration.

Even worse is if you need large amount of storage for the database server. ZFS stands for "Zettabyte file system" which is ironic given how poorly it actually scales in real world terms. With 12TB hard drives available, it is not hard to build an petabyte array. According to the ZFS rule of thumb of providing 1GB of RAM for every 1TB of disk, that petabyte array should be used with a system that has 1,000 gigabytes of RAM?!?! Majority of server motherboards I have worked with top out around below a fifth of that!

Lastly, it seems like with the release candidates of RHEL 8 that Red Hat is strongly pushing XFS with a btrfs like configuration interface provided by Stratis Storage. When doing FS comparisons, it would be nice if XFS was also included in the benchmarking. And again, it would be nice to see the amount of RAM left available for application services to take advantage of and how much RAM is monopolized with the FS kernel module.

ZFS on Linux Module Parameters

https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters

OpenZFS on Linux and FreeBSD. Contribute to openzfs/zfs development by creating an account on GitHub.

Announcement

FreeBSD ZFS vs. Linux EXT4/Btrfs RAID With Twenty SSDs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment