Announcement

**jrch2k8** · 17 October 2019, 12:54 PM

Originally posted by kcmichaelm View Post

I'm going to support these benchmarks because I think they provide an excellent service that is needed right now. The word "default" is under-appreciated throughout your response.

This is ZFS going into a shipping, standard, desktop distribution. By the numbers, FAR more people are going to be using this default "hilariously wrong setup" than a properly tuned version, and I think that needs exploration.

It is perfectly true that no one should pick single-disk-ZFS for a performance benefit over Ext4 - and therefore it's very important to measure that difference so people are informed.

There could be many examples (let's pick photographers as one, with the example in this thread) who run Linux due to their appreciation of the open-source tools, and now they see there's a supported filesystem option which we all tend to agree is pretty darn good against bitrot, and they want to try it. However, if Canonical isn't providing them tools to tune it like you say it must be done, then their installation won't be tuned unless they really feel like digging into it.

That is a fair point, maybe benchmarks touching ZFS should have a very visible caveat as a middle ground solution, so those who pick ZFS on Ubuntu will at least notice that ZFS is going to need intervention(being CLI as of today or some future tool canonical develops) for optimal performance/security.

I guess the bigger issue is ubuntu bringing ZFS to the desktop version(for some reason i did completely though was only for server images) because ZFS was never designed for desktop usage and basically have 0 sane defaults for that scenario and honestly i'm not even sure how are they going to handle this outside is just an option if you pick you should know what you are doing!!! kinda attitude

**DanglingPointer** · 17 October 2019, 06:57 PM

Originally posted by jrch2k8 View Post

[/LIST]Don't get me wrong, i do believe for desktop/small servers BTRFS is sufficient and stable enough, specially considering is a lot simpler than ZFS and give similar features but for Enterprise stuff ...

Meh...
If you're on Facebook, you're on BTRFS! Last I checked, they're one of the largest enterprises on the world and their business is data.... If you've been with them for years, then your data has been on BTRFS for years!

Fanboys will be fanboys and judge from afar.

**cjcox** · 17 October 2019, 07:47 PM

Originally posted by DrYak View Post

Yes and no.
Yes, periodic scrubing on a new-gen filesystem (BTRFS, ZFS and BcacheFS once that hit mainstream) is "good enough".
And no, it's not on most RAID subsystems.

Really? Shoot even elcheapo subsystems like Nexsan support scrubbing. I just figured if they did it, that most did. Good to know though.

Scrubbing is designed to be "rot" prevention (refresh the bits so that they aren't so musty and dusty)

Not a checksumming thing like with ZFS and Btrfs... do those scrub? I guess they must. How else to you prevent rot?

**foobaz** · 17 October 2019, 09:00 PM

ZFS never comes across well in benchmarks. It's one of the slowest filesystems I've used, and yet it's also one of the best. ZFS prioritizes data integrity above all else. It's possible to corrupt an Ext4 filesystem and lose data. Features like journaling and fsck mitigate this to a huge degree, so ext4 is very safe. But not as safe as ZFS.

The three main features that make ZFS so safe are block checksumming, copy-on-write, and plugging the RAID write hole. Block checksums protect against bit rot and hard drive errors. Data is only recoverable on pools with redundancy. A single disk ZFS pool can detect bit rot, but not repair it.

Copy-on-write means ZFS never modifies data in place. It writes a new block with the modified data, and then changes the extent table to point at the new block. This is bad for performance for frequently modified files, most notably databases. But it means that if you lose power or experience hardware failure in the middle of the write, the old data is still there on disk.

The RAID write hole is an effect that can cause data loss if a write operation is interrupted in the middle. The classic solution is to use a RAID controller with battery backup, so in the event of a power failure, it can finish writing data to disk. However, these cards are expensive and don't protect against other events that can cause data loss like a hardware failure in the card.

The only other filesystem with the ingredients to compete with ZFS is Btrfs. It supports block checksumming and is also a copy-on-write filesystem. Unfortunately it was plagued with bugs in early versions, most recently the parity modes in 2016. Since data integrity is supposed to be the main feature of Btrfs, it's not very appealing if you don't trust it.

So although ZFS is slow, it's the most reliable filesystem available. It's possible to mitigate the speed issues by throwing money at the problem - use flash storage, or add more disks. It's not so simple to mitigate reliability issues.

**jrch2k8** · 18 October 2019, 09:34 AM

Originally posted by DanglingPointer View Post

Meh...
If you're on Facebook, you're on BTRFS! Last I checked, they're one of the largest enterprises on the world and their business is data.... If you've been with them for years, then your data has been on BTRFS for years!

Fanboys will be fanboys and judge from afar.

Is amazing the irony in this phrase after that statement.

Your statement makes no sense, specially the dumb analogy "if facebook use it means is cool, duh!!!"

1.) Facebook use a myriad of filesystems same as any enterprise because they have millions of dollars to keep focus engineering teams for each use case and even more teams for integration, so yeah nobody is doubting Facebook or any other enterprise uses BTRFS the issue is where, for what and in which conditions.

For example nobody in its sane mind will use BTRFS(and the likes) for a cluster hot backend(put here whatever service make you happy) but BTRFS(and the likes) could be awesome for massive raids 1/10/(5+0/6+0 in the case of ZFS) of spinning disk +cache drives for a nice a cold backend, why simply because no CoW filesystem could match read bursts of a journaled filesystem for latency but on the other no journaled filesystem can do data integrity checks(properly).

2.) Facebook probably uses BTRFS(and i am taking a huge leap of faith in believing they don't also use ZFS for other stuff because they are not mutually exclusive) because is included in the kernel already and for their use case i'm pretty sure they debugged and optimized the living crap out of it(i can almost bet money they have their own implementation in house or at least hyper specific patches for their workloads since this is common practice on those huge enterprise and the main reason they go FOSS in the first place).

So, in resume everything in my post is accurate including the part you cherry picked because i've never said BTRFS is worse in every scenario and condition(i gave a very specific set of problems that are even recognized by the uptream developers btw) to ZFS and in fact there are scenarios where BTRFS could be better than ZFS.

My point was this simple as a general availability, battle tested of most scenarios, fully featured(as in all claimed features work properly) CoW filesystem ZFS is superior in most scenarios to BTRFS and this is factually true but again this does not mean BTRFS is 100% bad for you as it has it good points as well(is not black and white).

Also please again take into account when you mention Enterprise as huge as Facebook, Microsoft, Google, etc. normal common sense get thrown out of the window since they have enough engineering power to handle massive teams to make anything work with anything because their control over data is ultra fine grained and ultra focused, so the fact they can make work VFAT with PostgreSQL(for example) doesn't mean that for people working on regular out of the mill Enterprises/SMB/etc. will hold true or that this FS is better than that FS because X mega enterprise uses it will hold true because you don't know the exact conditions of usage of either(for example VFAT/Ext4/NILFS/etc. could be better than anything under the sun for a very specific table setting at facebook with very controlled data types for certain operation that probably will only ever be used at facebook but that still qualifies as Facebook uses VFAT/Ext4/NILFS/etc, duh!! and that is no way means your databases won't run like crap on VFAT/Ext4/NILFS/etc. because you don't meet that specific conditions)

**DanglingPointer** · 18 October 2019, 08:29 PM

Originally posted by jrch2k8 View Post

Is amazing the irony in this phrase after that statement.

Your statement makes no sense, specially the dumb analogy "if facebook use it means is cool, duh!!!"

1.) Facebook use a myriad of filesystems same as any enterprise because they have millions of dollars to keep focus engineering teams for each use case and even more teams for integration, so yeah nobody is doubting Facebook or any other enterprise uses BTRFS the issue is where, for what and in which conditions.

For example nobody in its sane mind will use BTRFS(and the likes) for a cluster hot backend(put here whatever service make you happy) but BTRFS(and the likes) could be awesome for massive raids 1/10/(5+0/6+0 in the case of ZFS) of spinning disk +cache drives for a nice a cold backend, why simply because no CoW filesystem could match read bursts of a journaled filesystem for latency but on the other no journaled filesystem can do data integrity checks(properly).

2.) Facebook probably uses BTRFS(and i am taking a huge leap of faith in believing they don't also use ZFS for other stuff because they are not mutually exclusive) because is included in the kernel already and for their use case i'm pretty sure they debugged and optimized the living crap out of it(i can almost bet money they have their own implementation in house or at least hyper specific patches for their workloads since this is common practice on those huge enterprise and the main reason they go FOSS in the first place).

So, in resume everything in my post is accurate including the part you cherry picked because i've never said BTRFS is worse in every scenario and condition(i gave a very specific set of problems that are even recognized by the uptream developers btw) to ZFS and in fact there are scenarios where BTRFS could be better than ZFS.

My point was this simple as a general availability, battle tested of most scenarios, fully featured(as in all claimed features work properly) CoW filesystem ZFS is superior in most scenarios to BTRFS and this is factually true but again this does not mean BTRFS is 100% bad for you as it has it good points as well(is not black and white).

Also please again take into account when you mention Enterprise as huge as Facebook, Microsoft, Google, etc. normal common sense get thrown out of the window since they have enough engineering power to handle massive teams to make anything work with anything because their control over data is ultra fine grained and ultra focused, so the fact they can make work VFAT with PostgreSQL(for example) doesn't mean that for people working on regular out of the mill Enterprises/SMB/etc. will hold true or that this FS is better than that FS because X mega enterprise uses it will hold true because you don't know the exact conditions of usage of either(for example VFAT/Ext4/NILFS/etc. could be better than anything under the sun for a very specific table setting at facebook with very controlled data types for certain operation that probably will only ever be used at facebook but that still qualifies as Facebook uses VFAT/Ext4/NILFS/etc, duh!! and that is no way means your databases won't run like crap on VFAT/Ext4/NILFS/etc. because you don't meet that specific conditions)

Meh...

I expected the diatrab from the fanboy...

**deusexmachina** · 20 October 2019, 04:45 PM

Hopefully some ZFS devs and experts can agree on what precisely qualifies as a properly configured and optimized ZFS setup and let Phoronix know before their next benchmarks. I really want to see ZFS over Optane and other top SSDs with cryptography on! Definitely needs to be tested in raidz2 also.

**jrch2k8** · 21 October 2019, 09:15 AM

Originally posted by make_adobe_on_Linux! View Post

Hopefully some ZFS devs and experts can agree on what precisely qualifies as a properly configured and optimized ZFS setup and let Phoronix know before their next benchmarks. I really want to see ZFS over Optane and other top SSDs with cryptography on! Definitely needs to be tested in raidz2 also.

I can give you some base rules(and i have bunch more in older posts in other threads as well):

Hardware:

if possible always prefer ECC over fast ram, all technicalities aside it protects you from getting trash written from RAM to Pools.
ZFS use a lot of RAM by default(unless fine tuned), so if you don't wanna spend hours with arcstat just set you minimum RAM to 32GB for a regular home usage
If you wanna encrypt get a CPU at least modern enough for 4 cores with AES-NI, Zen based CPU's are great choices.
If you plan to use NVME in as disk not as ZIL/SLOG drives, set you minimum requirement to at least ThreadRipper or X299 if you love to waste money. ZFS can handle RAID on NVME on any system and won't require any sort of BIOS or dongle extras but regular desktop boards lacks PCI-e bandwith, so you will always be limited to 1 drive speed or worse depending the mobo.
Never use ZFS on RAID 0 or with a single drive because basically you will get all the downsides of CoW with literally 0 benefit.

Generals:

know you data, depending on your data the performance can be great or unbearable trash.
not all properties on a pool are for you, just use what you need.
never ever ever use ZFS on a bare pool, volumes exist for a reason and if you don't use them then you should consider why are you using ZFS to start with.

Basic Optimization:

Compression is a great tool but a deceptive one, if you have a volume with lots of office files or highly compressible files(let say your documents folder ) compression will work great and boost you transfer rate because you save lots of bandwidth but if you have lots of non-compressible files like videos or already compressed files enabling compression on that volume will skyrocket you latency with 0 bandwidth savings, aka you are wasting CPU cycles for no reason.
Deduplication is a nice feature but it requires huge amounts of RAM/CPU and can make latency spike harshly if misused, so keep in mind it should only be used on volumes with lots of small files that you know can be redundant and compressible(like a samba share for people to save office files or the likes) or a volume with big binaries that you know have lot in common like ISOs, Virtual Machine Drives(in case you share several instances of the same OS),etc.
Large Dnodes should be use on auto always unless you have a very specific reason not to(like Solaris compatibility).
Atime=off Relatime=on, this one don't need much explanation.
recordsize is a tricky one, my rule is 16k for certain databases, 128k for general bunch of small compressible files and 1M for volumes where most of your files are bigger than 1M(like videos/iso, etc), if you don't do this right you will get low performance and/or very high fragmentation
Sync goes always standard unless you really know what you are doing.
xattr=sa, acltype=posixacl and aclinherit=(up to you) worked great for me through the years but as always check the documentation first.
Encryption require testing because depending on you data/compression/recordsize could be great or a slow dog, so do some tests before blindly go and start encrypting.

FAT WARNING:

Never use ZFS on RAID 0 or with a single drive because basically you will get all the downsides of CoW with literally 0 benefit.
Most changes made to a Volume affect only new/modified files be careful to make most of your changes before adding data to Volumes/Pools/etc.
A lot of fine tuning can be done through kernel module parameters as well but is way more complex than a simple post can handle.

**DrYak** · 21 October 2019, 11:56 AM

Originally posted by jrch2k8 View Post

RAID 5/50/6/60 is still very nuclear on BTRFS, it may work or it may eat you data and kill you kitten

Anything using BTRFS' RAID5 or 6 is still considered officially NOT STABLE and SHOULD NOT be used in anything but testing environment. (I don't dare to use them)
For now you're limited to stacking it a top a MD RAID5/6 (that's what I use instead if I want erasure coding)
Though in that setup BTRFS can detect bitrot (checksum missmatch) but not attempt to fix it (it won't get access to the redundant strips).

Originally posted by jrch2k8 View Post

Is very slow on big LUNs specially on PostgreSQL(last tried with PG10/lin5.1), at least compared to ZFS but it may be related to the first issue since i didn't test with RAID1(which i think is the strongest one on BTRFS atm)
{...}
BTRFS and virtualization are not very good friends.

Both boil down to BTRFS being CoW and thus not very good at big numbers of random write in large files. (You can add large torrent files as a third exemple).
Work-around consist of tagging these large files a no COW ("chattr +C" on them before starting to write into the files. Or alternatively touch/chattr a new file and then pipe over data from the old file), which will cause BTRFS to perform the write operations in-place, and will delegate the integrity on whatever that system uses (on the database's log journal, the VM's own mechanism used by the guest OS, and torrent's tree of hashes).
auto-defrag can also help alleviate a tiny bit, by coalescing multiple neighboring writes into a single one.

Also RAID5/6 leads to expensive read-modify-write cycles when only a fraction of a whole stripe is added ( <- my experience on write intense situations).

BcacheFS is interesting because - by leveraging it's tiered storage - it can (according to Kent) condense those writes into *new stripes* (thus append-only write cycles, not the dreaded read-modify-write).

Originally posted by jrch2k8 View Post

In general i believe BTRFS lacks flexibility in the volume/snapshot department as well but this may be subjective depending on what you do.

I would find interesting if you could go into details (e.g.: about which scenario you can't accomplish with BTRFS).

Originally posted by jrch2k8 View Post

I don't think it works ok on NVME(as with several drives), i can't prove it since the logs say nothing but on nvme i just noticed some services get random huge latency spikes and the only difference is ZFS vs BTRFS but well i may be wrong(i also didn't bother much to go the extra mile i just nuked the server and went back to ZFS to test another stuff <-- was a test server ofc).

Haven't got much experience with NVMe. Though there isn't a fundamental reason why BTRFS shouldn't work with them, except maybe the usual low-performance stuff of BTRFS getting in the way (random writes and eventually maze of pointers to traverse until finding the correct extent in a heavily fragmented file, read-modify-write cycles if you dare to RAID5, the need to absolutely checksum everything meaning you're not operating at maximum bandwidth, etc.)

Don't get me wrong, i do believe for desktop/small servers BTRFS is sufficient and stable enough, specially considering is a lot simpler than ZFS and give similar features but for Enterprise stuff or really important stuff i do believe ZFS is without peers,

Just as an example of this way of thinking: I run BTRFS on my workstation, the admins deploy ZFS on the HPC here.

or a workload i couldn't find a way to optimize the living bejesus out of it

This is probably one of the main advantage of ZFS: it's extremely tunable.

BTRFS is quite limited from that point of view (locally turning of CoW on certain files, autodefrag, and that's about it).

damn, even today i have a client with an old server that has lost in the last 10 years 23 of its original 24 hard drives(is implied i've been replacing the damaged drives for new ones and resilvering, of course) but have never lost a bit of data.

Well, on the other hand, *any* erasure coding could handle such scenario, down to even your garden variety of MD RAID5/6.
(Caveat aside about BTRFS's own Raid5/6 not being stable)

**deusexmachina** · 30 October 2019, 12:23 PM

Originally posted by jrch2k8 View Post

I can give you some base rules(and i have bunch more in older posts in other threads as well):

Hardware:

if possible always prefer ECC over fast ram, all technicalities aside it protects you from getting trash written from RAM to Pools.
ZFS use a lot of RAM by default(unless fine tuned), so if you don't wanna spend hours with arcstat just set you minimum RAM to 32GB for a regular home usage
If you wanna encrypt get a CPU at least modern enough for 4 cores with AES-NI, Zen based CPU's are great choices.
If you plan to use NVME in as disk not as ZIL/SLOG drives, set you minimum requirement to at least ThreadRipper or X299 if you love to waste money. ZFS can handle RAID on NVME on any system and won't require any sort of BIOS or dongle extras but regular desktop boards lacks PCI-e bandwith, so you will always be limited to 1 drive speed or worse depending the mobo.
Never use ZFS on RAID 0 or with a single drive because basically you will get all the downsides of CoW with literally 0 benefit.

Generals:

know you data, depending on your data the performance can be great or unbearable trash.
not all properties on a pool are for you, just use what you need.
never ever ever use ZFS on a bare pool, volumes exist for a reason and if you don't use them then you should consider why are you using ZFS to start with.

Basic Optimization:

Compression is a great tool but a deceptive one, if you have a volume with lots of office files or highly compressible files(let say your documents folder ) compression will work great and boost you transfer rate because you save lots of bandwidth but if you have lots of non-compressible files like videos or already compressed files enabling compression on that volume will skyrocket you latency with 0 bandwidth savings, aka you are wasting CPU cycles for no reason.
Deduplication is a nice feature but it requires huge amounts of RAM/CPU and can make latency spike harshly if misused, so keep in mind it should only be used on volumes with lots of small files that you know can be redundant and compressible(like a samba share for people to save office files or the likes) or a volume with big binaries that you know have lot in common like ISOs, Virtual Machine Drives(in case you share several instances of the same OS),etc.
Large Dnodes should be use on auto always unless you have a very specific reason not to(like Solaris compatibility).
Atime=off Relatime=on, this one don't need much explanation.
recordsize is a tricky one, my rule is 16k for certain databases, 128k for general bunch of small compressible files and 1M for volumes where most of your files are bigger than 1M(like videos/iso, etc), if you don't do this right you will get low performance and/or very high fragmentation
Sync goes always standard unless you really know what you are doing.
xattr=sa, acltype=posixacl and aclinherit=(up to you) worked great for me through the years but as always check the documentation first.
Encryption require testing because depending on you data/compression/recordsize could be great or a slow dog, so do some tests before blindly go and start encrypting.

FAT WARNING:

Never use ZFS on RAID 0 or with a single drive because basically you will get all the downsides of CoW with literally 0 benefit.
Most changes made to a Volume affect only new/modified files be careful to make most of your changes before adding data to Volumes/Pools/etc.
A lot of fine tuning can be done through kernel module parameters as well but is way more complex than a simple post can handle.

Very useful! Is there someway to @copy the phoronix.com authors so next ZFS tests we can have everyone in the comments saying, "ya totally fair config, thanks!" Looking forward to the next ZFS raidz2 benchmarks on the top SSDs vs ext4 & XFS.

Announcement

A Quick Look At EXT4 vs. ZFS Performance On Ubuntu 19.10 With An NVMe SSD

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment