Announcement

**DeFTeR** · 16 January 2016, 12:22 PM

Very interesting results, especially for PC-BSD. I'd like to see full set of benchmarks used for Linux distributions. Some network performance benchmarks would as well be interesting. Also adding NetBSD and some BTRFS linuxes (OpenSUSE maybe) would be great.

**ryao** · 16 January 2016, 12:41 PM

Originally posted by SystemCrasher View Post

I think Michael does right thing: testing how it performs out of the box. That's what most users would actually see. Sorry, but demanding everyone to detect block sizes and somesuch just not going to work. So either defaults perform adequately or you get crappy results and users are being unhappy - for a reason. Getting some extras after proper tuning is good, but things should work reasonably by default, out of the box. Else its crappy code or crappy defaults, crappy (sub)systems and so on. That's how users would see it.

"testing how it performs out of the box" for readers is impossible when hardware quirks between models introduce a confounding variable that gives one group of readers' systems a handicap and another group of readers' systems no handicap. It would be a different matter if things were guaranteed to be incorrect *everywhere*. If that were the case, it would be a clear-cut filesystem bug, but it is not the case.

Consequently, Michael's benchmarks have an inherent bias that can either help a filesystem's performance in a comparison by handicapping others or hurt a filesystem's performance in a comparison by handicapping it, depending solely on whether the filesystem has something to detect and correct the quirk.

The correct way to handle this is to make certain each filesystem is configured correctly and explain this potential issue to people reading the benchmark methodology People making filesystem choices based on those numbers would then be able to make them based on some degree of actual merit rather than a distorted view of reality that arbitrarily favors one or the other.

Considering only performance itself is also a form of bias that does not consider long term reliability in terms of system runtime, data integrity, maintenance, etcetera, although considering only performance correctly is far better than considering only performance incorrectly.

That said, it turns out that ZFS was not affected by this issue in Michael's benchmarks. The FreeBSD block IO layer has a quirk for Michael's test hardware that allows it to overcome the problem that I outlined:

https://github.com/freebsd/freebsd/b.../ata_da.c#L477

Linux lacks such quirks and the in-tree Linux filesystems are therefore at a disadvantage. ZFSOnLinux used to be similarly disadvantaged, but I hardcoded aquirks list into the ZFSOnLinux code so that it would override the autodetection when quirky drives were detected because I could not go back in time to submit patches to Linus for a quirks list in 2.6.32 onward, although it looks like Michael's current hardware is not in the quirks list. That will be fixed for the next release:

https://github.com/zfsonlinux/zfs/bl...ol_vdev.c#L109

I am unable to discern whether your remarks are the result of bias or the result of ignorance. If they are the result of bias, then your defense of bad benchmark methodology achieves the opposite of what you would expect and your bias can lead you to make poor decisions. If they are the result of ignorance, then if you rely on these numbers to make decisions, you are more than likely to find yourself obtaining worse performance than you would have obtained had they properly accounted for it, although you might never realize. Either way, you lose.

Originally posted by SystemCrasher View Post

Whatever, but we have real world full of ignorant users and imperfect hardware full of odds, bugs and quirks. It would be better if HW is perfect, etc. But it not going to happen. Ignoring this fact is .. ahem, naive.

P.S. hmm, PC-BSD wasn't worst of the bunch and even won in some cases. Not bad for BSD

. Though I think it is fair to test Linux distros on more assorted filesystems to get idea what various designs can do and how it compares. I think it is reasonable to compare btrfs, zfs, xfs and ext4 on HDD & SSD and f2fs on SSDs, etc? Sure, test matrix happens to be quite large. Also, FreeBSD or clones are good to run on ZFS and UFS2. Let's see if claims of BSD fans about UFS2 are at least half-true.

You will not see "if claims of BSD fans about UFS2 are at least half-true" by any benchmarks done on Phoronix as long as Michael's benchmark methodology is incapable of providing a fair comparison.

For starters, the Linux mainline filesystems are all handicapped by his refusal to adjust sector size at filesystem creation to compensate for the quirks of his test hardware. OpenBSD and DragonflyBSD might suffer similar issues, although I have not checked. There are also severe issues in the rationale of some of the benchmarks he runs and also other aspects of configuration, such as the imposition of read-modify-write overhead on heavy database workloads (e.g. pgbench) when no database administrator would configure a production database that way. The often cited justification "we only care about desktop performance" makes no sense when things like pgbench (used in other articles) measure workloads that will have never in the past, do not now and will never matter on a desktop.

The compilebench benchmark that was used this time is even worse comparing filesystems by their performance in an accelerated IO pattern of a CPU bound task makes no sense under any circumstance. The acceleration of a a compute bound IO pattern has no relevance to anyone in Michael's target audience and mostly, no relevance to anyone anywhere.

In addition, Michael should not only be running benchmarks that model real world workloads, but also synthetic benchmarks designed to determine whether certain code paths are bottlenecked internally in the filesystem driver. A good example of this is the serialization bottleneck imposed by ext4's grabbing the inode lock in its DirectIO, AIO and Vector IO paths or ZFS's per dataset synchronous IO serialization bottleneck imposed by the per-dataset ZIL (this will be fixed later this year). Some discussion of the workloads where each path actually is exercised would serve to illuminate the factors contributing to the numbers in benchmarks simulating real workloads, provided he ever does such benchmarks. So far, he has not.

**darkcoder** · 16 January 2016, 03:41 PM

Originally posted by Xaero_Vincent View Post

Michael, have you done any 3D / graphics performance benchmarks between BSD and Linux before?

E.g., Supertuxkart and Open Arena with the Nvidia proprietary driver on BSD and Linux? A FOSS graphics stack benchmark would be interesting as well.

We already know FOSS graphics stack is still under-performing compared to binary blobs. But a quick NVidia binary test, which is available for all of them if I'm right, would be nice.

**pjezek** · 17 January 2016, 11:18 AM

Originally posted by dimko View Post

Where is my favourite beast of all times, Gentoo? It constantly beats everything thrown at it in most Michaels benchmarks. Including BSD

Is Gentoo comparable by design on Michael's testbed with another linux distros? Think!

**pjezek** · 17 January 2016, 11:22 AM

Inspiring, ryao! But I guess Michal seeks additional testers or a testing team, one man show is limited by productivity.

**SystemCrasher** · 18 January 2016, 09:35 AM

Originally posted by ryao View Post

"testing how it performs out of the box" for readers is impossible when hardware quirks between models introduce a confounding variable that gives one group of readers' systems a handicap and another group of readers' systems no handicap.

That's where we going to call handicapped combos "not recommended for use". Devs are free to fix bugs and/or add workarounds if they're not happy with it. Perfectly valid outcome of benchmark.

It would be a different matter if things were guaranteed to be incorrect *everywhere*. If that were the case, it would be a clear-cut filesystem bug, but it is not the case.

We live in real world with all bugs, quirks and lazy users it has got. If something performs poorly under these conditions - it could be wise to install it using /dev/null as target.

Consequently, Michael's benchmarks have an inherent bias that can either help a filesystem's performance in a comparison by handicapping others or hurt a filesystem's performance in a comparison by handicapping it, depending solely on whether the filesystem has something to detect and correct the quirk.

These are software faults if some SW fails to perform properly on existing HW, not Michael faults. Most users would see what Michael sees, so it is very realistic real-world evaluation.

The correct way to handle this is to make certain each filesystem is configured correctly

Running things in default state is most correct configuration one can get. It matches 90+% of installs and hence brings most valuable result one can get.

and explain this potential issue to people reading the benchmark methodology People making filesystem choices based on those numbers would then be able to make them based on some degree of actual merit rather than a distorted view of reality that arbitrarily favors one or the other.

People usually want to take a look on how various things perform and grab what suits them. That's why defaults MUST be sane. It may be not ZFS fault itself, but underlying layers. That where ZFS devs may want to interfact with Linux devs to sort it out if it hurts 'em, etc. At the end of day it is really fair to evaluate full stack performance on some config and take a look how it performs overall. I see nothing wrong in it.

Considering only performance itself is also a form of bias that does not consider long term reliability in terms of system runtime, data integrity, maintenance, etcetera, although considering only performance correctly is far better than considering only performance incorrectly.

Long term reliability is hard to evaluate within reasonable time. Same for stability, to some degree. And in this regard, ZFS could put nasty surprise. This large monster is quite fragile and unexpected kinds of errors like bad sectors with no redundant data to recover from can bring it down really easy. And then recovery can be not the easiest thing in the world. Speaking of this, btrfs has got quite handy tool which would try to read-out most data from storage in read-only, non-destructive way, using alternate destination to store whatever it reads from damaged storage. When everything has failed, it sounds like a backup plan. IIRC, there is no comparable things for ZFS. Having something similar to Tiramisu recovery suite (does the same on NTFS and FAT32 in windows) but free of charge - counts as major advantage of filesystem tools for me. Uhm, yeah, I have some experience in data recovery from damaged storages, so I value such tooling. I can also admit EXT2-4 fsck, which usually manages to fix most images to degree I can actually mount them and read most data. Which is much better than chopping things in hex editor.

That said, it turns out that ZFS was not affected by this issue in Michael's benchmarks. The FreeBSD block IO layer has a quirk for Michael's test hardware that allows it to overcome the problem that I outlined:

Ok, and if it is not a case in Linux, it could be good idea to ping Linux devs about it and sort it out together, etc, insn't it? If it brings mentioned benefits, it could be worth of it, no?

Linux lacks such quirks and the in-tree Linux filesystems are therefore at a disadvantage.

Sounds interesting. Can you show some benchmarks or other sensible proofs, actually showcasing these disadvantages?

could not go back in time to submit patches to Linus for a quirks list in 2.6.32 onward, although it looks like Michael's current hardware is not in the quirks list. That will be fixed for the next release:

Then Michael would get better benchmarks in next release. And it will be something to talk about. But in next release. Right now he has got these results. That's how it works in real world.

I am unable to discern whether your remarks are the result of bias or the result of ignorance.

It can even be both. Or neither. Who knows?

If they are the result of bias, then your defense of bad benchmark methodology achieves the opposite of what you would expect and your bias can lead you to make poor decisions.

I'm taking a look on Michael benchmarks and refuse to consider his methodology bad. He uses simplest method: just install it and give a shot. Most users are doing the very same, so they would see the very same result, so it is quite valuable result. Actually, it makes sense to select combos which are performing reasonably by default. It can help users/admins to make decisions, choosing least troublesome configurations which take minimum attention/efforts to run. Perfectly fair.

If they are the result of ignorance, then if you rely on these numbers to make decisions, you are more than likely to find yourself obtaining worse performance than you would have obtained had they properly accounted for it, although you might never realize. Either way, you lose.

I would not lose, because I'm actually not that bad in understanding how systems are working. But I can't tune each and every fucking homepage web server. This going to be both waste of time for expert and damn expensive/suboptimal for most uses. So it would be better if things work reasonably. Else there is room to tell about crappy performance and I would consider there is valid point. In no way we can bring high-profile gurus to each and every server running some crappy blog or homepage. So Phoronix benchmark is perfectly valid and shows what most users would actually face.

You will not see "if claims of BSD fans about UFS2 are at least half-true" by any benchmarks done on Phoronix as long as Michael's benchmark methodology is incapable of providing a fair comparison.

If default install of FreeBSD on UFS2 can't unleash it good properties - okay, why someone supposed to dig in this pile of smouldering wreck at all? When you install Linux on ext4, it gives you quite a good performance out of the box. Maybe it could be better, etc. But usually it speaks for itself.

For starters, the Linux mainline filesystems are all handicapped by his refusal to adjust sector size at filesystem creation to compensate for the quirks of his test hardware.

Most filesystems are using 4K-blocks anyway as minimal unit, it also matches page size on most platforms, and modern tools are learned to align paritions properly. If its not a case - ok, we're going to talk about crappy defaults.

OpenBSD and DragonflyBSD might suffer similar issues, although I have not checked.

Either way, it do not have to turn into user's problems. This is fundamentally wrong. No, I understand ppl can be unhappy about drives lying about sector size, or SSDs pretending they can do 512 byte sectors. On other hand.... ok, let assume I know e.g. SSD layout. I.e. NAND page size, erase block/group size, etc. How do I give it to filesystems? E.g. in case of ZFS? Most designs weren't meant to deal with physical layer coming with such a strange properties.

There are also severe issues in the rationale of some of the benchmarks he runs and also other aspects of configuration, such as the imposition of read-modify-write overhead on heavy database workloads (e.g. pgbench) when no database administrator would configure a production database that way.

You're overlyoptimistic about admin skills. Sure, some large companies with heavy loads or just experts in the area would avoid these problems. But some crappy blog or home pages aren't going to enjoy by availability of gurus. And if we are about tuning... okay, on btrfs one can say "nodatacow" and CoW goes away, turning filesystem into thin layer so it no longer interferes with database's own journallling. That's where Michael underestimates Btrfs. But it is kinda specific tuning, coming with some tradeoffs. Btw. am I correct ZFS still lacks similar options to the date? I can imagine this way btrfs can secure quite a major win on DB loads. But at some cost - snapshots and somesuch would not apply to DB. Though snapshotting DB by filesystem's CoW facilities isn't exactly brightest idea ever.

The often cited justification "we only care about desktop performance" makes no sense when things like pgbench (used in other articles) measure workloads that will have never in the past, do not now and will never matter on a desktop.

I guess it depends. I.e. some ppl may want to deal with Postgres on desktop to mess with OSM data & GIS. Why not? Though it isn't most common use for sure.

The compilebench benchmark that was used this time is even worse comparing filesystems by their performance in an accelerated IO pattern of a CPU bound task makes no sense under any circumstance. The acceleration of a a compute bound IO pattern has no relevance to anyone in Michael's target audience and mostly, no relevance to anyone anywhere.

It mimics typical workload on any dev workstatoin. As simple as that. And if I would try to e.g. recompile kernel, I would see similar behavior generally. So it makes some sense. I would agree Michael isn't hardcore guru in benchmarks... that what makes 'em interesting. He manages to pick up some cases probably being something close to his own use cases, etc. I do not see why it should be fundamentally wrong. Sure, there is room for other benchmarks and better tuning. But Michael behaves ... like typical user or web dev. Which makes his results fairly valuable. Can you imagine webdevs doing hardcore tuning of filesystems on their workstations?

In addition, Michael should not only be running benchmarks that model real world workloads, but also synthetic benchmarks designed to determine whether certain code paths are bottlenecked internally in the filesystem driver.

Looks way too synthetic and hardly matches typical real-world loads. So these are mostly interesting for filesystem devs themselves, such benchmarks are of little use for general public. Because it does not anyhow correlates with what ppl would see in their typical workloads. Its okay to hammer designs in attempt to improve them, sure. But TBH I do care about "average" performance in some tasks more or less making some sensible use cases. Even pgbench makes more sense than that - postgres at least sometimes used by those fiddling with OSM maps, which is far more popular case than bottleneck analisys in some corner case.

imposed by ext4's grabbing the inode lock in its DirectIO, AIO and Vector IO paths[/URL] or ZFS's per dataset synchronous IO serialization bottleneck imposed by the per-dataset ZIL (this will be fixed later this year).

Its cool, etc. And its nice people are trying it to learn weak places in their designs. But I doubt it would drastically change how filesystems behave in typical use cases, be it desktop, some web server, etc. Maybe it would affect some selected workloads, sure.

Some discussion of the workloads where each path actually is exercised would serve to illuminate the factors contributing to the numbers in benchmarks simulating real workloads, provided he ever does such benchmarks. So far, he has not.

He do not have to. He behaves mostly like user & web dev. Perfectly valid use case. I have two gazillions of fellow web devs running Linux on their workstations. They are surely interested in benches like this. And I doubt they would give a fuck about mentioned issues, unless it being a problem in some practical, real-world use case. And sorry, but benches aren't supposed to get best of what design can do. Devs may want to see it to "advertise" their design. But users would generally face far more grim picture and would not be happy about exagerrated/synthetic benches.

**kebabbert** · 18 January 2016, 12:45 PM

Originally posted by SystemCrasher View Post

Long term reliability is hard to evaluate within reasonable time. Same for stability, to some degree. And in this regard, ZFS could put nasty surprise. This large monster is quite fragile and unexpected kinds of errors like bad sectors with no redundant data to recover from can bring it down really easy. And then recovery can be not the easiest thing in the world. Speaking of this, btrfs has got quite handy tool which would try to read-out most data from storage in read-only, non-destructive way, using alternate destination to store whatever it reads from damaged storage. When everything has failed, it sounds like a backup plan. IIRC, there is no comparable things for ZFS.

You got it backwards. ZFS is not fragile, it is the most robust filesystem there is out there. ZFS is built with a focus on data integrity, to be robust and protect your data. ZFS goes to great length to do that. It is targeted to the Enterprise sector, where requirements are the highest, where data loss can cost millions of dollars. The Sun team had vast experience of the problems large enterprise storage face, that is why they added checksums to detect data corruption in a unique way. Sure, there are lot of checksums everywhere in the system, but it is done wrong. For isntance, hard disks have lot of checksums on the surface to detect data corruption - and still you get data corruption on hard disks. To add checksums to correctly detect data corruption is not easy to do, but ZFS does it as proven by several researchers, who examined ZFS data integrity capabilities and ZFS is the safest, beating XFS, NTFS, ext3, etc - they are all unsafe except ZFS. ZFS detects and corrects all problems, the other filesystems do not. When you say that ZFS is fragile, you are totally ignorant. It's main purpose is to be robust. That is why it was built. To be safe and robust.

But hey, come to think of it now, I remember one person complaining that ZFS was too brittle and fragile, maybe it was you? That person said that when he used... ext3 (?) he got no reports of data corruption on his PC. Then he switched to ZFS, and the data corruption reports poured in. Well, ZFS detected problems in his hardware that no other filesystem could not do. That is why he got reports of data problems. He on the other hand, said that "ZFS is fragile, it looses my data". Well, several research papers shows that ZFS can detect data corruption problems, whereas other filesystems can not. So, he did not understand that his hardware was unsafe, not ZFS. If you ever visit large ZFS forums, you will see threads occaniously where people reports data corruption problems by ZFS, and at the end, they discover the problem was due to a slightly loose SATA cable, or RAM dimm was faulty, or a flaky PSU, etc. And as soon they fix the hardware problem everything is fine and dandy. ZFS is the only filesystem able to detect such subtle hardware problems, no other filesystem can do it - according to research papers. Here are some research papers:

ZFS - Wikipedia

https://en.wikipedia.org/wiki/ZFS#Data_integrity

Regarding your talk about BTRFS having a repair tool, which ZFS does not have. Well, ZFS is targeted to Enterprise and has been in production for over 10 years now. And until now ZFS fsck or some other tool has not been needed. If your data ZFS zpool ever gets corrupt, you can back in time, to a last known valid state and import the zpool back then. So, you dont need fsck. ZFS data is always valid, and never out of synch. "ZFS does not need no stinking fsck":

404 Not Found

http://c0t0d0s0.org/archives/6071-No,-ZFS-really-doesnt-need-a-fsck.html

http://c0t0d0s0.org/index.php?serendipity[action]=search&serendipity[fullentry]=1&serendipity[searchTerm]=fsck&serendipity[searchButton]=%3E

**kgardas** · 19 January 2016, 04:34 AM

Interesting how well ZFS performs in a form of PC-BSD. I'm Solaris user since old SXDE days and ZFS is one of the reason for using this OS as a developer workstation even the software stack is not there. So well, I do have quite a lot of experience with ZFS from the user point of view on this deployment size. Honestly ZFS simply rocks as long as you are not hit by its fragmentation. If you are hit, then well, it sucks performance wise -- just performance, all the feature is still here. My data pool is already quite old and seen 2 upgrades already 500 GB -> 750 GB -> 1000 GB -- this was done by using automatic growing functionality so I just added two new drives to the mirror (of two drives), waited for resilvering to finish and then unplugged old smaller drives. This way I have it running for several years and now I'm afraid this beast is slower than even OpenBSD ffs (running on top of software RAID 1 with checksumming support to add data integrity.) :-)
Looking forward how btrfs/hammer2 are going to solve fragmentation issue on COW fss. Nice engineering challenge indeed.

**SystemCrasher** · 19 January 2016, 06:18 AM

Originally posted by kebabbert View Post

You got it backwards. ZFS is not fragile, it is the most robust filesystem there is out there. ZFS is built with a focus on data integrity, to be robust and protect your data.

Sounds like Sun's marketing bullshit.

ZFS goes to great length to do that. It is targeted to the Enterprise sector, where requirements are the highest, where data loss can cost millions of dollars.

Yet, googling a bit can yeld some funny "success stories". When someone tells "Enterprise" you should expect catch, and marketing promises would wildly exceed practical performance.

The Sun team had vast experience of the problems large enterprise storage face, that is why they added checksums to detect data corruption in a unique way. Sure, there are lot of checksums everywhere in the system, but it is done wrong.

Checksumming isn't worst part of ZFS, sure. They used quite good tradeoffs in terms of checksumming strength vs speed. On other hand, many other engineering solutions are much less sound. No defragger? While CoW inherently means fragments on write? And only lame mumblings about adding new drives? Skipping "good luck to remove them later" catch? Staying silent about bad sectors would bring thing down and no tools would be able to fix that? That's what called MARKETING BULLSHIT. Sure, sun was good at it. That's what I dislike about Sun and their products.

For isntance, hard disks have lot of checksums on the surface to detect data corruption - and still you get data corruption on hard disks [..sun marketing..].

XFS can't do full journalling, to begin with. It would journal metadata only, so you can get half-old, half-new data in file if crash happens. There is no data to complete or rollback DATA part. Needless to say, file can get unusable. So speaking about reliability sounds funny for such designs. In classic designs, journalling takes one write to journal and one into storage. Hence writing data twice. It kills write speed by at least 2x. Nobody in sane mind uses e.g. EXT4 in this mode. So XFS, JFS and some few others do not even implement full journalling. OTOH CoW based things avoid it in some smartass way, where whole storage area is basically a "journal". That's why most futuristic designs are centered around CoW these days. Full journalling at no-journalling speed. But there're some other catches

.

But hey, come to think of it now, I remember one person complaining that ZFS was too brittle and fragile, maybe it was you? That person said that when he used... ext3 (?) he got no reports of data corruption on his PC.

I guess it wasn't me. I can't remember when I've used EXT3. I've migrated EXT4 & XFS quite some years ago. Now I'm moving more and more storages to btrfs.

Regarding your talk about BTRFS having a repair tool, which ZFS does not have. Well, ZFS is targeted to Enterprise and has been in production for over 10 years now.

I do not give a fuck about "since 1929" stuff, but I do care about technical excellence, solving difficult challenges, good performance, interesting designs, good algos, inner working of things, etc.

And until now ZFS fsck or some other tool has not been needed.

Come on, keep quoting Sun's marketing BS. Yet, if we google a bit, we'll see some ppl around who had ZFS fucked up and .. no good means to recover. It do not have to be fsck, etc, but if filesystem's metadata are damaged to degree driver can't mount it properly, it could be good idea to have some "plan B", either to parse all metadata, check them carefully and recover like fsck tools do, or at least being able to read data back with relaxed demands on metadata integrity. Just failing and leaving me with hex editor does not sounds great. And uhm, yeah, enterprises can pay some few thousand dollars to guys with hexeditors, if in dire need, any day. But even then, it may or may not work.

If your data ZFS zpool ever gets corrupt, you can back in time, to a last known valid state and import the zpool back then.

This makes vague assumption all involved data are valid and it works reasonably. But it may or may not be a case. CoW can have rather specific failure modes and snapshots are anything but backup replacement. E.g. if someone had 5 snapshots, they shared blocks and bad sectors appeared here, all 5 snapshots are going to be ruined at once. Going back in time may or may not succeed. Depends on how badly metadata are damaged, and actually, if one got severe corruption, trying tricks like this can easily make things worse, because CoW can get stuck half way while trying to rewind back, hanging in the middle of nowhere. Read-only recovery to another destination allows to avoid this problem - it does not touches original storage, so failure at least can't develop any further. And one can try to select various "entry points" to try to parse them.

So, you dont need fsck. ZFS data is always valid, and never out of synch.

Unless there is some pesky bad sector appears and it happens there was no redundancy. At this point ZFS easily fucked up and really hard to repair. Because Sun architects preferred to mumble "always valid" and had no plan B for scenarios this assumption has failed. And I've seen a few, where it has been the case. So ppl had a lot of very specific fun. I think it can be better than that and not really fond being fed by marketing BS instead of technical solutions.

**dimko** · 02 February 2016, 10:16 AM

Originally posted by pjezek View Post

Is Gentoo comparable by design on Michael's testbed with another linux distros? Think!

I can dream, right?

Announcement

How Three BSD Operating Systems Compare To Ten Linux Distributions

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment