Announcement

**waxhead** · 30 November 2023, 02:19 PM

Originally posted by Quackdoc View Post

I strongly disagree with this, one of the jobs a high preformance filesystem may need to do is cope with the idiosyncrasies of the drive(s) and potential setups, SMR, Raid, eMMC, Nand etc. I would say that in isolation, any test is useless.

I understand what you mean, but I still think it would be valuable to see how each filesystem copes if running on "ideal hardware" without any latency introduced by the hardware itself (SSD/HDD). I agree that your workload is the only thing that matters, but why exactly do you think it is useless? If filesystem X takes 10 seconds doing operation A and filesystem Y only uses 5 seconds it tells something about the efficiency of the algorithm being used. Yes, the results may be completely reversed on real hardware depending on access patterns, but since we are moving more and more towards storage devices that mostly are "regular RAM" I think it would be a good test still. I would love to hear more about why you think such a test is useless?!

**Quackdoc** · 30 November 2023, 02:25 PM

Originally posted by waxhead View Post

I understand what you mean, but I still think it would be valuable to see how each filesystem copes if running on "ideal hardware" without any latency introduced by the hardware itself (SSD/HDD). I agree that your workload is the only thing that matters, but why exactly do you think it is useless? If filesystem X takes 10 seconds doing operation A and filesystem Y only uses 5 seconds it tells something about the efficiency of the algorithm being used. Yes, the results may be completely reversed on real hardware depending on access patterns, but since we are moving more and more towards storage devices that mostly are "regular RAM" I think it would be a good test still. I would love to hear more about why you think such a test is useless?!

I don't think a test on ram is "useless" but I don't think there is too much value in it either aside from "huh that's neat". Not to mention it there is little point in actually optimizing for a ram only situation (there is some don't get me wrong) for the vast majority of filesystems, so they would not be indicative of the "best preformance" anyways.

I do think test's in isolation are useless because they will never be indicative of what a filesystem will actually be put through, there are thousands upon thousands of potential variables

**clipcarl** · 30 November 2023, 02:27 PM

A few questions / points:

Why was a 512 byte block size used for bcachefs but 4k for all the other filesystems? SSDs virtually always perform much better with a 4k block size (though I haven't personally tested the difference on bcachefs).
This article doesn't mention whether the bcachefs `discard` option was enabled. For this to be a fair test, that option should probably be disabled. (Bcachefs basically does a "trim" on its own as blocks are freed when that option, which is on by default, is enabled. Trim operations are usually pretty slow especially on consumer SSDs. The other filesystems don't automatically trim blocks-- the administrator needs to manually run something like `fstrim` and thus the trim penalty wouldn't be included in their results.)
This article doesn't mention what was done in between filesystems to reset the SSD back to its "factory" state. Without this, the filesystems tested first would have a huge advantage over the filesystems tested later. You can't even just wait an hour or a day or whatever between tests. Something like using `nvme format /dev/nvme0n0 -s 2` might work (but double check for the specific drive). For SATA drives, usually a security erase does the trick and that `nvme format` command does something similar for NVME drives. Or simply use a different identical brand new SSD for each filesystem's tests. If you're not sure how to do a complete reset of the drive's state something like `blkdiscard /dev/nvme0n0` would be much better than nothing. (But in any case see the caveat below.)
Because SSDs do garbage collection in the background, in order to get the most accurate results the tests should be run in the same order for all the filesystems with no other filesystem operations besides the tests. The amount of time before and in between each test should also be the same. I'd recommend using a script that includes the reset and every test so there's zero time in between tests. Was that done here? Manually starting individual tests when you get around to it is a good way to get non-meaningful results (I have no reason to believe that was done here).
The Flexible IO Tester (fio) results (up to 1.5m IOPS) appear to suggest that the test setup was one that I wouldn't consider useful because it is only testing how quickly the SSD's controller can talk to its DRAM cache for very short bursts and not what the SSD can actually sustain over time. No SSD can really do 1.5m random 4k IOPS actually writing to the flash. (SSD manufacturers advertise that same largely useless kind of number for their consumer SSDs I guess because they assume that regular consumers and reviewers don't know better. Enterprise SSD spec sheets usually show much more useful sustained IOPS which is why enterprise SSD numbers can often at first glance look a lot slower even though they're often much faster in reality.) IMHO, if you're going to go through the trouble of running fio tests they should be set up with a large enough random dataset to completely overwhelm the SSD's DRAM and SLC caches and actually measure what the SSD can sustain over long periods of time.
To make sure the benchmark numbers for each filesystem are as accurate as possible, the entire test run for all filesystems should be repeated multiple times. For example, 3 times on 3 successive days. If your results for a particular filesystem aren't reasonably close on each day you ran them then you know your test environment and controls need more work.

Caveat: For SSDs, benchmarks from a brand-new-out-of-box state (or a factory reset state) represent a best-case very short period of time scenario. The same is true for newly formatted filesystems in general regardless of the medium. I.e., you likely won't get the same performance once you're system has been running for a day / month / year and as the filesystem fills up with data. For people like me who have built storage systems for work, benchmark runs designed to capture steady-state or worst-case performance are far more useful.

**clipcarl** · 30 November 2023, 02:53 PM

Originally posted by waxhead View Post

Personally i think filesystem tests should be performed on a block device in ram.

Strongly disagree. That would capture theoretical performance that has nothing at all to do with how the filesystem would perform in the real world. IMHO the most useful benchmarks are those that capture real-world workloads run on a system that's reached steady-state.

**cj.wijtmans** · 30 November 2023, 03:30 PM

Originally posted by hyperchaotic View Post

And by a monolithic beast you mean a highly modular init and runtime management/admin system comprising many optional components each with their own purpose.

Do you apply this logic to xorg too?

**cj.wijtmans** · 30 November 2023, 03:32 PM

Originally posted by Berniyh View Post

Really? I can think of many things that made Linux popular, but this isn't one of them.
I wouldn't even agree that it's really part of the core development technique for most tools.
I mean … one of the core components in the Linux ecosystem was/is X11 and that fails hugely in terms of KISS.

X11 is not the reason linux became popular though. Especially when many users had issues with it especially nvidia users.

**Berniyh** · 30 November 2023, 03:59 PM

Originally posted by cj.wijtmans View Post

X11 is not the reason linux became popular though. Especially when many users had issues with it especially nvidia users.

So what "KISS" part made Linux become popular in your opinion?

**clipcarl** · 30 November 2023, 04:22 PM

Originally posted by hyperchaotic View Post

And by a monolithic beast you mean a highly modular init and runtime management/admin system comprising many optional components each with their own purpose.

Systemd fans like to claim that it's really super-modular but the reality is that most distributions that include it do so in a monolithic package and / or with package inter-dependencies that generally enforce installing and running its sub-components. For example, try installing a desktop environment on a Systemd-based distribution without say systemd-logind. The reality on this planet is really closer to Systemd being one big take-it-or-leave it monolith than to the "highly modular" system you claim if users want to compile their own packages and create their own distribution.

I'm a bit older than probably most of you here so I actually know and remember from my own professional experience what made Unix (and now Linux) such a force to be reckoned with over the last 40+ years. It really was in large part due to the KISS (Keep It Simple, Sir) philosophy. For Unix and Linux that boiled down to the idea that each program does exactly one thing and does it well. In practice this allowed vendors of integrated solutions to customize their systems to their specific needs in a way that's not as easy with more monolithic, tightly integrated software stacks. Need a `login` with different features? Just replace it with a different one without having to worry too much about it breaking half the system. Same thing for `syslogd`, the initrd system and lots of other things that have been and are being consumed by Systemd.

Personally I haven't figured out what problem(s) Systemd is supposed to solve that no other system can in the more traditional way. Which is part of the reason why I run Alpine Linux (with OpenRC) on my personal servers and Artix Linux (with OpenRC) on my personal workstations. Obviously professionally I do have to deal with Systemd, though.

It's a matter of personal preference and people can run whatever they want. But it does annoy me a little that pro-Systemd people often appear to go out of their way to make it much more difficult for useful software to be used without Systemd seemingly for no real benefit to the users. Obviously, that could simply be a matter of erroneous perception but it does sometimes seem that way.

**carewolf** · 30 November 2023, 04:43 PM

Originally posted by Berniyh View Post

Well, it failed me 3 times in the last 10-15 years. btrfs, so far, has not yet failed me in roughly the last 10 years. So now what?

Of course, nothing can actually repace backups other than more backups.
But, as an ext4 user, how do you even know that you have faulty data and that you need your backup?
tbh, (data) checksumming support should be a standard thing in filesystems today.

(Edit: and yes, that's what happened to me. Faulty data I noticed by pure coincidence.)

I have had btrfs eat all my data twice, which is two time more than any other FS in over 35 years of using computers. The fucking problem with it is that it doesnt have an fsck that can actually repair errors, and as a safety measure it just refuses to do ANYTHING if there is one bit of error anywhere.. I only use it for git back sources now, that I can always recover, but no way I ever put anything else on bitrotfs. So you don't just quitely rot one bit of data, but instead you lose EVERYTHING because it was badly designed.

Though to be fair, both loses was like 10 years ago. Not had any problems with btrfs in 10 year

**S.Pam** · 30 November 2023, 05:19 PM

Was sqlite configured in WAL mode, or are the sqlite tests still using legacy setting (delete)? WAL is used by anything needing performance on any fs, and especially COW filesystems.

Announcement

Another Look At The Bcachefs Performance on Linux 6.7

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment