Large HDD/SSD Linux 2.6.38 File-System Comparison

energyman replied

12 March 2011, 05:28 AM
Originally posted by tytso View Post

Sure but you're begging the question of what is "normal conditions". Are you always going to fill a file system to 10% of capacity, and then reformat it, and then fill it to 10% again? That's what many benchmarkers actually end up testing. And so a file system that depends on the garbage collector for correct long-term operation, but which never has to garbage collect, will look really good. But does that correspond to how you will use the file system?

What is "basic conditions", anyway? That's fundamentally what I'm pointing out here. And is performance really all people should care about? Where does safety factor into all of this? And to be completely fair to btrfs, it has cool features --- which is cool, if you end up using those features. If you don't then you might be paying for something that you don't need. And can you turn off the features you don't need, and do you get the performance back?

For example, at $WORK we run ext4 with journalling disabled and barriers disabled. That's because we keep replicated copies of everything at the cluster file system level. If I were to pull a Hans Reiser, and shipped ext4 with its defaults to have the journal and barriers disabled, it would be faster than ext2 and ext3, and most of the other file systems in the Phoronix file system comparison. But that would be bad for the desktop users for ext4, and that to me is more important than winning a benchmark demolition derby.

-- Ted

you mean reiser4? Which yells loudly if barriers are not supported and go in sync mode?
Why not use your own creation as an example of dumb defaults - ext3?
Leave a comment:
mtippett replied

11 March 2011, 08:42 PM
Originally posted by tytso View Post

The right answer would be use something like the fs impressions tool to "age" the file system before doing the timed benchmark part of the test (see: http://www.usenix.org/events/fast09/...es/agrawal.pdf).

Has this been previously done? The impressions tool presentation talks only about making a filesystem look similar to an old one. It didn't actually attempt to benchmark it within that paper. I understand intellectually the value of it, but I would also assume that some filesystems would behave very differently between the two.

The fundamental question is what are you trying to measure? What is more important? The experience the user gets when the file system is first installed, or what they get a month later, and moving forward after that?

100% agree. There are thousands of measures and thousands of conditions that can be applied. What Michael and I try to listen for is the scenario and the potential measure that can be used. OpenBenchmarking and PTS provide for the visibility and repeatability respectively. The harder part is determining the Configuration Under Test and preparing the System Under Test to suit.

We know that for each scenario presented a vocal minority will see it being pointless...
Leave a comment:
tytso replied

11 March 2011, 08:02 PM
One more thought... the fact that TRIM requests can hurt in the short-term, while preserving SSD performance in the long-term, is something that disadvantages btrfs (which has an SSD option which I believe does use TRIM) and might be an advantage for ext4. So it's another example of how not doing apples to apples comparisons can lead to misleading results --- and since this is one that ext4 benefits from, hopefully I won't be accused for complaining just because of sour grapes since I think ext4 should have done better in the benchmark comparisons.

Yes, I understand the argument that most people don't mess with the defaults, and so the defaults should matter --- but at the same time, when some file systems are unsafe out of the box, it seems misleading not to call that out. And if a file system happens to have great performance when it is freshly formatted, but might degrade badly once the file system is aged, that is to me an indication that benchmark isn't doing a good job.

Quite frankly, the primary way I think benchmarks are useful is as a tool for improving a particular file system. I might compare against another file system just to understand what might be possible, but then I'll want to understand exactly why it was faster than my file system, in that particular workload and configuration --- and then I may decide to try to improve things, or might decide that on balance, disabling barriers by default isn't a fair thing to do to my user base.

Competitive benchmarking is always subject to gaming, for people who are into doing that. And that's primarily driven by marketing folks who spend millions of dollars doing that in the commercial world for enterprise databases, for example. Very often those results are completely unrelated to how most people use their databases, but it's important for marketing purposes.

A frequent complaint of the Phoronix benchmarks is that they are only useful to drive advertising revenue by driving page hits. I don't think that's entirely fair, but I do think they aren't as useful as they could be, and at least today they certainly aren't useful to help users decide which file system to use. The main way I use them for is to look at the long term trends and see if there are any performance improvements or regressions. (And one shortcoming for this purpose is it would be ideal if there were multiple hardware configurations, including some high-end configurations with 4, 8, and 16 CPU's, as well as high-end RAID storage. But I understand Phoronix is budget constrained and high-end hardware is expensive.)

Note though that I'm comparing across kernel versions, not between file systems --- and sometimes there is a good reason for a performance drop, such as improving data safety in the face of a power crash. At least for me, that's going to be higher priority than performance. (Or at least, it should be, as the default option. Maybe I'll have an unsafe option for people with specialized needs and who know what they are doing; but the default should optimize for safety, assuming non-buggy application programs.)
Leave a comment:
tytso replied

11 March 2011, 07:39 PM
[QUOTE]The current thinking is that it's better to batch discards, and every few hours, issue a FITRIM ioctl request which will cause the disk to send discards on blocks which it knows to be free. [\QUOTE]

Oops, that should be, "... issue a FITRIM ioctl request which will cause the file system to send discards..."

Sorry for the typo, but this bboard doesn't let you edit posts a minute after they've been saved, and I didn't notice this until now.
Leave a comment:
tytso replied

11 March 2011, 07:29 PM
Originally posted by mtippett View Post

Note that as per Documentation/filesystems/ext4.txt in the kernel

Code:

discard Controls whether ext4 should issue discard/TRIM nodiscard(*) commands to the underlying block device when blocks are freed. This is useful for SSD devices and sparse/thinly-provisioned LUNs, but it is off by default until sufficient testing has been done.

The option is potentially putting at risk data.

Actually, there's only been one report where using trim caused a disk drive (vendor withheld to protect the guilty, and because I don't know; the distribution which reported this to me had signed an NDA with the vendor) to brick itself. That was probably a case of a firmware bug --- but the problem is that regardless of whether the bug is with the disk drive or not, if anything goes wrong when previously things had been working O.K., they blame the kernel developers.

The bigger problem is that for some SSD's, issuing a large number of TRIM requests actually trashes performance. That's because you have to flush the NCQ queue before you can issue a discard request, thanks to a brain-dead design decisions by the good folks in the T10 standards committee. Hence, a discard costs almost as much as barrier request, and for some SSD's, could actually be more expensive (because they take a long time to process a TRIM request) and so could cause a localized decrease in performance if you happen to have an operation mix that includes file deletes alongside other read/write operations.

The current thinking is that it's better to batch discards, and every few hours, issue a FITRIM ioctl request which will cause the disk to send discards on blocks which it knows to be free. This should have less impact than issuing a discard after every single file delete, which what currently happens if you enable the discard mount option in ext4. The FITRIM ioctl is in the latest kernels, and the userspace daemon will be coming soon. (It's posted on LKML, but I doubt any distro's have packaged it yet.)

In all likelihood, enabling discard for a file system probably won't help the benchmark a whole lot, since the performance advantage of using TRIM is a long-term advantage; and if the file system has been fully TRIM'ed at mkfs time, it's unlikely that the benchmark will have done enough writes that the SSD performance will degrade during the benchmark run. In fact, if the SSD takes time to process TRIM requests, you might actually get better performance by disabling the TRIM requests, just as you will get better short-term performance if you disable the nilfs2's log cleaner. (Long-term it will hurt you badly, but often benchmarks don't test long-term results; that's my concern about benchmarks that don't pre-age the filesystem before beginning its benchmark run.)

I've argued for on similar points previously on these forums as well as in QEMU/KVM as well. A blazingly fast SQLite result will usually imply that sync operations are being ignored, which puts risk to the data when used for other loads. In the QEMU/KVM issue I chased down, it was true that barriers were being dropped in qemu block layer. (That was 3 weeks of fingerpointing between projects I don't want to relive).

So until the maintainers of the filesystem want to enable a performance optimization by default, you need to be _really_ careful with it. If they even suggest it might be risky, then caveat emptor.

Very true. It's worse because we don't have technical writers at our disposal, so we don't always have time to write detailed memos describing how best to optimize your workload. I wish we did, and that's largely on us. But if people are willing to help out on http://ext4.wiki.kernel.org, please let me know. It needs a lot of love.

BTW, one time when it might be OK to disable barriers is if you have a UPS that you absolutely trust, and the system is configured to shut itself down cleanly when the UPS reports that its battery is low. Oh, and it might be a good idea to put a big piece of tape (or an ungrounded wire) over the power switch....

(In case it wasn't obvious, the ungrounded wire was a BOFH-style joke; please don't do it in real life. :-)
Last edited by tytso; 11 March 2011, 07:30 PM. Reason: typo
Leave a comment:
mtippett replied

11 March 2011, 06:09 PM
Originally posted by skeetre View Post

Here are my results with noatime and discard on OCZ Vertex 2:
http://openbenchmarking.org/result/1...SKEE-110309125

Note that as per Documentation/filesystems/ext4.txt in the kernel

Code:

discard Controls whether ext4 should issue discard/TRIM nodiscard(*) commands to the underlying block device when blocks are freed. This is useful for SSD devices and sparse/thinly-provisioned LUNs, but it is off by default until sufficient testing has been done.

The option is potentially putting at risk data. Similar to tytso's comment earlier in the thread

Where does safety factor into all of this?

I've argued for on similar points previously on these forums as well as in QEMU/KVM as well. A blazingly fast SQLite result will usually imply that sync operations are being ignored, which puts risk to the data when used for other loads. In the QEMU/KVM issue I chased down, it was true that barriers were being dropped in qemu block layer. (That was 3 weeks of fingerpointing between projects I don't want to relive).

So until the maintainers of the filesystem want to enable a performance optimization by default, you need to be _really_ careful with it. If they even suggest it might be risky, then caveat emptor.
Leave a comment:
jbrown96 replied

11 March 2011, 05:35 PM
Originally posted by locovaca View Post

Well, since the distribution is the end user version of Ubuntu which is marketed to a more casual user I would expect the file system to receive a modest load of files (installation), then see mainly small reads and writes over the course of its lifetime (logs, home folder) with some occasional larger writes (software installation, cd rip maybe). I believe Ubuntu's default partitioning scheme is one big file system + a swap partition so this is the configuration I'd expect to see with this test. So yes, assuming a 10% full file system is probably ok given this set of assumptions.

If the defaults of the file system are not ok, either link to the bug report or it's not really an issue.

You do realize that you're responding to Ted Tso the creator of the Ext4 file system, right?
Leave a comment:
tytso replied

11 March 2011, 08:29 AM
Originally posted by locovaca View Post

Well, since the distribution is the end user version of Ubuntu which is marketed to a more casual user I would expect the file system to receive a modest load of files (installation), then see mainly small reads and writes over the course of its lifetime (logs, home folder) with some occasional larger writes (software installation, cd rip maybe). I believe Ubuntu's default partitioning scheme is one big file system + a swap partition so this is the configuration I'd expect to see with this test. So yes, assuming a 10% full file system is probably ok given this set of assumptions.

Yes, but you're not constantly reformatting the file system (i.e., reinstalling the distribution) over and over again. That is, the file system is allowed to age. So a month later, with a copy-on-write file system, the free space will have all been written will potentially get quite fragmented. But the benchmarks don't take this into account. They use a freshly formatted file system each time --- which is good for reproducibility, but it doesn't model what you will see in real life a month or 3 months later.

The right answer would be use something like the fs impressions tool to "age" the file system before doing the timed benchmark part of the test (see: http://www.usenix.org/events/fast09/...es/agrawal.pdf).

The fundamental question is what are you trying to measure? What is more important? The experience the user gets when the file system is first installed, or what they get a month later, and moving forward after that?
Leave a comment:
skeetre replied

11 March 2011, 08:13 AM
Originally posted by stqn View Post

Someone said it before, but I don't get the point of benchmarking ext4 on an SSD without the discard option (and maybe noatime.)

An SSD benchmark would in fact be a good place to tell people they should use discard, for the few who wouldn't know it already.

Here are my results with noatime and discard on OCZ Vertex 2:
http://openbenchmarking.org/result/1...SKEE-110309125
Leave a comment:
locovaca replied

11 March 2011, 08:04 AM
Originally posted by tytso View Post

Sure but you're begging the question of what is "normal conditions". Are you always going to fill a file system to 10% of capacity, and then reformat it, and then fill it to 10% again? That's what many benchmarkers actually end up testing. And so a file system that depends on the garbage collector for correct long-term operation, but which never has to garbage collect, will look really good. But does that correspond to how you will use the file system?

What is "basic conditions", anyway? That's fundamentally what I'm pointing out here. And is performance really all people should care about? Where does safety factor into all of this? And to be completely fair to btrfs, it has cool features --- which is cool, if you end up using those features. If you don't then you might be paying for something that you don't need. And can you turn off the features you don't need, and do you get the performance back?

For example, at $WORK we run ext4 with journalling disabled and barriers disabled. That's because we keep replicated copies of everything at the cluster file system level. If I were to pull a Hans Reiser, and shipped ext4 with its defaults to have the journal and barriers disabled, it would be faster than ext2 and ext3, and most of the other file systems in the Phoronix file system comparison. But that would be bad for the desktop users for ext4, and that to me is more important than winning a benchmark demolition derby.

-- Ted

Well, since the distribution is the end user version of Ubuntu which is marketed to a more casual user I would expect the file system to receive a modest load of files (installation), then see mainly small reads and writes over the course of its lifetime (logs, home folder) with some occasional larger writes (software installation, cd rip maybe). I believe Ubuntu's default partitioning scheme is one big file system + a swap partition so this is the configuration I'd expect to see with this test. So yes, assuming a 10% full file system is probably ok given this set of assumptions.

Originally posted by TonsOfPeople

Wah, the default didn't set xxx, that's horrible

If the defaults of the file system are not ok, either link to the bug report or it's not really an issue.
Leave a comment:

Announcement

Large HDD/SSD Linux 2.6.38 File-System Comparison

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: