Announcement

Collapse
No announcement yet.

Large HDD/SSD Linux 2.6.38 File-System Comparison

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #51
    Originally posted by skeetre View Post
    Here are my results with noatime and discard on OCZ Vertex 2:
    http://openbenchmarking.org/result/1...SKEE-110309125
    Note that as per Documentation/filesystems/ext4.txt in the kernel

    Code:
    	discard		Controls whether ext4 should issue discard/TRIM
    	nodiscard(*)	commands to the underlying block device when
    				blocks are freed.  This is useful for SSD devices
    				and sparse/thinly-provisioned LUNs, but it is off
    				by default until sufficient testing has been done.
    The option is potentially putting at risk data. Similar to tytso's comment earlier in the thread

    Where does safety factor into all of this?
    I've argued for on similar points previously on these forums as well as in QEMU/KVM as well. A blazingly fast SQLite result will usually imply that sync operations are being ignored, which puts risk to the data when used for other loads. In the QEMU/KVM issue I chased down, it was true that barriers were being dropped in qemu block layer. (That was 3 weeks of fingerpointing between projects I don't want to relive).

    So until the maintainers of the filesystem want to enable a performance optimization by default, you need to be _really_ careful with it. If they even suggest it might be risky, then caveat emptor.

    Comment


    • #52
      Originally posted by mtippett View Post
      Note that as per Documentation/filesystems/ext4.txt in the kernel

      Code:
      	discard		Controls whether ext4 should issue discard/TRIM
      	nodiscard(*)	commands to the underlying block device when
      				blocks are freed.  This is useful for SSD devices
      				and sparse/thinly-provisioned LUNs, but it is off
      				by default until sufficient testing has been done.
      The option is potentially putting at risk data.
      Actually, there's only been one report where using trim caused a disk drive (vendor withheld to protect the guilty, and because I don't know; the distribution which reported this to me had signed an NDA with the vendor) to brick itself. That was probably a case of a firmware bug --- but the problem is that regardless of whether the bug is with the disk drive or not, if anything goes wrong when previously things had been working O.K., they blame the kernel developers.

      The bigger problem is that for some SSD's, issuing a large number of TRIM requests actually trashes performance. That's because you have to flush the NCQ queue before you can issue a discard request, thanks to a brain-dead design decisions by the good folks in the T10 standards committee. Hence, a discard costs almost as much as barrier request, and for some SSD's, could actually be more expensive (because they take a long time to process a TRIM request) and so could cause a localized decrease in performance if you happen to have an operation mix that includes file deletes alongside other read/write operations.

      The current thinking is that it's better to batch discards, and every few hours, issue a FITRIM ioctl request which will cause the disk to send discards on blocks which it knows to be free. This should have less impact than issuing a discard after every single file delete, which what currently happens if you enable the discard mount option in ext4. The FITRIM ioctl is in the latest kernels, and the userspace daemon will be coming soon. (It's posted on LKML, but I doubt any distro's have packaged it yet.)

      In all likelihood, enabling discard for a file system probably won't help the benchmark a whole lot, since the performance advantage of using TRIM is a long-term advantage; and if the file system has been fully TRIM'ed at mkfs time, it's unlikely that the benchmark will have done enough writes that the SSD performance will degrade during the benchmark run. In fact, if the SSD takes time to process TRIM requests, you might actually get better performance by disabling the TRIM requests, just as you will get better short-term performance if you disable the nilfs2's log cleaner. (Long-term it will hurt you badly, but often benchmarks don't test long-term results; that's my concern about benchmarks that don't pre-age the filesystem before beginning its benchmark run.)

      I've argued for on similar points previously on these forums as well as in QEMU/KVM as well. A blazingly fast SQLite result will usually imply that sync operations are being ignored, which puts risk to the data when used for other loads. In the QEMU/KVM issue I chased down, it was true that barriers were being dropped in qemu block layer. (That was 3 weeks of fingerpointing between projects I don't want to relive).

      So until the maintainers of the filesystem want to enable a performance optimization by default, you need to be _really_ careful with it. If they even suggest it might be risky, then caveat emptor.
      Very true. It's worse because we don't have technical writers at our disposal, so we don't always have time to write detailed memos describing how best to optimize your workload. I wish we did, and that's largely on us. But if people are willing to help out on http://ext4.wiki.kernel.org, please let me know. It needs a lot of love.

      BTW, one time when it might be OK to disable barriers is if you have a UPS that you absolutely trust, and the system is configured to shut itself down cleanly when the UPS reports that its battery is low. Oh, and it might be a good idea to put a big piece of tape (or an ungrounded wire) over the power switch....

      (In case it wasn't obvious, the ungrounded wire was a BOFH-style joke; please don't do it in real life. :-)
      Last edited by tytso; 11 March 2011, 07:30 PM. Reason: typo

      Comment


      • #53
        [QUOTE]The current thinking is that it's better to batch discards, and every few hours, issue a FITRIM ioctl request which will cause the disk to send discards on blocks which it knows to be free. [\QUOTE]

        Oops, that should be, "... issue a FITRIM ioctl request which will cause the file system to send discards..."

        Sorry for the typo, but this bboard doesn't let you edit posts a minute after they've been saved, and I didn't notice this until now.

        Comment


        • #54
          One more thought... the fact that TRIM requests can hurt in the short-term, while preserving SSD performance in the long-term, is something that disadvantages btrfs (which has an SSD option which I believe does use TRIM) and might be an advantage for ext4. So it's another example of how not doing apples to apples comparisons can lead to misleading results --- and since this is one that ext4 benefits from, hopefully I won't be accused for complaining just because of sour grapes since I think ext4 should have done better in the benchmark comparisons.

          Yes, I understand the argument that most people don't mess with the defaults, and so the defaults should matter --- but at the same time, when some file systems are unsafe out of the box, it seems misleading not to call that out. And if a file system happens to have great performance when it is freshly formatted, but might degrade badly once the file system is aged, that is to me an indication that benchmark isn't doing a good job.

          Quite frankly, the primary way I think benchmarks are useful is as a tool for improving a particular file system. I might compare against another file system just to understand what might be possible, but then I'll want to understand exactly why it was faster than my file system, in that particular workload and configuration --- and then I may decide to try to improve things, or might decide that on balance, disabling barriers by default isn't a fair thing to do to my user base.

          Competitive benchmarking is always subject to gaming, for people who are into doing that. And that's primarily driven by marketing folks who spend millions of dollars doing that in the commercial world for enterprise databases, for example. Very often those results are completely unrelated to how most people use their databases, but it's important for marketing purposes.

          A frequent complaint of the Phoronix benchmarks is that they are only useful to drive advertising revenue by driving page hits. I don't think that's entirely fair, but I do think they aren't as useful as they could be, and at least today they certainly aren't useful to help users decide which file system to use. The main way I use them for is to look at the long term trends and see if there are any performance improvements or regressions. (And one shortcoming for this purpose is it would be ideal if there were multiple hardware configurations, including some high-end configurations with 4, 8, and 16 CPU's, as well as high-end RAID storage. But I understand Phoronix is budget constrained and high-end hardware is expensive.)

          Note though that I'm comparing across kernel versions, not between file systems --- and sometimes there is a good reason for a performance drop, such as improving data safety in the face of a power crash. At least for me, that's going to be higher priority than performance. (Or at least, it should be, as the default option. Maybe I'll have an unsafe option for people with specialized needs and who know what they are doing; but the default should optimize for safety, assuming non-buggy application programs.)

          Comment


          • #55
            Originally posted by tytso View Post
            The right answer would be use something like the fs impressions tool to "age" the file system before doing the timed benchmark part of the test (see: http://www.usenix.org/events/fast09/...es/agrawal.pdf).
            Has this been previously done? The impressions tool presentation talks only about making a filesystem look similar to an old one. It didn't actually attempt to benchmark it within that paper. I understand intellectually the value of it, but I would also assume that some filesystems would behave very differently between the two.

            The fundamental question is what are you trying to measure? What is more important? The experience the user gets when the file system is first installed, or what they get a month later, and moving forward after that?
            100% agree. There are thousands of measures and thousands of conditions that can be applied. What Michael and I try to listen for is the scenario and the potential measure that can be used. OpenBenchmarking and PTS provide for the visibility and repeatability respectively. The harder part is determining the Configuration Under Test and preparing the System Under Test to suit.

            We know that for each scenario presented a vocal minority will see it being pointless...

            Comment


            • #56
              Originally posted by tytso View Post
              Sure but you're begging the question of what is "normal conditions". Are you always going to fill a file system to 10% of capacity, and then reformat it, and then fill it to 10% again? That's what many benchmarkers actually end up testing. And so a file system that depends on the garbage collector for correct long-term operation, but which never has to garbage collect, will look really good. But does that correspond to how you will use the file system?

              What is "basic conditions", anyway? That's fundamentally what I'm pointing out here. And is performance really all people should care about? Where does safety factor into all of this? And to be completely fair to btrfs, it has cool features --- which is cool, if you end up using those features. If you don't then you might be paying for something that you don't need. And can you turn off the features you don't need, and do you get the performance back?

              For example, at $WORK we run ext4 with journalling disabled and barriers disabled. That's because we keep replicated copies of everything at the cluster file system level. If I were to pull a Hans Reiser, and shipped ext4 with its defaults to have the journal and barriers disabled, it would be faster than ext2 and ext3, and most of the other file systems in the Phoronix file system comparison. But that would be bad for the desktop users for ext4, and that to me is more important than winning a benchmark demolition derby.

              -- Ted
              you mean reiser4? Which yells loudly if barriers are not supported and go in sync mode?
              Why not use your own creation as an example of dumb defaults - ext3?

              Comment


              • #57
                Reiser4

                Originally posted by ayumu View Post
                As usual, reiser4 is missing. A shame.
                Is there a chance that Reiser4 gets added later on?

                Comment


                • #58
                  Originally posted by mtippett View Post
                  Has this been previously done? The impressions tool presentation talks only about making a filesystem look similar to an old one. It didn't actually attempt to benchmark it within that paper. I understand intellectually the value of it, but I would also assume that some filesystems would behave very differently between the two.
                  The need to use aged file systems to catch both performance and functional problems is something which is well known to industry practitioners. For example, xfstests (which was a functional test suite for file systems developed by SGI, but which has since been extended so it can be used on many file systems in general, and has started to have ext4-specific tests contributed to it) has provisions so that one file system is constantly reformatted, and another file system is kept across test runs, so it can be "aged" --- since some problems could only be reproduced on an aged file systems.

                  However, there haven't been good tools to generate aged file systems; some people may have had some ad hoc tools, but nothing general. And the academic community hasn't had this insight until very recently. (If it hasn't been published in an tenure-track journal, it doesn't exist as far as the academic community is concerned, and so for them it's a new idea. :-)

                  The author of the Impressions paper gave a talk at Google, and he used some of the same slides that he presented at the FAST conference. In those slides, he did show graphs that showed how much difference there was between freshly created file systems and aged file systems --- and the difference was quite noticeable. (Not a surprise.)

                  My opinion of the tool is that functionally it's O.K., but it could be better. it currently models the number, size, and distribution of files. And, it also measures file fragmentation, which is a great way to measure future performance when reading the existing files. So that's all good. The one thing it doesn't measure the free space fragmentation, which would be a good predictor of future performance for newly created files. So there's room for improvement on that front.

                  On the implementation side, my primary complaint is that its method for measuring file fragmentation talks to debugfs over a pipe, and parses the output of debugfs, which is (a) an ugly kludge, and (b) limits its functionality to ext2/3/4 file systems. Linux has a file systme independent ioctl, FIEMAP, which will return the same information and would allow the file fragmentation module of the fs impressions tool to be used across multiple file systems.

                  The thing which makes the fs impressions tool really useful is that you can both provide a statistical model for the number of files, distribution of files, and fragmentation of the files --- and he has some initial work to measure these statistics on an existing file system, so that you can create a statistical model that matches a current file system. So if you have a file server which has been in service for 18 months, you could use his tool to gather statistics, and then use that to create a model that is reproducible, and has the same characteristics as the source file system.

                  I talked to the author of the paper, and he's agreed in principle to put the source code of the tool up on github, and allow community members to submit patches to improve the tool. What is there is a good starting point, but he was the first to admit that it was a research vehicle, and once it was done and the researchers moved on to other research interests, the tool was never improved and productionalized. This probably explains why it still uses the debugfs hack and not a more general FIEMAP interface. So hopefully we can get this up on github, and people can help get the tool in shape so it can be used by folks such as your benchmarking operation. Some assembly will still be necessary, but it shouldn't be that much work.

                  100% agree. There are thousands of measures and thousands of conditions that can be applied. What Michael and I try to listen for is the scenario and the potential measure that can be used. OpenBenchmarking and PTS provide for the visibility and repeatability respectively. The harder part is determining the Configuration Under Test and preparing the System Under Test to suit.
                  One thing that I definitely need to give you guys kudos for is that you do document your hardware configurations for the System and Configuration Under Test, and you do strive for strong reproducibility. That's all good stuff.

                  Other folks who do a really good job are Eric Whitney at HP, who has helped me greatly in ext4 development, and Stephen Pratt at IBM. Both have done benchmarking professionally, and it shows. For an example of their work, see: http://free.linux.hp.com/~enw/ext4/2.6.36-rc6/ and http://btrfs.boxacle.net/.

                  One of the things which they do which is incredibly helpful to file system developers is that they will do oprofile and (very important on larger CPU count machines) lockstat runs. Enabling oprofile and/or lockstat will of course skew the benchmark results, so they have to be done separately, and the performance results discarded, but the oprofile and lockstat information is very useful in showing what are the next things that can be optimized to further improve the file system.

                  Another very useful analysis tools for understanding why the results are the way they are is to use blktrace. The only caveat with blktrace is that the results can be very misleading on non-aged file systems. For example, btrfs is unquestionably better than ext4 at avoiding seeks on freshly created file systems. Chris Mason has some animations of blktrace output which makes this very clear, and no doubt this is why btrfs performs better on ext4 on freshly created file systems on 1-2 processor systems (where lock contention isn't as important), and on workloads where there is a lot of files created and written to sequentially on a new file system (as opposed to database workloads which have a lot of random read/write operations). But I've been hesitant to put in some very simple and easy-to-make changes that would improve ext4's sequential file creation on freshly created file systems, because it would mean turning off the anti-defragmentation measures that we have put in to try to assure that ext4 will age more gracefully over the long term. As with any engineering discipline, a file system engineer has to often balance and trade off competing goals.

                  At the same time, I know we haven't done enough work to make sure ext4 could be improved on the long-term file system aging point of view. So much work to be done, and not enough time. :-)

                  Comment


                  • #59
                    Originally posted by energyman View Post
                    you mean reiser4? Which yells loudly if barriers are not supported and go in sync mode?
                    Why not use your own creation as an example of dumb defaults - ext3?
                    Ext3 is actually not my creation. Credit for implementing ext3 journaling, which was the key new feature for ext3, belongs to Stephen Tweedie. I ported and integrated Daniel Phillip's htree code and Andreas Gruenbacher's acl and xattr code into ext3, but I've never been the maintainer of the ext3 subsystem. That honor fell to Andrew Morton, and more recently, the maintainer is now Jan Kara of SuSE.

                    I actually pushed strongly for changing ext3's defaults to enable barriers, but that was vetoed by the then-maintainer of ext3, which was Andrew Morton. As I said earlier, we really should try again now that Jan Kara is the ext3 maintainer, since SuSE ships with the defaults changed in their enterprise product.

                    Comment


                    • #60
                      Originally posted by tytso View Post
                      Ext3 is actually not my creation. Credit for implementing ext3 journaling, which was the key new feature for ext3, belongs to Stephen Tweedie. I ported and integrated Daniel Phillip's htree code and Andreas Gruenbacher's acl and xattr code into ext3, but I've never been the maintainer of the ext3 subsystem. That honor fell to Andrew Morton, and more recently, the maintainer is now Jan Kara of SuSE.

                      I actually pushed strongly for changing ext3's defaults to enable barriers, but that was vetoed by the then-maintainer of ext3, which was Andrew Morton. As I said earlier, we really should try again now that Jan Kara is the ext3 maintainer, since SuSE ships with the defaults changed in their enterprise product.
                      then I am sorry for blaming you. Please excuse my behaviour.
                      Hopefully that push comes soon. A filesystem optizimed for benchmarks is not a file system I want to use. I want a filesystem that puts the data first.

                      Which means that ext4 with its 'sometimes a crash can mean original and destination are both 0' isn't good enough either

                      Comment

                      Working...
                      X