Announcement

Collapse
No announcement yet.

Ted Ts'o: EXT4 Within Striking Distance Of XFS

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    so - why is nobody using reiser4? It is fast AND cares about data.

    And why is no other fs able to do the same? Why the same shit with extX and btrfs? A filesystem that might lose data WHICH ALREADY WAS ON THE PLATTER is braindead. And no matter what the devs write 'it was never guaranteed'... FUCK YOU.

    Just look at the btrfs faq. A simple rename can result in two empty files. BULLSHIT.

    Comment


    • #17
      Originally posted by kebabbert View Post
      Why would you want to run a fast filesystem? I dont get it.

      The most important thing for a filesystem is data integrity. Is your data safe and protected on disk? With XFS, the answer is no.

      http://www.zdnet.com/blog/storage/ho...ta-at-risk/169
      A researcher shows that:

      "XFS, NTFS, ext3, ReiserFS and JFS have. . . failure policies that are often inconsistent, sometimes buggy, and generally inadequate in their ability to recover from partial disk failures."

      I would not want to put my data on a fast, but not reliable storage solution. Data safety is the most important for me. Not speed.
      Do you use ECC RAM?

      Comment


      • #18
        It's not about competition!

        Originally posted by rohcQaH View Post
        Why don't you look at the graph in the linked blog post? There isn't much difference between ext4+journal and ext4-journal.

        Yes, XFS is still faster in the tested scenario. The point is that EXT4 got a huge speed boost on systems with many cores/threads and gets close to XFS, which had previously been all alone in that segment.
        If people are fixated on "win-loss" graphs, see the "random write" and "mail server" workloads, where ext4 actually does better than XFS. But really, I'm not seeing this primarily as an ext4 vs. XFS thing. If I had, I would have pointed at those graphs instead, and done the fanboy "Nyah, nyah" thing.

        We benchmark ourselves against XFS as a mark of respect. XFS has been optimized for HPC workloads where they are often writing large files to large raid arrays from big systems. (For SGI, 48 cores is a small system.) So the "large file create" workload, on the given hardware configuration, is basically on XFS's home ground, and as that graphs shows, we still have more work to do.

        Some people like to treat file system benchmarks as a competition, and want to score wins and losses. That's not the way I look at it. I hack file systems because I'm passionate about working on that technology. I'm more excited about how I can make ext4 better, and not whether I can "beat down" some other file system. That's not what it's all about.

        -- Ted

        Comment


        • #19
          Why we benchmark ext4 in no journal mode

          Originally posted by Lynxeye View Post
          What's the point of a journaling filesystem like ext4 without active journaling? And i think the difference with many threads is still huge.
          There are two reasons why I asked Eric to benchmark ext4 in no journal mode.

          First of all, if you are using ext4 as the object store for a cluster filesystem, where you might have hundreds of servers in your cluster file system, with perhaps thousands of disks, and where each file is composed of "shards" which are replicated on multiple servers for redundancy in case a server dies (maybe the hard drive craps out, or a power supply explodes, etc; when you have that many servers, the probability of some machine failing approaches 100%). In that scenario, the journal is overhead that's not worth the cost.

          (Note by the way that the "large file creates" workload is not a metadata heavy workload; so the effect of the journal is not that pronounced. The "mail server" workload has a many more metadata changes per transaction, and we see a much more pronounced differences between the journal and no journal modes with 1 threads. The fact that differences fall off at 48 and 192 threads is because we still have scalability bottlenecks in the journal code that I still need to fix up.)

          The second reason why I asked Eric to benchmark ext4 in journal and no journal mode is that it helps me to see where potential bottlenecks are in ext4, so I know what is most profitable to tackle next. The main thrust of my LCA talk is actually about how to decide what optimizations to do next to improve a kernel system's scalability. If you look at the three benchmark reports which Eric produced for the work that I did during 2.6.34, 2.6.35, and 2.6.36, the lockstat report showed me where the ext4 code was hitting bottlenecks, and that told me what I should do next in order to make ext4 more scalable.

          It would be awfully nice if the Phoronix benchmarks actually gathered information using perf and lockstat, since that is actually what we kernel developers need to help see how to improve the benchmarks. In fact, the graphs are pretty, but they are not what we need to understand how we can improve the filesystem. The graphs are what ESPN shows when they do 15 second clips of receives catching touchdown passes; it may drive advertising revenue, but it doesn't help the football players improve their game. For that we need to study the game films, in slow motion, and from multiple angles, and we need to study a lot more than just the receive catching the touchdown pass. How the offensive and defensive linemen react to the play, etc., is far more important.

          Comment


          • #20
            You dont understand what Data Integrity is.

            It is not about a disk crashes or something similar. It is about retrieving the same data you put on the disk. Imagine you put this data on disk: "1234567890" but a corruption occured so you got back "2234567890". And the hardware does not even notice the data got corrupted. This is called Silent Corruption and occurs all the time.

            Now, imagine you have a fast filesystem but there is silent corruption now and then. You can NOT trust the data you get back. As I have shown in links, this happens to XFS, ReiserFS, JFS, ext3, etc. Even to hardware raid, it happens all the time.

            CERN did a test, their 3.000 storage Linux servers showed 100s of instances of silent corruption. (CERN wrote a known bit pattern to the disks and compared the result - and they differed). CERN can not trust the data on disk. Therefore CERN are now migrating to ZFS (which actually is the only modern solution who is designed from scratch, to protect against silent corruption).

            I dont get it, who wants to have a fast filesystem which gives you false data?

            Comment


            • #21
              Originally posted by kebabbert View Post
              You dont understand what Data Integrity is.

              It is not about a disk crashes or something similar. It is about retrieving the same data you put on the disk. Imagine you put this data on disk: "1234567890" but a corruption occured so you got back "2234567890". And the hardware does not even notice the data got corrupted. This is called Silent Corruption and occurs all the time.

              Now, imagine you have a fast filesystem but there is silent corruption now and then. You can NOT trust the data you get back. As I have shown in links, this happens to XFS, ReiserFS, JFS, ext3, etc. Even to hardware raid, it happens all the time.

              CERN did a test, their 3.000 storage Linux servers showed 100s of instances of silent corruption. (CERN wrote a known bit pattern to the disks and compared the result - and they differed). CERN can not trust the data on disk. Therefore CERN are now migrating to ZFS (which actually is the only modern solution who is designed from scratch, to protect against silent corruption).

              I dont get it, who wants to have a fast filesystem which gives you false data?
              You obviously don't get it, or refuse to note the examples given to you, like the google situation. You also don't get that there is no such thing as 100% data integrity EVER. No chance. In all cases choices made based on cost/performance/and data integrity needs are all balanced.

              Google doesn't need perfect data integrity or hardware quality at that. By the volumes of data they process and the volume of hardware they deal with they don't care if a server is of iffy quality, they toss it. Large data volumes might also suggest a faster yet more data error prone file system might fit them ROI wise best.

              Like I asked you, do you use ECC in all your computers in all situations? Do you slightly overvolt/over clock/ underclock any part of your computer,back up on tape,backup on 10yr DVDs? The point is, you personally make trade offs on data integrity on a daily basis.

              Gee I don't get it! Ohhh how can someone/corporation live in a world without 100% data integrity?!!

              Comment


              • #22
                Originally posted by energyman View Post
                And no matter what the devs write 'it was never guaranteed'... FUCK YOU.
                POSIX says what you do or do not do, as an app, if you want your data safely stored. How much more simple does it need to be? Follow POSIX, or else it's the program author's fault when the data is not safely stored. Early filesystems like ext2 were designed such that (not on purpose) it wasn't really a problem if the software didn't save properly, and therefore people used bad code for too long with no issues. Now we have filesystems which make use of everything possible, and broken software fails on them. So fix the software.

                Didn't know you were still around, Jade, since the Reiser confession. Or was that forced, or something, I suppose?

                Comment


                • #23
                  Originally posted by tytso View Post
                  Some people like to treat file system benchmarks as a competition, and want to score wins and losses. That's not the way I look at it. I hack file systems because I'm passionate about working on that technology. I'm more excited about how I can make ext4 better, and not whether I can "beat down" some other file system. That's not what it's all about.

                  -- Ted
                  That's a hell of a quote. Respect.

                  Comment


                  • #24
                    Originally posted by yotambien View Post
                    That's a hell of a quote. Respect.
                    I think the first thought that popped in my head was "Thank you!", for the hard work and sentiment.

                    Comment


                    • #25
                      Originally posted by Tgui View Post
                      You obviously don't get it, or refuse to note the examples given to you, like the google situation. You also don't get that there is no such thing as 100% data integrity EVER. No chance. In all cases choices made based on cost/performance/and data integrity needs are all balanced.

                      Google doesn't need perfect data integrity or hardware quality at that. By the volumes of data they process and the volume of hardware they deal with they don't care if a server is of iffy quality, they toss it. Large data volumes might also suggest a faster yet more data error prone file system might fit them ROI wise best.

                      Like I asked you, do you use ECC in all your computers in all situations? Do you slightly overvolt/over clock/ underclock any part of your computer,back up on tape,backup on 10yr DVDs? The point is, you personally make trade offs on data integrity on a daily basis.

                      Gee I don't get it! Ohhh how can someone/corporation live in a world without 100% data integrity?!!
                      Hmmm... You dont seem to understand what I mean.

                      CERN is storing lots of data because of their LHC (Large Hadron Collider) which costed billions and decades to plan and build. They are trying to find the Higgs boson(?). CERN really thinks it is important that their bits from the experiments are stored correct. Now they are migrating to ZFS:


                      http://blogs.sun.com/simons/entry/hp..._science_means

                      "Simultaneously, more LCH sites are beginning to use Sun's Thumper (Sun Fire x4500) ultra-dense disk storage systems.

                      Having conducted testing and analysis of ZFS, it is felt that the combination of ZFS and Solaris solves the critical data integrity issues that have been seen with other approaches. They feel the problem has been solved completely with the use of this technology. There is currently about one Petabyte of Thumper storage deployed across Tier1 and Tier2 sites. That number is expected to rise to approximately four Petabytes by the end of this summer."



                      Here is another link about LHC and ZFS:
                      http://hpc-events.com/sun-dresden07/CERN_Gasthuber.pdf

                      "Solaris 10/11 with ZFS solves critical data
                      integrity cases
                      seen on all types of HW (/$ independent) <---------OBS!!!!
                      we feel that this issue is solved completely
                      Numerous sites now deploy ZFS + Thumper
                      other storage + ZFS still minor
                      >1 PB already in operation (T1 + T2)
                      - just count the well known sites (as of April 07)
                      - doubles soon"





                      Also, in finance (which is I work) it is extremely important that the data is stored correct. I hope you understand that there are some fields of work, where performance is secondary.



                      Of course I am not saying that ZFS is 100% secure, but it is far more secure than other solutions. The thing is that ZFS uses end-to-end checksummming. No other solution uses that. End to end means, from RAM down to controller down to disk, the entire chain. There might be bit flips in some parts of the chain. No solution compares the end of the chain with checksums. They only compare checksums within a realm, not when data passes a realm to a new.




                      For instance, here is an example where ZFS end-to-end checksum immediately detected there was an error in a switch. The switch injected faulty bits into the data stream, down to server. Earlier, they did not notice the faulty switch and the corrupt data:
                      http://jforonda.blogspot.com/2007/01...meets-zfs.html

                      "As it turns out our trusted SAN was silently corrupting data due to a bad/flaky FC port in the switch. DMX3500 faithfully wrote the bad data and returned normal ACKs back to the server, thus all our servers reported no storage problems.
                      ...
                      ZFS was the first one to pick up on the silent corruption"




                      There is also research from computer scentists that show that hardware raid and filesystems are not secure, you can not trust on them. There is also research that shows ZFS to be safe, ZFS detected all artificially introduced errors. ZFS would have also corrected all errors if they had used raid. Now they used single disk.

                      The first step is detecting errors. Then you can correct the errors. Unfortunately, only ZFS is designed from scratch to detect errors.




                      In summary, you can use a storage solution that scales to PetaBytes and is SAFE. So, I dont see why you should focus on performance, if you can not use that solution in for instance, finance?

                      (No, I do not use ECC RAM yet, because I am waiting to upgrade my old ZFS server. When I upgrade, I will surely use ECC RAM. I agree ECC is important)

                      Comment


                      • #26
                        Originally posted by Ranguvar View Post
                        POSIX says what you do or do not do, as an app, if you want your data safely stored. How much more simple does it need to be? Follow POSIX, or else it's the program author's fault when the data is not safely stored. Early filesystems like ext2 were designed such that (not on purpose) it wasn't really a problem if the software didn't save properly, and therefore people used bad code for too long with no issues. Now we have filesystems which make use of everything possible, and broken software fails on them. So fix the software.

                        Didn't know you were still around, Jade, since the Reiser confession. Or was that forced, or something, I suppose?
                        file A is on disk.

                        You want to rename it to B.

                        You call rename(). A crash at the wrong moment and both are gone. r there is a file A or B. But it contents? Gone. That is a fucking braindead idiocy.
                        from the btrfs faq:

                        What are the crash guarantees of rename?
                        Renames NOT overwriting existing files do not give additional guarantees. This means, a sequence like
                        echo "content" > file.tmp
                        mv file.tmp file

                        # *crash*
                        will most likely give you a zero-length "file". The sequence can give you either
                        Neither file nor file.tmp exists
                        Either file.tmp or file exists and is 0-size or contains "content"


                        That is inacceptable. No matter what POSIX says. POSIX is crap anyway (windows NT is posix compliant too... yeah..)

                        Whoever thinks that some clusterfuck like that is acceptable has a major problem with reality.

                        In reality data is sacrosanct. Nuking it is not an option. A FS nuking data is fucking broken by fucking design.

                        Comment


                        • #27
                          Originally posted by Ranguvar View Post
                          POSIX says what you do or do not do, as an app, if you want your data safely stored. How much more simple does it need to be? Follow POSIX, or else it's the program author's fault when the data is not safely stored.
                          Fsync() is evil.

                          I mean, really, truly, horribly, satanically evil.

                          Flushing data on my laptop forces the disk to spin up just to write a file that I probably don't care that much about, thereby wasting my battery power. Flushing data on an ecommerce database server, on the other hand, is probably vital to ensure that databases are kept up to date.

                          But that's a system configuration choice, and should not ever be something that applications randomly decide to do. If I don't care that I might lose the last five minutes of files when I crash, then Firefox shouldn't be calling fsync() every time I visit a new web page. But it does because there are so many crappy filesystems which will corrupt your files if you crash before everything has been flushed to disk.

                          Having every application decide whether to force a write to the disk and waste my battery power is simply braindead. Filesystems should behave in a sensible manner so that we don't need this kind of hackery to make them work the way they should have worked in the first place. If I have file A on disk and I edit it and write it out, then the filesystem should be able to ensure that when I read it back after a reboot it will either be file A or file B and not an empty file or some corrupted mixture of the two. Anything else is unacceptable in a general use filesystem (special use filesystems may well prefer speed to consistency and be able to handle corruption issues).

                          Comment


                          • #28
                            Originally posted by kebabbert View Post
                            Hmmm... You dont seem to understand what I mean.
                            No, it's pretty clear you're the one who isn't understanding. Everyone agrees that in certain cases it's important to have data integrity. What you don't seem to get is that in certain situations it is perfectly acceptable to have data errors. Even if you don't know they are there. This has been explained to you, but you keep repeating the same stuff so I'm not sure if you're ignoring us or just don't understand the concept.

                            Comment


                            • #29
                              Originally posted by smitty3268 View Post
                              No, it's pretty clear you're the one who isn't understanding. Everyone agrees that in certain cases it's important to have data integrity. What you don't seem to get is that in certain situations it is perfectly acceptable to have data errors.
                              Ok, please enlighten me. I work in finance, therefore I have trouble seeing situations where it is acceptable that the data you get, is not correct. Maybe my line of work has colored me, but please give me some real life examples where it is acceptable that you get erroneous data (not some contrived examples).

                              I suppose you also advocate fast cpus, that every once in a while insists that 1 + 1 = 7?

                              To me, it is strange to suggest such storage solutions or hardware. I can promise you that in finance your suggestions would get kicked out, faster than greased lightning.

                              Comment


                              • #30
                                Originally posted by kebabbert View Post
                                Ok, please enlighten me. I work in finance, therefore I have trouble seeing situations where it is acceptable that the data you get, is not correct.
                                Multimedia files usually tolerate being slightly corrupted. It might show as an small artefact in a movie or an image or a crack an audio stream.
                                Applications such as video hosting would probably be great candidates for fast but less secure storage.

                                Comment

                                Working...
                                X