Announcement

Collapse
No announcement yet.

Bcachefs Linux File-System Benchmarks vs. Btrfs, EXT4, F2FS, XFS

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by jacob View Post

    AFAIK RAID1 has been production quality for many years and has been tested in similar scenarios countless times. Your problem has probably some deeper cause that may or may not be related to BTRFS.
    My problem was caused by a hard shutdown. But the inability to fix it was completely caused by bugs in btrfs-progs v4.14. Problems that were fixed a long time ago keep getting broken again, whenever they try to fix another problem. If you talk with the developers, this is ridiculously common. The main cause of these frequent breaks is that the architecture of btrfs is a horrible mess.

    The bug I experienced is actually known by the developers and is being worked on since at least January, and still not fixed as of 4.16.1.

    If you're not convinced, head over to [email protected] and ask the developers themselves how long it has been since someone reported a btrfs raid-1 error resulting in a corrupted, unrecoverable and unmountable file system. I bet the answer will be less than one week, if not one day.

    Comment


    • #32
      Originally posted by PuckPoltergeist View Post
      If you have a crash between 1. and 3. your data is lost. It doesn't matter, if you add any checksums. You can add them for detecting data corruptions, but writeout still suffer the same problem. A different problem, that checksums won't solve.
      You're missing the point. I'm not talking about data loss. If there is a crash before a commit, you WILL lose data, with or without COW and with or without checksum. Checksums have nothing to do with that.

      The way COW relates to checksums is in the fact that writing the data and writing the checksum are two distinct disk operations. You can't do them indivisibly at the same time. Which means that there can be a crash between the two. With COW, that doesn't matter, because you will lose the possibly messed up new data and the possibly messed up new checksum, but what remains available after reboot is the old data AND the old checksum, which are guaranteed to be mutually consistent.

      Without COW, you write the new data, at which point the old data are no longer available, and the disk thus contains new data and the old checksum, which is now inconsistent. If everything goes well, you can then write the new checksum and all is fine. If you have a crash, you reboot in a state where the disk contains a checksum that doesn't match the current state of the data and there is no way to recover reliably from there, other than HOPING that the current data are in fact correct (which may be the case, or not) and recalculating the new checksum from there.

      Originally posted by PuckPoltergeist View Post
      It does. The data doesn't change with writeout and the checksum is calculated over the data in RAM. So it doesn't matter, if it is calculated before or after writeout.
      Once again, you can't do that except in the most trivial cases. For one thing, you generally don't have all the data available at the moment you create a new journal transaction, so you can't just precalculate the checksum "over the data in RAM". Secondly, imagine a situation where you have an extent AAAAAAAAAAAAAA, with a valid checksum, and you want to write an extent BBBBBB that overlaps with it. In practice that means splitting the original extent into three, resulting in AAAAAABBBBBBAAAA. Now you need to create and write a new checksum for the first chunk of A's, calculate and write the checksum for BBBBBB (which, let me repeat, you generally can't do in advance when you write the journal header), and then create and write a new checksum for the second chunk of A's. This must all be done transactionally for the reasons above, and the only way to do this is using COW.

      Comment


      • #33
        Originally posted by AndyChow View Post
        By now, it's evident that BTRFS is badly designed. Bcachefs might not have all the features, but they are building them slowly, in a sane way, once the basics have been mastered and work. With BTRFS, everything was thrown together, and it doesn't work. A couple of weeks ago, a pcie hardware failure caused my system to require a hard reboot. My raid-1 btrfs had the last leaf on the most recent tree corrupted. There was absolutely no way to repair it. The only thing I could do was dump the files in another array, and destroy and fe-format the btrfs array. All attempts to recover and all commands were done by a btrfs developer.

        So with BTRFS, a raid-1 array that has the very last block of the very last write broken due to a power failure corrupts the entire filesystem. And there is no way to recover. So how is btrfs even COW? My understanding of COW is that you could just truncate the last modifications and recover everything not too new. But no, it doesn't, not with BTRFS.
        I was looking for your report on the BTRFS mailing list, but could not find anything. Can you please link or let me know the subject of your post?!
        From my experience BTRFS "RAID-1" works fine as long as you have two operational disks, and my system was able to work fine during a controller lockup. Apparently you did not suffer any dataloss either which does not sound like bad design to me. It seem that the filesystem has gone read only to protect itself.
        You claim that ths bug corrupted the entire filesystem. Is this correct? was it *corrupt* or was it just not read/write ?!

        http://www.dirtcellar.net

        Comment


        • #34
          Originally posted by jacob View Post
          Once again, you can't do that except in the most trivial cases. For one thing, you generally don't have all the data available at the moment you create a new journal transaction, so you can't just precalculate the checksum "over the data in RAM". Secondly, imagine a situation where you have an extent AAAAAAAAAAAAAA, with a valid checksum, and you want to write an extent BBBBBB that overlaps with it. In practice that means splitting the original extent into three, resulting in AAAAAABBBBBBAAAA. Now you need to create and write a new checksum for the first chunk of A's, calculate and write the checksum for BBBBBB (which, let me repeat, you generally can't do in advance when you write the journal header), and then create and write a new checksum for the second chunk of A's. This must all be done transactionally for the reasons above, and the only way to do this is using COW.
          You're modifying some data, why making it so overcomplicated? You have your data with valid checksum

          AAAAAAAAAAAAAA => checksum foo

          you're modifying to

          AAAAAABBBBBBAAAA => checksum foo doesn't match anymore

          So a new checksum bar must be generated and stored instead of foo. Everything is known before the data is written to disk. So this can be journaled without any problem. And no need for splitting anything.

          Comment


          • #35
            Originally posted by boxie View Post

            It does however give a nice handy point in time performance snapshot. and even though you are biased towards BTRFS, you have to admit that it is not a bad start!
            Oh yes, I do welcome bcachefs and thumbs to the dev(s?) for trying

            http://www.dirtcellar.net

            Comment


            • #36
              Originally posted by waxhead View Post

              I was looking for your report on the BTRFS mailing list, but could not find anything. Can you please link or let me know the subject of your post?!
              From my experience BTRFS "RAID-1" works fine as long as you have two operational disks, and my system was able to work fine during a controller lockup. Apparently you did not suffer any dataloss either which does not sound like bad design to me. It seem that the filesystem has gone read only to protect itself.
              You claim that ths bug corrupted the entire filesystem. Is this correct? was it *corrupt* or was it just not read/write ?!
              It couldn't mount or be made mountable, or restored to any type of mountable way. I didn't go on the mailing list, just freenode. I only got my data back because I had a few disks where I could migrate the data to, through the recovery tools. I would have documented it, but it was 4 a.m. and I'm lazy. The devs seemed to find my situation rather usual.

              Comment


              • #37
                Originally posted by AndyChow View Post

                It couldn't mount or be made mountable, or restored to any type of mountable way. I didn't go on the mailing list, just freenode. I only got my data back because I had a few disks where I could migrate the data to, through the recovery tools. I would have documented it, but it was 4 a.m. and I'm lazy. The devs seemed to find my situation rather usual.
                Well from experience (not just BTRFS) devs have a tendency to always blame the hardware, but that was a digression. So what you are saying is that your BTRFS filesystem was not mountable either as read only or read write, but regardless it seems that you did not loose data. Would you mind sharing what profile you did use for data + metadata and how many disks was used for your filesystem? Also please (if you haven't already... I'm lazy too and too lazy to go back in check earlier posts) share the kernel version as well.

                http://www.dirtcellar.net

                Comment


                • #38
                  Originally posted by PuckPoltergeist View Post
                  So a new checksum bar must be generated and stored instead of foo. Everything is known before the data is written to disk. So this can be journaled without any problem. And no need for splitting anything.
                  That's precisely what is impossible. Contrary to what you say, everything is NOT known at the time the journal entry is created (which usually happens way before the data writeout is initiated, and not in an atomic way). But that's not the whole problem. EVEN IF you could somehow journal the new checksum, you would need to do the following:

                  1.) Read the old file into memory, to calculate the new checksum with the relevant extents replaced. Writeout would be blocked during all that time;

                  2.) Journal the new checksum

                  3.) Start the writeout - don't forget that this is not atomic either. So unless the new extent is just a single block, you could very well have a crash in the middle of a write out. Upon reboot, there will be an invalid old checksum in the FS, an invalid new checksum in the journal, a partly-but-not-completely overwritten file and no way to recover.

                  Comment


                  • #39
                    Originally posted by phoronix View Post
                    Phoronix: Bcachefs Linux File-System Benchmarks vs. Btrfs, EXT4, F2FS, XFS

                    With Bcachefs on its trek towards the mainline Linux kernel, this week I conducted some benchmarks using the very latest Bcachefs file-system code and compared its performance to the mainline Btrfs, EXT4, F2FS, and XFS file-system competitors on both rotating and solid-state storage.

                    http://www.phoronix.com/vr.php?view=26357
                    This is an old topic, but Michael, the most important thing you failed to test was:
                    bcachefs on both SSD *and* HDD.
                    Since all filesystems mentioned, except for btrfs can't do raid, we can limit the tests to btrfs and bcachefs.
                    Now bcachefs knows the difference between SSD and HDD. It will migrate active files to SSD.
                    btrfs doesn't, so it will just trash it.

                    Now we can get back ext4 and xfs into the test by creating 2 raid one devices, and put *bcache* on top of that, before actually creating the filesystem.

                    So that's the setup I would love to see you make:

                    bcachefs on 2 SSD's and 2 HDD's with nr of meta and data on 2, data and metadata checksum on
                    btrfs on 2 SSD's and 2 HDD's. with nr of meta and data on 2, data and metadata checksum on
                    ext4 on bcache on raid one on 2 HDD's and raid one on 2 SSD's with all checksums on (my normal install, except for the checksums)
                    xfs with the same and all checksums on

                    Now rsync garbage to the filesystem (linux tree multiple times in different directories), until it certainly exceeds the size of the SSD's.
                    And then try to test whatever you like.
                    Now that would make some interesting benchmarks.
                    And you still need to realise that both btrfs and bcachefs have data checksums, so they know about *filedata* integrity, and they know which disk contains a correct copy.
                    Raid will never know unless you use raid6, as it doesn't checksum blocks to know which one is correct.

                    Now we go for statistics and count the number of times you had to reformat a partition, or reboot due to kernel locks with a filesystem that is called stable.
                    In data loss and kernel locks in my experience btrfs clearly is the winner.
                    Second is xfs, but I heard it's stable now. I haven't used it as much as btrfs though.
                    Reiser4 to me was also causing low system uptime due to the panics.
                    Ext4 has a low count of 0 on data loss for me. Still reboots due to kernel locks (ext4 bugs) were still there.
                    Now bcachefs has not even been tagged experimental because it is not even upstream. But when I read bcachefs.org , I recognize all that pain that btrfs brought upon us.

                    Anyway: the clearest feature winner about bcachefs vs btrfs is it's history as bcache: SSD's for fast data, HDD's for long term storage. But if I read it correctly, the SSD is not used as a block cache anymore but as a file cache.

                    Comment

                    Working...
                    X