Announcement

Collapse
No announcement yet.

Btrfs Enjoys More Performance With Linux 6.3 - Including Some 3~10x Speedups

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Originally posted by brucethemoose View Post

    ZFS still not a good fit for a Steam game directory though.

    What Steam games need is casefolding, sequential speed, low access latency, low task energy (especially for the Deck), and maybe compression. The game data integrity itself is not really important, and the library can easily be split across drives.

    TBH I think ext4 is a better fit than btrfs since its faster. F2FS would be even better if its lz4 compression actually saved space.
    I think you're responding to the wrong comment. Neither I nor the person I quoted mentioned games.

    Comment


    • #62
      Originally posted by dreamcat4 View Post
      [...]
      And i have 3 options to make them mirror drives (raid1), it can be zfs, btrfs, or madm. OK, but with btrfs based solution when 1 of the 2 disks fails, and drops out. Then the filesystem automatically goes into readonly mode, (and also maybe it requires a reboot or whatever).
      [...]
      I dropped a disk (inside qemu) of a btrfs raid1 filesystem; and the btrfs filesystem still worked; of course the dmesg is filled by error message but the filesystem seems to work.
      I umount it and then re mount it (with -o degraded), and it still works.
      Code:
      # btrfs fi show /mnt/other/
      Label: none  uuid: 19526ff8-c664-4954-8f4d-564d23556559
              Total devices 2 FS bytes used 7.90GiB
              devid    1 size 100.00GiB used 9.01GiB path /dev/sdd
              devid    2 size 100.00GiB used 9.01GiB path /dev/sde​
      
      # cd /mnt/other/
      # while true; do cp -ra usr/ bin/ ; sync ; done &  # write to the fs
      
      # echo 1 >/sys/devices/pci0000:00/0000:00:05.0/virtio1/host2/target2:0:2/2:0:2:0/delete  # unmount /dev/sde
      
      # # the filesystem is still healty
      
      # kill %1  # kill the writing background process
      
      # cd /
      # umount /mnt/other
      
      # btrfs fi show
      Label: none  uuid: 19526ff8-c664-4954-8f4d-564d23556559
              Total devices 2 FS bytes used 7.90GiB
              devid    1 size 100.00GiB used 9.01GiB path /dev/sdd
              devid    2 size 0 used 0 path /dev/sde MISSING
      ​
      
      # # mount again
      # mount -o degraded /dev/sdd /mnt/other/
      
      
      # cd /mnt/other/
      # while true; do cp -ra usr/ bin/ ; sync ; done &  # write to the fs​
      
      # # the filesystem is still healty
      ​
      My tests showed that btrfs + raid1 is unusable as you reported. I didn't see any switch to readonly.

      Comment


      • #63
        Originally posted by pkese View Post

        The "big" RAID5/6 patchset is in the works, now at revision 5, most of problems have been ironed out by now and is likely to hit mainstream sometime this year.


        And the patchset is not very big either - about 1000 lines of code (relative to 150,000 for the whole filesystem).
        This patches set has a different aiming: allow to efficiently use the zoned disk (i.e. append only disk). I didn't go deeply in this patches set, but if I remember correctly it doesn't handle raid5/6 (yet ?).

        The btrfs and raid5/6 incompatibilities depends by the fact that COW and raid5/6 don't mix well. The reason is that raid5/6 pretends to overwrite in place the parities.
        This means that in case of (e.g.) power failure there is a possible mis-alignemnt between parities and the data.

        The checksum may help to rebuild the parities; but a power failure + disk failure mail leads to a permanent data loss.

        From a theoretical point of view the stripe tree could solve this issue. However it is another layer on an already complex filesystem: the stripe tree add another indirection level between the logical block and the physical block.

        Frankly speaking I prefer to add a journal/logging as md does to avoid the write hole problem.

        I remember that a review of the raid5/6 code showed even other problems, like the fact that the rebuilding of the data doesn't check the new computed data and their checksums, allowing to propagate a potential corruption.

        Some of these were addressed by Qu Wenruo. But my understanding is that there are some corner cases still unaddressed.

        Comment


        • #64
          Originally posted by kreijack View Post

          This patches set has a different aiming: allow to efficiently use the zoned disk (i.e. append only disk). I didn't go deeply in this patches set, but if I remember correctly it doesn't handle raid5/6 (yet ?).
          In the first revision of the patch, when it was still marked RFC ONLY, it stated:

          Introduce a raid-stripe-tree to record writes in a RAID environment.

          In essence this adds another address translation layer between the logical
          and the physical addresses in btrfs and is designed to close two gaps. The
          first is the ominous RAID-write-hole we suffer from with RAID5/6 and the
          second one is the inability of doing RAID with zoned block devices due to the
          constraints we have with REQ_OP_ZONE_APPEND writes.

          Thsi is an RFC/PoC only which just shows how the code will look like for a
          zoned RAID1. Its sole purpose is to facilitate design reviews and is not
          intended to be merged yet. Or if merged to be used on an actual file-system.​
          So it is a solution for both RAID5/6 as well as RAID1 on zoned devices.
          The first use case of stripe tree was targeting zoned devices, but the solution is applicable to RAID5/6 as well if I understand it correctly.

          Comment


          • #65
            kreijack I wonder how zfs solved this problem?

            Comment


            • #66
              Originally posted by pkese View Post

              In the first revision of the patch, when it was still marked RFC ONLY, it stated:



              So it is a solution for both RAID5/6 as well as RAID1 on zoned devices.
              The first use case of stripe tree was targeting zoned devices, but the solution is applicable to RAID5/6 as well if I understand it correctly.
              Look at this

              https://lore.kernel.org/linux-btrfs/...E9@PH0PR04MB74 16.namprd04.prod.outlook.com/

              The author reported that:
              I think both solutions have benefits and drawbacks.

              The stripe tree adds complexity, metadata (though at the moment only 16
              bytes per drive in the stripe per extent) and another address translation /
              lookup layer, it adds the benefit of being always able to do CoW and close
              the write-hole here. Also it can work with zoned devices and the Zone Append
              write command.

              The raid56j code will be simpler in the end I suspect, but it still doesn't
              do full CoW and isn't Zone Append capable. Two factors that can't work on
              zoned filesystems. And given that capacity drives will likely be more and more
              zoned drives, even outside of the hyperscale sector, I see this problematic.
              So yes, it can solve the raid5/6 problem. But it adds a non trivial complexity, so it may be not a general solution. raid56j (which is not materialized yet), may be a more general solution.

              Comment


              • #67
                Originally posted by NobodyXu View Post
                kreijack I wonder how zfs solved this problem?
                My understanding is that zfs uses a variable stripe length.

                The raid 5/6.. write hole happens when a parity is shared between different extents. If you update an extent, you need to update the shared parity too. So if this is not done atomically, there is the possibility that the parity and the stripe are out of sync.

                But if you use a variable stripe length, the parities are private to the extent. So if you need to update the extent there is no risk of an out of sync between the parity and the extentS because there is only ONE extent. And the extent is totally written or not due to the cow.

                Of course even this has its own drawback:
                - the fragmentation increase
                - the space efficiency decrease

                My understanding is that ZFS apparently handle this efficiently, only because it has its concept of tiering (faster disk used as cache). So it can group all the changes in the cache and then write on big chunk on the destination disk.

                But I am not a zfs expert; this is only my understanding.

                Anyway, when I did some math to check if the "variable length stripe" is better than a "journal", what I found is that the journal is a simpler and a better way of handling the raid.
                What it seem strange to me is to have a cow over a journal :-)

                Comment


                • #68
                  Originally posted by kreijack View Post

                  This patches set has a different aiming: allow to efficiently use the zoned disk (i.e. append only disk). I didn't go deeply in this patches set, but if I remember correctly it doesn't handle raid5/6 (yet ?).

                  The btrfs and raid5/6 incompatibilities depends by the fact that COW and raid5/6 don't mix well. The reason is that raid5/6 pretends to overwrite in place the parities.
                  This means that in case of (e.g.) power failure there is a possible mis-alignemnt between parities and the data.
                  Actually CoW + combining the filesystem and block level interface means that you can actually solve the RAID 5/6 write hole (which is the problem you allude to), ZFS does this and thats why it doesn't have this issue.

                  The reason why BTRFS 5/6 doesn't solve this problem is because they didn't design for it initially. The problem is solvable, it just requires a change to on disk format (and this was stated by a BTRFS dev on a mailing list).

                  EDIT: Just noted you posted references later.

                  Comment


                  • #69
                    Originally posted by S.Pam View Post

                    I seem to remember that ext4 isn't always faster. But it was a while back. I use reflink copies (cp --reflink) extensively and that is way faster than copying on ext4.
                    Reflinks are also much more space efficient. You can "store" data thousands of times as large as the underlying device as long as each copy is reflinked.

                    Comment


                    • #70
                      Originally posted by cynic View Post

                      nope. If you backup garbage, you restore garbage, regardless of how you do your backups.
                      They probably meant how the backup software finds changes for incremental backups. It's probably relying on file metadata rather than reading every file for every backup. If the file is corrupted but the metadata is unchanged then the corrupted copy wouldn't be stored in the backup.

                      Of course, if the metadata is corrupted then the backup would store the corrupted metadata, and if both the metadata and file are corrupted then both would be backed up. And if the backup device gets corrupted, then every deduplicated copy would be corrupted as well.

                      Comment

                      Working...
                      X