Announcement

Collapse
No announcement yet.

Bcachefs File-System Plans To Try Again To Land In Linux 6.6

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by waxhead View Post
    And no , you are not totally screwed if one disk fails depending on how you configure metadata. The failure mode may be perfectly acceptable and besides if one disk fail it does not have to be as taxing for the drives to duplicate remaining replica of the lost drive's data to other drives. E.g. with one drive lost , you *MAY* have a faster route to recovering the array if you have existing space than on a traditional raid10.
    One disk failing is fine. As I said, any 2nd disk of the 3 remaining disks failing is not. All 3 of the remaining disks are basically guaranteed to have data chunks that were mirrored with the now dead disk. I know you know how all this works in btrfs, just trying to make it clear to others who haven't looked into it. I originally assumed I'd go with a btrfs RAID10 on my desktop to avoid out of tree modules, but the write strategy forced me over to ZFS here too.

    Comment


    • #32
      Originally posted by pWe00Iri3e7Z9lHOX2Qx View Post

      I think the biggest problem with the RAID10 example I gave is that nobody who is familiar with RAID10 from other systems would expect a write pattern like this to be possible.

      Code:
      | SDA | SDB | SDC | SDD |
      |-----|-----|-----|-----|
      | A1 | A2 | A1 | A2 |
      | B1 | B1 | B2 | B2 |
      | C1 | D1 | D1 | C1 |
      | D2 | C2 | C2 | D2 |​
      I think for btrfs they should have actually named these profiles something else, because a lot of assumptions get made based on a name and previous familiarity / experience. I certainly wouldn't instinctively assume that writes were like "mini RAID10s" going everywhere willy-nilly and that I was totally screwed if any second disk fails.
      I didn't know this.

      Thank you!


      Comment


      • #33
        Originally posted by EphemeralEft View Post

        It's actually BTRFS | LVM | DM-Crypt | BCache | DM-RAID, where DM-Crypt is managed by Cryptsetup and DM-RAID is managed by LVM. The top-most LVM Layer is split into different filesystems for different purposes.

        Although I'm not using it, LVM actually has the option to layer DM-Integrity over each RAID member for per-member corruption detection. Because DM-Integrity treats corruption as read errors, the other RAID members are automatically used if the data on one member is corrupt. The RAID layout is a 6x4TB “raid6_ls_6”, which is a non-standard combination of left-symmetric RAID5 (distributed parity) but the last disk is dedicated to Q syndrome parity. This has the benefit that I can switch between RAID5 and "RAID6" without reshaping, at the expense of losing 1/6 disks worth of read performance. In theory RAID6 should also be able to tell which member is invalid in the case of a mismatch (without per-member DM-Integrity), but DM-RAID/LVM doesn't currently have that feature.

        BCache is used in write-through mode, so the SSD can fail without data loss. My boot partition is a RAID1 at the beginning of all RAID members (thanks to Grub) so truly any 2 drives could fail without losing any data. I use the integrity checking of BTRFS as a sanity check of the RAID, BCache, and the SSD. It also functions as a janky method of "authenticated encryption". Besides the BTRFS RAID56 issues, at-rest encryption is important to me. So until BTRFS supports encryption, I'd need to encrypt all RAID members individually.​

        I honestly prefer having separate layers that I can manage myself. I can (and eventually will) switch BCache to DM-Cache. And move integrity checking from BTRFS to DM-Crypt for AEAD. A while ago I switched from MDAdm to DM-RAID. I couldn't mix and match implementations with an all-in-one solution. I also probably couldn't tweak as many settings.
        This is like the poster child for why so many of us whine about wanting ZFS to get merged. You can do very powerful things will all these layers, but it is extremely complicated, especially when something goes wrong. You obviously know what you are doing, but I've seen plenty of posts online from people attempting similar setups where they don't even order the layers correctly and end up negating some benefit they think they are getting. Having the volume management / encryption / file system / verification / etc. all baked into one thing is so much easier to grok and work with for most people.

        Comment


        • #34
          Originally posted by Mark Rose View Post

          Sure, that's for ZFS and bcachefs, but I've not heard of btrfs having tiered storage. I've actually been thinking it would be a fun project to get my feet wet with kernel development (no commitment yet).
          All Linux filesystems (including BTRFS) have tiered storage if you put them on top of BCache (what BCacheFS is based on) or DM-Cache. Same with encryption if you put it on top of DM-Crypt. It's honestly a better solution than duplicating that work for every filesystem. And it works for non-filesystem block devices, too.

          Comment


          • #35
            Originally posted by pWe00Iri3e7Z9lHOX2Qx View Post

            This is like the poster child for why so many of us whine about wanting ZFS to get merged. You can do very powerful things will all these layers, but it is extremely complicated, especially when something goes wrong. You obviously know what you are doing, but I've seen plenty of posts online from people attempting similar setups where they don't even order the layers correctly and end up negating some benefit they think they are getting. Having the volume management / encryption / file system / verification / etc. all baked into one thing is so much easier to grok and work with for most people.
            I get where you're coming from, and I agree that most people can't (or at least shouldn't) make setups like mine. But that's not a kernel issue; the technology is already there and it just needs a simple userspace tool to manage everything. There are only a few features that you would gain with an all-in-one solution, while duplicating existing functionality is almost always a bad idea.

            Comment


            • #36
              Just an anecdote:

              I used btrfs for 2 years in raid10 configuration and experienced data loss, but particularly bad data loss where a lot of files were experiencing a small amount of corruption.

              Turns out it was some combination of raid10, autodefrag, and compression.

              I used bcachefs for 3 years with the same drives, and the only issues were the filesystem updates when I upgraded versions.

              Now I'm using btrfs raid10 with proxmox but I won't use any of the other features without fearing corruption again.

              Comment


              • #37
                Originally posted by woddy View Post

                I really don't understand... you criticize the alleged instability of Btrfs, and then you are looking forward to using an experimental fs, which as of today hasn't even been accepted into the kernel tree.
                Strange, isn't it?​
                You have to look at the ethos and attitudes of the projects and its developers to understand this. BTRF's development I can summarize as trigger happy, that is historically untested/ill designed features that added into Linux tree and then in the worst case scenario things blow up, something that has even pissed off Linus a few times. RAID 5/6 support in BTRF's is a very good illustration for this, an implementation of it that can cause data loss was merged into tree like a decade ago and for the majority of its existence it didn't even warn uses that this is highly experimental and you should probably not create a RAID 5/6 partition, the warning was only added like a year ago. In other words BTRF's seems to use the "work fast, break fast" mentality of facebook which doesn't work well when you deal with filesystems.

                On the other hand with both ZFS and bcachefs, both development teams/developers have a very strict attitude when it comes to merging in changes, that is they will only add in changes they think are properly designed and stable. The difference between ZFS and bcachefs is that as is well known ZFS is never going to officially be accepted into Linux kernel tree which provides its own set of problems.

                Thats why people are looking forward to bcachefs, its developed by someone who has the same quality control/attention to detail/stability+design mindset as ZFS but is a Linux first filesystem. Being a filesystem it will take years at least for that experimental flag to be lifted, but unlike BTRFS that flag will likely only get removed when it actually is stable.

                Originally posted by stormcrow View Post

                Personal note: I think the problem with BTRfs is that the only features that get enough attention to be stable and performant are the use cases the maintainers (mainly Facebook & Oracle?) utilize. For everyone else, we have to use ZFS which has a lot of big companies working on it so there's a more diverse user and developer base.
                Yes and this is precisely the problem. Some parts of BTRFs are evidently very stable, but thats the issue, you need to have the same setup as what Facebook/Oracle and in the beginning this wasn't entirely clear. With ZFS, if its in the code base then its just as supported/tested/used as any other feature in ZFS. Furthermore the developers of ZFS actually took into account different usecases right at the start when making the disk format, BTRFS didn't. The reason why BTRFS RAID 5/6 like functionality in BTRF's has its issues still unresolved for a decade is that to solve it correctly a change to the on disk format would be necessary where as ZFS already accounted for this decades ago, and that begs the question as to why they even added RAID 5/6 support to BTRFS if they knew that it doesn't properly solve the write hole (something that is an expectation for volume+filesystem based CoW filesystems).
                Last edited by mdedetrich; 12 July 2023, 11:48 PM.

                Comment


                • #38
                  Originally posted by EphemeralEft View Post

                  All Linux filesystems (including BTRFS) have tiered storage if you put them on top of BCache (what BCacheFS is based on) or DM-Cache. Same with encryption if you put it on top of DM-Crypt. It's honestly a better solution than duplicating that work for every filesystem. And it works for non-filesystem block devices, too.
                  The advantage of not doing things this way is that if you include everything in a single filesystem like ZFS does, not only is it far simpler for end users it also means that the filesystem (in this case ZFS) is aware of everything since its not abstracted away into different layers such as DM-Cache or DM-Crypt.

                  In fact that was the killer feature of ZFS in the first place, it combined both filesystem and volume management. That is was supposedly very anti-Linux/Unix in design, but it also provided massive advantages because since it made ZFS aware of both it allowed ZFS to do things both performance and data integrity wise which is not possible with mdadm + filesystem.

                  Even I personally experienced how annoying the separating out of abstractions in Linux when dealing with BTRF's is annoying compared to ZFS from a usability standpoint. I tried to experiment BTRF's with one project I was doing and when I created the filesystem initially I didn't set up an SSD read cache using bcache. Unfortunately I didn't realize at the time that if you wanted this functionality you should have initially created BTRF's a specific way which means that officially speaking you would need to reformat your BTRF's raid setup if you want bcache (iirc when I was looking into this there is a way around this but its some random script on the internet).

                  With ZFS this is a non issue, you can create a filesystem and at any point in time in the future you are free to both add and remove L2ARC (i.e. ssd read cache) without having to reformat. Same with compression and other settings.

                  Comment


                  • #39
                    Originally posted by EphemeralEft View Post

                    I get where you're coming from, and I agree that most people can't (or at least shouldn't) make setups like mine. But that's not a kernel issue; the technology is already there and it just needs a simple userspace tool to manage everything. There are only a few features that you would gain with an all-in-one solution, while duplicating existing functionality is almost always a bad idea.
                    These are the reasons I was excited about stratis when I first read about it. Pity they still don't even seem to have all the layers in place years later.

                    Comment


                    • #40
                      Originally posted by EphemeralEft View Post

                      All Linux filesystems (including BTRFS) have tiered storage if you put them on top of BCache (what BCacheFS is based on) or DM-Cache. Same with encryption if you put it on top of DM-Crypt. It's honestly a better solution than duplicating that work for every filesystem. And it works for non-filesystem block devices, too.
                      Well, it is more of "make things as simple as possible, but not simpler". All these layered structures tend to become overengineered. And btrfs already breaks these tiers by doing raid stuff and distributing data across devices, adding cache is just goes one step further. Caching subsystem can benefit from knowing about files, especially in case of COW. Imagine some "hot" extent, like some shared library. With all access statistics gathered during last year this extent has to have highest caching priority, unless you know it was overwritten yesterday and now persist only because of some backup snapshot. So you have to either just guess by access pattern, or overcomplicate tiers API to pass data needed only for this two modules.

                      Comment

                      Working...
                      X