Announcement

Collapse
No announcement yet.

Btrfs RAID 5/6 Code Found To Be Very Unsafe & Will Likely Require A Rewrite

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #71
    Originally posted by starshipeleven View Post
    Raid1 is production ready, raid5/6 experimental.
    Is it so hard to understand that the two features have exactly jackshit of common code (apart from basic stuff anyway) so you can have one that is production ready and the one that is nowhere near?
    I can't stress how much starshipeleven is right here.

    RAID in mdadm/dm and in BTRFS is different.
    Even if they do actually share code (unlike ZFS which seems to reimplement its own damn kitchensink - one of the reason it's a lot less liked than BTRFS and eveyone complains about layers transgression in ZFS but not in BTRFS. ZFS is its own kitchen sink, BTRFS is actually a whole stack that implements a B-tree based filesystem, but also uses shares code with other kernel code facilities. It's more a kind of a wapper around B-trees + RAID shared with mdadm/dm + partitionning shared with dm/lvm/evm/dynvol/blah...)
    That shared cde is only used to read and write stripes. That's it.

    Then you need a bunch of management code.
    just like mdadm adds its raid array management code, or all clientts of device mapper like LVM add their own volume management.

    In the case of BTRFS that code is relatively simple. It's all about using checksuming to pick which of the present copies is the correct one.
    And handlinig adding/removing storage to the pool.
    And some balancing code to spread the copies around (that what the whole damn B-tree technology is about).
    In BTRFS this code exists for ages. It's rock solid. It's put into production by companies like Facebook, Suse, etc.
    It has survived all the testing thrown at it.

    The only missing feature is being to parametrically set the number of copies putting 3 instead of 2 to be able to lose 2 drive instead of 1, to have caracteristics parity with RAID6).
    And that's currently been worked on, and will end up in production some day in the future.

    RAID5 is *completely different*. It's redundancy works around having 1 (or 2) parity block for every N-1 (or N-2) block of data.
    First you need a component that handle an array of drives, correctly reacts when one is missing and is able to rebuild a replacement drive.
    Although mdadm had it almost from day one (that was the whole point), and DMraid correctly leverages dm to do it (even if most firmware RAIDs only support combos of RAID0 and 1), LVM still doesn't feature it in mainstream.
    (And do you complain that LVM is "not production ready after a decade" just because it missed RAID5/6 for so long ? Nope. You just stack mdadm RAID5/6 under it or restrict yourself to RAID0/1 in LVM).
    BTRFS is in the same position as LVM here. (except that the experimental RAID5/6 feature is in the vanilla kernel code instead of floating as some patches on some obscure mailing list).
    Then for BTRFS you would need code that can leverage checksums to guess wich combo of data and parity is correct and which either data or parity block is corrupt.
    That code doesn't exist yet in the kernel. That code needs to be written.


    Originally posted by gbcox View Post
    No, we're not taking about ZFS. Read the title of this article... It's not about ZFS. It's about BTRFS --- jeez...
    No. Nope. We, the other contributors to this thread are talking about BTRFS, a real world FS, actual witten code, that exists, that is already rock stable for some features, and still expeprimental for others.
    We talk about an FS that is into production at several companies.
    We're talkinug about something that is a great tool, but that comes with quite a few caveats, that you need to pay attention to (or that you need to delegate to someone else to manage for you. You don't give a shit hhow FB runs their installation. And if you pick a product from Suse or Jolla, you rely on their code to handle the management for you).

    What you, gbcox, are talking is about your dream filesystem that desn't exist yet.
    What you're constantly bitching about boils down, according to starshipeleven, to "A want a file system that is like ZFS, but has none of ZFS's drawbacks... and works for everything from embed all the way to massive clusters... and BTRFS isn't that thing I drream of... Waaaaaaaa....."

    Originally posted by gbcox View Post
    As far as production ready and experimental... if you believe that a file system can't be production ready in a decade, well then, that just kind of says it all, doesn't it.
    We're not talking about FAT, here. A filesystem that is so simple, it only requires a few lines of code, that was able to run on 8bit home microcomputers back then, and that can work on an arduino right now.

    We're talking about a project that is very complex, and has hundreds of features.

    Some of them are rock solid since a couple of years.

    Some of them are still highly experimental (RAID5/6)

    Some don't even exist yet (on-line dedup, integrated crypto, integrated B-cache style multilayered caching).

    Do you complain that LVM is not complete because it doesn't have a fully functionnal and poduction ready RAID5/6 ?

    Do you used to complain that EXT2/3/4 is not production ready because it missed an integrated crypto layer for so long, or because its compression sill isn't mainline even today ?

    Same for BTRFS: it IS production ready for some usage patterns, still experimental for others.



    Originally posted by gbcox
    The other was from Kent Overstreet - and he also has a bit of expertise in that area.
    Well, let's us first see how his BcacheFS turns out.
    It's one thing to complain that BTRFS was done too hastily.
    It's another thing to actuallyy deliver all the things that he promises (snapshots for free, checksumming)
    And let's see if he ever gets production ready RAID5/6.

    I'm not doubting that he will manage to get it done eventually. I'm just saying, don't expect a feature perfect flawless BcacheFS to pop-up suddenly into existence in the next few months.

    Originally posted by gbcox
    If you read my comment, I clearly said that I took the necessary precautions to run it. What I was talking about was their inability to deliver a production ready product for what was being touted as the next Linux filesystem. Ten years later and it's not done.
    What have you been reading ? It has been in the wiki for ages :

    The one missing piece, from a reliability point of view, is that it is still vulnerable to the parity RAID "write hole", where a partial write as a result of a power failure will result in inconsistent parity data.
    • Parity may be inconsistent after a crash (the "write hole")
    • Parity data is not checksummed
    In other words: RAID5 isn't yet correctly plugged into the cecksumming infrastructure. You know the thing whose job it to keep the data integrity under BTRFS.
    Or said differently: "scrub" the basic functionnality that you (or Suse's scripts) need to perform regularily to make sure that your data is okay could very probably eat your data up.
    (And apparently that's exactly what was confirmed and requires a rewrite).

    On the other hand, *that* is the part rewritten after ten years (the same part that isn't even handled properly by LVM and requires stacking with mdadm).
    In other news, RAID0/1 are still performing as they should, and are still giving good results in production.
    The only complain is not yet being able to set the width of RAID1 to 3 (2 copies).

    I'm *also* following BTRFS closely, because I'm *also* using it personnally and professionnally.
    For me, it was clear BTRFS RAID6 sould not be relied on yet.

    Currently, I stack it above a classical mdadm+lvm whenever raid6 is needed (at home).

    It needs a rewrite. Major portions are still experimental.
    Still other portions are production ready and have stood the test of time...

    Major server vendors are quietly moving to XFS instead. It has simply taken too long to stablize.
    According to your definition, XFS is also (WARNING, sarcasm ahead) an unstable piece of shit that is still unfinished after nealy 20 years of developpement~

    - It still can only be grown and not reduced in mainstream code. That's a critical feature missing!~ XFS sucks~
    - There is no copy-on-write yet, nor log-structured. The system relies on a simple journal!~ In 2016!~ When UDF has featured it for ages!~
    - There is no snapsots for free. XFS relies on freezing the FS, doing a slow snapsot in LVM, and continuing only afterward. It's has nothing better than EX4~

    And now, they are rewriting part of the file allocation code. Their are replacing the B+ trees with plain B-trees (hmm... where have I heard that one being used before ?) Their code is unfinished, and BTRFS is the better solution !~

    Fedora attempted several times to make it the default and wisely chose against it.
    And meanwhile Suse has picked it and it works for them.

    I would tend to agree however that based upon the apparent development cycle of BTRFS, that by the time it is ready, it very well could be a moot point - along with many other features they are "working" on - but not everyone expects a filesystem project to be decades in the making.
    Nobody expects a file system to take 20 year in the making, and XFS is still lacking modern features offered elsewhere.


    Comment


    • #72
      Originally posted by DrYak View Post

      For fuck's sake, please try to read and understand starshipeleven 's explanation.

      on BTRFS RAID has a fixed with of 2, it keeps one extra copy of the data so you can lose 1 drive.
      On a 5 drive BTRFS system, you have 2 copies of all data and you get 4x the size. You can at most lose 1 drive.

      On RAID5 system, you have 4 blocks of data for 1 block of parity. You *ALSO* get 4x the size. You *ALSO* can lose max 1 drive.

      Both solutions give THE EXACT SAME END RESULT, even if they use sllightly different technologies under the hood (copies for BTRFS RAID1- hence the name RAID1. And parity for mdadm / DeviceMapper RAID5 - hence the name).
      If you really believe this, then you are braindamaged somehow. Some checksums somewhere between your ears somehow do not meet. And there is no ECC to plaster over the hole. Early sign of Alzheimers, perhaps ? Please, see your doctor...

      Comment


      • #73
        Well, it seems pretty obvious to me now that BTRFS isn't going anywhere anytime soon, I think that's unfortunate though. It's not stable enough for daily desktop usage even. Free space fragmentation means you will run out of free space eventually, even if there actually is enough free space. The counter argument to that is to script a solution using tools that are highly recommended by their own developers to never use.

        I've read through this thread several times now. I have a better understanding of the arguments being made now, but I still can't imagine a single use case where CoW techniques will do anything but hurt performance or break files with no possibility of recovery. This idea that RAID is somehow a better option at the file layer I personally think sounds like nonsense. It -the- reason why BTRFS is so unstable and why it's taking so fucking long to develop. It needs at least another 10 years of development to be usefull, probably more, and by then it'll be too late. It's already too late.

        Comment


        • #74
          Originally posted by starshipeleven View Post
          Facebook is using them on their webservers (as far as what I read), which are the more expendable of their infrastructure, and don't need huge storage arrays anyway.
          Please don't use a company such as Facebook as a reference for an honourable business. Mark Zuckerberg said "A squirrel dying in front of your house may be more relevant to your interests right now than people dying in Africa." and made it his business to support and to profit from such a mindset. So when anything good comes out of Facebook then it certainly wasn't intentional but is a mere by-product of his highly profit-driven business. You then wouldn't argue for BitCoin being an innovative technology and base your argument on the number of criminals who are using it or the number of trades of illegal items, or would you? Consequently, should Facebook (or any other company) ever make the mistake to fall over its business strategies, i.e. with financial fraud, then linking it with a technology will ultimately also harm the technology.

          It is often not clear how good or bad a technology ultimately is. A good technology often gets used by everyone. A bad technology can have a lot of use and may only be lacking the support by good people. Please keep this in mind when arguing over a technology and ideally keep it strictly technical.

          TL;DR: "Facebook is using it" doesn't sound like the best argument.

          Comment


          • #75
            Originally posted by sdack View Post
            Please don't use a company such as Facebook as a reference for an honourable business.
            I never did. I just said that Facebook, the company currently employing the btrfs lead dev, is using btrfs in production in their webservers, according to publicly available material.

            Comment


            • #76
              Originally posted by duby229 View Post
              It's not stable enough for daily desktop usage even.
              Lolwhut? You get this from?
              Free space fragmentation means
              that you will have to defragment it every once in a while, just like NTFS on windows.

              I still can't imagine a single use case where CoW techniques will do anything but hurt performance or break files with no possibility of recovery.
              That's your own problem. Meanwhile ZFS has been a CoW filesystem on Unix and then on Linux too, btrfs is a CoW filesystem, and XFS is going to become a CoW filesystem soon.

              This idea that RAID is somehow a better option at the file layer I personally think sounds like nonsense.
              You are wrong.

              It -the- reason why BTRFS is so unstable and why it's taking so fucking long to develop.
              No, the reason it is taking so long is that the features that companies want more aren't what people want more.
              Companies wanted more btrfs's RAID1, and that is production-grade. Companies don't seem to care much about RAID5/6 in btrfs, that code isn't developed.

              Comment


              • #77
                Originally posted by DrYak View Post
                For fuck's sake, please try to read and understand starshipeleven 's explanation.

                on BTRFS RAID has a fixed with of 2, it keeps one extra copy of the data so you can lose 1 drive.
                On a 5 drive BTRFS system, you have 2 copies of all data and you get 4x the size. You can at most lose 1 drive.

                On RAID5 system, you have 4 blocks of data for 1 block of parity. You *ALSO* get 4x the size. You *ALSO* can lose max 1 drive.

                Both solutions give THE EXACT SAME END RESULT, even if they use sllightly different technologies under the hood (copies for BTRFS RAID1- hence the name RAID1. And parity for mdadm / DeviceMapper RAID5 - hence the name).
                No wait a sec. On a btrfs raid1 you get HALF the total array size as usable (the other is used for copies).
                On that hypotetical 5 drive array you get 2.5 drives of space and 2.5 drives are wasted for copies.
                That's still RAID1, to get the magic "we only waste one drive regardless of how much drives there are in the array" thing you need striping, the code for that isn't operational in btrfs.

                Comment


                • #78
                  Originally posted by Brane215 View Post
                  If you really believe this, then you are braindamaged somehow. Some checksums somewhere between your ears somehow do not meet. And there is no ECC to plaster over the hole. Early sign of Alzheimers, perhaps ? Please, see your doctor...
                  And here we can observe a very good way to still show you're inferior to your opponent even if you are right. I can understand calling people "moron" or something, but a full paragraph dedicated to personal offences? gimme a break.

                  Comment


                  • #79
                    Why would be Alzheimer's a personal offence ? Would it be acceptable to mock someone with it ? I was just worried.

                    "Lectures" like his last one have fallen into background for me a long ago, I've learned how to filter out various morons, but they still ruin the signal to noise ratio of a communication chanell. And since they are not simple white noise, their chirps get annoying. This time i had filters down for some reason and this "lecture" has left me flabergasted. This wasn't an ordinary moron, this guy is full of giant mental blind spots. So I tried to be useful.

                    But these lectures do get to be tiring. World is full of people that read an article and all of the sudden they have their "42" - answer for the existence of universe and everything within it. It's a pitty that Douglas Adams died- he would be proud that he correctly predicted explosion of moron population.

                    It's not for nothing that SPaceX is working so feverishly on a mission to Mars and NASA is finding a "menacing comet" every 15 minutes.




                    Comment


                    • #80
                      Originally posted by Brane215 View Post
                      But these lectures do get to be tiring. World is full of people that read an article and all of the sudden they have their "42" - answer for the existence of universe and everything within it.
                      I forgot to add - and above all IRRESSISTIBLE URGE TO FORCE EVERYONE AROUND THEM into compliance witt their new idea.
                      This is not about just BTRFS, it applies to everything - solar vs nuclear, batteries of the week, language of the future etc etc.

                      Comment

                      Working...
                      X