Announcement

Collapse
No announcement yet.

Bcachefs Multi-Device Users Should Avoid Linux 6.7: "A Really Horific Bug"

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #71
    Originally posted by intelfx View Post

    I like bcachefs, but that's pure mental gymnastics. By this measure btrfs, too, never eats your data because btrfs restore can almost always recover it :-)
    I had an issue where my RAID10 configured BTRFS silently introduced corruption into files and I would only find out when attempting to access them.

    I've seen theories thrown around, but I have no idea what caused it. Most likely it was the result of some brown outs, but I have no way to confirm that. Since then, I've used RAID10 with brtfs without issue, but I'm also not making use of anything like autodefrag or compression as I just don't trust it.

    Two takeaways:
    1. BTRFS development and Bcachefs development is very different. Bcachefs doesn't add a feature without a certain level of stable-ness, and it also adds those features behind an EXPERIMENTAL kernel flag.
    2. Don't put important data on things you haven't already tested with unimportant things first.

    Originally posted by Quackdoc View Post

    i've posted it more indepth before, but i've had issues with using btrfs as an VM drive (with chattr applied), I've used it as a cache drive for video work (writing and deleting high quality image sequences) (EXR, Tiff, PNG, JXL etc.) I've had issues using it as a portable HDD for movies and games (emulators mostly) and more.
    This sounds very much like what happened with me. At least once a month I ended up having to do an integrity check with Steam only to find that like 3-5 files needed to be re-acquired. I was also using it for some fairly large video projects, over two hours long, multiple cameras, and the software would also create the low-res proxy video to help deal with the project sizes.​ I came back to those videos months later and found there were chunks missing due to corruption.
    Last edited by lyamc; 18 March 2024, 02:15 PM.

    Comment


    • #72
      Originally posted by Old Grouch View Post

      The plural of anecdote is not data.

      I don't want to go into the detail of generation and testing of hypotheses, and all the background work necessary in any research following a scientific method. Anecdotes can generate ideas suitable for further testing, and whole swathes of academia rely on qualitative methods, but a collection of anecdotes, no matter how large, are not informative. Processing according to pre-formulated criteria is necessary to transform the collection into data from which information might possibly be derived.

      It's a nice quip, but wrong. Sorry.
      this is the stupidest thing i have ever read. if you go only with controlled variables you will never hit outliers and that in itself is extremely unreliable.

      Comment


      • #73
        Originally posted by mdedetrich View Post

        I'm sorry to break it to you but giving notice and then solving the bug within a week is the opposite of poor notice unless its a high level CVE. Let me also remind you that bcachefs is officially marked as EXPERIMENTAL in the kernel, in other words as a user if you are creating this filesystem you know what you are signing up for (i.e. it will have bugs like this).

        I actually maintain a large open source project that is used in mission critical usecases and the only thing that's poor about this is the stonewalling refusal of putting it into 6.7 LTS, that's what would actually help users.

        Like this is the problem, your incessant "can someone please think of the children" is ignoring all context and nuance.
        6.7 isn't LTS, 6.6 is. No one refused to take his patches to 6.7, it's just that he's yelled at the stable team a bunch of times, refuses to follow the rules and guidelines, and someone on the stable team is on vacation, further slowing down the hand holding that kent obviously needs. he hasn't maintained kernel code in a while (and it shows). No big deal, but he's making every bit of the learning curve someone elses problem. I don't even care. I'm not going to touch bcachefs until at least a few more releases until this all settles down. I'm just worried that Kent is going to fall apart. Look at this email to the stable team:

        You guys need to get your shit together - I've already let people know
        that 6.7 is not safe to use and they need to immediately upgrade to 6.8,
        but unfortunately not everyone will see that.

        This is going to cause people reall pain.
        This has nothing to do with Kent being rude (which obviously he is). It's that he's clearly angry, and that's not healthy, and he's clearly pushing away people that are under no obligation to help his little pet project. That would be a loss all around, because Kent spent a lot of time on it and there's a lot of promise in bcachefs.

        I resigned to keep all of his little fights out of this forum, but he was the one that wanted to make a big deal about this bug, so we have to tell the truth now about what led his patch not making it into 6.7.10.

        Comment


        • #74
          Originally posted by Quackdoc View Post

          this is the stupidest thing i have ever read. if you go only with controlled variables you will never hit outliers and that in itself is extremely unreliable.
          Where did I say controlled variables? I said 'pre-formulated criteria' which is not the same thing. Looking through a mass/mess of anecdotes with no specific plan isn't going to achieve much.

          There are lots of texts out there on experimental methods. One of the key points is deciding in advance what your hypothesis is, and how you are going to test it. You are going to, at least, need:

          Background and Rationale
          Explanation for choice of Comparators
          Specific Objectives or Hypotheses
          Study Design

          (The above from SPIRIT)

          There is a lot more. The Aristotelian idea of 'simply observing' went out of currency in the Renaissance.

          Try reading this: https://en.wikipedia.org/wiki/Scientific_method

          Then think about how you might rigorously compare filesystems. It's non-trivial.



          Comment


          • #75
            Originally posted by fitzie View Post

            I resigned to keep all of his little fights out of this forum, but he was the one that wanted to make a big deal about this bug, so we have to tell the truth now about what led his patch not making it into 6.7.10.
            Thats a paradox. If you want to resign to keep out of the little fights then keep your word, no one is forcing you to "tell the truth" which your just using an excuse to always jump back in.

            Comment


            • #76
              Originally posted by Old Grouch View Post

              Where did I say controlled variables? I said 'pre-formulated criteria' which is not the same thing. Looking through a mass/mess of anecdotes with no specific plan isn't going to achieve much.
              You can't really begin to form a plan without knowing there's a problem, and you don't hear about the problem until there's at least one person reporting that problem, and you won't necessarily consider it to be a big problem until many people (relative to the overall size of the userbase) also report that problem.

              But I don't even have to really take anyone else's word for it.

              Early on, when I was getting into Linux and distro hopping, I was multibooting a lot of distros to try them out. I did the same with filesystems. The ones that used btrfs, jfs, and reiserfs ended up with some serious issues one-way or another, so I avoided them. This would have been at least 10 years ago.

              Much later, I tried btrfs again, mainly to try different ways of sharing a disk across a dual-booted Windows/Linux machine. It sort of worked, but something about the combination of the WIP windows driver and my frequent writes followed by reboots introduced problems. I chalked it up to the driver, which is fine since it was very much WIP. It seemed to work okay before then.

              A little later, I tried a btrfs native distro, and that seemed to work okay. It made me think, maybe this time? So I did the RAID10 setup, and that was a mistake, probably because of the features I was using.

              I also tried a distro that did btrfs snapshots. Unfortunately I managed to get it into a state where I my snapshots were worthless and would fail when I tried to restore to them. At this point I don't know if I'm just retarded, unlucky, or what. There's been many different pieces of hardware/software tried at this point.

              Currently, I'm using btrfs in RAID10 without issue. I do a scheduled scrub task. I'm not using autodefrag, and I'm not using compression. I just don't trust those and every time I had issues, those were enabled.

              ----

              Using bcachefs, the only issues I had were related to the kernel and utility upgrades, which were fixed by just not blindly doing dist-upgrade -y, and doing specific mount options to fix it. Sometimes I just needed to grab from master again because the issue was fixed just before.

              Comment


              • #77
                Originally posted by lyamc View Post

                You can't really begin to form a plan without knowing there's a problem, and you don't hear about the problem until there's at least one person reporting that problem, and you won't necessarily consider it to be a big problem until many people (relative to the overall size of the userbase) also report that problem.

                But I don't even have to really take anyone else's word for it.

                Early on, when I was getting into Linux and distro hopping, I was multibooting a lot of distros to try them out. I did the same with filesystems. The ones that used btrfs, jfs, and reiserfs ended up with some serious issues one-way or another, so I avoided them. This would have been at least 10 years ago.

                Much later, I tried btrfs again, mainly to try different ways of sharing a disk across a dual-booted Windows/Linux machine. It sort of worked, but something about the combination of the WIP windows driver and my frequent writes followed by reboots introduced problems. I chalked it up to the driver, which is fine since it was very much WIP. It seemed to work okay before then.

                A little later, I tried a btrfs native distro, and that seemed to work okay. It made me think, maybe this time? So I did the RAID10 setup, and that was a mistake, probably because of the features I was using.

                I also tried a distro that did btrfs snapshots. Unfortunately I managed to get it into a state where I my snapshots were worthless and would fail when I tried to restore to them. At this point I don't know if I'm just retarded, unlucky, or what. There's been many different pieces of hardware/software tried at this point.

                Currently, I'm using btrfs in RAID10 without issue. I do a scheduled scrub task. I'm not using autodefrag, and I'm not using compression. I just don't trust those and every time I had issues, those were enabled.

                ----

                Using bcachefs, the only issues I had were related to the kernel and utility upgrades, which were fixed by just not blindly doing dist-upgrade -y, and doing specific mount options to fix it. Sometimes I just needed to grab from master again because the issue was fixed just before.
                OK, how do you know there is a problem? What are the decision criteria that you apply, and how did you choose them? If you think there is a problem, what is your hypothesis, and how do you test it?

                Your story sounds like making random changes for no reason, then wondering why things didn't work.

                Note that if you look at bug reporting systems, people are incentivised to report problems, and not successes, which biases results. In addition, bug reports need triage to remove PEBKAC problems - filtering bug reports to get to actual problems is hard, and not a solved problem in itself. Clueless idiots can report real, actual bugs, and extremely experienced people can be hyper-focused on details without noticing the simple mistakes they have made. Triage is hard.

                It is certainly true that a real bug e.g. a logic failure can be an existence proof of a problem, and a failure in a filesystem. Getting to that bug can be hard.

                Given the number of lines of code in a typical filesystem, it is a racing certainty that there will be findable bugs in all of them. So you need a better evaluation than 'Ooh, there's a bug, this filesystem is crap!'. It can boil down to opinion: is an off-by-one error in the milliseconds of a file modification date better or worse than an off-by-one error that corrupts the journal in rare circumstances? It's the 'same' bug, but has different effects. How do you evaluate the difference? All your file creation dates could be wrong. By a millisecond. Or, very occasionally, your journal is unusable. So you need to evaluate how 'serious' the bug is, and opinions will differ. Saying that your experience of a file-system is poor might be telling me that you are a clueless idiot, or that you are using unreliable hardware, or that you have been unlucky and found bugs that affect you badly. Someone else using exactly the same code for different purposes might have a very different view - which is why you need to evaluate the coverage of any testing, including file-system testing.

                A pile of anecdotes tells you very little about reality. It might inform the development of some hypotheses for further testing, but tells you very little about what is actually going on.

                Comment

                Working...
                X