Announcement

Collapse
No announcement yet.

XFS Metadata Corruption On Linux 6.3 Tracked Down To One Missing One-Line Patch

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by andyprough View Post
    Ahh, I see now, that makes more sense. So they were wondering why the 6.4 kernel did not have the same problem and found this line for this completely different bug fix was also unknowingly fixing the xfs metadata corruption problem in 6.3.
    The core xfs developer who suggested applying that specific fix from 6.4 seems to have honed into it quite quickly once it was clear that this was an xfs issue only in 6.3 and there was a reproducer without a tainted kernel being involved, which suggests that once the corruption was properly identified and acknowledged, and that it was fixed in 6.4, the xfs developer had an epiphany about possible implications of the (one line) livelock fix in 6.4 (hindsight, and all that).

    Comment


    • #12
      From a QA point of view I would expect the XFS developers to be asking themselves these questions:
      • Is this issue a problem that can be tested before code is ever released to the general public? Can that testing be automated?
      • Is this testing simple or complicated? And how long should this testing take to satisfy the subjective level of "good code quality"?
      • How did we miss this issue in our current battery of tests? Assuming the XFS developers have a battery of tests for their code that would have addressed this issue.
      People will develop more and more trust in Linux when developers make serious efforts to stop these "self-inflicted wounds" in Linux. Those efforts involve, in part, rigorous QA testing that goes beyond getting community feedback that generally responds with "yeah it works here" comments and no detail.

      Telling people that they should run a LTS kernel is not enough to avoid issues like this even though an LTS kernel might not have had this issue. We all know people that have to live on the edge, take selfies while standing at cliff edges, and run with sissors in both hands. There will always be the issue of "people just wanna run cutting edge kernels to play with new features or obtain needed device support", so there's no protecting them from Murphy's Law with "idiot-proof design" since those designs can always be defeated by a better idiot, but critical features (and metadata is critical to XFS operation) need to be thoroughly tested.

      Yeah, I know many developers are unpaid contributors that program in their spare time, but many key developer roles are held by people from major employers, and in the case of this article it is Red Hat. That brings back into question the level of support provided by that employer to that developer. In this case I do not think "the optics" look good...major Linux employer; employee of major Linux employer; major bug in an important Linux module, a filesystem no less (losing people's files & data IS NEVER COOL); and an apparent lack of obvious QC and QA IMHO.

      Granted, the issue was quickly caught after community feedback (see the Bugzilla report), but metadata is a key feature of making XFS work and screwing up metadata should not have happened. Simply put, if you are not certain of the code quality, then don't release the code to the public; test it some more.

      In this case the issue tracked back to the uncertainty over the need for a specific patch per the Dave Chinner quote in the Phoronix article, so is that a demonstration of "lack of patch documentation" in the XFS project? Or a "lack of understanding" regarding what a patch does or what else that patch can impact? Documentation is generally considered the first thing ("we pay other people to do that") that programmers hate to do with QA-QC testing ("we pay other people to do that") being the second thing.​
      Last edited by NotMine999; 30 May 2023, 12:39 AM.

      Comment


      • #13
        Originally posted by Developer12 View Post

        Ah, yes, so we can have the filesystem corrupt itself on ALL kernel versions.
        at least with BTRFS you don't have a false sense of security and tend to backup your data more often 😬

        (of course it's a joke: I love and use btrfs everywhere I can)

        Comment


        • #14
          Originally posted by NotMine999 View Post
          From a QA point of view I would expect the XFS developers to be asking themselves these questions:
          • Is this issue a problem that can be tested before code is ever released to the general public? Can that testing be automated?
          • Is this testing simple or complicated? And how long should this testing take to satisfy the subjective level of "good code quality"?
          • How did we miss this issue in our current battery of tests? Assuming the XFS developers have a battery of tests for their code that would have addressed this issue.
          People will develop more and more trust in Linux when developers make serious efforts to stop these "self-inflicted wounds" in Linux. Those efforts involve, in part, rigorous QA testing that goes beyond getting community feedback that generally responds with "yeah it works here" comments and no detail.

          Telling people that they should run a LTS kernel is not enough to avoid issues like this even though an LTS kernel might not have had this issue. We all know people that have to live on the edge, take selfies while standing at cliff edges, and run with sissors in both hands. There will always be the issue of "people just wanna run cutting edge kernels to play with new features or obtain needed device support", so there's no protecting them from Murphy's Law with "idiot-proof design" since those designs can always be defeated by a better idiot, but critical features (and metadata is critical to XFS operation) need to be thoroughly tested.

          Yeah, I know many developers are unpaid contributors that program in their spare time, but many key developer roles are held by people from major employers, and in the case of this article it is Red Hat. That brings back into question the level of support provided by that employer to that developer. In this case I do not think "the optics" look good...major Linux employer; employee of major Linux employer; major bug in an important Linux module, a filesystem no less (losing people's files & data IS NEVER COOL); and an apparent lack of obvious QC and QA IMHO.

          Granted, the issue was quickly caught after community feedback (see the Bugzilla report), but metadata is a key feature of making XFS work and screwing up metadata should not have happened. Simply put, if you are not certain of the code quality, then don't release the code to the public; test it some more.

          In this case the issue tracked back to the uncertainty over the need for a specific patch per the Dave Chinner quote in the Phoronix article, so is that a demonstration of "lack of patch documentation" in the XFS project? Or a "lack of understanding" regarding what a patch does or what else that patch can impact? Documentation is generally considered the first thing ("we pay other people to do that") that programmers hate to do with QA-QC testing ("we pay other people to do that") being the second thing.​
          telling people how they should write software is easy.
          actually writing software is hard.

          Comment


          • #15
            Originally posted by ALRBP View Post

            I agree that XFS is more stable than BRTFS, but the only time I experimented corruption with BTRFS was due to faulty RAM and still recoverable easily without significant data loss.
            The only file systems I have ever experienced corruption with are FAT (to be expected) and ext4. This means that my system is on Btrfs and my backups, somewhat unfortunately, on XFS. I was lucky enough to not get corruption from this bug.

            Comment


            • #16
              Originally posted by NotMine999 View Post
              From a QA point of view I would expect the XFS developers to be asking themselves these questions:
              • Is this issue a problem that can be tested before code is ever released to the general public? Can that testing be automated?
              • Is this testing simple or complicated? And how long should this testing take to satisfy the subjective level of "good code quality"?
              • How did we miss this issue in our current battery of tests? Assuming the XFS developers have a battery of tests for their code that would have addressed this issue.


              Yes XFS developers do make and maintain a battery of automated tests. Yes the automated xfs test suite does get run before general public end up on the receiving end.

              The xfs testing is quite advanced to the point many other file systems on Linux also use large sections of the xfs test suite. But there is always a risk that you don't test for how the file system driver fails.

              "lack of patch documentation"/"lack of understanding" Neither of these are true. This is a simple case of opps that happened to make a hard to detect fault.



              Do note that only some of the time does the defective kernel fail when running though the complete automated xfs test suite.



              When the allocator code was rewritting in the 6.3 cycle, this fallback was broken - the old code used args->fsbno as the both the allocation target and the allocation result, the new code passes the target as a separate parameter. The conversion didn't handle the aligned->unaligned fallback path correctly - it reset args->fsbno to the target fsbno on failure which broke allocation failure detection in the high level code and so it never fell back to unaligned allocations.
              This is out the patch that fixes the Metadata Corruption problem. This is also a patch for a completely different bug "xfs: fix livelock in delayed allocation at ENOSPC"

              This is a case the were working on allocator to fix particular. One line of old code was not correctly removed and Pandora box opened.. Problem here it a race condition form of Pandora box where particular conditions have to align for issue to happen so that mean automated tests run by many different parties all could successfully pass with no issue displayed testing users could also use the file system and no issue appears. This fault just won the odds that everyone running the testsuite and before release kernels had no defect appear everything appeared to be looking good when it was in fact a dud.

              Please note the removed line of code by this patch was correct before the allocator was changed. This is a simple case developer missed removing exactly 1 line of code all the lines of code that make up the XFS driver and it just happened to be a line of code that caused a defect that only happens sometimes..

              Comment


              • #17
                Originally posted by oiaohm View Post
                ...

                Please note the removed line of code by this patch was correct before the allocator was changed. This is a simple case developer missed removing exactly 1 line of code all the lines of code that make up the XFS driver and it just happened to be a line of code that caused a defect that only happens sometimes..
                Thank you for this post, a very informative analysis. As a "hobby coder" at best, this was an interesting look into things, gave me some insights in things that can happen with software engineering/development, etc. This was very useful, again much thanks.

                Comment


                • #18
                  Originally posted by ALRBP View Post

                  I agree that XFS is more stable than BRTFS, but the only time I experimented corruption with BTRFS was due to faulty RAM and still recoverable easily without significant data loss.
                  Well, the only time I experimented corruption with XFS was with a faulty HDD (seagate that failed hours later). And it recovered 98% of the data without significant data loss.
                  It seems to be in our nature to take sides, but can we agree that the main options we got for filesystem in linux are all great and mature? except of course for very specific cases. I've been using ext4, btrfs, xfs, and even f2fs for several cases and drives, and they're all pretty good at doing what they're supposed to.

                  Comment


                  • #19
                    You know what is a "btrfs" than XFS, BTRFS!


                    Time to smell the coffee lads, RAID5 everything and take selfies at the edge of the write-holes!

                    Comment


                    • #20
                      Originally posted by Developer12 View Post
                      This smells suspiciously like a "magic fix." Hopefully there's more investigation later to ensure that it's complete and correct.
                      It would not be the first time a developer missed something when making a change to the code. It happens to all of us at some point.

                      Comment

                      Working...
                      X