Originally posted by andyprough
View Post
Announcement
Collapse
No announcement yet.
XFS Metadata Corruption On Linux 6.3 Tracked Down To One Missing One-Line Patch
Collapse
X
-
- Likes 1
-
From a QA point of view I would expect the XFS developers to be asking themselves these questions:- Is this issue a problem that can be tested before code is ever released to the general public? Can that testing be automated?
- Is this testing simple or complicated? And how long should this testing take to satisfy the subjective level of "good code quality"?
- How did we miss this issue in our current battery of tests? Assuming the XFS developers have a battery of tests for their code that would have addressed this issue.
Telling people that they should run a LTS kernel is not enough to avoid issues like this even though an LTS kernel might not have had this issue. We all know people that have to live on the edge, take selfies while standing at cliff edges, and run with sissors in both hands. There will always be the issue of "people just wanna run cutting edge kernels to play with new features or obtain needed device support", so there's no protecting them from Murphy's Law with "idiot-proof design" since those designs can always be defeated by a better idiot, but critical features (and metadata is critical to XFS operation) need to be thoroughly tested.
Yeah, I know many developers are unpaid contributors that program in their spare time, but many key developer roles are held by people from major employers, and in the case of this article it is Red Hat. That brings back into question the level of support provided by that employer to that developer. In this case I do not think "the optics" look good...major Linux employer; employee of major Linux employer; major bug in an important Linux module, a filesystem no less (losing people's files & data IS NEVER COOL); and an apparent lack of obvious QC and QA IMHO.
Granted, the issue was quickly caught after community feedback (see the Bugzilla report), but metadata is a key feature of making XFS work and screwing up metadata should not have happened. Simply put, if you are not certain of the code quality, then don't release the code to the public; test it some more.
In this case the issue tracked back to the uncertainty over the need for a specific patch per the Dave Chinner quote in the Phoronix article, so is that a demonstration of "lack of patch documentation" in the XFS project? Or a "lack of understanding" regarding what a patch does or what else that patch can impact? Documentation is generally considered the first thing ("we pay other people to do that") that programmers hate to do with QA-QC testing ("we pay other people to do that") being the second thing.Last edited by NotMine999; 30 May 2023, 12:39 AM.
- Likes 2
Comment
-
Originally posted by Developer12 View Post
Ah, yes, so we can have the filesystem corrupt itself on ALL kernel versions.
(of course it's a joke: I love and use btrfs everywhere I can)
- Likes 1
Comment
-
Originally posted by NotMine999 View PostFrom a QA point of view I would expect the XFS developers to be asking themselves these questions:- Is this issue a problem that can be tested before code is ever released to the general public? Can that testing be automated?
- Is this testing simple or complicated? And how long should this testing take to satisfy the subjective level of "good code quality"?
- How did we miss this issue in our current battery of tests? Assuming the XFS developers have a battery of tests for their code that would have addressed this issue.
Telling people that they should run a LTS kernel is not enough to avoid issues like this even though an LTS kernel might not have had this issue. We all know people that have to live on the edge, take selfies while standing at cliff edges, and run with sissors in both hands. There will always be the issue of "people just wanna run cutting edge kernels to play with new features or obtain needed device support", so there's no protecting them from Murphy's Law with "idiot-proof design" since those designs can always be defeated by a better idiot, but critical features (and metadata is critical to XFS operation) need to be thoroughly tested.
Yeah, I know many developers are unpaid contributors that program in their spare time, but many key developer roles are held by people from major employers, and in the case of this article it is Red Hat. That brings back into question the level of support provided by that employer to that developer. In this case I do not think "the optics" look good...major Linux employer; employee of major Linux employer; major bug in an important Linux module, a filesystem no less (losing people's files & data IS NEVER COOL); and an apparent lack of obvious QC and QA IMHO.
Granted, the issue was quickly caught after community feedback (see the Bugzilla report), but metadata is a key feature of making XFS work and screwing up metadata should not have happened. Simply put, if you are not certain of the code quality, then don't release the code to the public; test it some more.
In this case the issue tracked back to the uncertainty over the need for a specific patch per the Dave Chinner quote in the Phoronix article, so is that a demonstration of "lack of patch documentation" in the XFS project? Or a "lack of understanding" regarding what a patch does or what else that patch can impact? Documentation is generally considered the first thing ("we pay other people to do that") that programmers hate to do with QA-QC testing ("we pay other people to do that") being the second thing.
actually writing software is hard.
- Likes 8
Comment
-
Originally posted by ALRBP View Post
I agree that XFS is more stable than BRTFS, but the only time I experimented corruption with BTRFS was due to faulty RAM and still recoverable easily without significant data loss.
Comment
-
Originally posted by NotMine999 View PostFrom a QA point of view I would expect the XFS developers to be asking themselves these questions:- Is this issue a problem that can be tested before code is ever released to the general public? Can that testing be automated?
- Is this testing simple or complicated? And how long should this testing take to satisfy the subjective level of "good code quality"?
- How did we miss this issue in our current battery of tests? Assuming the XFS developers have a battery of tests for their code that would have addressed this issue.
Yes XFS developers do make and maintain a battery of automated tests. Yes the automated xfs test suite does get run before general public end up on the receiving end.
The xfs testing is quite advanced to the point many other file systems on Linux also use large sections of the xfs test suite. But there is always a risk that you don't test for how the file system driver fails.
"lack of patch documentation"/"lack of understanding" Neither of these are true. This is a simple case of opps that happened to make a hard to detect fault.
Do note that only some of the time does the defective kernel fail when running though the complete automated xfs test suite.
When the allocator code was rewritting in the 6.3 cycle, this fallback was broken - the old code used args->fsbno as the both the allocation target and the allocation result, the new code passes the target as a separate parameter. The conversion didn't handle the aligned->unaligned fallback path correctly - it reset args->fsbno to the target fsbno on failure which broke allocation failure detection in the high level code and so it never fell back to unaligned allocations.
This is a case the were working on allocator to fix particular. One line of old code was not correctly removed and Pandora box opened.. Problem here it a race condition form of Pandora box where particular conditions have to align for issue to happen so that mean automated tests run by many different parties all could successfully pass with no issue displayed testing users could also use the file system and no issue appears. This fault just won the odds that everyone running the testsuite and before release kernels had no defect appear everything appeared to be looking good when it was in fact a dud.
Please note the removed line of code by this patch was correct before the allocator was changed. This is a simple case developer missed removing exactly 1 line of code all the lines of code that make up the XFS driver and it just happened to be a line of code that caused a defect that only happens sometimes..
- Likes 3
Comment
-
Originally posted by oiaohm View Post...
Please note the removed line of code by this patch was correct before the allocator was changed. This is a simple case developer missed removing exactly 1 line of code all the lines of code that make up the XFS driver and it just happened to be a line of code that caused a defect that only happens sometimes..
Comment
-
Originally posted by ALRBP View Post
I agree that XFS is more stable than BRTFS, but the only time I experimented corruption with BTRFS was due to faulty RAM and still recoverable easily without significant data loss.
It seems to be in our nature to take sides, but can we agree that the main options we got for filesystem in linux are all great and mature? except of course for very specific cases. I've been using ext4, btrfs, xfs, and even f2fs for several cases and drives, and they're all pretty good at doing what they're supposed to.
Comment
-
Originally posted by Developer12 View PostThis smells suspiciously like a "magic fix." Hopefully there's more investigation later to ensure that it's complete and correct.
Comment
Comment