It Turns Out The Btrfs RAID 5/6 Issue Isn't Completely Fixed
Earlier this week we reported on the Btrfs RAID5/RAID6 code being fixed, well, it appeared to. However, now the Btrfs developers have clarified that the situation isn't entirely resolved.
This all stems from the problem discovered months ago about the Btrfs RAID 5/6 code being found unsafe. Btrfs contributor Zygo Blaxell wrote to clarify the situation:
He also commented, "There are multiple bugs in the stress + remove device case. Some are quite easy to isolate. They range in difficulty from simple BUG_ON instead of error returns to finally solving the RMW update problem...To be able to use a RAID5 in production it must be possible to recover from one normal disk failure without being stopped by *any* bug in most cases. Until that happens, users should be aware that recovery doesn't work yet."
So for now it's probably just best not using Btrfs' native RAID 5/6 code in production.
This all stems from the problem discovered months ago about the Btrfs RAID 5/6 code being found unsafe. Btrfs contributor Zygo Blaxell wrote to clarify the situation:
with headlines like "btrfs RAID5/RAID6 support is finally fixed" when that's very much not the case. Only one bug has been removed for the key use case that makes RAID5 interesting, and it's just the first of many that still remain in the path of a user trying to recover from a normal disk failure.So pardon the confusion, the RAID5/6 code has improved, but not is all dandy.
Admittedly this is Michael's (Phoronix's) problem more than Qu's, but it's important to always be clear and _complete_ when stating bug status because people quote statements out of context. When the article quoted the text
"it's not a timed bomb buried deeply into the RAID5/6 code, but a race condition in scrub recovery code"
the commenters on Phoronix are clearly interpreting this to mean "famous RAID5/6 scrub error" had been fixed *and* the issue reported by Goffredo was the time bomb issue. It's more accurate to say something like
"Goffredo's issue is not the time bomb buried deeply in the RAID5/6 code, but a separate issue caused by a race condition in scrub recovery code"
Reading the Phoronix article, one might imagine RAID5 is now working as well as RAID1 on btrfs. To be clear, it's not--although the gap is now significantly narrower.
He also commented, "There are multiple bugs in the stress + remove device case. Some are quite easy to isolate. They range in difficulty from simple BUG_ON instead of error returns to finally solving the RMW update problem...To be able to use a RAID5 in production it must be possible to recover from one normal disk failure without being stopped by *any* bug in most cases. Until that happens, users should be aware that recovery doesn't work yet."
So for now it's probably just best not using Btrfs' native RAID 5/6 code in production.
21 Comments