Announcement

**F.Ultra** · 12 February 2022, 02:08 PM

Originally posted by coder View Post

Where did you get that idea? RAID rebuilds only read from the existing drives and only write to the new drive. Rewriting the existing drives would be pretty stupid, if only because it'd more than double rebuild times.

Because RAID needs to rebuild the parity and on raid5/6 the parity is spread out over several drives so when one drive fails and you have to rebuild the parity will change. For many people they notice the failed drive first on the next boot or after a while and at that time they have run the array in degraded mode which furhter polluted the parity data that now have to be respread among the new set of drives on rebuild.

Now I could be wrong here, but then why would rebuilding the RAID always risk nuking the working drives if all they did was doing reads when it's basically 99.9999% writes that damages a drive and risking nuking the existing drives in a raid5/6 recovery is an actual real world problem that I've encountered myself several times over.

**F.Ultra** · 12 February 2022, 02:17 PM

Originally posted by coder View Post

I'm pretty sure that's not a mdraid thing. I'm guessing you previously used like ext2 on mdraid. It's probably that filesystem which wanted to do a fsck.

Please stop playing the game that every one else here is stupid, why go this route? Of course I was not running ext2 and of course it was the raid that was rebuilding, unless you want to somehow claim that the FS can trigger /proc/mdstat to show the raid being resyncing...

Originally posted by coder View Post

You can schedule scrubbing whenever you want. And you should still be scrubbing your RAID, even though it's using BTRFS.

Of course you should but the mdraid process is incredible intrusive, saving a code change in say gedit and 10s you can perhaps run make. BTRFS scrub on the other hand I don't even notice when it runs. Each time it happened my entire work day was gone since the resync would take 8/9h on /home and in that time the entire machine was basically useless.

Originally posted by coder View Post

That usually only happens to people who don't do regular scrubbing. If you scrub frequently enough, and especially if you use RAID-6, then the risk of an array failure during rebuild is negligible (though it's higher for arrays with more drives).

Well due to work experience I only have experience with "more drives" but this is a well known phenomenon in the enterprise storage community and is why many refuse to run RAID5/6 in the enterprise.

E.g it happened to LTT twice on ZFS, though it has to be said that this was mostly down to them forgetting to schedule scrubbing:

**coder** · 12 February 2022, 02:29 PM

Originally posted by F.Ultra View Post

Because RAID needs to rebuild the parity and on raid5/6 the parity is spread out over several drives so when one drive fails and you have to rebuild the parity will change.

What a rebuild does is to recompute whatever the replaced drive would've stored, which you can do from any combination of blocks, parity or not. It's like how you can get data off the array after a drive fails or has been removed.

Originally posted by F.Ultra View Post

why would rebuilding the RAID always risk nuking the working drives if all they did was doing reads

Because drives are continually experiencing bit rot, mechanical wear, and even electrical wear that makes them more likely to fail a read with age. Doing a rebuild to compute the contents of a new drive reads the entire disk (or partition). If any of those blocks on any drives have degraded to the point where they're now unreadable, you'll get a failed read and that drive will also get kicked out of the array. That's why periodic scrubbing is so important, and that's why someone decided to bog down your machine once per month. Because, otherwise people would blissfully assume their data is safe because it's on a RAID, not knowing that blocks could be silently going bad in ways that will torpedo their next rebuild.

Now, some hardware RAID controllers I've seen can slow down the scrubbing process (sometimes called "consistency check"), so that it's less likely to interfere with normal usage. Perhaps some can also limit it to when the drive is otherwise idle.

It should also be noted that modern disks themselves do a periodic self-check, to refresh/relocate any blocks with recoverable errors. If you check the SMART logs of the drive, you can often see this happening at intervals of like 1000 hours.

**coder** · 12 February 2022, 02:44 PM

Originally posted by F.Ultra View Post

Please stop playing the game that every one else here is stupid, why go this route? Of course I was not running ext2 and of course it was the raid that was rebuilding, unless you want to somehow claim that the FS can trigger /proc/mdstat to show the raid being resyncing...

Okay, thanks for confirming. I've never encountered that case, but I've seen plenty of (mostly older) filesystems fsck when they're first mounted after going down uncleanly.

BTW, I did say "I'm pretty sure", which was intended to express a degree of uncertainty.

mdraid is supposed to let you use an array while it's rebuilding, so I'm a little surprised to hear that yours didn't. Maybe that was due to someone being overly cautious in the init script.

Originally posted by F.Ultra View Post

Of course you should but the mdraid process is incredible intrusive, saving a code change in say gedit and 10s you can perhaps run make. BTRFS scrub on the other hand I don't even notice when it runs. Each time it happened my entire work day was gone since the resync would take 8/9h on /home and in that time the entire machine was basically useless.

You can stop it, and then restart it at a time that's more convenient for you:

https://serverfault.com/questions/21...-resync/216532

Originally posted by F.Ultra View Post

Well due to work experience I only have experience with "more drives" but this is a well known phenomenon in the enterprise storage community and is why many refuse to run RAID5/6 in the enterprise.

I've run RAID-5 and RAID-6 on small arrays at work (4-6 disks), for decades, and not lost an array. You just have to configure hot spares or be prompt about swapping out failed drives. Most of our machines with RAIDs are Dell servers with hardware RAID controllers.

**F.Ultra** · 12 February 2022, 02:45 PM

Originally posted by coder View Post

What a rebuild does is to recompute whatever the replaced drive would've stored, which you can do from any combination of blocks, parity or not. It's like how you can get data off the array after a drive fails or has been removed.

Yes and in many cases this happens after the array have been a while in degraded mode (unless you have a script that shuts everything down the ms a failed drive is detected). Any writes being done in degraded mode changes the parity layout since its done over x-1 drives so when you replace the failed drive this have to be completely redistributed among X drives again. Again completely nuking the entire array when replacing a failed drive in raid5/6 is a well known phenomena in the enterprise world.

Now it might not happen that easily in a small home setup with only 3 or 4 drives.

**coder** · 12 February 2022, 02:50 PM

Originally posted by F.Ultra View Post

Yes and in many cases this happens after the array have been a while in degraded mode (unless you have a script that shuts everything down the ms a failed drive is detected). Any writes being done in degraded mode changes the parity layout since its done over x-1 drives so when you replace the failed drive this have to be completely redistributed among X drives again.

I don't see why that should need to be the case, but I can't specifically refute what you're saying. I just think that would be an incredibly bone-headed implementation.

Originally posted by F.Ultra View Post

Again completely nuking the entire array when replacing a failed drive in raid5/6 is a well known phenomena in the enterprise world.

Yes, I already explained how & why array failures happen during rebuilds. It doesn't presume any writes to any drives except the new replacement drive.

**F.Ultra** · 12 February 2022, 02:51 PM

Originally posted by coder View Post

mdraid is supposed to let you use an array while it's rebuilding, so I'm a little surprised to hear that yours didn't. Maybe that was due to someone being overly cautious in the init script.

Well it's usable in theory but in practice everything just takes soo much time to do since eeverything touches disk now a days.

Originally posted by coder View Post

You can stop it, and then restart it at a time that's more convenient for you:

I know but after an unclean shutdown, I would rather have it do a complete scrub just to be sure that everything is ok.

Originally posted by coder View Post

I've run RAID-5 and RAID-6 on small arrays at work (4-6 disks), for decades, and not lost an array. You just have to configure hot spares or be prompt about swapping out failed drives. Most of our machines with RAIDs are Dell servers with hardware RAID controllers.

"hardware" raid is no different, they are basically all mdraid on a SOC with a parity acceleration chip anyway. Spares does not help here since it's the very nature of the resync that puts so much stress on the drives that you have a high change of one or more of them failing during the time. Small sets like 4-6 drives are probably not as much a problem here, but on 20, 40+ drive arrays the array is so large that a rebuild can take days. This is why most enterprises have switched to 1+0 since at least a decade ago.

**F.Ultra** · 12 February 2022, 02:54 PM

Originally posted by coder View Post

I don't see why that should need to be the case, but I can't specifically refute what you're saying. I just think that would be an incredibly bone-headed implementation.

How else do you think that raid5/6 supports to be written to in degraded mode? It has to do it this way to both work and still keep a running parity at the same time.

**foobaz** · 12 February 2022, 10:44 PM

Originally posted by F.Ultra View Post

How else do you think that raid5/6 supports to be written to in degraded mode?

I'm pretty sure it works by writing the exact same thing to each disk that it would if the array were healthy. There is no need to change the parity layout, as you suggest. Continuing to write the same data to the same places works fine and is simpler. The missing disk can still be reconstructed.

**coder** · 13 February 2022, 03:44 AM

Originally posted by F.Ultra View Post

I know but after an unclean shutdown, I would rather have it do a complete scrub just to be sure that everything is ok.

Oh, I thought you were talking about the regularly-scheduled scrubs.

Originally posted by F.Ultra View Post

"hardware" raid is no different, they are basically all mdraid on a SOC with a parity acceleration chip anyway.

That's not accurate. These machines have RAID controllers with onboard batteries, which will hold any non-committed writes until next boot. Then, you're presented with an option to commit or discard those pending writes.

Originally posted by F.Ultra View Post

Spares does not help here

Hot spares are mainly a solution to the problem of not noticing when a array is degraded, for some extended period of time. As long as you don't let much time go by before rebuilding your RAID, they don't meaningfully impact the probability of a rebuild failure.

Originally posted by F.Ultra View Post

since it's the very nature of the resync that puts so much stress on the drives that you have a high change of one or more of them failing during the time.

I think this is a common misconception. For sure, rebuilds are more stressful than the drive idling, but they're far from the most stressful thing you can do to a drive. That's because all the accesses are coherent. There's very little seeking, which is what exerts the most stress on a drive. Still, you do see drive temperatures click up, during rebuilds.

The misconception part is that the rebuild actually triggers the drive failures. However, what's probably happening is that sectors have already failed and it's just a question of when you're going to discover that fact. Since the rebuild involves doing reads of the entire drive, any bad sectors are going to be detected. Doing regular scrubs is how you can detect these failed sectors on your drives as soon as they happen, so you can replace that drive before the next one fails.

Originally posted by F.Ultra View Post

Small sets like 4-6 drives are probably not as much a problem here, but on 20, 40+ drive arrays the array is so large that a rebuild can take days.

I think the main factors in rebuild times are generally the drive capacity and its media transfer rate. I guess you could have a situation where a controller simply can't cope with all drives streaming at full speed, but most of them can. Anyway, it's true that drive capacities are getting to a point where a rebuild could probably take more than a day.

The main issue with big arrays is just the probability of drive failures, which stacks up as a function of the number of drives. And the failure probability has to be considered in the light of rebuild times, if you want to predict the likelihood of an array failure.

Originally posted by F.Ultra View Post

This is why most enterprises have switched to 1+0 since at least a decade ago.

Even this isn't the ideal solution, IMO. It's much better to use object stores with replication, especially as you scale up. Using a distributed object store, you could even tolerate the loss of an entire machine.

Announcement

Linux 5.18 Looks Like It Will Finally Land Btrfs Encoded I/O

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment