Announcement

**starshipeleven** · 27 May 2020, 07:55 PM

Originally posted by cynic View Post

actually, with btrfs it also has a data consistency function and sefl healing function in case of data corruption on one disk of the array (ie: bad sectors or bitrot)

That's still not a backup.
Backups protect from human error (someone deletes or modifies something by mistake), mainly, and also protect from catastrophic damage to the computer from internal events (i.e. the PSU blows up and kills all drives in a fell swoop, yes it can do that, please don't use shit-grade PSU people, it's baad) to external events (something destroys the entire building or room, fire, alien attack, third world war, angry ex girlfriend with a hammer).

Normal backup methods don't usually protect from data corruption as they aren't checksumming the data and checking timestamps, and you can easily overwrite your (good) backups with corrupted files from the working system and only realize they are corrupted years later when all backups are overwritten and you have no chances of getting them back.

**S.Pam** · 27 May 2020, 11:53 PM

Yep. This is why I only use filsystems that do datachecksums (btrfs, zfs). I've had too much data loss in my photo archives over the years due to bit rot. As you say, you need to continue with checksums on your incremental backups to have a sane and reliable backup.

**zyxxel** · 28 May 2020, 12:38 AM

Originally posted by starshipeleven View Post

A RAID5 system that sees a parity error will rebuild the data using the parity information. You don't need to lose a drive.

The point is that a RAID5 does not check the checksums on read, and the rebuild process is long and slow as it has to rescan the entire array.

So while this would be able to deal with data corruption too, it sucks in a very big way if compared to filesystem-level RAID that check the checksums on read and can self-heal in real time as it knows where the data is.

A RAID5 system that gets a hard read failure on one drive can rebuild based on the other drives.

But a RAID5 that does get a parity error without seeing a hard read error on one of the drives will not know which of the drives that has incorrect data. So it can recompute the parity - but for a 5-drive system there is then a 80% chance that it was the parity that was correct and one of the data drives that had the soft error.

That's why RAID-5 needs help with the data integrity. And also why so many people lose data with RAID-5. When they finally get a read error, and try to swap a drive, there is already more read errors that hasn't been noticed. People then often blame it on multiple drives bought at the same time failing at the same time - but quite likely the failures have been slowly cropping up for quite some time without any data scrub performed to notice it.

So ZFS, Btrfs - or Snapraid - with other means of integrity checking is really needed when we get to the really big drives, or people are going to take backups of already corrupt data without knowing about it.

**starshipeleven** · 28 May 2020, 01:48 AM

Originally posted by zyxxel View Post

But a RAID5 that does get a parity error without seeing a hard read error on one of the drives will not know which of the drives that has incorrect data. So it can recompute the parity - but for a 5-drive system there is then a 80% chance that it was the parity that was correct and one of the data drives that had the soft error.

That's why RAID-5 needs help with the data integrity.

Hm ok, pure RAID5 needs help, but what is actually done in practice when you select RAID5 on some devices may differ.

I was thinking of hardware raid cards or higher end storage appliances. I know high end appliances either use disks with 520 sector size to store the " magic 8 bytes" (aka various checksumming) for EMC VNX https://www.dataanalyzers.com/servic...-vnx-recovery/
or either do that or use one block every 9 blocks for the checksums like NetAPP https://library.netapp.com/ecmdocs/E...BA71ABBC6.html

Disk formats supported by Data ONTAP
The disk format determines how much of the disk’s raw capacity can be used for data storage. Some disk formats cannot be combined in the same aggregate.
Most disks used in storage systems are block checksum disks (BCS disks).
The amount of space available for data depends on the bytes per sector (bps) of the disk:

Disks that use 520 bps provide 512 bytes per sector for data. 8 bytes per sector are used for the checksum.
Disks that use 512 bps use some sectors for data and others for checksums. For every 9 sectors, 1 sector is used for the checksum, and 8 sectors are available for data.

The disk formats by Data ONTAP disk type are as follows:

FCAL and SAS BCS disks use 520 bps.
ATA, SATA, and BSAS BCS disks use 512 bps.
SSD BCS disks use 512 bps.

If you have an older storage system, it might have zoned checksum disks (ZCS disks). In ZCS disks, for every 64 (4,096 byte) blocks, one block is used for the checksum, and 63 blocks are available for data. There are rules about combining BCS disks and ZCS disks in the same aggregate.

I just don't know if any raid card does something like that (probably not). Afaik mdadm aka Linux software raid does not, and you can only use block-level checksumming with a new-ish DM mode called dm-integrity, but it's kind of klunky (as mdadm also is)

**zyxxel** · 28 May 2020, 02:18 AM

Originally posted by starshipeleven View Post

Hm ok, pure RAID5 needs help, but what is actually done in practice when you select RAID5 on some devices may differ.

I was thinking of hardware raid cards or higher end storage appliances. I know high end appliances either use disks with 520 sector size to store the " magic 8 bytes" (aka various checksumming) for EMC VNX https://www.dataanalyzers.com/servic...-vnx-recovery/
or either do that or use one block every 9 blocks for the checksums like NetAPP https://library.netapp.com/ecmdocs/E...BA71ABBC6.html

Disk formats supported by Data ONTAP
The disk format determines how much of the disk’s raw capacity can be used for data storage. Some disk formats cannot be combined in the same aggregate.
Most disks used in storage systems are block checksum disks (BCS disks).
The amount of space available for data depends on the bytes per sector (bps) of the disk:

Disks that use 520 bps provide 512 bytes per sector for data. 8 bytes per sector are used for the checksum.
Disks that use 512 bps use some sectors for data and others for checksums. For every 9 sectors, 1 sector is used for the checksum, and 8 sectors are available for data.

The disk formats by Data ONTAP disk type are as follows:

FCAL and SAS BCS disks use 520 bps.
ATA, SATA, and BSAS BCS disks use 512 bps.
SSD BCS disks use 512 bps.

If you have an older storage system, it might have zoned checksum disks (ZCS disks). In ZCS disks, for every 64 (4,096 byte) blocks, one block is used for the checksum, and 63 blocks are available for data. There are rules about combining BCS disks and ZCS disks in the same aggregate.

I just don't know if any raid card does something like that (probably not). Afaik mdadm aka Linux software raid does not, and you can only use block-level checksumming with a new-ish DM mode called dm-integrity, but it's kind of klunky (as mdadm also is)

Every single disk in existence has a physical sector larger than what is announced - there is always a number of bytes for ECC (Error Correcting Code).

This can detect and repair smaller bit errors and can detect (but not repair) most all larger read errors. But it isn't perfect, which is the reason why there is always a specification in the datasheets claiming the probability of non-recoverable errors.

But it's also possible to use disks where ECC area is available for the software itself, instead of just used by the disk firmware.

But besides recovery/detection using ECC, there are silent errors that may happen during the write - the transfer chain (software in PC, RAM, transfer over disk cable, ...) accidentally corrupts a bit of the data before writing to the sector.

The hardware RAID cards can't do anything that software RAID can't do when it comes to recovery because of bit errors. The main advantage with a high-end hardware RAID card is that they support battery backup. So the OS gets a sign-off from the RAID card "all data received and synchronized". Then you may lose power, and next time you power up, the RAID card can double-check with each individual disk that the battery-backed data actually got written to each disk. This is otherwise knowns as the "write hole" - i.e. your software RAID prepares the data to update 3 disks in a RAID-5. But only 2 of the 3 gets the data. Now you have an inconsistency in the system. Which of the three disks is missing the last write, in case you want to rebuild and/or validate the parity.

In a system where the file system itself has check sums, and you have a small error, then you can do one guess for each disk and see if recovery using the other disks will result in the file (or normally file block) is getting a correct check sum. If the file system doesn't do it, then you really want a storage system that uses additional sectors for checksums - but they basically need battery-backed controller cards to avoid inconsistencies where not all sectors got updated for one or more disks.

**starshipeleven** · 28 May 2020, 02:47 AM

Originally posted by zyxxel View Post

Every single disk in existence Every single disk in existence has a physical sector larger than what is announced

Not it's not the same thing.
I'm talking of the RAID engine, either in software or in the card, that needs the drives to advertise a bigger physical sector because it will use it to store a block-levbel checksum and more stuff.

I know about this because it's a possible to buy older used SAS drives and maybe they come from a NetAPP or EMC VMX appliance, and were using the 520 byte sectors you need to use some commandline tools to get the drive to do a low-level format to 512 byte sectors which is the standard for other systems.
Been there, done that. In SAS drives is a thing.

But besides recovery/detection using ECC, there are silent errors that may happen during the write - the transfer chain (software in PC, RAM, transfer over disk cable, ...) accidentally corrupts a bit of the data before writing to the sector.

The abovementioned systems are as protected as btrfs/zfs, they just use checksumming at block level and not at filesystem level.

btrfs/ZFS won't save you from RAM issues either, you need ECC RAM in either case.

The hardware RAID cards can't do anything that software RAID can't do when it comes to recovery because of bit errors. The main advantage with a high-end hardware RAID card is that they support battery backup. So the OS gets a sign-off from the RAID card "all data received and synchronized". Then you may lose power, and next time you power up, the RAID card can double-check with each individual disk that the battery-backed data actually got written to each disk. This is otherwise knowns as the "write hole" - i.e. your software RAID prepares the data to update 3 disks in a RAID-5. But only 2 of the 3 gets the data. Now you have an inconsistency in the system. Which of the three disks is missing the last write, in case you want to rebuild and/or validate the parity.

This is irrelevant if the system has a UPS and isn't running a shitshow OS, then again a battery backup on a RAID card is much cheaper than a server-grade UPS.

In a system where the file system itself has check sums, and you have a small error, then you can do one guess for each disk and see if recovery using the other disks will result in the file (or normally file block) is getting a correct check sum. If the file system doesn't do it, then you really want a storage system that uses additional sectors for checksums - but they basically need battery-backed controller cards to avoid inconsistencies where not all sectors got updated for one or more disks.

Not really, the EMS VMX use these 8 bit to store also time stamps for various activities. (from the other article I linked), and I suspect NetAPP does the same. Additional unique aspects are that these arrays do not use standard disk sector formatting. They use what is referred to as the Magic 8 bytes, which produces a 520 byte sector format, which includes 8 bytes of metadata in addition to the 512bytes of traditional data. These additional bytes include a linear check sum, write stamp, parity shed stamp and time stamp.

This would allow to detect what blocks were not updated yet, among other things.

**zyxxel** · 28 May 2020, 04:19 AM

Originally posted by starshipeleven View Post

Not it's not the same thing.
I'm talking of the RAID engine, either in software or in the card, that needs the drives to advertise a bigger physical sector because it will use it to store a block-levbel checksum and more stuff.

I know about this because it's a possible to buy older used SAS drives and maybe they come from a NetAPP or EMC VMX appliance, and were using the 520 byte sectors you need to use some commandline tools to get the drive to do a low-level format to 512 byte sectors which is the standard for other systems.
Been there, done that. In SAS drives is a thing.

See two paragraphs later in my post where I wrote:

"But it's also possible to use disks where ECC area is available for the software itself, instead of just used by the disk firmware."

This is what you get with disks that announces 520 byte large sectors.

The abovementioned systems are as protected as btrfs/zfs, they just use checksumming at block level and not at filesystem level.

btrfs/ZFS won't save you from RAM issues either, you need ECC RAM in either case.

This is irrelevant if the system has a UPS and isn't running a shitshow OS, then again a battery backup on a RAID card is much cheaper than a server-grade UPS.

No, the battery-backed disk controller cards really do have advantages compared to just having an UPS because they introduces transactions over multiple disks. It doesn't matter how good OS you have - the OS itself can only access disks one-by-one and can't force a write to commit to all disks. But a battery-backed RAID card can work as man-in-the-middle and give this functionality.

All high-end systems are running with UPS - there wouldn't be a need for battery backup in the RAID controller cards if good OS + UPS was enough. But a good OS only manages write barriers for a single disk interface at a time.

Not really, the EMS VMX use these 8 bit to store also time stamps for various activities. (from the other article I linked), and I suspect NetAPP does the same. Additional unique aspects are that these arrays do not use standard disk sector formatting. They use what is referred to as the Magic 8 bytes, which produces a 520 byte sector format, which includes 8 bytes of metadata in addition to the 512bytes of traditional data. These additional bytes include a linear check sum, write stamp, parity shed stamp and time stamp.

This would allow to detect what blocks were not updated yet, among other things.

It's common to store transaction numbers when writing data - if you look at a mdadm disk mirror, you can see that in that implementation the sequence number is in the partition header. So when the machine boots, it's happy if both headers shows the same sequence number. Else, the mirror will auto-synchronize from the drive with the higher sequence number.

And yes - similar concepts are used when doing block-level checksumming - it doesn't matter if it's handled by the hardware or if it's done by software. Sequence numbers are a way to implement transactions over multiple devices.
But if you haven't battery backup to cache a multi-device write, then the software must instead perform a two-step write. First to a journal. And then on the second step to the live data area. If it doesn't, then the sequence number will only be good enough to catch a half write - but not be enough to figure out how to rollback to the original state. Unless the system was lucky enough that the write went through on enough drives that it's possible to recompute the missing data for the remaining device(s).

**scottishduck** · 28 May 2020, 04:21 PM

Partition my wife

Announcement

Reiser5 File-System Working On New Features Like Data Tiering, Burst Buffers

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment