Announcement

**movieman** · 05 November 2010, 03:46 PM

Originally posted by misiu_mp View Post

Say I write data to the FS, but not flush to disk. The ram is corrupted and when the fsync comes, something else flows out of my ram and into the storage controller.

Silent corruption is fairly hard unless you're working with plain ASCII text. If you're using a document format like OpenOffice which essentially packs everything in a .zip file then any bit error is likely to make the file unreadable. Similarly a database has a lot of complex data structures which are going to causes serious errors if a bit gets corrupted somewhere in there, and many store the data in a format which is likely to cause errors if a bit gets corrupted in the data too.

**Jimbo** · 05 November 2010, 03:58 PM

Originally posted by misiu_mp View Post

Say I write data to the FS, but not flush to disk. The ram is corrupted and when the fsync comes, something else flows out of my ram and into the storage controller.
What CRC check is it that will signal this? ...

Really, please stop guessing that silent errors are something usual. If u don't know how crc, parity ... works please stop guessing that silent errors happens, stop guessing how this works, accept that u don't know how it works.

Memory uses parity checks , ECC memory is an improved version of parity checks. If your memory change because "high energy particles in the air trying to change them" your memory controller detects it.

Really hit a silent error is *very* hard, RAID 5 bad controllers seems to be one rare case. Some engineers believe that silent errors more myth than fact.

How many times did you see a corrupted file, file system? I guess a few. But how many times you, your friends, in your work... have experienced a silent corruption?

**misiu_mp** · 05 November 2010, 05:07 PM

Originally posted by Jimbo View Post

Really, please stop guessing that silent errors are something usual. If u don't know how crc, parity ... works please stop guessing that silent errors happens, stop guessing how this works, accept that u don't know how it works.

Memory uses parity checks , ECC memory is an improved version of parity checks. If your memory change because "high energy particles in the air trying to change them" your memory controller detects it.

Really hit a silent error is *very* hard, RAID 5 bad controllers seems to be one rare case. Some engineers believe that silent errors more myth than fact.

How many times did you see a corrupted file, file system? I guess a few. But how many times you, your friends, in your work... have experienced a silent corruption?

The problem with silent corruption is that its is _silent_ - thus more difficult to discover, so it would be difficult to see it in everyday life.

I don't say it is common to get space-radiation induced memory corruption, but it could happen. Also, electrostatic-induced corruption is much more common. And there is most likely no CRC to cover that. You may only hope that if you fried your chips, you fried them well enough for it to show quickly.

I know very well what parity check and CRC is and how they work.
The problem is there is no automatic CRC done on memory and parity check requires parity memory, which although not the same as ECC memory is equally unpopular amongst ordinary users. We have to live with no memory error detection below the application/kernel level whats-so-ever.
Oh, there is memtest86, but it is too late for our bits.

And you know what? That's fine. When it comes to storage, it's good enough for home users and it's good enough for enterprises like google, for whom data integrity is not an absolute requirement and who most likely does use ecc memory anyways, largely reducing the probability of errors.

Everyone else, worrying about the electrons as they leave the RAM and flow through the wires and registers and caches of the motherboard (all possibly corrupted and exposed to the elements), to the writing heads of the disk, can use ZFS or anything else that does checksumming of everything it writes to disk.

**misiu_mp** · 05 November 2010, 05:14 PM

Btw, most of errors discovered by memory checks such as memtest86, are actually silent errors. They do usually cause crashes and hangups, because they change in-memory data structures and code in random ways. But if they don't affect any vital part of the memory, they might just as well silently be corrupting your iso's as you download or copy them.
Would you trust the data written by a system that has shown to have had bad ram?

**Jimbo** · 05 November 2010, 06:00 PM

Originally posted by misiu_mp View Post

I don't say it is common to get space-radiation induced memory corruption, but it could happen. Also, electrostatic-induced corruption is much more common. And there is most likely no CRC to cover that. You may only hope that if you fried your chips, you fried them well enough for it to show quickly.

I not agree, when error appears on 64-bit word usually only a bit is erroneous (the BER on todays memory is very low) and ECC can correct this case, when there are 2 bits erroneous on a 64 bit word (extremely unusual) ECC can detect it. So your data is not compromised if u use ECC memory, in other case your data may be compromised, and no FS could help, because your memory is saying to the FS write this data.

Originally posted by misiu_mp View Post

Btw, most of errors discovered by memory checks such as memtest86, are actually silent errors. They do usually cause crashes and hangups, because they change in-memory data structures and code in random ways. But if they don't affect any vital part of the memory, they might just as well silently be corrupting your iso's as you download or copy them.
Would you trust the data written by a system that has shown to have had bad ram?

If the system is so fucked up, no FS can help to store data securely. ECC memory is a must to avoid those nasty situation (which anyway are not frequent), because errors are detected, the system (at least linux does) is counting them and advise to the user that memory is corrupted.

**misiu_mp** · 05 November 2010, 07:28 PM

Originally posted by Jimbo View Post

I not agree, when error appears on 64-bit word usually only a bit is erroneous (the BER on todays memory is very low) and ECC can correct this case, when there are 2 bits erroneous on a 64 bit word (extremely unusual) ECC can detect it. So your data is not compromised if u use ECC memory, in other case your data may be compromised, and no FS could help, because your memory is saying to the FS write this data.

How is it that you don't agree with me but I do agree with you? Must be the age difference.
I said 'most likely' because home pcs generally do not use ecc or parity memory nor do they crc the data written.

Originally posted by Jimbo View Post

If the system is so fucked up, no FS can help to store data securely. ECC memory is a must to avoid those nasty situation (which anyway are not frequent), because errors are detected, the system (at least linux does) is counting them and advise to the user that memory is corrupted.

Again, you are right. Bad ram equals a ruined system. Ecc will almost always save you. The thing is, when it happens undetected (e.g. when not having ecc or when 3 bits are changed), a checksumming file system might help discover it sooner, before you lose more data and work.

Btw, damaging memory (ram, disk cache) is neither difficult nor unlikely, if you handle memory modules or hdds without taking necessary precautions and there is a high risk even if you do take precautions (because you might have not taken precautions in taking precautions - e.g. paint on the heating element might prevent the wrist strip from working). It's a good idea to run memtest after handling/touching memory.

**Jimbo** · 06 November 2010, 02:10 AM

Originally posted by misiu_mp View Post

How is it that you don't agree with me but I do agree with you? Must be the age difference.
I said 'most likely' because home pcs generally do not use ecc or parity memory nor do they crc the data written.

Again, you are right. Bad ram equals a ruined system. Ecc will almost always save you. The thing is, when it happens undetected (e.g. when not having ecc or when 3 bits are changed), a checksumming file system might help discover it sooner, before you lose more data and work.

Btw, damaging memory (ram, disk cache) is neither difficult nor unlikely, if you handle memory modules or hdds without taking necessary precautions and there is a high risk even if you do take precautions (because you might have not taken precautions in taking precautions - e.g. paint on the heating element might prevent the wrist strip from working). It's a good idea to run memtest after handling/touching memory.

Better don't speak about ages, if you want to believe you are older go ahead!, I have no problem believing that I am young.

ZFS is a superior FS and more robust, it detects errors better than others filesystem.

If u have corrupted memory , ZFS or other FS can not guarantee your data.

If your are using ext4, ext3... and write 12345667890 on your disk you don't get 2234567890 because of read errors, when your data is corrupted you get read errors, computers don't tolerate errors because there are checks all the time. Errors happens, noise exits and electromagnetic interferences too, when this happens the error is corrected due to convolutional codes or the error is detected by checks, is *very* hard to hit silent errors, as I said CERN found them because firmware malfunction on some raid 5 controllers, and here ZFS proved its value. But there are not many more cases of silent errors.

So trying to confuse others saying that u must be protected because your data is changed on your disk all the time due to errors is basically not true. Critical data servers have worked and continue working with ext4, ext3... filesystems without problems.

**misiu_mp** · 06 November 2010, 07:07 AM

Originally posted by Jimbo View Post

Better don't speak about ages, if you want to believe you are older go ahead!, I have no problem believing that I am young.

I never said I'm older. Your juvenile behaviour could stem from other reasons.

Originally posted by Jimbo View Post

If your are using ext4, ext3... and write 12345667890 on your disk you don't get 2234567890 because of read errors,

Yes you might, if you fried your memory. It does not always mean your system crashes. It is compromised, but may appear correct. You might not notice it. And forget about ecc. Its not used at homes.

Originally posted by Jimbo View Post

when your data is corrupted you get read errors, computers don't tolerate errors because there are checks all the time.
Errors happens, noise exits and electromagnetic interferences too, when this happens the error is corrected due to convolutional codes or the error is detected by checks,

You have to realize that most operations are not checked. The risk of hardware-caused errors is too small for the cost of it (we are talking about the consumer market here). That means that if they DO happen, you're toast. What is checked is the notoriously erroneous storage media, but not the data on the way to it.

Originally posted by Jimbo View Post

is *very* hard to hit silent errors, as I said CERN found them because firmware malfunction on some raid 5 controllers, and here ZFS proved its value. But there are not many more cases of silent errors.

I just remembered I used to have a consumer-grade motherboard few years ago, with a build-in sata controller that was corrupting written data (quite a lot). Again it was fixable by a firmware update (even though it could have been caused by a hardware bug), but indicates that shit hits the fan more often than extremely rarely. Its not surprising, after all - software has bugs, especially firmware. (btw, I never got any warning from any part of the system. I discovered it the hard way.)
Again, its easy to fry your memory which may cause silent errors, but it is also easy to verify you don't have broken memory with memtest.
Have in mind though that memory exposed to electrostatic shock might not fail right away. I had a computer that started to show memory corruption months after last time the modules were touched.

Originally posted by Jimbo View Post

So trying to confuse others saying that u must be protected because your data is changed on your disk all the time due to errors is basically not true. Critical data servers have worked and continue working with ext4, ext3... filesystems without problems.

I am NOT propagating ZFS or 'confusing others'. Data wont be changed _on disk_ without detection, but it is not impossible to get wrong data to get written there in the first place.

I agree that bitflipping caused by cosmic rays is *very* hard to get. A computer case protects well from external electromagnetic interference and you can generally trust the designers of the hardware to have given it good margins for internal interference.

The greatest risks are memory corruption and faulty firmware or hardware. If you really have critical data servers, you probably run ecc memory, so the first part is taken care of. The only way to protect against firmware errors is to choose proven hardware. But that usually means old hardware and you are trading performance for reliability again. We could do more about it if the firmware was free software, but its not.

Having said that, I am not using ZFS and if I am missing something about it, it would be the snapshots and build-in volume management. For me data checksumming is just an extras.

**Jimbo** · 06 November 2010, 08:47 AM

Attacking me makes you more juvenile.

The main source of errors is due to electromagnetic noise and electromagnetic interference with other hardware (in the same motherboard or extern hardware), that maybe exacerbated by manufacturing errors.

This is know and vendors implement features to check errors, is fucking hard to get a silent error, you could see SATA T13 and SCSI T10 standards, which propose methods to checksumming correctly. Even on hardware that doesn't support those standards, vendors have always implemented their our algorithms to add some type of checksumming.

I repeat, 1234567890 is not converted on 2234567890 due to errors on your hard disk, and this is not not happening all the time as kebabbert says (this was my main reply), and even on memory errors your hardware controller and your OS protects you. Yeah silent errors exits, but is not easy to see them.

You are guessing that simple errors lead to data corruption and this is not true!!

**Jimbo** · 06 November 2010, 08:49 AM

On my last sentence I wanted to mean data change, not data corruption, data coruption happens all the time.

Announcement

Ted Ts'o: EXT4 Within Striking Distance Of XFS

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment