Originally posted by misiu_mp
View Post
Announcement
Collapse
No announcement yet.
Ted Ts'o: EXT4 Within Striking Distance Of XFS
Collapse
X
-
-
Originally posted by misiu_mp View PostSay I write data to the FS, but not flush to disk. The ram is corrupted and when the fsync comes, something else flows out of my ram and into the storage controller.
What CRC check is it that will signal this? ...
Memory uses parity checks , ECC memory is an improved version of parity checks. If your memory change because "high energy particles in the air trying to change them" your memory controller detects it.
Really hit a silent error is *very* hard, RAID 5 bad controllers seems to be one rare case. Some engineers believe that silent errors more myth than fact.
How many times did you see a corrupted file, file system? I guess a few. But how many times you, your friends, in your work... have experienced a silent corruption?
Comment
-
Originally posted by Jimbo View PostReally, please stop guessing that silent errors are something usual. If u don't know how crc, parity ... works please stop guessing that silent errors happens, stop guessing how this works, accept that u don't know how it works.
Memory uses parity checks , ECC memory is an improved version of parity checks. If your memory change because "high energy particles in the air trying to change them" your memory controller detects it.
Really hit a silent error is *very* hard, RAID 5 bad controllers seems to be one rare case. Some engineers believe that silent errors more myth than fact.
How many times did you see a corrupted file, file system? I guess a few. But how many times you, your friends, in your work... have experienced a silent corruption?
I don't say it is common to get space-radiation induced memory corruption, but it could happen. Also, electrostatic-induced corruption is much more common. And there is most likely no CRC to cover that. You may only hope that if you fried your chips, you fried them well enough for it to show quickly.
I know very well what parity check and CRC is and how they work.
The problem is there is no automatic CRC done on memory and parity check requires parity memory, which although not the same as ECC memory is equally unpopular amongst ordinary users. We have to live with no memory error detection below the application/kernel level whats-so-ever.
Oh, there is memtest86, but it is too late for our bits.
And you know what? That's fine. When it comes to storage, it's good enough for home users and it's good enough for enterprises like google, for whom data integrity is not an absolute requirement and who most likely does use ecc memory anyways, largely reducing the probability of errors.
Everyone else, worrying about the electrons as they leave the RAM and flow through the wires and registers and caches of the motherboard (all possibly corrupted and exposed to the elements), to the writing heads of the disk, can use ZFS or anything else that does checksumming of everything it writes to disk.
Comment
-
Btw, most of errors discovered by memory checks such as memtest86, are actually silent errors. They do usually cause crashes and hangups, because they change in-memory data structures and code in random ways. But if they don't affect any vital part of the memory, they might just as well silently be corrupting your iso's as you download or copy them.
Would you trust the data written by a system that has shown to have had bad ram?
Comment
-
Originally posted by misiu_mp View PostI don't say it is common to get space-radiation induced memory corruption, but it could happen. Also, electrostatic-induced corruption is much more common. And there is most likely no CRC to cover that. You may only hope that if you fried your chips, you fried them well enough for it to show quickly.
Originally posted by misiu_mp View PostBtw, most of errors discovered by memory checks such as memtest86, are actually silent errors. They do usually cause crashes and hangups, because they change in-memory data structures and code in random ways. But if they don't affect any vital part of the memory, they might just as well silently be corrupting your iso's as you download or copy them.
Would you trust the data written by a system that has shown to have had bad ram?
Comment
-
Originally posted by Jimbo View PostI not agree, when error appears on 64-bit word usually only a bit is erroneous (the BER on todays memory is very low) and ECC can correct this case, when there are 2 bits erroneous on a 64 bit word (extremely unusual) ECC can detect it. So your data is not compromised if u use ECC memory, in other case your data may be compromised, and no FS could help, because your memory is saying to the FS write this data.
I said 'most likely' because home pcs generally do not use ecc or parity memory nor do they crc the data written.
Originally posted by Jimbo View PostIf the system is so fucked up, no FS can help to store data securely. ECC memory is a must to avoid those nasty situation (which anyway are not frequent), because errors are detected, the system (at least linux does) is counting them and advise to the user that memory is corrupted.
Btw, damaging memory (ram, disk cache) is neither difficult nor unlikely, if you handle memory modules or hdds without taking necessary precautions and there is a high risk even if you do take precautions (because you might have not taken precautions in taking precautions - e.g. paint on the heating element might prevent the wrist strip from working). It's a good idea to run memtest after handling/touching memory.
Comment
-
Originally posted by misiu_mp View PostHow is it that you don't agree with me but I do agree with you? Must be the age difference.
I said 'most likely' because home pcs generally do not use ecc or parity memory nor do they crc the data written.
Again, you are right. Bad ram equals a ruined system. Ecc will almost always save you. The thing is, when it happens undetected (e.g. when not having ecc or when 3 bits are changed), a checksumming file system might help discover it sooner, before you lose more data and work.
Btw, damaging memory (ram, disk cache) is neither difficult nor unlikely, if you handle memory modules or hdds without taking necessary precautions and there is a high risk even if you do take precautions (because you might have not taken precautions in taking precautions - e.g. paint on the heating element might prevent the wrist strip from working). It's a good idea to run memtest after handling/touching memory.
ZFS is a superior FS and more robust, it detects errors better than others filesystem.
If u have corrupted memory , ZFS or other FS can not guarantee your data.
If your are using ext4, ext3... and write 12345667890 on your disk you don't get 2234567890 because of read errors, when your data is corrupted you get read errors, computers don't tolerate errors because there are checks all the time. Errors happens, noise exits and electromagnetic interferences too, when this happens the error is corrected due to convolutional codes or the error is detected by checks, is *very* hard to hit silent errors, as I said CERN found them because firmware malfunction on some raid 5 controllers, and here ZFS proved its value. But there are not many more cases of silent errors.
So trying to confuse others saying that u must be protected because your data is changed on your disk all the time due to errors is basically not true. Critical data servers have worked and continue working with ext4, ext3... filesystems without problems.
Comment
-
Originally posted by Jimbo View PostBetter don't speak about ages, if you want to believe you are older go ahead!, I have no problem believing that I am young.
Originally posted by Jimbo View PostIf your are using ext4, ext3... and write 12345667890 on your disk you don't get 2234567890 because of read errors,
Originally posted by Jimbo View Postwhen your data is corrupted you get read errors, computers don't tolerate errors because there are checks all the time.
Errors happens, noise exits and electromagnetic interferences too, when this happens the error is corrected due to convolutional codes or the error is detected by checks,
Originally posted by Jimbo View Postis *very* hard to hit silent errors, as I said CERN found them because firmware malfunction on some raid 5 controllers, and here ZFS proved its value. But there are not many more cases of silent errors.
Again, its easy to fry your memory which may cause silent errors, but it is also easy to verify you don't have broken memory with memtest.
Have in mind though that memory exposed to electrostatic shock might not fail right away. I had a computer that started to show memory corruption months after last time the modules were touched.
Originally posted by Jimbo View PostSo trying to confuse others saying that u must be protected because your data is changed on your disk all the time due to errors is basically not true. Critical data servers have worked and continue working with ext4, ext3... filesystems without problems.
I agree that bitflipping caused by cosmic rays is *very* hard to get. A computer case protects well from external electromagnetic interference and you can generally trust the designers of the hardware to have given it good margins for internal interference.
The greatest risks are memory corruption and faulty firmware or hardware. If you really have critical data servers, you probably run ecc memory, so the first part is taken care of. The only way to protect against firmware errors is to choose proven hardware. But that usually means old hardware and you are trading performance for reliability again. We could do more about it if the firmware was free software, but its not.
Having said that, I am not using ZFS and if I am missing something about it, it would be the snapshots and build-in volume management. For me data checksumming is just an extras.
Comment
-
Attacking me makes you more juvenile.
The main source of errors is due to electromagnetic noise and electromagnetic interference with other hardware (in the same motherboard or extern hardware), that maybe exacerbated by manufacturing errors.
This is know and vendors implement features to check errors, is fucking hard to get a silent error, you could see SATA T13 and SCSI T10 standards, which propose methods to checksumming correctly. Even on hardware that doesn't support those standards, vendors have always implemented their our algorithms to add some type of checksumming.
I repeat, 1234567890 is not converted on 2234567890 due to errors on your hard disk, and this is not not happening all the time as kebabbert says (this was my main reply), and even on memory errors your hardware controller and your OS protects you. Yeah silent errors exits, but is not easy to see them.
You are guessing that simple errors lead to data corruption and this is not true!!
Comment
Comment