Originally posted by coder
View Post
Announcement
Collapse
No announcement yet.
Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC
Collapse
X
-
http://www.dirtcellar.net
-
Originally posted by waxhead View PostI think you forget about memory channels and memory banks. If you got dual, tripple or even a quad channel you simply in the case of one bad memory module offline the entire memory bank. E.g. you offline all two, tree or four memory modules in that particular bank. Then you replace the broken module and online the memory bank again. So interleaving does not determine if it can be offlined or not in itself. Do you agree?Last edited by coder; 10 February 2021, 09:53 PM.
Comment
-
Originally posted by schmidtbag View PostYou act like RAM is known to regularly and steadily degrade over a relatively short period of time.
We have one server with probably a bad motherboard or CPU memory controller that occasionally reports correctable memory errors and has done so for years. We tried replacing the DIMM, but with no effect. When it gets bad enough, we will replace the server (if it's not replaced for other reasons, before then). However, because only one bit is affected, the errors have all been corrected and it continues to run stably.
In most cases where the actual DIMM was bad, I've seen it just get worse and worse, until we could replace it. There was a machine used by our QA folks for testing builds and they hadn't been monitoring for ECC errors. Eventually, the machine had become too unstable to use, and when I checked the logs, I found ECC errors on two of their DIMMs and one had gotten so bad that it was routinely generating uncorrectable multi-bit errors. So, there definitely is some degradation over time. That's not to say that you couldn't have some sort of catastrophic failure that suddenly starts causing multi-bit errors from one moment to the next, but the degradation from "perfectly functioning" to the point of having a high error rate often takes some time.
Originally posted by schmidtbag View PostMost of such systems are so low-end with so little processing that they're not handling any especially critical data. They use slow RAM, which tends to be more stable. Of course, that doesn't mean it's failproof.
Or maybe it's something fairly non-critical, but it's just installed somewhere that's difficult, costly, time-consuming, or dangerous to access. So, you want to maximize reliability simply to reduce the need for physical maintenance.Last edited by coder; 11 February 2021, 03:25 AM.
Comment
-
Originally posted by coder View PostYou're right -- a page where there are errors could be blocked, regardless of interleaving. I guess I was mainly thinking about offlining an entire DIMM, which can't be done in an interleaved setup.
http://www.dirtcellar.net
- Likes 1
Comment
-
Originally posted by coder View PostI'm just speaking from my experience that once a DIMM starts experiencing correctable ECC errors, the rate of those errors only increases with time.
Industrial applications need greater reliability than "it tends to be stable". There can significant safety or financial liabilities, if the hardware malfunctions. Or we could talk about medical equipment or telecoms (the P-series Atom I linked is aimed at transceiver base stations).
Originally posted by Zan Lynx View PostThat is exactly what RAM does in its most common failure modes. The big server farms like Google have published data on this and they say that the most common indicator of future errors is past errors.
In other words, as soon as a memory stick starts reporting errors, it often starts to get worse.
Comment
-
Originally posted by waxhead View PostAs for our friends that use operating systems named after the principle of punching a hole in a load bearing wall, then covering that hole with a thin sheet of breakable glass that completely ruins your privacy and allows for easy entry by intruders -
- Likes 1
Comment
-
Originally posted by schmidtbag View PostOnce RAM gets a physical defect, the problems aren't going to stop and ECC at that point is mostly just there for error reporting rather than correction, before bad gets worse.
When you have a physical degradation, it both lets you know and buys you time to take corrective action, without suffering data loss or unplanned downtime.
- Likes 1
Comment
-
Originally posted by coder View PostHuh? It absolutely does correct bad cells or wires! Again, ECC works the same, regardless of the underlying cause. A bad bit is a bad bit. Period.
Comment
Comment