Announcement

**coder** · 14 February 2021, 06:49 AM

Originally posted by schmidtbag View Post

You seem to forget that bit flips occur from more than just damaged hardware. If there is a permanent defect, it's just a matter of time until the parity bits will encounter an accidental flip, in which case the bad cells won't be corrected.

Not forgetting. Just that in my experience, the ECC errors I've seen have been overwhelmingly caused by marginal hardware. So, you have a bit of runtime with a failing DIMM before an uncorrectible, multi-bit error is likely to occur. Importantly, if it does, it's not silent. As noted before, the process triggering the error will abort, so that you don't silently persist the error on disk or propagate it via the network.

Originally posted by schmidtbag View Post

ECC isn't supposed to compensate for damaged hardware.

I don't know why you keep saying that, as if you're someone who decides these things.

Originally posted by schmidtbag View Post

In this context, it basically just buys you enough time to replace the defective DIMM without suffering catastrophic losses.

Yes, that's my point. It notifies the admin and gives them a chance to perform maintenance without data loss or unplanned downtime. Is there a slight risk of a hardware error or other cause triggering an uncorrectible multi-bit error? Sure. But it's just like any other fault-tolerance mechanism, in that respect. There's always a level of catastrophic failure that can still happen, but as long as it's sufficiently improbable, not undetectable, and still manageable (i.e. you can resort to backups), then that's okay.

**schmidtbag** · 14 February 2021, 05:21 PM

Originally posted by coder View Post

Not forgetting. Just that in my experience, the ECC errors I've seen have been overwhelmingly caused by marginal hardware.

Maybe stop buying cheap crap then?

I don't know why you keep saying that, as if you're someone who decides these things.

It's not a matter of opinion, but seeing as you're too cheap to buy good RAM, I'm not surprised you're also too cheap to think ECC is capable of compensating for defects. The parity bits don't guarantee a fix, especially if there's a permanent defect. Even if the parity bit is capable of fixing a hard error, it's just a matter of time until the parity bit suffers a bit flip. If data integrity is important enough to warrant the need of ECC, it is objectively irresponsible to not replace RAM with a physical defect. It is widely agreed upon throughout the industry that hard errors require replacing the module.

What are the Common Memory Error Types and How Do ECC DIMMs Work?

https://www.atpinc.com/blog/ecc-dimm-memory-ram-errors-types-chipkill

Explaining ECC mechanism for DIMMs and how DRAM functionality, system tests plays a vital role in ensuring the reliability of your server, mission critical systems

Troubleshooting DIMM Problems

https://docs.oracle.com/cd/E19121-01/sf.x4150/820-4213-11/dimms.html

The Sun Fire X4150, X4250, and X4450 Servers Diagnostics Guide contains information and procedures for using available tools to diagnose problems with the servers.

https://www.cisco.com/c/dam/en/us/support/docs/servers-unified-computing/ucs-b-series-blade-servers/ManagingCorrectableMemoryErrorsFinalJuly142020.pdf

By your logic, that's like having a kidney removed and saying "it's fine, the spleen and liver also filter blood". There might be some overlap in what they can filter but they aren't drop-in replacements. Having them is nice but your quality of life will be compromised with only one kidney.

Yes, that's my point. It notifies the admin and gives them a chance to perform maintenance without data loss or unplanned downtime. Is there a slight risk of a hardware error or other cause triggering an uncorrectible multi-bit error? Sure. But it's just like any other fault-tolerance mechanism, in that respect. There's always a level of catastrophic failure that can still happen, but as long as it's sufficiently improbable, not undetectable, and still manageable (i.e. you can resort to backups), then that's okay.

If that really was your point then you wouldn't question why I say ECC isn't meant to compensate for defects.
The fact of the matter is, as long as your RAM is healthy, all errors are "sufficiently improbable" and manageable. ECC is needed where "sufficient", ironically, isn't good enough.

Announcement

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

Comment

Comment