Announcement

Collapse
No announcement yet.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by schmidtbag View Post
    You seem to forget that bit flips occur from more than just damaged hardware. If there is a permanent defect, it's just a matter of time until the parity bits will encounter an accidental flip, in which case the bad cells won't be corrected.
    Not forgetting. Just that in my experience, the ECC errors I've seen have been overwhelmingly caused by marginal hardware. So, you have a bit of runtime with a failing DIMM before an uncorrectible, multi-bit error is likely to occur. Importantly, if it does, it's not silent. As noted before, the process triggering the error will abort, so that you don't silently persist the error on disk or propagate it via the network.

    Originally posted by schmidtbag View Post
    ECC isn't supposed to compensate for damaged hardware.
    I don't know why you keep saying that, as if you're someone who decides these things.

    Originally posted by schmidtbag View Post
    In this context, it basically just buys you enough time to replace the defective DIMM without suffering catastrophic losses.
    Yes, that's my point. It notifies the admin and gives them a chance to perform maintenance without data loss or unplanned downtime. Is there a slight risk of a hardware error or other cause triggering an uncorrectible multi-bit error? Sure. But it's just like any other fault-tolerance mechanism, in that respect. There's always a level of catastrophic failure that can still happen, but as long as it's sufficiently improbable, not undetectable, and still manageable (i.e. you can resort to backups), then that's okay.

    Comment


    • Originally posted by coder View Post
      Not forgetting. Just that in my experience, the ECC errors I've seen have been overwhelmingly caused by marginal hardware.
      Maybe stop buying cheap crap then?
      I don't know why you keep saying that, as if you're someone who decides these things.
      It's not a matter of opinion, but seeing as you're too cheap to buy good RAM, I'm not surprised you're also too cheap to think ECC is capable of compensating for defects. The parity bits don't guarantee a fix, especially if there's a permanent defect. Even if the parity bit is capable of fixing a hard error, it's just a matter of time until the parity bit suffers a bit flip. If data integrity is important enough to warrant the need of ECC, it is objectively irresponsible to not replace RAM with a physical defect. It is widely agreed upon throughout the industry that hard errors require replacing the module.
      Explaining ECC mechanism for DIMMs and how DRAM functionality, system tests plays a vital role in ensuring the reliability of your server, mission critical systems

      The Sun Fire X4150, X4250, and X4450 Servers Diagnostics Guide contains information and procedures for using available tools to diagnose problems with the servers.


      By your logic, that's like having a kidney removed and saying "it's fine, the spleen and liver also filter blood". There might be some overlap in what they can filter but they aren't drop-in replacements. Having them is nice but your quality of life will be compromised with only one kidney.
      Yes, that's my point. It notifies the admin and gives them a chance to perform maintenance without data loss or unplanned downtime. Is there a slight risk of a hardware error or other cause triggering an uncorrectible multi-bit error? Sure. But it's just like any other fault-tolerance mechanism, in that respect. There's always a level of catastrophic failure that can still happen, but as long as it's sufficiently improbable, not undetectable, and still manageable (i.e. you can resort to backups), then that's okay.
      If that really was your point then you wouldn't question why I say ECC isn't meant to compensate for defects.
      The fact of the matter is, as long as your RAM is healthy, all errors are "sufficiently improbable" and manageable. ECC is needed where "sufficient", ironically, isn't good enough.

      Comment

      Working...
      X