Announcement

Collapse
No announcement yet.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • schmidtbag
    replied
    Originally posted by coder View Post
    Not forgetting. Just that in my experience, the ECC errors I've seen have been overwhelmingly caused by marginal hardware.
    Maybe stop buying cheap crap then?
    I don't know why you keep saying that, as if you're someone who decides these things.
    It's not a matter of opinion, but seeing as you're too cheap to buy good RAM, I'm not surprised you're also too cheap to think ECC is capable of compensating for defects. The parity bits don't guarantee a fix, especially if there's a permanent defect. Even if the parity bit is capable of fixing a hard error, it's just a matter of time until the parity bit suffers a bit flip. If data integrity is important enough to warrant the need of ECC, it is objectively irresponsible to not replace RAM with a physical defect. It is widely agreed upon throughout the industry that hard errors require replacing the module.
    https://www.atpinc.com/blog/ecc-dimm...types-chipkill
    https://docs.oracle.com/cd/E19121-01...-11/dimms.html
    https://www.cisco.com/c/dam/en/us/su...July142020.pdf
    By your logic, that's like having a kidney removed and saying "it's fine, the spleen and liver also filter blood". There might be some overlap in what they can filter but they aren't drop-in replacements. Having them is nice but your quality of life will be compromised with only one kidney.
    Yes, that's my point. It notifies the admin and gives them a chance to perform maintenance without data loss or unplanned downtime. Is there a slight risk of a hardware error or other cause triggering an uncorrectible multi-bit error? Sure. But it's just like any other fault-tolerance mechanism, in that respect. There's always a level of catastrophic failure that can still happen, but as long as it's sufficiently improbable, not undetectable, and still manageable (i.e. you can resort to backups), then that's okay.
    If that really was your point then you wouldn't question why I say ECC isn't meant to compensate for defects.
    The fact of the matter is, as long as your RAM is healthy, all errors are "sufficiently improbable" and manageable. ECC is needed where "sufficient", ironically, isn't good enough.

    Leave a comment:


  • coder
    replied
    Originally posted by schmidtbag View Post
    You seem to forget that bit flips occur from more than just damaged hardware. If there is a permanent defect, it's just a matter of time until the parity bits will encounter an accidental flip, in which case the bad cells won't be corrected.
    Not forgetting. Just that in my experience, the ECC errors I've seen have been overwhelmingly caused by marginal hardware. So, you have a bit of runtime with a failing DIMM before an uncorrectible, multi-bit error is likely to occur. Importantly, if it does, it's not silent. As noted before, the process triggering the error will abort, so that you don't silently persist the error on disk or propagate it via the network.

    Originally posted by schmidtbag View Post
    ECC isn't supposed to compensate for damaged hardware.
    I don't know why you keep saying that, as if you're someone who decides these things.

    Originally posted by schmidtbag View Post
    In this context, it basically just buys you enough time to replace the defective DIMM without suffering catastrophic losses.
    Yes, that's my point. It notifies the admin and gives them a chance to perform maintenance without data loss or unplanned downtime. Is there a slight risk of a hardware error or other cause triggering an uncorrectible multi-bit error? Sure. But it's just like any other fault-tolerance mechanism, in that respect. There's always a level of catastrophic failure that can still happen, but as long as it's sufficiently improbable, not undetectable, and still manageable (i.e. you can resort to backups), then that's okay.

    Leave a comment:


  • make_adobe_on_Linux!
    replied
    Ryzen PRO mobile CPUs support ECC... So how can we get Lenovo to stop being lazy and support it on their ThinkPad T-series motherboards? I doubt it would cost them extra... And those of us who care can just buy & install ECC instead of it being included.

    Leave a comment:


  • schmidtbag
    replied
    Originally posted by coder View Post
    Huh? It absolutely does correct bad cells or wires! Again, ECC works the same, regardless of the underlying cause. A bad bit is a bad bit. Period.
    You seem to forget that bit flips occur from more than just damaged hardware. If there is a permanent defect, it's just a matter of time until the parity bits will encounter an accidental flip, in which case the bad cells won't be corrected. ECC isn't supposed to compensate for damaged hardware. In this context, it basically just buys you enough time to replace the defective DIMM without suffering catastrophic losses.

    Leave a comment:


  • make_adobe_on_Linux!
    replied
    Maybe Lenovo will actually listen to ThinkPad customers? Would be a breeze because Ryzen PRO mobile CPUs already have ECC support, but Lenovo just needs to stop being lazy:
    https://www.reddit.com/r/thinkpad/co...d_support_ecc/

    Leave a comment:


  • coder
    replied
    Originally posted by schmidtbag View Post
    Once RAM gets a physical defect, the problems aren't going to stop and ECC at that point is mostly just there for error reporting rather than correction, before bad gets worse.
    Huh? It absolutely does correct bad cells or wires! Again, ECC works the same, regardless of the underlying cause. A bad bit is a bad bit. Period.

    When you have a physical degradation, it both lets you know and buys you time to take corrective action, without suffering data loss or unplanned downtime.

    Leave a comment:


  • coder
    replied
    Originally posted by waxhead View Post
    As for our friends that use operating systems named after the principle of punching a hole in a load bearing wall, then covering that hole with a thin sheet of breakable glass that completely ruins your privacy and allows for easy entry by intruders -
    LOL -- you forgot to mention damaging energy efficiency.

    Leave a comment:


  • schmidtbag
    replied
    Originally posted by coder View Post
    I'm just speaking from my experience that once a DIMM starts experiencing correctable ECC errors, the rate of those errors only increases with time.
    I agree - once RAM starts to degrade, it only gets worse. My point is RAM these days doesn't begin to degrade that often anymore.
    Industrial applications need greater reliability than "it tends to be stable". There can significant safety or financial liabilities, if the hardware malfunctions. Or we could talk about medical equipment or telecoms (the P-series Atom I linked is aimed at transceiver base stations).
    In such cases, they will use ECC. If it doesn't come with ECC and is bought anyway, that shows it either isn't that critical of a machine of the sysadmin is an idiot.

    Originally posted by Zan Lynx View Post
    That is exactly what RAM does in its most common failure modes. The big server farms like Google have published data on this and they say that the most common indicator of future errors is past errors.

    In other words, as soon as a memory stick starts reporting errors, it often starts to get worse.
    I'm well aware. That doesn't contradict what I was saying. Once RAM gets a physical defect, the problems aren't going to stop and ECC at that point is mostly just there for error reporting rather than correction, before bad gets worse. My point is RAM doesn't physically degrade so easily anymore. Obviously, it still happens - that's inevitable, especially when we're talking petabytes of it in a big server farm.

    Leave a comment:


  • waxhead
    replied
    Originally posted by coder View Post
    You're right -- a page where there are errors could be blocked, regardless of interleaving. I guess I was mainly thinking about offlining an entire DIMM, which can't be done in an interleaved setup.
    Precisely. The very fact that memory (and CPU) hotplug is even possible is really stunning to me. it is like doing a heart transplant while running a marathon. As for our friends that use operating systems named after the principle of punching a hole in a load bearing wall, then covering that hole with a thin sheet of breakable glass that completely ruins your privacy and allows for easy entry by intruders - well, I only think hot add cpu/memory is supported so you need to offline the entire machine if something gets bad.

    Leave a comment:


  • coder
    replied
    Originally posted by schmidtbag View Post
    You act like RAM is known to regularly and steadily degrade over a relatively short period of time.
    I'm just speaking from my experience that once a DIMM starts experiencing correctable ECC errors, the rate of those errors only increases with time.

    We have one server with probably a bad motherboard or CPU memory controller that occasionally reports correctable memory errors and has done so for years. We tried replacing the DIMM, but with no effect. When it gets bad enough, we will replace the server (if it's not replaced for other reasons, before then). However, because only one bit is affected, the errors have all been corrected and it continues to run stably.

    In most cases where the actual DIMM was bad, I've seen it just get worse and worse, until we could replace it. There was a machine used by our QA folks for testing builds and they hadn't been monitoring for ECC errors. Eventually, the machine had become too unstable to use, and when I checked the logs, I found ECC errors on two of their DIMMs and one had gotten so bad that it was routinely generating uncorrectable multi-bit errors. So, there definitely is some degradation over time. That's not to say that you couldn't have some sort of catastrophic failure that suddenly starts causing multi-bit errors from one moment to the next, but the degradation from "perfectly functioning" to the point of having a high error rate often takes some time.

    Originally posted by schmidtbag View Post
    Most of such systems are so low-end with so little processing that they're not handling any especially critical data. They use slow RAM, which tends to be more stable. Of course, that doesn't mean it's failproof.
    Industrial applications need greater reliability than "it tends to be stable". There can significant safety or financial liabilities, if the hardware malfunctions. Or we could talk about medical equipment or telecoms (the P-series Atom I linked is aimed at transceiver base stations).

    Or maybe it's something fairly non-critical, but it's just installed somewhere that's difficult, costly, time-consuming, or dangerous to access. So, you want to maximize reliability simply to reduce the need for physical maintenance.
    Last edited by coder; 11 February 2021, 03:25 AM.

    Leave a comment:

Working...
X