Announcement

Collapse
No announcement yet.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by coder View Post
    Whether it's interleaved determines whether it can be offlined. And by interleaved, I mean that each 64-bits (assuming we're talking about DDR4 DIMMs) alternates which memory channel it goes to. In order to offline a DIMM (or pages of one), you need either a non-interleaved configuration or a memory setup that's at least partially-mirrored (e.g. a server CPU where half of the channels are mirrors of the others).

    In other words, you won't find memory hot-plugging supported on any fully-interleaved configuration.
    I think you forget about memory channels and memory banks. If you got dual, tripple or even a quad channel you simply in the case of one bad memory module offline the entire memory bank. E.g. you offline all two, tree or four memory modules in that particular bank. Then you replace the broken module and online the memory bank again. So interleaving does not determine if it can be offlined or not in itself. Do you agree?

    http://www.dirtcellar.net

    Comment


    • Originally posted by waxhead View Post
      I think you forget about memory channels and memory banks. If you got dual, tripple or even a quad channel you simply in the case of one bad memory module offline the entire memory bank. E.g. you offline all two, tree or four memory modules in that particular bank. Then you replace the broken module and online the memory bank again. So interleaving does not determine if it can be offlined or not in itself. Do you agree?
      You're right -- a page where there are errors could be blocked, regardless of interleaving. I guess I was mainly thinking about offlining an entire DIMM, which can't be done in an interleaved setup.
      Last edited by coder; 10 February 2021, 09:53 PM.

      Comment


      • Originally posted by schmidtbag View Post
        You act like RAM is known to regularly and steadily degrade over a relatively short period of time.
        I'm just speaking from my experience that once a DIMM starts experiencing correctable ECC errors, the rate of those errors only increases with time.

        We have one server with probably a bad motherboard or CPU memory controller that occasionally reports correctable memory errors and has done so for years. We tried replacing the DIMM, but with no effect. When it gets bad enough, we will replace the server (if it's not replaced for other reasons, before then). However, because only one bit is affected, the errors have all been corrected and it continues to run stably.

        In most cases where the actual DIMM was bad, I've seen it just get worse and worse, until we could replace it. There was a machine used by our QA folks for testing builds and they hadn't been monitoring for ECC errors. Eventually, the machine had become too unstable to use, and when I checked the logs, I found ECC errors on two of their DIMMs and one had gotten so bad that it was routinely generating uncorrectable multi-bit errors. So, there definitely is some degradation over time. That's not to say that you couldn't have some sort of catastrophic failure that suddenly starts causing multi-bit errors from one moment to the next, but the degradation from "perfectly functioning" to the point of having a high error rate often takes some time.

        Originally posted by schmidtbag View Post
        Most of such systems are so low-end with so little processing that they're not handling any especially critical data. They use slow RAM, which tends to be more stable. Of course, that doesn't mean it's failproof.
        Industrial applications need greater reliability than "it tends to be stable". There can significant safety or financial liabilities, if the hardware malfunctions. Or we could talk about medical equipment or telecoms (the P-series Atom I linked is aimed at transceiver base stations).

        Or maybe it's something fairly non-critical, but it's just installed somewhere that's difficult, costly, time-consuming, or dangerous to access. So, you want to maximize reliability simply to reduce the need for physical maintenance.
        Last edited by coder; 11 February 2021, 03:25 AM.

        Comment


        • Originally posted by coder View Post
          You're right -- a page where there are errors could be blocked, regardless of interleaving. I guess I was mainly thinking about offlining an entire DIMM, which can't be done in an interleaved setup.
          Precisely. The very fact that memory (and CPU) hotplug is even possible is really stunning to me. it is like doing a heart transplant while running a marathon. As for our friends that use operating systems named after the principle of punching a hole in a load bearing wall, then covering that hole with a thin sheet of breakable glass that completely ruins your privacy and allows for easy entry by intruders - well, I only think hot add cpu/memory is supported so you need to offline the entire machine if something gets bad.

          http://www.dirtcellar.net

          Comment


          • Originally posted by coder View Post
            I'm just speaking from my experience that once a DIMM starts experiencing correctable ECC errors, the rate of those errors only increases with time.
            I agree - once RAM starts to degrade, it only gets worse. My point is RAM these days doesn't begin to degrade that often anymore.
            Industrial applications need greater reliability than "it tends to be stable". There can significant safety or financial liabilities, if the hardware malfunctions. Or we could talk about medical equipment or telecoms (the P-series Atom I linked is aimed at transceiver base stations).
            In such cases, they will use ECC. If it doesn't come with ECC and is bought anyway, that shows it either isn't that critical of a machine of the sysadmin is an idiot.

            Originally posted by Zan Lynx View Post
            That is exactly what RAM does in its most common failure modes. The big server farms like Google have published data on this and they say that the most common indicator of future errors is past errors.

            In other words, as soon as a memory stick starts reporting errors, it often starts to get worse.
            I'm well aware. That doesn't contradict what I was saying. Once RAM gets a physical defect, the problems aren't going to stop and ECC at that point is mostly just there for error reporting rather than correction, before bad gets worse. My point is RAM doesn't physically degrade so easily anymore. Obviously, it still happens - that's inevitable, especially when we're talking petabytes of it in a big server farm.

            Comment


            • Originally posted by waxhead View Post
              As for our friends that use operating systems named after the principle of punching a hole in a load bearing wall, then covering that hole with a thin sheet of breakable glass that completely ruins your privacy and allows for easy entry by intruders -
              LOL -- you forgot to mention damaging energy efficiency.

              Comment


              • Originally posted by schmidtbag View Post
                Once RAM gets a physical defect, the problems aren't going to stop and ECC at that point is mostly just there for error reporting rather than correction, before bad gets worse.
                Huh? It absolutely does correct bad cells or wires! Again, ECC works the same, regardless of the underlying cause. A bad bit is a bad bit. Period.

                When you have a physical degradation, it both lets you know and buys you time to take corrective action, without suffering data loss or unplanned downtime.

                Comment


                • Maybe Lenovo will actually listen to ThinkPad customers? Would be a breeze because Ryzen PRO mobile CPUs already have ECC support, but Lenovo just needs to stop being lazy:

                  Comment


                  • Originally posted by coder View Post
                    Huh? It absolutely does correct bad cells or wires! Again, ECC works the same, regardless of the underlying cause. A bad bit is a bad bit. Period.
                    You seem to forget that bit flips occur from more than just damaged hardware. If there is a permanent defect, it's just a matter of time until the parity bits will encounter an accidental flip, in which case the bad cells won't be corrected. ECC isn't supposed to compensate for damaged hardware. In this context, it basically just buys you enough time to replace the defective DIMM without suffering catastrophic losses.

                    Comment


                    • Ryzen PRO mobile CPUs support ECC... So how can we get Lenovo to stop being lazy and support it on their ThinkPad T-series motherboards? I doubt it would cost them extra... And those of us who care can just buy & install ECC instead of it being included.

                      Comment

                      Working...
                      X