Announcement

Collapse
No announcement yet.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Hope he used the proper pronouns. THIMM/DIMM/SHIM.

    Comment


    • #12
      Originally posted by waxhead View Post

      Absolutely, but remember that Linux contrary to other popular OS' actually tries to use your memory for something useful. After all if you got a system with 16GB RAM would you not rather want it to be used even if you, with your own use typically occupy no more than 2-3GB. As you stop and start programs they may be allocated to different ranges in the 16GB pool each time, this eventually making your program hit one of those bad spots. But of course , if you have some ridiculous amount of RAM it will take longer to hit the bad spot.

      While I agree with you that a dropped cache may remove the bad effect of the bitflip you must not forget that there is still a bitflip sensitive area somewhere in your 16GB pool. With non-ECC memory you may never detect this other than suddenly finding a corrupted file (With BTRFS/ZFS you at least know what file it is). If you got a large 8GB video file you might not even notice a bitflip at all other than a odd pixel artifact, but if you have some music you may hear a "pop" or a small portion of noise. Does it matter? For the average joe , maybe not so much , but yes, the corruption is still there and your system is flaky. it is pure luck if "disposable" data such as video or music is stored in the bad location or if actual program logic is overwritten that may further corrupt your system.
      As if debugging program logic was not hard enough, a bitflip may mislead the programmer to adjust logic that in worst case may be completely unrelated.

      One way to work around the lack of ECC would be a background task e.g. a kernel driver that does the same job as Memtest86+. E.g. runs on idle or reserves a very small amount of CPU time (2-3%) and slowly but surely walks the memory all the time - continuously. If it finds a bad spot, it could blacklist the memory area or memory module. Building a "bad block list" over time. That would be a poor mans ECC e.g. software memory scrubbing, but it would be interesting if something like this existed - as I am sure it would exposed how big the problem probably is , it might build a strong case for Torvalds's call for more ECC memory
      No disagreements whatsoever - let me just add a few things:

      * Cache hits tend to be asymmetric in most applications. As in, a few hot areas account for the vast majority of hits. However, caching is a brute force approach - the cache (at whatever level) doesn't know ahead of time which blocks to keep, so it keeps them all (with some form of LRU replacement) ... and then one in a thousand blocks is a winner (more or less).

      * Yes, from what I heard ZFS is far more memory intensive (it uses a lot of RAM, and it's not eventually dead, like most of the stuff in a cache). I don't have direct experience with ZFS, so anyone should correct me if I'm wrong.

      * Treating your customers like idiots who don't need reliability is a really bad idea in the long term. And yes, kudos to Linus for calling Intel on it.

      Comment


      • #13
        Totally support Linus's sentiment although it is not fully correct. AMD supports (validates) ECC RAM support ONLY on their PRO processors. ECC is enabled on non-PRO processors but validation and implementation is left to the motherboard vendor. So ECC will most likely work on non-PRO processor but it is not clear if it will work in ECC mode and how will it report the errors. You can watch Wendel for the current state of ECC on Ryzen - in short it is messy.
        You can argue that Ryzen's "support" of ECC can give false sense of security for some users....
        Anyway whoever needs ECC they are most likely to buy the solution validated by the vendor which means Ryzen PRO and Xeon CPUs - not big difference here. So to be precise instead of "AMD did it", I think the correct statement is more like "AMD raised a valid point, made some noise and scored some marketing points for ECC RAM support".
        Last edited by Ivan Dimitrov; 03 January 2021, 07:37 PM.

        Comment


        • #14
          Originally posted by Ivan Dimitrov View Post
          Totally support Linus's sentiment although it is not fully correct. AMD supports (validates) ECC RAM support ONLY on their PRO processors. ECC is enabled on non-PRO processors but validation and implementation is left to the motherboard vendor. So ECC will most likely work on non-PRO processor but it is not clear if it will work in ECC mode and how will it report the errors. You can watch Wendel for the current state of ECC on Ryzen - in short it is messy.
          You can argue that Ryzen's "support" of ECC can give false sense of security for some users....
          Anyway whoever needs ECC they are most likely to buy the solution validated by the vendor which means Ryzen PRO and Xeon CPUs - not big difference here. So to be precise instead of "AMD did it", I think the correct statement is more like "AMD raised a valid point, made some noise and scored some marketing points for ECC RAM support".
          The video you linked is about overclocking ECC memory ... did you mean to include a different video?

          Comment


          • #15
            -- "The "modern DRAM is so reliable that it doesn't need ECC"

            How dare they lie in a straight face?
            DRAM is much much less reliable than even five years ago. MemTest is mandatory nowadays even if you're not doing anything special.

            I've purchased DRAM modules ~4 years ago and last year. The older ones run on my semi-server (without ECC because I misread the spec) quite happily, but one of latter modules can't even pass MemTest for 1 minute.
            Another module from my colleague (brought last year) was also broken from the beginning.

            A funny fact for modern DDR4 is, they're mostly DDR4-2133 under JEDEC spec (1.2V), and people simply overclock it to 3200+ with 1.35V, under an obscure name called "Intel XMP".
            The desktop DRAM market is not trustworthy anymore, when overclocking becomes the "new common".
            Last edited by zxy_thf; 03 January 2021, 06:35 PM.

            Comment


            • #16
              AMD's unofficial support on consumer platforms is great, but they need to make it official and have some kind of pathway for motherboard vendors to officially close the support loop. And Intel needs to do the same.

              The transition to DDR5 is an excellent time to get it done in a way that's simple to communicate to consumers.

              Comment


              • #17
                Good to see Linus back.
                And yes, I agree, when he "loses it" he's often right. I mean you can't blame him for being pissed off by repeated misbehaviour.

                It's also interesting what Ian Cutress adds in that discussion.


                I'd also be glad about more of the "official" ECC support. Especially as amounts, speed and possible other factors increase it would be a good idea to have this stuff officially supported and advertised. For all sorts of machines, big boxes and embedded ones (can be absolutely crucial in industry control). I can easily live with a little performance bump if I can be sure then my system won't mess up because of some random bit flip.
                Stop TCPA, stupid software patents and corrupt politicians!

                Comment


                • #18
                  I'm surprised that you talk about "bad spots". Isn't ECC main job to detect and correct bits that were accidentally flipped by external sources? Like an accidental cosmic ray, or radiation from the plastic covering the chip? In fact, Torvalds talks about row-hammer, which was a bug found in DRAM that can make a bit to flip just by reading adjacent rows very quickly, and asks how many accidental bit-flip cases are happening due to a program just triggering that accidentally during its normal operation... That has nothing to do with "bad spots" or "bad bits".
                  Last edited by rastersoft; 03 January 2021, 07:34 PM.

                  Comment


                  • #19
                    Linus argues that error-correcting code (ECC) memory "absolutely matters" but that "Intel has been instrumental in killing the whole ECC industry with it's horribly bad market segmentation... Intel has been detrimental to the whole industry and to users because of their bad and misguided policies wrt ECC.

                    Hallelujah! Sing it, Linus!

                    Comment


                    • #20
                      Originally posted by waxhead View Post
                      One way to work around the lack of ECC would be a background task e.g. a kernel driver that does the same job as Memtest86+. E.g. runs on idle or reserves a very small amount of CPU time (2-3%) and slowly but surely walks the memory all the time - continuously. If it finds a bad spot, it could blacklist the memory area or memory module. Building a "bad block list" over time. That would be a poor mans ECC e.g. software memory scrubbing, but it would be interesting if something like this existed - as I am sure it would exposed how big the problem probably is , it might build a strong case for Torvalds's call for more ECC memory
                      I had the same idea, and apparently so did others:
                      I haven't personally tried either, always opting for ECC RAM in my hardware. I'm surprised it's not more popular.

                      Comment

                      Working...
                      X