Announcement

Collapse
No announcement yet.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

    Phoronix: Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

    There's nothing quite like some fun holiday-weekend reading as a fiery mailing list post by Linus Torvalds. The Linux creator is out with one of his classical messages, which this time is arguing over the importance of ECC memory and his opinion on how Intel's "bad policies" and market segmentation have made ECC memory less widespread...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Better late than never.

    Comment


    • #3
      In a DRAM, the difference between a one and a zero is roughly a couple thousand electrons in a tiny capacitor (order of femtofarads) ... that discharges in a couple of ms.

      Torvalds is 101% right about this one.

      From my personal experience, the reason we don't see their effects as much is twofold:

      1. Most of the memory won't ever be used. So applications may hold gigs of data in RAM, but they won't use most of it; when that memory is discarded, the bit flip goes away as well.

      2. Software error rates are still higher (so misattribution).


      Edit: #1 is strongly linked to the principles that make microprocessor caching work so well - the working set of an application is typically pretty small, much much smaller than the RSS. So a ~128KiB L1 cache makes a lot of sense (coupled with larger L2 cache, and considerably larger L3 cache). Now the L1/L2 caches are static RAM and ECCed.
      Last edited by vladpetric; 05 January 2021, 03:26 PM.

      Comment


      • #4
        Those random single bit errors get easier to manifest as process scales down. What's worse - the likelihood to introduce an non-repairable multi-bit error increases as well.

        But aside from that, ecc is also quite helpful to identify failing ram sticks on time. Those can produce errors without visible symptoms for days, even weeks, before the failure becomes apparent. You can end up with tons of bad data, incredibly difficult to detect and repair.

        Comment


        • #5
          Finally! Linus' amazing rants are back! (to some degree)

          Comment


          • #6
            Originally posted by tildearrow View Post
            Finally! Linus' amazing rants are back! (to some degree)
            Remarkably, I think that the rants only happen when he's absolutely right (ok, maybe 99% of the time).

            It's not that he's not ever wrong (he absolutely can be, and parts of the Linux architecture are a testament to that). I'm saying that when he "loses" it, he is almost always right.

            Comment


            • #7
              Will 2021 bring the good old Linus back?
              I sure hope so but I am not so optimistic that a single rant is strong enough evidence that Torvalds is back...

              Comment


              • #8
                Wonder if Intel will use their political tools like the $300 million they gave to postmodern feminist to "cancel" Torvalds if he keep blasting Intel with the truth on their harmful anti-consumer practices, and let's not even get on the anti-competitive practices Intel has employed. I can totally see Intel looking at that comment by Torvals and going "Oh no, looks like someone needs another reeducation, I mean soft skills".

                Comment


                • #9
                  Originally posted by vladpetric View Post
                  In a DRAM, the difference between a one and a zero is roughly a couple thousand electrons in a tiny capacitor (order of femtofarads) ... that discharges in a couple of ms.

                  Torvalds is 101% right about this one.

                  From my personal experience, the reason we don't see their effects as much is twofold:

                  1. Most of the memory won't ever be used. So applications may hold gigs of data in RAM, but they won't use most of it; when that memory is discarded, the bit flip goes away as well.

                  2. Software error rates are still higher (so misatribution).
                  Absolutely, but remember that Linux contrary to other popular OS' actually tries to use your memory for something useful. After all if you got a system with 16GB RAM would you not rather want it to be used even if you, with your own use typically occupy no more than 2-3GB. As you stop and start programs they may be allocated to different ranges in the 16GB pool each time, this eventually making your program hit one of those bad spots. But of course , if you have some ridiculous amount of RAM it will take longer to hit the bad spot.

                  While I agree with you that a dropped cache may remove the bad effect of the bitflip you must not forget that there is still a bitflip sensitive area somewhere in your 16GB pool. With non-ECC memory you may never detect this other than suddenly finding a corrupted file (With BTRFS/ZFS you at least know what file it is). If you got a large 8GB video file you might not even notice a bitflip at all other than a odd pixel artifact, but if you have some music you may hear a "pop" or a small portion of noise. Does it matter? For the average joe , maybe not so much , but yes, the corruption is still there and your system is flaky. it is pure luck if "disposable" data such as video or music is stored in the bad location or if actual program logic is overwritten that may further corrupt your system.
                  As if debugging program logic was not hard enough, a bitflip may mislead the programmer to adjust logic that in worst case may be completely unrelated.

                  One way to work around the lack of ECC would be a background task e.g. a kernel driver that does the same job as Memtest86+. E.g. runs on idle or reserves a very small amount of CPU time (2-3%) and slowly but surely walks the memory all the time - continuously. If it finds a bad spot, it could blacklist the memory area or memory module. Building a "bad block list" over time. That would be a poor mans ECC e.g. software memory scrubbing, but it would be interesting if something like this existed - as I am sure it would exposed how big the problem probably is , it might build a strong case for Torvalds's call for more ECC memory


                  http://www.dirtcellar.net

                  Comment


                  • #10
                    Keep it up Linus, we love the sweary rants and provides a wonderful antithesis to these politically correct snowflake times.

                    Comment

                    Working...
                    X