No announcement yet.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

  • Filter
  • Time
  • Show
Clear All
new posts

  • "Rant on, Rant on,

    What you got they can't deny it,

    Can't sell it, or buy it."

    With apologies to U2.


    • I agree with Trovalds. Intel is a menace as well as its bugged CPUs.
      Last edited by Azrael5; 06 January 2021, 09:48 AM.


      • For me the difference is to run a process once or run it twice and hope for the same result. I can rerun a pass once a month and get a single bit error from memory.

        A trick I found is to chunk the data to fit in one of the ECC or Parity checked then make a check of the input against the original... Even this sucks...


        • Originally posted by BingoNightly View Post

          Tell that to my DCO guys, they replace memory and CPUs daily.
          So, somebody is doing a poor job. How long is the renewal cycle? A decade?.


          • Intel has been a basket case for some time. It has lagged badly in many areas such as process node. It has tried to push impractical architectures that were not on a solid basis such as Itanium which actually would not have worked well. It neglected mobile. There is a bubble-up effect with CPUs where todays consumer CPU ends up being tomorrows server CPU. x86 started as a cheap desktop CPU then moved to servers for instance. History has shown that selling pricey and expensive premium server end stuff alone really doesnt work, when people can figure out they can save a lot of money by instead just buying and load balancing a inexpensive consumer hardware. Intel seems to be repeating the mistakes of companies like DEC, IBM, and failed architectures such as MIPS, SPARC, Alpha etc in becoming being seen as an expensive, high end premium product and letting ARM own value. History shows it doesn't work and people figure how how to use inexpensive stuff on servers.


            • Originally posted by pal666 View Post
              it is clear if you look at motherboard specs
              It is not. Some companies like Gigabyte claim they support "ECC memory" on motherboards but it doesnt' support validation or correction on consumer grade motherboard. Which enhances further fragmentation. As far as I know only Asus does support proper validation/correction on motherboards and only on some of them, everyone else is "no". Asrock i think only claims ECC correction on those "pro" processors.
              Last edited by piotrj3; 04 January 2021, 02:22 PM.


              • Originally posted by magallanes View Post

                It is exactly what I said: "ECC does not automatically correct memories." but it could correct wrong reading but not if the information is store correctly.

                But even more. Let's say we have a faulty module:

                ECC = crashes and maybe an error low.
                NO-ECC = crashes anyways

                So why we even need ECC memories for simple usage? 🤷‍♂️
                Wrong. ECC memory works based on Hamming's code or Reed–Solomon error correction or some other error correction codes. Principle of those codes is that if there is single bit error, it is automaticly corrected, if there are 2 bits wrong it can detect it is wrong but not fix it. at diffrences of 3 bits stuff might be undetectable.

                When your computer crashes, it is mostly because just one bit got wrong as pointer now pointer leads to memory it can't access and crash happens. When ECC memory has single bit wrong it still runs fine (with some additional latency to correct it), when it will detect 2 errors, it will just refuse to read corrupted data preventing you from using wrong data. So ECC memory kind of ensures correctness of stuff you read because you can say almost for sure that first 2 bit error will crash computer before 3 bit error will make unintended changes. It also decreases chance of crash due to fixing single bit errors. So it kind of makes sense to have them on server that needs to run non stop for months or even years as potential downtime costs a lot more then ECC memory.

                It also makes row hammer attacks harder to exploit, however it is worth noting it is "harder" not impossible. Also DDR4 memory does self-refresh on adjusted rows when you change a row to make row hammer harder to make, although it is again proven that it is possible to still row-hammer DDR4 but still it is also making it harder. That being said it is really hard to make row hammer attack vs row-refresh memory (majority is) and ECC memory. Also another trick to make row hammer harder is decrease time intervals between refreshing (default i think is 64ms?).

                Anyway it is 1000 times more likely your stuff were corrupted by hardware faulty stuff (your fault for not making sure stuff works right in memtest etc.). I mean come on most electronics have RMA rates of 2-5% which means you mostly had in your lifetime some issues with stuff that should be returned. ECC makes sense in enviroments with higher radiation or with software that needs to ensure stability for client, but there you often don't use GUI and you have much diffrent policies of updates etc. Unless you are living in high mountains or near uranium mine/use computer close to nuclear reactor I doubt you will ever need ECC memory, if stuff you do are extremly high priority, first ensure you use stable software before ECC memory, and after use ECC.
                Last edited by piotrj3; 04 January 2021, 02:42 PM.


                • What probably set Linus off is that the new RAM sticks require a more complex controller on the DIMM that could just as well do ECC at that BoM so the only thing keeping the non-ECC DIMMs in demand is market segmentation strategies from the likes of Intel.


                  • Originally posted by schmidtbag View Post
                    As I mentioned earlier, you checksum against the source data. Then you don't face such problems.
                    This is a topic of ECC on memory, not disk corruption. ECC is the actual checksum for RAM that you keep promoting.
                    Stop thinking of memory data as files. You file is an organized data, but in memory it will be spread all over. And guess what: your checksum routine will be executed from RAM over contents in RAM. That's how CPUs work.
                    If you keep data in RAM for long enough time, it can get corrupted (bit flipped). So you do a checksum over corrupted data and you write that to disk, do a backup etc. All on corrupted data.
                    Originally posted by schmidtbag View Post
                    Trends seem to suggest that modules are becoming more reliable. This could be due to better engineering or better software-level error detection, but it could also be that in modern times, memory usage and availability is so vast that the probability of a major failure plummets. Remember, for the average PC, the vast amount of memory used is in software and assets rather than personal data. Although a single flipped bit in a 64-bit register could result in an astronomically big difference in a variable, the probability of that 1 bit affecting your workload is preposterously small.
                    When it comes to servers and workstations, it's the exact opposite: most of the data is personal/critical and therefore the probability of a flipped bit causing a problem is significant. That's why you'd be a fool for not using ECC in such situations.
                    The difference between servers/workstations and regular PCs: regular PCs are intended to be restarted daily. And if not, Windows symptoms would make you restart it at least monthly. Also, home/small business PCs are used with small workloads where a compounding error could be caught by humans and redone. Or will just have crashes, but they also would not be in high-EMF area.
                    On workstations, their redoing cost is much larger, and pay for extra insurance.
                    Servers on the other hand, with all the power fluctuation in a data center, would probably have a much higher bit flip rate, so ECC offers not only a small correction, but also inform of correction rate, so as to plan a server replacement before any uncorrected action needs to be done. Also they tend to keep content in memory for months, if not years (e.g.:kernel).
                    Originally posted by schmidtbag View Post
                    ECC isn't a miracle worker. It has a high chance of recovery, but not 100%. It's meant to recover from a flipped bit here and there, but in the situation you described, it could be entire bytes that were erroneous. And, it's possible those bits/bytes physically can't be corrected. If your modules are physically defective, there's not much ECC can do to protect you, at least not without some severe performance penalties.
                    Actually I see you did not diagnose with memtest and did not read what I said.
                    Memtest tells you what pattern it expects and what it found....and I've had exactly 2 addresses with 1 bit difference in each within 5 seconds of the test and no other error during the following hour (when I stopped it) ECC would have spared me of the headache of why I had to press reset after each power on of the day.

                    And to get back to the cost:
                    The consumer cost of implementing ECC is just in the memory module, as I said +12.5% capacity, for which you pay 100% more because of low production of such modules (which could come down massively if it was widely implemented). CPU already has it (but disabled in i5/i7/i9, which is the main point of Linus), memory slot has pins for it and MB cost is neglijible for 16 more lines (dual channel) between CPU and RAM slots (+1$??). The memory controller in CPU is the one that does all this, even from reads/writes comming from PCI(e), and requires a single extra clock cycle to do any correction (for a complete burst access). Oh....and CPU caches already have ECC enabled.


                    • Originally posted by pal666 View Post
                      i want to use ecc memory, but it's significantly slower, so i don't use it
                      Do you mean it's not available in the rated speed that you desire, or that memory of the same rated speed performs worse?

                      If the former, then the problem is limited to UDIMMs (i.e. unbuffered memory), not that it would help you much. Once you switch to RDIMMs, the speeds go back up, though there's a slight penalty due to the buffering built into those DIMMs. However, RDIMMs are only supported in some Intel workstations (and all servers). You won't find them supported by mainstream desktop boards.

                      If you mean the latter, please provide a source for this claim.
                      Last edited by coder; 04 January 2021, 03:25 PM.