No announcement yet.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

  • Filter
  • Time
  • Show
Clear All
new posts

  • Soo.. With that in mind, performance UDIMM for AM4 platform..
    Kingston seems to have 3200mhz Udimm available - but the question is will it work on x570 boards? (Or intel xeon boards for that matter?) SKU for part is KSM32ED8K2/64ME, not going to link website since I don't know phoronix rules regarding links.

    how would one go by to check if ECC is indeed working as intended?

    Edit: Qnap for non-qnap products probably isn't best example.. changed example to a kingston ecc module.
    Last edited by Entzilla; 05 January 2021, 10:23 AM. Reason: jumped the gun on UDimm brand, edited to reflect something more consumerfriendly.


    • Originally posted by schmidtbag View Post
      Yes, and so in the unlikely event a bit is flipped, whether that was during the backup/copy process or the checksum process, an error will be found. If an error was found, re-run the checksum to double-check if that process failed or if the transfer failed.
      Not a hard concept to grasp, but considering you're clearly rabid with anger, you aren't thinking clearly.
      Again, you literally have no idea what you're talking about. Why do I have to spell it out for you?

      Here's a made up example just to illustrate. You open a text file in some encrypted archive format with a special editor. It's all in RAM now. The old one has checksum 0x12345678. You edit it a bit and save it. The bit flip happens during your edit in RAM. You save it. Since it's encrypted, it is totally different on-disk. The checksum is calculated. It is now 0x87654321. It's saved on disk along with it.

      You back it up with the same data that is in RAM, not on disk. A disk failure here (i.e. inability to write it properly) does not impact you in the slightest, since it's cached. So now you have two copies, on two different devices.

      Sometime later, you need to read it. You check the first disk. Bad sector, checksum fails! Not a tragedy, you know you have backups! Because disk failures are insignificant, you check your other backup. Not a bad sector. Checksum is OK. You think all is good.

      Since the first disk failed, now you think you should make more copies. Not a problem, you copy the other backup into 100 different disks. All have same checksum, and are identical. Now you're safe, right?

      You open it up 1 year later. It opens. But wait, there's a bit flip somewhere in there, completely silent and undetected. It happened during your edit in RAM before the fucking checksum was calculated. The checksum was calculated on the bit flipped RAM data. Again, you are beyond fucked, imagine if it was in some password or code you stored it up there.

      Now ALL of your backups are bad, because they were all told to copy the bit flipped data. For crying out loud, you have to MANUALLY verify each and every file.

      If you can't trust your RAM, you cannot trust ANY DATA since everything is cached into RAM at one point. Not so with disk failures, those are inconsequential.

      Originally posted by schmidtbag View Post
      Do you realize how you're proving my point here?
      If you do a checksum, you'll figure out immediately if a flipped bit occurred (or if your drive is faulty), in which case, run the backup again. Problem solved. The chances of an error occurring twice on a non-critical system (since critical systems need ECC) is astronomically unlikely.
      If you don't do a checksum, then the problem is silent because you won't know something went wrong until you need it.
      Please read the example above. I still don't think you realize how low level RAM goes.
      Last edited by Weasel; 05 January 2021, 09:21 AM.


      • Originally posted by strtj View Post

        I went and looked up the details because I couldn't remember them (I never personally ran it) - they called it Memory Trolling, which in and of itself is a great enough reason to use it ;-) Documentation here:

        This certainly wouldn't be to difficult to re-implement on Linux, but I think you are right that manufacturers weren't scrambling to implement something like this. Especially in the days when the vendor and the OS were generally one and the same, there would have been very little motivation to increase the number of memory returns and/or service calls. If the customer doesn't notice some bad bits and never complains, who cares, right? :-P

        As far as I am aware the only significant feature of Tru64 that was carried forward in any way was the AdvFS filesystem which was open sourced and ported to Linux, but I'm not aware of anyone actually using it. The only reason that I can think to have used it was if you had a huge investment in AdvFS storage and for some reason couldn't easily migrate it to something else. AdvFS wasn't a bad filesystem but the management tools were awful and by 2008 there were plenty of reasonable large/clustering filesystem options. As far as I can tell, HP's only interest in Tru64 was killing it and migrating its customers to HP-UX.
        Ok , I quickly read the man page. It seems like memory trolling is "just "memory scrubbing and that it requires ECC to even work at all. At least that is what I am thinking. If if finds an error it will trigger a (MCE) machine check exception in which case you either soft offline the memory module e.g. if the error count in high enough you migrate data of it until it is empty and disable it. Or in case of a uncorrectable error you simply hard offline the memory module and are willing to accept crashes on programs that use that memory. If I remember correctly this is not much different than what Linux does today if ECC memory is present. I may be wrong about this , so I'll look into the details later. Thanks!


        • Originally posted by sentry66 View Post
          ZFS solved this issue via software, not hardware.
          Why can't ECC be solved via software checksums?
          I know performance would drop, but seems like it should be an option you enable in the BIOS if you're willing to lose performance.

          ECC memory can still go bad and doesn't even address the worst errors in the first place

          This isn't just system memory. There's GPU memory, CPU L1, L2, and L3 cache, SSD and hard drive cache, RAID card cache, highspeed network card cache.
          BOTH ZFS and BTRFS are prone to memory errors on the host system. E.g. if corrupt data is written to disk, you will get the same corrupted data back. With ZFS and BTRFS you just know that your corruption has not been further corrupted on disk.

          ECC can't easily be solved by software checksums as this (as others have said) will completely ruin your performance. The next best thing would be for Linux to checksum each and every memory page which will again ruin your performance. The best method would be to set up zswap to use all the memory it can and change swapiness to 100. If you got a memory corruption zswap will not be able to restore the page and fail. You can try that and see if you are happy with the performance. You most certainly will not be.

          Enterprise class hardware usually have memory mirroring and/or memory patrolling. Both requires ECC if I remember correctly. With memory mirroring you halves your usable memory but will survive any number of bit failures since you got a copy on another memory chip. ECC let's the memory controller know which memory chip are bad. With memory patrolling the memory controller constantly reads/writes a test pattern to the memory to catch any errors before the memory is (hopefully) being used by the system.

          Yes , you can do magical things in software, but the only correct way fowards is as Linus says - we need more ECC , not less!


          • Originally posted by Weasel View Post
            Again, you literally have no idea what you're talking about. Why do I have to spell it out for you?
            Because maybe it's you who doesn't know what you're talking about. Go wipe the foam off your mouth and come back with a clear head.
            Here's a made up example just to illustrate. You open a text file in some encrypted archive format with a special editor. It's all in RAM now. The old one has checksum 0x12345678. You edit it a bit and save it. The bit flip happens during your edit in RAM. You save it. Since it's encrypted, it is totally different on-disk. The checksum is calculated. It is now 0x87654321. It's saved on disk along with it.

            You back it up with the same data that is in RAM, not on disk. A disk failure here (i.e. inability to write it properly) does not impact you in the slightest, since it's cached. So now you have two copies, on two different devices.
            If the bit flipped while you were editing it, you would either notice the change (and since you said it was encrypted, a single flipped bit is going to have a cascading effect, making it more noticeable) or it would trigger the application to crash. So, you wouldn't get to the point where you save it in the first place.
            Please read the example above. I still don't think you realize how low level RAM goes.
            I don't think you know how low-level RAM goes, because when RAM is fucked, you know it immediately.
            There are of course exceptions, but where those exceptions are important you get ECC.
            As I've said a hundred times already: the average home user doesn't need to worry about such things. Your grandma doesn't give a shit about encrypting a letter to family members. The average teen isn't going to flunk out of school because the calculator for their math homework gave the wrong answer.
            It must suck to be you, living in paranoia over the most mundane things.
            Last edited by schmidtbag; 05 January 2021, 10:38 AM.


            • Originally posted by coder View Post
              The reason I draw a distinction is:
              1. The size of high-value data, relative to memory capacity.
              2. The potential for errors to get persisted or for corruption to multiply.

              If we look at point #1, a document someone is editing will occupy only a tiny amount of RAM. Your error rate would have to be extremely high for there to be a reasonable likelihood of it getting corrupted, and long before it gets that bad, the machine would be so unstable that it would be unusable and the user would replace it. While photos and videos are much larger, they're also fairly error-tolerant.

              Compare that to a large CAD model that a workstation user might edit, and the math starts to look a lot different. Worse, you could have a scenario where an error creeps into the model at some point and gets persisted, contaminating all subsequent iterations of the model. If, at some point it's finally noticed, the user might have to revert many iterations to get rid of it, resulting in much lost work. And if it's not noticed until production (or later), the cost of the error could multiply further.

              Regarding point #2: for servers, bad RAM can mean serving up the same (or different) incorrect data to multiple clients, on multiple occasions. Those clients might edit the data and resubmit it to be saved, resulting in the errors being persisted. Even worse would be if memory errors actually corrupt filesystem datastructures, which could result in much larger data corruption that you hopefully notice soon enough to revert to a backup, but maybe not before users have made additional changes that would also be lost by reverting.
              I fully agree with what you say, but that's beyond the point.

              Originally posted by coder View Post
              Suffice to say that the economics of data integrity looks a lot different for "workstation" datasets and server usage patterns than typical end-consumer client PC usage. That's not to say I don't wish everyone had reliable computing devices with excellent data integrity, but it merely serves to illuminate the conventional distinction. We cannot deny that the consumer world has been managing alright without ECC, so it deserves an honest examination of why that might be so.

              If a government or industry body wants to establish such a ratings scheme, I'd be in favor of that. In general, you cannot blame the industry for doing no more than what customers or regulations demand. Well, you can, but you're just wasting your breath.
              This is the problematic part, although you do have a much more nuanced position that Shmitbag there, you are still missing the point.

              Whether it's youngsters just coming into job market and brewing projects to valorize themselves...
              Or an association that tries to host its own webserver to organize events and manage its members...
              Or an individual that hosts a whole personal cloud stack because he wants to degooglize himself.
              Or medical practicioners that need to store personal files and radio pictures of their clients.
              Or SME that use specific applications for their business...
              Or even larger business entities that are oblivious to this kind of questions and grab "standard" hardware for their employees.

              Very few people know even 1% about how their software and operating system work in the first place, because it's not their core activity nor their passion so they don't care. Whether that is a good thing or not is irrelevant.
              Likewise, although "worse" because even more complex and multi-disciplinary (so this time ignorance may be "in spite of" wanting to know), they completely ignore how hardware works.

              How then should they be able to know that there is a qualitative and significant difference in reliability between mainstream hardware and industrial grade "for Google like entities" hardware?

              Whether people are ready to acknowledge that or not, a majority of data, productive or not, is created on hardware geared for "regular average chump". Data which is often critical for people as well.

              That using a specific kind of ECC reduces only ever so slightly the risk of failure is irrelevant to the point. Same with the disk comparison or "software bugs create more risk", it's irrelevant: if you have two weak points, A and B, and there are ways to cater to each of them, just because A has higher risk to fail "first" does not mean you should not be able to cater B as well if data integrity is a priority for you.

              The point is: everyone should be informed that "regular" memory have a small but existing chance to silently corrupt data and have a chance to decide by oneself if it's worth paying an adequate extra for a better reliability over time.
              Or it should become a default, everyone rants a while because prices goes slightly up then two years later everyone forgets there even was a world before where ECC wasn't the universal standard of quality. Honestly I'm far more up for that, but that would require the kind of multilateral governmental action one could only dream of.


              • If you use computers for even remotely important things, and generate / accumulate important data in excess of a gigabyte per month, you most likely need ecc memory, as simple as that.


                • The way things are looking with HBM2 or GDDRX6 memory potentially starting to become on-board the CPU with 3D stacking coupled with an on-board GPU or maybe an on-die ASIC or FPGA, is there a way alternative hardware that would give better performance than the CPU if you didn't have ECC memory?

                  ECC seems like the simplest solution, but it's still only system memory. There's still the issue of software errors elsewhere and other devices and firmware. It seems like there should be a more total solution available than ECC which doesn't even fix the worst errors.


                  • Originally posted by piotrj3 View Post

                    Your ECC didn't work at all on FX8320 unless you bought motherboard worth far more then your CPU itself. You have to understand supporting ECC memory as it is working and fact ECC itself is working is 2 diffrent things.

                    AMD on consumer chips does support USING ecc memory, but it is up to motherboard manufacture to make ECC work. There is tons of people who bought Ryzen consumer chip, used ECC memory and then found out in Memtest it is not working. Which for me is even worse then Intel because it is false advertising in large part because before you had "ECC supported/not supported" and everywhere when it was supported it was working simply. Now with AMD it is "ECC supported/ECC supported by motherboard (what means CPU memory controller might not catch all bugs)/ECC not supported". And here motherboard makers of course won't make additional chips in between CPU and RAM to catch ECC stuff. This is why 99% of motherboards you see doesn't and won't support ECC memory, but there is a ton of users who think ECC works for them but it doesn't. As far as I know only some ASUS motherboards do support ECC on consumer AMD chips correctly and Asrock "sometimes". Everyone else is no.
                    i had a 65€ gigabyte am3 mainboard and yes ECC worked i checked it in memtest

                    over time i had 3 different mainboards on the asrock board ecc did not work.

                    however choosing the mainboard wisely is much better than what you get with intel.
                    Phantom circuit Sequence Reducer Dyslexia


                    • Originally posted by schmidtbag View Post
                      If ECC became the only type of memory you could get, it wouldn't be low-production anymore.
                      ECC for a CPU cache is an absolute necessity, regardless of what workload you have.
                      right politics should make any non-ECC ram against the law.

                      the damange what non-ECC does is far far far greater than the cost of the ram.

                      this means a law who makes non-ECC against the law would have postive effect.
                      Phantom circuit Sequence Reducer Dyslexia