Announcement

Collapse
No announcement yet.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #91
    Originally posted by zxy_thf View Post
    It would be great to have some memtesting kernel module mainlined, but running in background all the time is probably not that great. It will drain the battery on a laptop.
    So,make it optional at boot so we in Desktops, not limited by those pesky Batteries can use it freely and others disable it to protect Batteries.

    Comment


    • #92
      Originally posted by mathew7 View Post
      All permanent storage devices have error correction built-in. When you recieve a bad sector information, their correction code failed.
      That is only when the bad sector is known. It's not uncommon to see drives that steadily lose data integrity over time. For HDDs, sometimes the data is proven clean when written but gets corrupted later.
      <granpa voice>Back in the days, when HDD size was about 3 times of a CD</granpa voice>, I prepared a folder to write to a CD and copied it over network to a friend (over Netbios, not TCP-IP) and everything seemed fine. Even compared the copied data on the PC that had written it.
      But when put the CD in my PC and compared with my original data, almost everything was corrupted (no zips could be extracted and all .exe crashed). I don't know which part failed: MEM-MEM of network driver or actual network transfer. But it failed silently. And this WAS a backup.
      Well if you didn't do TCP-IP, that's most likely the problem.
      Also, getting back to ECC necessity, I've had a RAM+MB combination that would always freeze loading windows on the 1st start of the day (at least 8h powered off). It took me a month to remember to load memtest 1st, and when I did, I've had 2x 1bit errors at the 1st pass, followed by hours of stability. ECC would have allowed me to use that reliably, instead of getting angry every time while resetting my PC. Now think about someone that doesn't know anything about PCs who tries to apply to sellers support channel.
      ECC is supposed to protect you from things like cosmic rays and neutrinos, which are basically unpredictable accidents. When you have something as consistent as what you described, that's not a fluke of nature, that's faulty hardware. I'm not sure if ECC would actually fix that, though it might have made it less of a problem. Simply swapping out the module with a different one (regardless of ECC) might have been enough to stop the problem.

      Comment


      • #93
        i want to use ecc memory, but it's significantly slower, so i don't use it

        Comment


        • #94
          Originally posted by schmidtbag View Post
          That is only when the bad sector is known. It's not uncommon to see drives that steadily lose data integrity over time. For HDDs, sometimes the data is proven clean when written but gets corrupted later.
          And that happens because the actual reading is below threshold, and I can think of 3 reasons: head damage, strong external magnet or particles from previous head crashes getting stuck between head and disk and continuously scrathing the disk. I've had those in the past.
          Originally posted by schmidtbag View Post
          Well if you didn't do TCP-IP, that's most likely the problem.
          Doooh!...the point was that a backup could backup the corrupted data.
          Originally posted by schmidtbag View Post
          ECC is supposed to protect you from things like cosmic rays and neutrinos, which are basically unpredictable accidents. When you have something as consistent as what you described, that's not a fluke of nature, that's faulty hardware. I'm not sure if ECC would actually fix that, though it might have made it less of a problem. Simply swapping out the module with a different one (regardless of ECC) might have been enough to stop the problem.
          Well, yes, but look at size of modules throughout years....you have the same die shrinking as CPUs. But the smaller the cell, the more susceptible is to those radiation. Also, the voltage of the cell gets lowered.
          As for my RAM problem, ECC would have hidden it and continue to work. My case was an edge case which proved problematic. If it would have been a total failiure (like and address at every pass), then I would have been delayed by 2 days for shipping/exchanging with new RAM. But due to it's nature, it bugged me for a months (and I did run memtest with no errors...just forgot to do it at 1st start when win froze).
          Also, BSOD with 2-bit-error detection is more info (like a specific address which I could see every time) than silent crash/freeze, which could be from any HW. Oh...and also a faulty (or on it's way out) PSU could create sporadic memory errors where Windows would just free when it could give you a BSOD triggered by specific source (ECC ....Error checking and correction).

          Comment


          • #95
            Originally posted by Ivan Dimitrov View Post
            AMD supports (validates) ECC RAM support ONLY on their PRO processors. ECC is enabled on non-PRO processors but validation and implementation is left to the motherboard vendor. So ECC will most likely work on non-PRO processor but it is not clear if it will work in ECC mode and how will it report the errors.
            it is clear if you look at motherboard specs

            Comment


            • #96
              Originally posted by carewolf View Post

              Error _correction_ does correct errors, it is in the name.. I think you are confusing parity with ECC, ECC can correct one bit errors, and can detect 2 bit errors, but not correct them.
              It is exactly what I said: "ECC does not automatically correct memories." but it could correct wrong reading but not if the information is store correctly.

              But even more. Let's say we have a faulty module:

              ECC = crashes and maybe an error low.
              NO-ECC = crashes anyways

              So why we even need ECC memories for simple usage? 🤷‍♂️

              Comment


              • #97
                Originally posted by waxhead View Post

                One way to work around the lack of ECC would be a background task e.g. a kernel driver that does the same job as Memtest86+. E.g. runs on idle or reserves a very small amount of CPU time (2-3%) and slowly but surely walks the memory all the time - continuously. If it finds a bad spot, it could blacklist the memory area or memory module. Building a "bad block list" over time. That would be a poor mans ECC e.g. software memory scrubbing, but it would be interesting if something like this existed - as I am sure it would exposed how big the problem probably is , it might build a strong case for Torvalds's call for more ECC memory
                This existed ~25 years ago in Tru64 (Digital Unix / OSF/1). I can't remember if it was a paid add-on or if it existed in the base OS, but it did pretty much exactly what you describe. You could set it to do a certain amount of memory in a certain period of time and CPU use would scale accordingly. I'm always surprised that I've never seen it anywhere else.

                Comment


                • #98
                  Originally posted by magallanes View Post
                  Hi everybody:

                  Also, memories are way less prone to fail than in the past. For example, if you are worked on a datacenter, then replacing a hard disk is part and included in the global costs. Instead, it is not common to replace the memories or the CPU, mainly because those components are well inside of the motherboard and protected inside several layers of capacitors.

                  Tell that to my DCO guys, they replace memory and CPUs daily.
                  Don't expect much and seldom disappointed.

                  Comment


                  • #99
                    Originally posted by mathew7 View Post
                    Doooh!...the point was that a backup could backup the corrupted data.
                    As I mentioned earlier, you checksum against the source data. Then you don't face such problems.
                    Well, yes, but look at size of modules throughout years....you have the same die shrinking as CPUs. But the smaller the cell, the more susceptible is to those radiation. Also, the voltage of the cell gets lowered.
                    Trends seem to suggest that modules are becoming more reliable. This could be due to better engineering or better software-level error detection, but it could also be that in modern times, memory usage and availability is so vast that the probability of a major failure plummets. Remember, for the average PC, the vast amount of memory used is in software and assets rather than personal data. Although a single flipped bit in a 64-bit register could result in an astronomically big difference in a variable, the probability of that 1 bit affecting your workload is preposterously small.
                    When it comes to servers and workstations, it's the exact opposite: most of the data is personal/critical and therefore the probability of a flipped bit causing a problem is significant. That's why you'd be a fool for not using ECC in such situations.
                    As for my RAM problem, ECC would have hidden it and continue to work. My case was an edge case which proved problematic. If it would have been a total failiure (like and address at every pass), then I would have been delayed by 2 days for shipping/exchanging with new RAM. But due to it's nature, it bugged me for a months (and I did run memtest with no errors...just forgot to do it at 1st start when win froze).
                    ECC isn't a miracle worker. It has a high chance of recovery, but not 100%. It's meant to recover from a flipped bit here and there, but in the situation you described, it could be entire bytes that were erroneous. And, it's possible those bits/bytes physically can't be corrected. If your modules are physically defective, there's not much ECC can do to protect you, at least not without some severe performance penalties.
                    Oh...and also a faulty (or on it's way out) PSU could create sporadic memory errors where Windows would just free when it could give you a BSOD triggered by specific source (ECC ....Error checking and correction).
                    Indeed it could. I considered mentioning that but I didn't want to ramble about all the situations where a bit could possibly flip. Crappy VRMs could also trigger it. Overclocking could trigger it. A lightning strike could trigger it. The list goes on and on.
                    Last edited by schmidtbag; 04 January 2021, 12:37 PM.

                    Comment


                    • I am searching / waiting for an excellent AMD Epyc motherboard with an expanded selection of BIOS RAS features, specifically ECC related; in that respect Intel motherboards for some Intel processors are way ahead of AMD. AMD motherboard manufacturers seem to only offer "ECC On/Off" and "Patrol Scrub", while some of the better Intel server motherboards (for the upper tier CPUs) offer an expanded range of features:

                      - Adaptive double device data correction (ADDDC)
                      - Single Device Data Correction (SDDC)
                      - Memory Address Range Mirroring (MARM)
                      - Memory error storm response and Auto self-healing (Analysis Engine)
                      - Post Package Repair to spare and replace defective portions of DRAMs.

                      - Not "ECC", but a useful addition: Enhanced Machine Check Architecture Gen 2 (eMCA2)

                      Each section of an Intel BIOS seems to offer more options than AMD BIOS's rather sparse selection of knobs to twiddle.
                      Last edited by JustRob; 04 January 2021, 01:06 PM.

                      Comment

                      Working...
                      X