Announcement

Collapse
No announcement yet.

AMD Introducing FRU Memory Poison Manager In Linux 6.9

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • AMD Introducing FRU Memory Poison Manager In Linux 6.9

    Phoronix: AMD Introducing FRU Memory Poison Manager In Linux 6.9

    Queued for introduction in the upcoming Linux 6.9 kernel cycle is an FRU Memory Poison Manager "FMPM" developed by AMD that may later be adapted for other non-AMD platforms. The FRU Memory Poison Manager is working to persist information around known bad/faulty memory across reboots...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    It seems a good idea to not have to throw out a product just because a "minor" problem.

    I think persisting on disk this information would be good for diagnosis too...
    You know, why server restarted and is it bad?!
    Ok, logs, but if 0,01% of your memory is bad or 10% are different problems.

    Nice!

    Comment


    • #3
      If only they had cared enough about the impacts of memory errors to have made ECC RAM standard in their chipsets / CPUs / BIOSes (including consumer / SMB) since, well, decades ago.

      Comment


      • #4
        Originally posted by pong View Post
        If only they had cared enough about the impacts of memory errors to have made ECC RAM standard in their chipsets / CPUs / BIOSes (including consumer / SMB) since, well, decades ago.
        IIRC, DDR5 RAM includes a base level of "not exposed to the rest of the system"/internal-to-the-DIMM ECC as a required part of the spec because data density has crossed the same "ECC is required for reliable normal operation... not just for exceptional circumstances" threshold that rotating platter hard drives crossed decades ago.

        What it doesn't have as mandatory is the full-stack ECC that's normally meant by ECC RAM, which requires motherboard support because it's designed to also catch corruption introduced at other points between the CPU and the DIMM. (i.e. the RAM equivalent to how ZFS's full data checksumming can also catch corruption which occurs after the data leaves the hard drive, such as from a flaky SATA cable or dying controller chip.)

        Comment


        • #5
          Originally posted by pong View Post
          If only they had cared enough about the impacts of memory errors to have made ECC RAM standard in their chipsets / CPUs / BIOSes (including consumer / SMB) since, well, decades ago.
          Many AMD Platforms have ECC support. They just dont certify it because of cost (which is a major point for consumer hardware).
          My AMD NAS runs consumer hardware and uses ECC memory just fine

          Comment


          • #6
            Originally posted by euduvda View Post
            It seems a good idea to not have to throw out a product just because a "minor" problem.

            I think persisting on disk this information would be good for diagnosis too...
            You know, why server restarted and is it bad?!
            Ok, logs, but if 0,01% of your memory is bad or 10% are different problems.

            Nice!
            Linux has mechanism that protects you against this kind of problems already.
            I had 64gigs for a few years.
            Did not realize for a few years that i had section of ram corrupt, purely because region of ram was rarely hit and i had problem running ram checker from grub. Was thinking it was CPU.

            Year ago or so I got ram checker to work and after couple of checks realized one localized area on ram stick is bad.

            Started to dig and found there is option in kernel that makes my booting about 30-40 seconds longer, but it does run a test and blocks damaged ram area from being used.

            Since then I was very happy, no problems with compiling browsers/open office, etc. Not planing to change my RAM either.

            Comment


            • #7
              Thanks for the information! Yes I've heard cursory details suggesting something similar though I've never seen / researched the details as to
              how it works / how good it is / etc. Your comment has filled in some of the missing pieces e.g. "not exposed to the rest of the system" -- even if there aren't
              literally additional ECC parity bits per RAM DWORD wired between the CPU and the memory controller I'd imagined they MIGHT have ways to expose it a bit to the rest of the system e.g. multiplexing "spare" signals in time / sequence as parity or querying the internal state (by some status registers / ...) of RAM correction statistics, whatever.

              I realize that any information channel with limited bandwidth and non-zero noise will have a certain (non-zero) probability of errors over time so anything from "non volatile" memory to a data bus or RAM, ethernet, ... will occasionally due to degradation (temperature, inadequate refresh, cosmic ray, electrical noise) flip / miss a correct bit value and to achieve acceptable BER statistics will have to use ECC or some other EDAC protocol or such to provide some redundancy or integrity verification / maintenance to keep things working at any guaranteed level of reliability.

              I have wondered even with such a "we really need it" DDR5 scheme if a typical consumer motherboard with 128 GBy of DDR4 is in a better or worse position
              for RAM errors (statistical / empirical rate) than the same for DDR5 over time per byte read / written. It could be DDR5 is consequently more robust to those internally detected / corrected errors or it could be the BER achieved "meets specification" but is still perhaps worse than or not better than the same case with DDR2/3/4.

              I suppose that's somewhat academic since in a few years DDR2/3/4 systems will mostly be gone and we've already got 192-256 GBy+ systems built with DDR5
              so the practical question as ever is how risky really is it to have one's 256 GBy DDR5 or whatever on a consumer motherboard and let's say doing some
              active computational use of much of that RAM how many bit errors per week / month / year is one likely to see due to whatever causes (electrical SI noise, cosmic rays, RAM cell problems, ...)?

              It is kind of hard to believe we've kept frequently doubling the amount (and less occasionally bandwidth) of installed RAM in consumer architecture systems and we've still not got universal end-to-end ECC. I would have thought that in any but the most cost constrained low end systems "spending" another what
              13% of RAM on "reliability" would be a pretty acceptable trade-off particularly these days.

              In other bus systems like PCIE or even SSD / SD one sends a CRC along with the data in-band in the transmission "packet.
              One could have even devised some scheme to do similarly on a RAM channel to convey either extra (ECC) bits or even just error detection (CRC, ...) bits
              and put it at the end of a burst or some other way to multiplex it into the read / write cycle. It could even have been optionally enabled as a bandwidth
              vs. reliability trade-off to at least detect / check the integrity after a R/W cycle.

              When you're commonly storing / using more than a trillion bits (e.g. 128 GBy RAM) PPB error level rates are not particularly comforting, nor even PPT ones.

              Then again we're still often using file systems without stored checksumming integrity hashes (vs. BTRFS, ZFS) though at least the SSDs themselves do have some layers of protection.

              Originally posted by ssokolow View Post

              IIRC, DDR5 RAM includes a base level of "not exposed to the rest of the system"/internal-to-the-DIMM ECC as a required part of the spec because data density has crossed the same "ECC is required for reliable normal operation... not just for exceptional circumstances" threshold that rotating platter hard drives crossed decades ago.

              What it doesn't have as mandatory is the full-stack ECC that's normally meant by ECC RAM, which requires motherboard support because it's designed to also catch corruption introduced at other points between the CPU and the DIMM. (i.e. the RAM equivalent to how ZFS's full data checksumming can also catch corruption which occurs after the data leaves the hard drive, such as from a flaky SATA cable or dying controller chip.)
              Last edited by pong; 06 March 2024, 04:31 PM.

              Comment


              • #8
                I found a bad DIMM in a (consumer type, DDR4) system I built from new parts a year or so ago.
                4-DIMMs installed, it'd pass most memory tests e.g. the "quick" ones.
                But running memtest86+ across all types of tests would reveal it and on a couple of the tests it'd show up like a few errors per hour of testing on
                the most frequently failing couple of test types.

                Tragically consumer motherboards I've seen don't even give you a way to know which physical DIMM corresponds to which byte addresses depending
                on one's chosen settings / configurations for DIMM sizes installed in which slots, bank interleaving, other interleaving, etc.

                In the end I had to guess which DIMM of the four it might be and pull it; no errors since though I still have to repair it with a new DIMM and test again.

                I'm thinking of writing an online memory tester that can run while the system is in use for motherboard RAM and GPU memory too.

                It would be much less needed / useful if the system had ECC RAM and could just scrub / test routinely in service but alas I have never had that.


                Originally posted by dimko View Post

                Linux has mechanism that protects you against this kind of problems already.
                I had 64gigs for a few years.
                Did not realize for a few years that i had section of ram corrupt, purely because region of ram was rarely hit and i had problem running ram checker from grub. Was thinking it was CPU.

                Year ago or so I got ram checker to work and after couple of checks realized one localized area on ram stick is bad.

                Started to dig and found there is option in kernel that makes my booting about 30-40 seconds longer, but it does run a test and blocks damaged ram area from being used.

                Since then I was very happy, no problems with compiling browsers/open office, etc. Not planing to change my RAM either.

                Comment


                • #9
                  Yes I've noticed a few blurbs here and there over the past generation or two of motherboards about more "ability" to use it even if (not as you said)
                  "supporting" it in the sense of certifying it / standardizing on it etc.

                  I've seen some user level commentary questions about how well it really works if you do install it
                  mostly wrt. having all the right BIOS related options to enable / configure it and handle / correct detected errors etc. or whatever but yeah it's nice to think at least with the right CPU + chipset + motherboard + BIOS one can have some option to get it working in practice.

                  I'll look forward to seeing what I can do with it when I build another generation of NAS and workstation TBD.

                  So do you have any rough statistics about what kinds of error rates you've personally encountered over time / use / installed memory? DDR4 or 5?


                  Originally posted by flower View Post

                  Many AMD Platforms have ECC support. They just dont certify it because of cost (which is a major point for consumer hardware).
                  My AMD NAS runs consumer hardware and uses ECC memory just fine

                  Comment


                  • #10
                    Originally posted by pong View Post
                    So do you have any rough statistics about what kinds of error rates you've personally encountered over time / use / installed memory? DDR4 or 5?
                    I use an AsRockRack X570D4U-2L2T with 4x32GB Kingston Dual Ranked Unbuffered ECC DDR4 DIMM 3200MHz (i am only running them at 2666MT because otherwise it would overclock my memory contoller). cpu is an AMD 5950X

                    i have that system for about 15 month without any problems. but this month the first DIMM seems to have about one bit error a week. they show up in bursts of 6-10 (but it's always the same bit). all of them are corrected so i am not to worried but planning to replace it at some point.

                    Comment

                    Working...
                    X