Announcement

Collapse
No announcement yet.

AMD Introducing FRU Memory Poison Manager In Linux 6.9

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by pong View Post
    I found a bad DIMM in a (consumer type, DDR4) system I built from new parts a year or so ago.
    4-DIMMs installed, it'd pass most memory tests e.g. the "quick" ones.
    But running memtest86+ across all types of tests would reveal it and on a couple of the tests it'd show up like a few errors per hour of testing on
    the most frequently failing couple of test types.

    Tragically consumer motherboards I've seen don't even give you a way to know which physical DIMM corresponds to which byte addresses depending
    on one's chosen settings / configurations for DIMM sizes installed in which slots, bank interleaving, other interleaving, etc.

    In the end I had to guess which DIMM of the four it might be and pull it; no errors since though I still have to repair it with a new DIMM and test again.

    I'm thinking of writing an online memory tester that can run while the system is in use for motherboard RAM and GPU memory too.

    It would be much less needed / useful if the system had ECC RAM and could just scrub / test routinely in service but alas I have never had that.



    Duno about if we need active 'ram slot' checker. For desktop user you either have 2 or 4 sticks.
    Remove one by one and run a test for even 10 minutes, and you will figure out which is damaged.

    Comment


    • #12
      pong please put your replies beneath the thing you are quoting instead of above it. In email and usenet this is called "top-posting" and is a pet-peeve of old timers, but I've never seen anyone do it on a web forum.

      Comment


      • #13
        I don't want my memory to be poisoned 😢

        Comment


        • #14
          I recall finding a couple of bad DIMMs over the years, one "recently" (like 9-12 months ago) another several years ago. Both "fresh from the factory" failures in
          newly built systems. They actually could run memtest86 for more like a hour or more and not detect the error. In the recent one, out of the dozen or so tests memtest86+
          can run in the series, only a couple of the tests showed the error either at all or with any degree of frequency. The other ~10 tests didn't show it at all or often.
          On the two failing tests I think it showed like maybe a couple error detections per hour if configured to run ONLY those two "failing frequently" test types in a loop.
          Of course that system had 128 GB not-so-fast RAM so it does take a fair while to run even a single test pass.

          As for plugging and removing DIMMs, I'm decidedly not that brave anymore about doing that i.e. it seems physically risky. I saw the way the motherboard
          actually bends quite a bit when installing DIMMs or to a lesser extent pulling them when the motherboard is installed fixed by the ~9 ATX standard screws + stand-offs
          and in between the screw / standoff points it just floats without being braced.

          So if physically removing the entire MB from the case and having it on a hard flat-ish cushioned ESD mat, yeah, no problem, pull DIMMs, put DIMMs in, the
          back of the motherboard will be well enough supported by the mat you're not going to flex it much or crack any electronic solder joints / vias / components etc.
          Inside a case? Yeah not so much, I've seen PCBs damaged by less flexing than it takes to install a DIMM or some of the connectors, CPU heatsink, etc. on these
          motherboards.

          So all in all I'd prefer to have, well, ECC, and a BIOS that tells me which DIMM is good / bad (why do they have such useless memory testers in the BIOS vs. memtest86+?!),
          so if I have to pull / replace a DIMM I at least know which.


          Originally posted by dimko View Post

          Duno about if we need active 'ram slot' checker. For desktop user you either have 2 or 4 sticks.
          Remove one by one and run a test for even 10 minutes, and you will figure out which is damaged.

          Comment


          • #15
            Originally posted by yump View Post
            pong please put your replies beneath the thing you are quoting instead of above it. In email and usenet this is called "top-posting" and is a pet-peeve of old timers, but I've never seen anyone do it on a web forum.
            Agreed. Will do.

            Comment

            Working...
            X