Announcement

Collapse
No announcement yet.

Fedora Developers Brainstorming Options For Better Memory Testing

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by starshipeleven View Post
    EDAC subsystem does not notify directly to the user, it all goes in dmesg and possibly system logs, but there is no GUI way to inform the user.

    THat's what he was talking about
    True. I guess a netlink notifier for edac/mce (I think edac/mce has netlink messages?) on a Fedora could serve a purpose.
    But for RHEL/servers in general I'd guess that you want remote log journald parsing or something similar.

    Comment


    • #22
      Originally posted by milkylainen View Post
      But for RHEL/servers in general I'd guess that you want remote log journald parsing or something similar.
      Yes but the article is about Fedora so I'm assuming desktop users is the main target and what the discussion is about.

      Comment


      • #23
        Originally posted by waxhead View Post
        Memory testing or "memory scrubbing" should definitively be part of the kernel. Servers (with some load) usually use ECC RAM (and so should everything else in my opinion), but sadly that is not the truth.
        I don't think that patrol scrubbing like that is a Linux option nor am I sure it should be.
        Usually, advanced memory controllers in servers can do patrol scrubbing with ECC-memory.-
        Scrubbing without ECC seems kinda pointless? What would you compare with?

        It does not add much. Single bit errors are detected and corrected on reads without scrubbing.
        It might safeguard against multiple bit errors on the same page/bank/cell sice you are actively correcting them.
        But cases where you're actually getting multiple-bit errors you don't really need to scrub things, you need to retire parts of the memory structure.

        Watching the soft-correct ECC counter is usually more than enough to know that your server has memory issues.

        Machines without ECC need a really good memory tester to test the memory subsystem stability.
        That won't hinder your machine from going to bits if a memory cell fails in runtime though, scrubbing or not.

        Comment


        • #24
          Originally posted by milkylainen View Post
          I don't think that patrol scrubbing like that is a Linux option nor am I sure it should be.
          Usually, advanced memory controllers in servers can do patrol scrubbing with ECC-memory.-
          Scrubbing without ECC seems kinda pointless? What would you compare with?
          Yes I know that some servers have memory mirroring and patrol scrubbing. If you are on hardware that do not support this then a memorytest module could reserve a number of areas , test those (which technically is not scrubbing) and move on. The point would be that you could detect bad memory before the OS use it or before it causes errors.

          Originally posted by milkylainen View Post
          It does not add much. Single bit errors are detected and corrected on reads without scrubbing.
          It might safeguard against multiple bit errors on the same page/bank/cell sice you are actively correcting them.
          But cases where you're actually getting multiple-bit errors you don't really need to scrub things, you need to retire parts of the memory structure.

          Watching the soft-correct ECC counter is usually more than enough to know that your server has memory issues.

          Machines without ECC need a really good memory tester to test the memory subsystem stability.
          That won't hinder your machine from going to bits if a memory cell fails in runtime though, scrubbing or not.
          True , but my point was of course systems without ECC RAM. How can you improve the reliability of those. I am not talking about preventing problems (for that you would need ECC RAM), but I am merely talking about how to minimize and detect the effects of bad memory.


          http://www.dirtcellar.net

          Comment


          • #25
            Originally posted by milkylainen View Post
            It does not add much. Single bit errors are detected and corrected on reads without scrubbing.
            It might safeguard against multiple bit errors on the same page/bank/cell sice you are actively correcting them.
            But cases where you're actually getting multiple-bit errors you don't really need to scrub things, you need to retire parts of the memory structure.
            Background scrubbing can deal with the random single bit flip soft error (due to something like cosmic rays) by correcting the flip before the next random bit flip turns the correctable into an uncorrectable error.

            Comment


            • #26
              Originally posted by CommunityMember View Post

              Background scrubbing can deal with the random single bit flip soft error (due to something like cosmic rays) by correcting the flip before the next random bit flip turns the correctable into an uncorrectable error.
              Yes. But odds of those happening are more like zilch.
              Cosmic radiation is almost more like a myth than a likelyhood.
              You are in the several orders of magnitude more likely to get a weak intermittent DRAM address/row/column/bank than a cosmic bit-flip (let alone a double one).

              If you are getting a lot of random bit-flips your refresh-rate is probably weak (In which case you'd be susceptible to row-hammering) or your doing high altitude flights or in a radiation heavy environment. In the latter case your using the wrong hardware.

              You still need ECC to have something to correct against.
              And good memory controllers can scrub independent of software.
              So, a good memory controller will scrub and watch the soft error counters.
              If incrementing it will report it to the OS to make it retire the module in question.
              This is how it's done in all serious high end computers.

              Scrubbing belongs in hardware.

              Comment

              Working...
              X