Announcement

Collapse
No announcement yet.

Fedora Developers Brainstorming Options For Better Memory Testing

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • milkylainen
    replied
    Originally posted by CommunityMember View Post

    Background scrubbing can deal with the random single bit flip soft error (due to something like cosmic rays) by correcting the flip before the next random bit flip turns the correctable into an uncorrectable error.
    Yes. But odds of those happening are more like zilch.
    Cosmic radiation is almost more like a myth than a likelyhood.
    You are in the several orders of magnitude more likely to get a weak intermittent DRAM address/row/column/bank than a cosmic bit-flip (let alone a double one).

    If you are getting a lot of random bit-flips your refresh-rate is probably weak (In which case you'd be susceptible to row-hammering) or your doing high altitude flights or in a radiation heavy environment. In the latter case your using the wrong hardware.

    You still need ECC to have something to correct against.
    And good memory controllers can scrub independent of software.
    So, a good memory controller will scrub and watch the soft error counters.
    If incrementing it will report it to the OS to make it retire the module in question.
    This is how it's done in all serious high end computers.

    Scrubbing belongs in hardware.

    Leave a comment:


  • CommunityMember
    replied
    Originally posted by milkylainen View Post
    It does not add much. Single bit errors are detected and corrected on reads without scrubbing.
    It might safeguard against multiple bit errors on the same page/bank/cell sice you are actively correcting them.
    But cases where you're actually getting multiple-bit errors you don't really need to scrub things, you need to retire parts of the memory structure.
    Background scrubbing can deal with the random single bit flip soft error (due to something like cosmic rays) by correcting the flip before the next random bit flip turns the correctable into an uncorrectable error.

    Leave a comment:


  • waxhead
    replied
    Originally posted by milkylainen View Post
    I don't think that patrol scrubbing like that is a Linux option nor am I sure it should be.
    Usually, advanced memory controllers in servers can do patrol scrubbing with ECC-memory.-
    Scrubbing without ECC seems kinda pointless? What would you compare with?
    Yes I know that some servers have memory mirroring and patrol scrubbing. If you are on hardware that do not support this then a memorytest module could reserve a number of areas , test those (which technically is not scrubbing) and move on. The point would be that you could detect bad memory before the OS use it or before it causes errors.

    Originally posted by milkylainen View Post
    It does not add much. Single bit errors are detected and corrected on reads without scrubbing.
    It might safeguard against multiple bit errors on the same page/bank/cell sice you are actively correcting them.
    But cases where you're actually getting multiple-bit errors you don't really need to scrub things, you need to retire parts of the memory structure.

    Watching the soft-correct ECC counter is usually more than enough to know that your server has memory issues.

    Machines without ECC need a really good memory tester to test the memory subsystem stability.
    That won't hinder your machine from going to bits if a memory cell fails in runtime though, scrubbing or not.
    True , but my point was of course systems without ECC RAM. How can you improve the reliability of those. I am not talking about preventing problems (for that you would need ECC RAM), but I am merely talking about how to minimize and detect the effects of bad memory.

    Leave a comment:


  • milkylainen
    replied
    Originally posted by waxhead View Post
    Memory testing or "memory scrubbing" should definitively be part of the kernel. Servers (with some load) usually use ECC RAM (and so should everything else in my opinion), but sadly that is not the truth.
    I don't think that patrol scrubbing like that is a Linux option nor am I sure it should be.
    Usually, advanced memory controllers in servers can do patrol scrubbing with ECC-memory.-
    Scrubbing without ECC seems kinda pointless? What would you compare with?

    It does not add much. Single bit errors are detected and corrected on reads without scrubbing.
    It might safeguard against multiple bit errors on the same page/bank/cell sice you are actively correcting them.
    But cases where you're actually getting multiple-bit errors you don't really need to scrub things, you need to retire parts of the memory structure.

    Watching the soft-correct ECC counter is usually more than enough to know that your server has memory issues.

    Machines without ECC need a really good memory tester to test the memory subsystem stability.
    That won't hinder your machine from going to bits if a memory cell fails in runtime though, scrubbing or not.

    Leave a comment:


  • starshipeleven
    replied
    Originally posted by milkylainen View Post
    But for RHEL/servers in general I'd guess that you want remote log journald parsing or something similar.
    Yes but the article is about Fedora so I'm assuming desktop users is the main target and what the discussion is about.

    Leave a comment:


  • milkylainen
    replied
    Originally posted by starshipeleven View Post
    EDAC subsystem does not notify directly to the user, it all goes in dmesg and possibly system logs, but there is no GUI way to inform the user.

    THat's what he was talking about
    True. I guess a netlink notifier for edac/mce (I think edac/mce has netlink messages?) on a Fedora could serve a purpose.
    But for RHEL/servers in general I'd guess that you want remote log journald parsing or something similar.

    Leave a comment:


  • CommunityMember
    replied
    Originally posted by birdie View Post
    I've long switched to memtest86 which works great and I couldn't care less that it's a closed source application.
    For some people, free, and open, is an important goal in their software selection. To each their own choices.

    And for those that take testing memory seriously at scale, there are the $5000+ hardware based memory testers which also allow tweaking of the parameters to over/under volt and over/under clock to detect memory that is living on the edge.

    Leave a comment:


  • starshipeleven
    replied
    Originally posted by CommunityMember View Post
    (outside the threadripper group proper ECC support is hit/miss, and reportedly mostly miss).
    Asrock is the only brand that has that available in nearly all boards (and it's clearly stated in the specs). They are'nt playing stupid games like saying "yes it supports ECC ram but runs in non-ECC mode" (asus) or just bullshitting about ECC support (gigabyte)

    Leave a comment:


  • starshipeleven
    replied
    Originally posted by birdie View Post
    I've long switched to https://www.memtest86.com/download.htm which works great and I couldn't care less that it's a closed source application.
    Well, yeah, you are using Windows already so it's fine I guess.

    Leave a comment:


  • CommunityMember
    replied
    Originally posted by waxhead View Post
    Servers (with some load) usually use ECC RAM (and so should everything else in my opinion), but sadly that is not the truth.
    Due to, of course, a bifurcation of the market by certain CPU and mainboard manufacturers where you only get ECC support on server class systems (yes, there are a couple of exceptions). Reportedly all the Ryzen chips support ECC DRAM, and at least some threadripper mainboards (properly) support ECC (outside the threadripper group proper ECC support is hit/miss, and reportedly mostly miss).

    Leave a comment:

Working...
X