Announcement

Collapse
No announcement yet.

Fedora Developers Brainstorming Options For Better Memory Testing

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • milkylainen
    replied
    Originally posted by bugmenot3 View Post
    Maybe the Linux kernel could do a "simple" short test before booting the operating system if there was an unexpected reboot, or even do plausibility tests in the background like file systems do?
    There is no "safe" heuristics in memory tests. Also, on large memory machines, even heuristics take substantial time to complete.
    Also, heuristics usually only catch the really broken configs. Which means you'd notice, anyway.

    To catch really subtle errors, you need long and agonizingly painful memory tests.
    This makes them rather annoying in servers with uptime requirements, when doing tests from cold boots.

    Originally posted by bugmenot3 View Post
    In addition, a nice way to notify the user is necessary, errors from ECC RAM can appear in a log file, but it would be nicer if there was a message after boot or a message during boot that has to be accepted.
    The kernel has had this for ages. It's the EDAC subsystem.

    Leave a comment:


  • milkylainen
    replied
    There is nothing wrong with memtester. It is hw-agnostic and runs on just about anything.
    I have ported memtester to various bootloaders and other environments.
    PPC of various sorts, ARMs and whatnot.

    The only fly in the ointment is the preparing relocation of whatever is running while testing the underlying memory.
    Memtest does that just fine, everything else, does not.

    If they want to test memory live while running Linux I suggest relocating everything live if it is possible,
    otherwise restrict memtesting to the smallest environment possible, eg the bootloader.

    The memtesting in itself, is trivial.

    Leave a comment:


  • bugmenot3
    replied
    One time the memory of my mom's PC was broken (after it had been fine for years) and there were a bunch of different errors when booting, and each time when booting it hung in a different state. Tests with memtest showed many errors already within the first second. I was a little worried about the file system writing garbage to the hard drive. Also, I was a little unhappy that a system that was so unstable would try to boot at all costs and that the kernel would not recognize that there were such huge problems.

    Maybe the Linux kernel could do a "simple" short test before booting the operating system if there was an unexpected reboot, or even do plausibility tests in the background like file systems do? Nice by the way that e.g. BTRS helps to detect such RAM errors.

    In addition, a nice way to notify the user is necessary, errors from ECC RAM can appear in a log file, but it would be nicer if there was a message after boot or a message during boot that has to be accepted.

    I know that some of these errors are rare and certainly not all RAM errors can be detected with such tools, but if there are such errors, the user must be notified, because bad RAM simply needs to be replaced.

    Leave a comment:


  • Auzy
    replied
    It's actually incredible that in today's age, we are building Quantum computers, have supercomputers on our desk, but fully-buffered ECC ram isn't standard yet on today's desktops (which surely would help help weed out many faulty RAM modules)

    Leave a comment:


  • Serafean
    replied
    BTRFS checksumming found faulty RAM/Motherboard for me once...
    Once upon a time, memory controllers were in a part of the motherboard called the Northbridge (today this is integrated on the CPU). This chip was silently failing.
    I was getting random reboots (about once a week), and some other weird behaviour. The weirdest of which was data corruption on my disks. After buying a new SSD, and the checksum errors persisting, I finally figured it out. A new LGA775 motherboard found "in the dumpster", and all was good.
    Memtest was useless, as at the time it had a giant false positive bug.
    And the final funny part: the northbridge was an Nvidia chip. Always overheating, and basically no good.

    Leave a comment:


  • Fedora Developers Brainstorming Options For Better Memory Testing

    Phoronix: Fedora Developers Brainstorming Options For Better Memory Testing

    In looking beyond the massive Fedora 33 release in development, Fedora developers have begun discussing options for allowing better memory testing on their distribution for evaluating possible faulty RAM issues that otherwise often get mixed in with other software bugs and other sporadic behavior...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite
Working...
X