Announcement

**Serafean** · 21 July 2020, 04:52 AM

BTRFS checksumming found faulty RAM/Motherboard for me once...
Once upon a time, memory controllers were in a part of the motherboard called the Northbridge (today this is integrated on the CPU). This chip was silently failing.
I was getting random reboots (about once a week), and some other weird behaviour. The weirdest of which was data corruption on my disks. After buying a new SSD, and the checksum errors persisting, I finally figured it out. A new LGA775 motherboard found "in the dumpster", and all was good.
Memtest was useless, as at the time it had a giant false positive bug.
And the final funny part: the northbridge was an Nvidia chip. Always overheating, and basically no good.

**Auzy** · 21 July 2020, 04:58 AM

It's actually incredible that in today's age, we are building Quantum computers, have supercomputers on our desk, but fully-buffered ECC ram isn't standard yet on today's desktops (which surely would help help weed out many faulty RAM modules)

**bugmenot3** · 21 July 2020, 05:30 AM

One time the memory of my mom's PC was broken (after it had been fine for years) and there were a bunch of different errors when booting, and each time when booting it hung in a different state. Tests with memtest showed many errors already within the first second. I was a little worried about the file system writing garbage to the hard drive. Also, I was a little unhappy that a system that was so unstable would try to boot at all costs and that the kernel would not recognize that there were such huge problems.

Maybe the Linux kernel could do a "simple" short test before booting the operating system if there was an unexpected reboot, or even do plausibility tests in the background like file systems do? Nice by the way that e.g. BTRS helps to detect such RAM errors.

In addition, a nice way to notify the user is necessary, errors from ECC RAM can appear in a log file, but it would be nicer if there was a message after boot or a message during boot that has to be accepted.

I know that some of these errors are rare and certainly not all RAM errors can be detected with such tools, but if there are such errors, the user must be notified, because bad RAM simply needs to be replaced.

**milkylainen** · 21 July 2020, 05:35 AM

There is nothing wrong with memtester. It is hw-agnostic and runs on just about anything.
I have ported memtester to various bootloaders and other environments.
PPC of various sorts, ARMs and whatnot.

The only fly in the ointment is the preparing relocation of whatever is running while testing the underlying memory.
Memtest does that just fine, everything else, does not.

If they want to test memory live while running Linux I suggest relocating everything live if it is possible,
otherwise restrict memtesting to the smallest environment possible, eg the bootloader.

The memtesting in itself, is trivial.

**milkylainen** · 21 July 2020, 05:44 AM

Originally posted by bugmenot3 View Post

Maybe the Linux kernel could do a "simple" short test before booting the operating system if there was an unexpected reboot, or even do plausibility tests in the background like file systems do?

There is no "safe" heuristics in memory tests. Also, on large memory machines, even heuristics take substantial time to complete.
Also, heuristics usually only catch the really broken configs. Which means you'd notice, anyway.

To catch really subtle errors, you need long and agonizingly painful memory tests.
This makes them rather annoying in servers with uptime requirements, when doing tests from cold boots.

Originally posted by bugmenot3 View Post

In addition, a nice way to notify the user is necessary, errors from ECC RAM can appear in a log file, but it would be nicer if there was a message after boot or a message during boot that has to be accepted.

The kernel has had this for ages. It's the EDAC subsystem.

**bugmenot3** · 21 July 2020, 06:27 AM

Originally posted by milkylainen View Post

There is no "safe" heuristics in memory tests. Also, on large memory machines, even heuristics take substantial time to complete.
Also, heuristics usually only catch the really broken configs. Which means you'd notice, anyway.

To catch really subtle errors, you need long and agonizingly painful memory tests.
This makes them rather annoying in servers with uptime requirements, when doing tests from cold boots.

That's why I wrote that I know that not all errors can be detected with such tools. And that a basic checkup only should be performed after an unexpected shutdown or reboot, not every cold boot. Primary I also did not think about server environments (they will probably have admins that know what they do), but like "normal" PCs, where the PC crashes during boot, but after restart tries again and again, and for the end user it is not clear that it is a RAM failure, although it is easy to find out via memtest.

The kernel has had this for ages. It's the EDAC subsystem

I know that the kernel logs those errors, but how does the user of the linux distribution take notice of it when he does not take a look into the logs?

**waxhead** · 21 July 2020, 07:43 AM

Memory testing or "memory scrubbing" should definitively be part of the kernel. Servers (with some load) usually use ECC RAM (and so should everything else in my opinion), but sadly that is not the truth.

As I see it, what is needed - or at least what could be quite useful - is to utilize free CPU time and very slowly and gradually walk the physical memory locations.
Since all memory (to my knowledge) is virtual addresses then the memory subsystem could reallocate data on untested pages to (tested) memory locations while the old page location is temporarily freed up and reserved or testing. Once the page is tested it is good and in case bad memory is found it could simply be blacklisted and the user could be warned. You could (and probably should) blacklist the entire memory module.

The system could slowly and steadily walk the physical memory chunk by chunk at a set interval (a rate limit should be configurable as well as a CPU time limit). Eventually, after hours or even days all memory on a system should be tested and the process could repeat. While this approach may be dead slow compared to running a deticated memtest86+, and while it could probably impact performance (CPU caches) it could have beena process that runs on idle (or part of the idle) time continuously. If reliability is important for you It would still be better than today where you either run a memtest and once it passes you don't run it again for months or even years.

This approach would make it possible to nest nearly all physical memory as the system runs as far as I can tell. The only downside would be that lots of memory manufacturers probably would fight against such a functionality like it was the plague itself.

**Auzy** · 21 July 2020, 07:50 AM

Originally posted by waxhead View Post

Memory testing or "memory scrubbing" should definitively be part of the kernel. Servers (with some load) usually use ECC RAM (and so should everything else in my opinion), but sadly that is not the truth.

As I see it, what is needed - or at least what could be quite useful - is to utilize free CPU time and very slowly and gradually walk the physical memory locations.
Since all memory (to my knowledge) is virtual addresses then the memory subsystem could reallocate data on untested pages to (tested) memory locations while the old page location is temporarily freed up and reserved or testing. Once the page is tested it is good and in case bad memory is found it could simply be blacklisted and the user could be warned. You could (and probably should) blacklist the entire memory module.

The system could slowly and steadily walk the physical memory chunk by chunk at a set interval (a rate limit should be configurable as well as a CPU time limit). Eventually, after hours or even days all memory on a system should be tested and the process could repeat. While this approach may be dead slow compared to running a deticated memtest86+, and while it could probably impact performance (CPU caches) it could have beena process that runs on idle (or part of the idle) time continuously. If reliability is important for you It would still be better than today where you either run a memtest and once it passes you don't run it again for months or even years.

This approach would make it possible to nest nearly all physical memory as the system runs as far as I can tell. The only downside would be that lots of memory manufacturers probably would fight against such a functionality like it was the plague itself.

Other consideration could be power consumption though.. But, that would also be an issue with ECC too I guess..

**zxy_thf** · 21 July 2020, 08:07 AM

The current situation is less relevant to ECC. It's more about the manufactures are SPAMMING the market with overclocked RAMs.

Nowadays it is really hard to find non-overclocking RAMs with 2666MHz or higher.
All you can buy is 2133Mhz chips XMPed to 3200 or something -- unless you want to pay the premium and go full server-grade build.

Announcement

Fedora Developers Brainstorming Options For Better Memory Testing

Fedora Developers Brainstorming Options For Better Memory Testing

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment