Announcement

**tsuru** · 10 August 2018, 08:36 PM

I hope this isn't a conflict, but Level One has some tests of a 1807B board using the Phoronix Test Suite: https://level1techs.com/article/sapp...-fpv5-review-0

**Michael** · 10 August 2018, 08:49 PM

Originally posted by tsuru View Post

I hope this isn't a conflict, but Level One has some tests of a 1807B board using the Phoronix Test Suite: https://level1techs.com/article/sapp...-fpv5-review-0

I don't mind other people at all using PTS, though was surprised looking at that article that they really didn't compare the performance directly to anything.

**linuxgeex** · 11 August 2018, 04:47 AM

Wow I thought the EDAC code had stopped being maintained. After the last few large-scale tests were performed, it was proven that unless you had a bad memory module the odds of getting a double fault in ECC memory at sea level were something silly like 49 million years MTBF per 1Mbit*hour, and everyone stopped caring about testing any more.

**starshipeleven** · 11 August 2018, 06:42 PM

Originally posted by linuxgeex View Post

Wow I thought the EDAC code had stopped being maintained. After the last few large-scale tests were performed, it was proven that unless you had a bad memory module the odds of getting a double fault in ECC memory at sea level were something silly like 49 million years MTBF per 1Mbit*hour, and everyone stopped caring about testing any more.

EDAC is infrastructure for reading ECC errors reported by the hardware (not necessarily in RAM but also in PCIe or other system bus that support it), so the system can react or just log them somehow (as the facilities to do so in the UEFI may or may not be present and usable).

It is probably one of the few reliable ways of actually testing if ECC ram is working at all, by doing shenanigans on the RAM modules (like covering some data traces) and checking through EDAC if errors are detected. Most "ECC checking" software only look for the registers in the processor to see if the ECC is enabled, but don't say if it is actually working or not.

I don't understand why you are talking of the chance of double bitflip in ECC affecting EDAC development.

**bridgman** · 12 August 2018, 11:17 PM

Originally posted by linuxgeex View Post

it was proven that unless you had a bad memory module the odds of getting a double fault in ECC memory at sea level were something silly like 49 million years MTBF per 1Mbit*hour, and everyone stopped caring about testing any more.

You don't need a double bit flip to benefit from ECC/EDC... you only need a single bit flip to corrupt data.

In a properly running system with error correction and a hardware (or software) scrubber you are normally detecting and fixing single bit flips sufficiently quickly that a double bit error almost never happens.

**drSeehas** · 13 August 2018, 03:19 AM

Originally posted by bridgman View Post

You don't need a double bit flip to benefit from ECC/EDC... you only need a single bit flip to corrupt data.

In a properly running system with error correction and a hardware (or software) scrubber you are normally detecting and fixing single bit flips sufficiently quickly that a double bit error almost never happens.

So it is a shame that Ryzen CPUs support ECC, but AM4 motherboards officially don't.

**chithanh** · 13 August 2018, 06:07 AM

Originally posted by linuxgeex View Post

unless you had a bad memory module the odds of getting a double fault in ECC memory at sea level were something silly like 49 million years MTBF per 1Mbit*hour

It is possible to provoke dual bit flips with Rowhammer, in less than 49 million years.

**linuxgeex** · 21 August 2018, 10:10 AM

Originally posted by chithanh View Post

It is possible to provoke dual bit flips with Rowhammer, in less than 49 million years.

That's one example of faulty modules / faulty refresh timing configuration, yes.

**linuxgeex** · 21 August 2018, 10:14 AM

Originally posted by bridgman View Post

You don't need a double bit flip to benefit from ECC/EDC... you only need a single bit flip to corrupt data.

Is there a software memory bitflip mitigation in place in the linux kernel based on EDAC? That would be news to me. AFAIK it only reports the number of errors which have been detected in which modules. Then software which is aware of the reporting have a chance to discard/repeat the work. Ie long-running scientific threads can be snapshotted and rolled back to before the error was detected then continue, or a database could roll its journal back and re-commit its witness log. But I'm unaware of any software error correction in kernel that applications can rely on without work on their own part. It would be nice if the VFS would flush buffers for the affected module so that cached blocks would get reloaded.

Announcement

Linux EDAC Support For AMD's Great Horned Owl

Linux EDAC Support For AMD's Great Horned Owl

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment