Announcement

Collapse
No announcement yet.

Linux EDAC Support For AMD's Great Horned Owl

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Linux EDAC Support For AMD's Great Horned Owl

    Phoronix: Linux EDAC Support For AMD's Great Horned Owl

    The latest Linux kernel patch is for supporting ECC error detection via the Error Detection And Correction (EDAC) code with AMD's Great Horned Owl...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    I hope this isn't a conflict, but Level One has some tests of a 1807B board using the Phoronix Test Suite: https://level1techs.com/article/sapp...-fpv5-review-0

    Comment


    • #3
      Originally posted by tsuru View Post
      I hope this isn't a conflict, but Level One has some tests of a 1807B board using the Phoronix Test Suite: https://level1techs.com/article/sapp...-fpv5-review-0
      I don't mind other people at all using PTS, though was surprised looking at that article that they really didn't compare the performance directly to anything.
      Michael Larabel
      https://www.michaellarabel.com/

      Comment


      • #4
        Wow I thought the EDAC code had stopped being maintained. After the last few large-scale tests were performed, it was proven that unless you had a bad memory module the odds of getting a double fault in ECC memory at sea level were something silly like 49 million years MTBF per 1Mbit*hour, and everyone stopped caring about testing any more.

        Comment


        • #5
          Originally posted by linuxgeex View Post
          Wow I thought the EDAC code had stopped being maintained. After the last few large-scale tests were performed, it was proven that unless you had a bad memory module the odds of getting a double fault in ECC memory at sea level were something silly like 49 million years MTBF per 1Mbit*hour, and everyone stopped caring about testing any more.
          EDAC is infrastructure for reading ECC errors reported by the hardware (not necessarily in RAM but also in PCIe or other system bus that support it), so the system can react or just log them somehow (as the facilities to do so in the UEFI may or may not be present and usable).

          It is probably one of the few reliable ways of actually testing if ECC ram is working at all, by doing shenanigans on the RAM modules (like covering some data traces) and checking through EDAC if errors are detected. Most "ECC checking" software only look for the registers in the processor to see if the ECC is enabled, but don't say if it is actually working or not.

          I don't understand why you are talking of the chance of double bitflip in ECC affecting EDAC development.

          Comment


          • #6
            Originally posted by linuxgeex View Post
            it was proven that unless you had a bad memory module the odds of getting a double fault in ECC memory at sea level were something silly like 49 million years MTBF per 1Mbit*hour, and everyone stopped caring about testing any more.
            You don't need a double bit flip to benefit from ECC/EDC... you only need a single bit flip to corrupt data.

            In a properly running system with error correction and a hardware (or software) scrubber you are normally detecting and fixing single bit flips sufficiently quickly that a double bit error almost never happens.
            Test signature

            Comment


            • #7
              Originally posted by bridgman View Post

              You don't need a double bit flip to benefit from ECC/EDC... you only need a single bit flip to corrupt data.

              In a properly running system with error correction and a hardware (or software) scrubber you are normally detecting and fixing single bit flips sufficiently quickly that a double bit error almost never happens.
              So it is a shame that Ryzen CPUs support ECC, but AM4 motherboards officially don't.

              Comment


              • #8
                Originally posted by linuxgeex View Post
                unless you had a bad memory module the odds of getting a double fault in ECC memory at sea level were something silly like 49 million years MTBF per 1Mbit*hour
                It is possible to provoke dual bit flips with Rowhammer, in less than 49 million years.

                Comment


                • #9
                  Originally posted by chithanh View Post
                  It is possible to provoke dual bit flips with Rowhammer, in less than 49 million years.
                  That's one example of faulty modules / faulty refresh timing configuration, yes.

                  Comment


                  • #10
                    Originally posted by bridgman View Post

                    You don't need a double bit flip to benefit from ECC/EDC... you only need a single bit flip to corrupt data.
                    Is there a software memory bitflip mitigation in place in the linux kernel based on EDAC? That would be news to me. AFAIK it only reports the number of errors which have been detected in which modules. Then software which is aware of the reporting have a chance to discard/repeat the work. Ie long-running scientific threads can be snapshotted and rolled back to before the error was detected then continue, or a database could roll its journal back and re-commit its witness log. But I'm unaware of any software error correction in kernel that applications can rely on without work on their own part. It would be nice if the VFS would flush buffers for the affected module so that cached blocks would get reloaded.
                    Last edited by linuxgeex; 21 August 2018, 11:04 AM.

                    Comment

                    Working...
                    X