Announcement

Collapse
No announcement yet.

Linux EDAC Support For AMD's Great Horned Owl

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by linuxgeex View Post
    Is there a software bitflip mitigation in place in the linux kernel based on EDAC? That would be news to me. AFAIK it only reports the number of errors which have been detected in which modules. Then software which is aware of the reporting have a chance to discard/repeat the work. Ie long-running scientific threads can be snapshotted and rolled back to before the error was detected then continue, or a database could roll its journal back and re-commit its witness log. But I'm unaware of any software error correction in kernel that applications can rely on without work on their own part.
    I am not aware of any software mitigation, and like you I understand EDAC to primarily be a place to store errors detected (and usually corrected) by hardware.

    Normally application or framework notification is handled via separate paths, either by the application checking vendor-specific error codes or (more recently) robustness-type extensions being added to various APIs.

    That said, the general industry philosophy seems to be drawing a line between <HW + OS> and <application>, ie the HW + OS is responsible for keeping the hardware reliable and understanding when something has gone wrong, while the application is supposed to run in a protected and reliable world and not have to care about things like hardware failures.

    The missing link AFAICS is that the model was developed when mainframes primarily did batch processing, and so things like checkpoint/restart processing could be done at an OS level. Now that we have dozens of different frameworks all doing sort-of-batch-processing in a vendor-independent way it feels like some more standardization is required. There seems to be a fair amount of that happening on the CPU side and a few of us are working on extending that to GPUs as well.
    Test signature

    Comment


    • #12
      Originally posted by starshipeleven View Post
      EDAC is infrastructure for reading ECC errors reported by the hardware (not necessarily in RAM but also in PCIe or other system bus that support it), so the system can react or just log them somehow (as the facilities to do so in the UEFI may or may not be present and usable).

      It is probably one of the few reliable ways of actually testing if ECC ram is working at all, by doing shenanigans on the RAM modules (like covering some data traces) and checking through EDAC if errors are detected. Most "ECC checking" software only look for the registers in the processor to see if the ECC is enabled, but don't say if it is actually working or not.

      I don't understand why you are talking of the chance of double bitflip in ECC affecting EDAC development.
      Because EDAC for things other than memory are usually corrected by drivers repeating ops until they get an error-free result, and applications can ignore those. That's a solved problem. I'd be shocked if the majority of drivers bother implementing EDAC.

      BTW a heat gun is an easy way to provoke double faults from ECC memory modules to help quickly identify whether you have a faulty one.

      Comment


      • #13
        Originally posted by linuxgeex View Post
        Because EDAC for things other than memory are usually corrected by drivers repeating ops until they get an error-free result, and applications can ignore those. That's a solved problem. I'd be shocked if the majority of drivers bother implementing EDAC.
        Another person that didn't get it. EDAC is just a LOGGER for ECC errors reported by hardware. ECC happens in HARDWARE, you can only ENABLE r DISABLE it by writing on some registers.

        BTW a heat gun is an easy way to provoke double faults from ECC memory modules to help quickly identify whether you have a faulty one.
        It's also a great way to ruin the DIMM, or the mobo, or both, and risk your work and your life if you just destroyed company hardware.

        Using tape on some data traces does not have the same risks, at most it causes a system crash.

        Comment


        • #14
          Originally posted by linuxgeex View Post
          That's one example of faulty modules / faulty refresh timing configuration, yes.
          Rowhammer doesn't require faulty modules to work.
          It works on virtually all DDR3 and also on many DDR4 systems.

          Comment


          • #15
            Originally posted by chithanh View Post
            Rowhammer doesn't require faulty modules to work.
            It works on virtually all DDR3 and also on many DDR4 systems.
            Apparently you think that Rowhammer isn't exposing a hardware flaw. That's a novel point of view.

            Comment


            • #16
              linuxgeex
              Rowhammer exploits characteristics of the DRAM cells. Even if they work completely to spec (and thus are not faulty), they can be susceptible.

              It is more a design flaw in DRAM than a hardware flaw.

              Comment


              • #17
                Originally posted by starshipeleven View Post
                Another person that didn't get it. EDAC is just a LOGGER for ECC errors reported by hardware. ECC happens in HARDWARE, you can only ENABLE r DISABLE it by writing on some register.
                Sorry... what are you suggesting that I don't get?

                I said that EDAC is useful to watch for ECC double faults so that long-running scientific applications can respond by repeating work that may have been corrupted. Are you saying that's wrong?

                I said that EDAC in device drivers can actually correct errors (by repeating reads for example) whereas it can't correct corrupted memory, only report it. Are you saying that's wrong?

                I said that the reporting is pointless for most people, because the drivers do the right thing and ECC double faults at sea level are so infrequent that they're not worth the effort of considering. Are you saying that's wrong?

                I said that most drivers are responding to detected errors by repeating operations so they can return good data, and that I believe the majority of drivers don't even bother reporting those errors, and the whole premise of my first post was that most new drivers being written aren't bothering to implement the reporting, so EDAC is largely unmaintained, and so I was surprised to see it mentioned that it had received an update. Are you saying that's wrong?


                Last edited by linuxgeex; 27 August 2018, 08:44 AM.

                Comment


                • #18
                  Originally posted by linuxgeex View Post
                  I said that EDAC is useful to watch for ECC double faults so that long-running scientific applications can respond by repeating work that may have been corrupted. Are you saying that's wrong?
                  Quite frankly, I have difficulties finding where you said this. Maybe you tried to condense too much and garbled up your answer?

                  I said that EDAC in device drivers can actually correct errors (by repeating reads for example) whereas it can't correct corrupted memory, only report it. Are you saying that's wrong?
                  Afaik drivers don't have any say in ECC in any shape or form. It's always handled by the hardware unless you like wasting a ton of performance with software-side double checks of incoming data. The only case where I know the driver handles ECC is for reading/writing on NAND flash raw (like in crappy embedded devices that can't afford eMMC), in all other situations it's handled by chipset, storage controller or ram controller.

                  I said that the reporting is pointless for most people, because the drivers do the right thing and ECC double faults at sea level are so infrequent that they're not worth the effort of considering. Are you saying that's wrong?
                  Quite frankly, I have difficulties finding where you said this. Maybe you tried to condense too much and garbled up your answer?

                  I said that most drivers are responding to detected errors by repeating operations so they can return good data, and that I believe the majority of drivers don't even bother reporting those errors, and the whole premise of my first post was that most new drivers being written aren't bothering to implement the reporting, so EDAC is largely unmaintained, and so I was surprised to see it mentioned that it had received an update. Are you saying that's wrong?
                  Yes, this is wrong. That's not what EDAC is for.
                  EDAC is there to report ECC errors happening IN HARDWARE and dealt with IN HARDWARE. Mostly ECC RAM and ECC on PCIe or other system bus, and using a dedicated hardware interface to do this reporting.

                  device drivers never had any support for EDAC as each device implements its own ECC (or not) however it likes, without using the system interface used by EDAC.

                  Comment


                  • #19
                    Originally posted by starshipeleven View Post
                    Afaik drivers don't have any say in ECC in any shape or form. It's always handled by the hardware unless you like wasting a ton of performance with software-side double checks of incoming data. The only case where I know the driver handles ECC is for reading/writing on NAND flash raw (like in crappy embedded devices that can't afford eMMC), in all other situations it's handled by chipset, storage controller or ram controller.
                    The drivers at best can toggle whether the hardware error correction is enabled, receive interrupts when there's errors, or poll for error data. Oh, and report what they find to EDAC. My point is that the vast majority do not even though it is the mandate of EDAC to harvest that data. And that is why I say it is largely unmaintained. It has gone years without any significant work into reporting new reportable data. For example storage (controllers, smart devices like HDDs, raw devices like MTD) network controllers, USB hosts. All of them have error detection, none of them are reporting it via EDAC. As a result, the RAS goals of EDAC are diminished.

                    Originally posted by starshipeleven View Post
                    Yes, this is wrong. That's not what EDAC is for.
                    EDAC is there to report ECC errors happening IN HARDWARE and dealt with IN HARDWARE. Mostly ECC RAM and ECC on PCIe or other system bus, and using a dedicated hardware interface to do this reporting.

                    device drivers never had any support for EDAC as each device implements its own ECC (or not) however it likes, without using the system interface used by EDAC.
                    You like your allcaps a lot don't you?

                    You're correct that /mc/ and /pci/ EDAC sysfs paths have a lot more support than anything else. IBM xgene is one of the few which exports SATA and Net EDAC data, and AMD also has support for Hypertransport CE reporting. Yelling doesn't change the fact that EDAC would be more valuable to all concerned if it reported all sources of correctable errors. And sadly it doesn't. And no, I'm not volunteering to fix it, lol... I don't feel like I have the free time right now.

                    Comment

                    Working...
                    X