Announcement

Collapse
No announcement yet.

Linux 6.9 Adding AMD MI300 Row Retirement Support For Problematic HBM Memory

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Linux 6.9 Adding AMD MI300 Row Retirement Support For Problematic HBM Memory

    Phoronix: Linux 6.9 Adding AMD MI300 Row Retirement Support For Problematic HBM Memory

    For the upcoming Linux 6.9 kernel cycle there are a number of AMD Instinct MI300 additions to the EDAC (Error Detection And Correction) and RAS (Reliability, Availability and Serviceability) drivers...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    AMD MI300 systems have on-die High Bandwidth Memory. This memory has a relatively higher error rate, and it is not individually replaceable like DIMMs.....
    When i read this, my brain translates it to "do not buy or use an AMD MI300 until and unless they get this corrected by using better quality hardware.

    These types of issues are not going to show up in benchmarks with freshly unboxed products that are sent to reviewers, they are going to show up with continuous use.

    I would never spent $20, much less 20 grand on a product that they need to employ this type of workaround to.

    Comment


    • #3
      Originally posted by sophisticles View Post

      When i read this, my brain translates it to "do not buy or use an AMD MI300 until and unless they get this corrected by using better quality hardware.

      These types of issues are not going to show up in benchmarks with freshly unboxed products that are sent to reviewers, they are going to show up with continuous use.

      I would never spent $20, much less 20 grand on a product that they need to employ this type of workaround to.
      This doesn't look like a workaround, but conscious reliability improvement. Hardware isn't infallible, the best you can do is handle the failures gracefully.
      NVIDIA also has similar mechanisms implemented in their datacenter offerings.​

      Comment


      • #4
        These cards have to use HBM for the bandwidth. Unless you want cards the size of a dustbin lid to fit all the gddr chips.

        Comment


        • #5
          Originally posted by sophisticles View Post
          When i read this, my brain translates it to "do not buy or use an AMD MI300 until and unless they get this corrected by using better quality hardware. These types of issues are not going to show up in benchmarks with freshly unboxed products that are sent to reviewers, they are going to show up with continuous use. I would never spent $20, much less 20 grand on a product that they need to employ this type of workaround to.
          Hmm... you may want to avoid any products with current generation memory then, including DDR5, since they all have higher data rates -> higher error rates -> built-in RAS features to maintain the same reliability levels as previous generations.

          This is a feature designed into the HBM chips, not a workaround.
          Test signature

          Comment


          • #6
            Originally posted by bridgman View Post

            Hmm... you may want to avoid any products with current generation memory then, including DDR5, since they all have higher data rates -> higher error rates -> built-in RAS features to maintain the same reliability levels as previous generations.

            This is a feature designed into the HBM chips, not a workaround.
            Absolutely. Errors occur, methods exist to improve stuff.
            But why retire the entire row for the length of what I assume is either complete reset or reset of states?
            That still smells of something with very small margins.
            And perhaps something that will improve with better serdes handling?

            Has it always been row retirement for a single ecc (assumed correctable) error?

            Comment


            • #7
              Originally posted by sophisticles View Post
              When i read this, my brain translates it to "do not buy or use an AMD MI300 until and unless they get this corrected by using better quality hardware.
              Why do you think DDR5 had to incorporate on-die ECC? It's because smaller lithography is leading to an increase in the intrinsic DRAM error rate.

              This is an issue with the DRAM, and there are only 3 HBM makers (although the 2 Korean makers seem to dominate that market). AMD doesn't fab its own HBM. The same issues should be affecting Nvidia and other HBM users. The only question is how their handling of it might differ.

              Originally posted by sophisticles View Post
              These types of issues are not going to show up in benchmarks with freshly unboxed products that are sent to reviewers, they are going to show up with continuous use.
              Talking about regular DIMM memory, I found a new laptop DIMM to have memory errors, when I tested it. I do this every time I install new memory. The memory maker replaced it under warranty.

              However, it's true that DRAM reliability decreases with usage. At my job, I've seen DIMMs start to encounter errors, only after years of usage.

              Comment


              • #8
                Originally posted by geerge View Post
                These cards have to use HBM for the bandwidth. Unless you want cards the size of a dustbin lid to fit all the gddr chips.
                HBM is for bandwidth, but also power and potentially cost savings.

                With Nvidia H200 using up to 4.8 GB/s of bandwidth, you'd need almost 5x as wide memory bus as the RTX 4090 to match it with GDDR6. That's infeasible, not to mention consuming much more power. Worse, it would probably would do nothing to address the reliability concerns, other than maybe being a little easier to cool.

                Comment


                • #9
                  Originally posted by milkylainen View Post
                  But why retire the entire row for the length of what I assume is either complete reset or reset of states?
                  IMO, just be glad it's not an entire page, which was the traditional Linux approach for excluding address ranges.

                  My guess is that you disable the row, because it's not that big and there's a reasonable chance the location of the error is somewhere that could affect multiple cells. Or, maybe it's just a granularity thing, where it'd be too much book keeping to disable address ranges at a finer granularity.

                  Comment


                  • #10
                    Originally posted by bridgman View Post
                    This is a feature designed into the HBM chips, not a workaround.
                    Are you saying this is similar to wear leveling found in SSDs?

                    Comment

                    Working...
                    X