Announcement

**sophisticles** · 18 February 2024, 01:50 PM

AMD MI300 systems have on-die High Bandwidth Memory. This memory has a relatively higher error rate, and it is not individually replaceable like DIMMs.....

When i read this, my brain translates it to "do not buy or use an AMD MI300 until and unless they get this corrected by using better quality hardware.

These types of issues are not going to show up in benchmarks with freshly unboxed products that are sent to reviewers, they are going to show up with continuous use.

I would never spent $20, much less 20 grand on a product that they need to employ this type of workaround to.

**numacross** · 18 February 2024, 02:00 PM

Originally posted by sophisticles View Post

When i read this, my brain translates it to "do not buy or use an AMD MI300 until and unless they get this corrected by using better quality hardware.

These types of issues are not going to show up in benchmarks with freshly unboxed products that are sent to reviewers, they are going to show up with continuous use.

I would never spent $20, much less 20 grand on a product that they need to employ this type of workaround to.

This doesn't look like a workaround, but conscious reliability improvement. Hardware isn't infallible, the best you can do is handle the failures gracefully.
NVIDIA also has similar mechanisms implemented in their datacenter offerings.

**geerge** · 18 February 2024, 03:00 PM

These cards have to use HBM for the bandwidth. Unless you want cards the size of a dustbin lid to fit all the gddr chips.

**bridgman** · 18 February 2024, 03:07 PM

Originally posted by sophisticles View Post

When i read this, my brain translates it to "do not buy or use an AMD MI300 until and unless they get this corrected by using better quality hardware. These types of issues are not going to show up in benchmarks with freshly unboxed products that are sent to reviewers, they are going to show up with continuous use. I would never spent $20, much less 20 grand on a product that they need to employ this type of workaround to.

Hmm... you may want to avoid any products with current generation memory then, including DDR5, since they all have higher data rates -> higher error rates -> built-in RAS features to maintain the same reliability levels as previous generations.

This is a feature designed into the HBM chips, not a workaround.

**milkylainen** · 18 February 2024, 04:30 PM

Originally posted by bridgman View Post

Hmm... you may want to avoid any products with current generation memory then, including DDR5, since they all have higher data rates -> higher error rates -> built-in RAS features to maintain the same reliability levels as previous generations.

This is a feature designed into the HBM chips, not a workaround.

Absolutely. Errors occur, methods exist to improve stuff.
But why retire the entire row for the length of what I assume is either complete reset or reset of states?
That still smells of something with very small margins.
And perhaps something that will improve with better serdes handling?

Has it always been row retirement for a single ecc (assumed correctable) error?

**coder** · 18 February 2024, 05:27 PM

Originally posted by sophisticles View Post

When i read this, my brain translates it to "do not buy or use an AMD MI300 until and unless they get this corrected by using better quality hardware.

Why do you think DDR5 had to incorporate on-die ECC? It's because smaller lithography is leading to an increase in the intrinsic DRAM error rate.

This is an issue with the DRAM, and there are only 3 HBM makers (although the 2 Korean makers seem to dominate that market). AMD doesn't fab its own HBM. The same issues should be affecting Nvidia and other HBM users. The only question is how their handling of it might differ.

Originally posted by sophisticles View Post

These types of issues are not going to show up in benchmarks with freshly unboxed products that are sent to reviewers, they are going to show up with continuous use.

Talking about regular DIMM memory, I found a new laptop DIMM to have memory errors, when I tested it. I do this every time I install new memory. The memory maker replaced it under warranty.

However, it's true that DRAM reliability decreases with usage. At my job, I've seen DIMMs start to encounter errors, only after years of usage.

**coder** · 18 February 2024, 05:32 PM

Originally posted by geerge View Post

These cards have to use HBM for the bandwidth. Unless you want cards the size of a dustbin lid to fit all the gddr chips.

HBM is for bandwidth, but also power and potentially cost savings.

With Nvidia H200 using up to 4.8 GB/s of bandwidth, you'd need almost 5x as wide memory bus as the RTX 4090 to match it with GDDR6. That's infeasible, not to mention consuming much more power. Worse, it would probably would do nothing to address the reliability concerns, other than maybe being a little easier to cool.

**coder** · 18 February 2024, 05:42 PM

Originally posted by milkylainen View Post

But why retire the entire row for the length of what I assume is either complete reset or reset of states?

IMO, just be glad it's not an entire page, which was the traditional Linux approach for excluding address ranges.

My guess is that you disable the row, because it's not that big and there's a reasonable chance the location of the error is somewhere that could affect multiple cells. Or, maybe it's just a granularity thing, where it'd be too much book keeping to disable address ranges at a finer granularity.

**sophisticles** · 18 February 2024, 05:50 PM

Originally posted by bridgman View Post

This is a feature designed into the HBM chips, not a workaround.

Are you saying this is similar to wear leveling found in SSDs?

Announcement

Linux 6.9 Adding AMD MI300 Row Retirement Support For Problematic HBM Memory

Linux 6.9 Adding AMD MI300 Row Retirement Support For Problematic HBM Memory

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment