Announcement

**piorunz** · 05 October 2023, 07:14 PM

Originally posted by Veto View Post

I have often wondered, if I should begin to use ECC RAM for my NAS/server running 24/7. However, I have not really realized any issues being due to RAM errors.

Does anyone have any experience with running ECC RAM? Do you get errors/corrections reported in your logs regularly or at all? Is it really necessary in real life?

I also experienced data corruption errors due to non-ECC DDR4 RAM. Cost me a lot of time to fix. Couple of years ago I upgraded home server with ECC RAM and never had any problems since. Also running main workstation computer with ECC nowadays. Both Ryzens, server first gen 1600X, workstation is 5800X. Boards ASUS Prime. I will continue running with ECC sticks, I don't want to downgrade to unreliable memory, it cost too much time when important data got corrupted and corruption replicates to every backup quietly.

**piorunz** · 05 October 2023, 07:18 PM

Originally posted by atmartens View Post

The data really show how pathetic it is that ECC isn't standard. The performance penalty is negligible!

Unfortunately it's not that easy. Normal non-ECC sticks have much lower latency and are faster because of that. ECC sticks have very high CL and are slower. Just inserting ECC stick with ECC disabled cost you some performance.
For me this is worth it, but for extreme gamers maybe not because they can redownload their game and so on.

**unwind-protect** · 05 October 2023, 07:23 PM

In summary, I am surprised that there is a measurable difference from turning ECC on and off. Even 2-3% isn't what I expcted.

7% slower surprises me. It doesn't match my mental model of how ECC works.

Now I'm afraid that somebody does the same testing for registered vs. unbuffered RAM

**gary7** · 05 October 2023, 07:27 PM

I've been running my consumer grade ASRock AM4 motherboard NFS server with ECC memory for over a year now and have yet to have yet to have any errors reported by edac-util and no reliability problems. I would not even consider running an NFS server without ECC memory. Most ASRock AM4 mother boards support ECC. A quick check of their AM5 motherboards shows that ECC support is 50/50 or less with few of the low end and even some of the high(est) end motherboards not supporting it.

**Psyord** · 05 October 2023, 07:30 PM

Originally posted by schmidtbag View Post

This reminds me of the days when someone was arguing with me that ECC was going to be standard on DDR5 even for desktops and laptops. Sure seems that didn't pan out.

But it did pan out... ECC is the standard for DDR5. As in, every DDR5 stick has on-die ECC. The caveat is that it's not full ECC on both chip and signal. And it was never suppose to be. So by default DDR5 "non-ECC"/"half-ECC" can only detect errors in the chip, but not in the transit like the "full ECC".

**undersuit** · 05 October 2023, 08:53 PM

Originally posted by Yalok View Post

According to DDR5 Wikipedia article, there is in fact some form of ECC by design on all DDR5 sticks, that was not present on standard DDR4.

I don't trust it to truly secure my data.

I trust it to get DDR5 working well enough to be competitive with DDR4. We don't know how often the in-die ECC is correcting errors introduced by the chip and if a cosmic ray hits your memory in the neighborhood of the data your overclocked DDR5 DIMM is triggering repeat errors you've got a corruption.

**unwind-protect** · 05 October 2023, 10:14 PM

Originally posted by Yalok View Post

According to DDR5 Wikipedia article, there is in fact some form of ECC by design on all DDR5 sticks, that was not present on standard DDR4.

That's nice but you get no reporting. For all you know you could have a broken module that is spewing 1-bit errors on a constant basis and next thing you know you get a 2-bit error and wrong data - again without that fact being disclosed to you. In a way this in-module-only ECC functionality is worse than no ECC.

**coder** · 06 October 2023, 02:58 AM

Originally posted by piorunz View Post

Unfortunately it's not that easy. Normal non-ECC sticks have much lower latency and are faster because of that. ECC sticks have very high CL and are slower. Just inserting ECC stick with ECC disabled cost you some performance.

This is the most ignorant thing I've read in a while. What's scary is how confidently you proclaim it.

ECC UDIMMs only differ from non-ECC UDIMMs in terms of the number of DRAM chips on them. The DRAM chips, themselves, are exactly the same as the type used on non-ECC UDIMMs.

So, why the higher latency spec? That's only because they used conservative timing and adhere strictly to JEDEC specifications. They also tend not to have heatspreaders on them, like you often to see on "gaming" DIMMs. So, I'm not sure how much the specs are down to reducing heat dissipation or power consumption.

When the CPU is running in ECC mode, the CPU's integrated memory controller performs the ECC checking & computation. That can cost you a couple extra nanoseconds, at most. I'd further speculate that perhaps enabling ECC might disable burst chop, as that's just about the only way I can make sense of the more extreme outliers, assuming those results are stable.

**coder** · 06 October 2023, 03:10 AM

Originally posted by Yalok View Post

According to DDR5 Wikipedia article, there is in fact some form of ECC by design on all DDR5 sticks, that was not present on standard DDR4.

On-die ECC is much less dense. With an ECC UDIMM, you get 8 bits per 32 (external). At the chip-level DDR5 typically uses just 8 bits per 128. So, the level of protection is much less.

Source: https://www.synopsys.com/designware-...-features.html

Furthermore, as others have noted, DDR5 doesn't report these on-die errors. So, your machine will be blissfully unaware of any error detections or corrections, which I think is a wasted opportunity. Imagine the kernel keeping a list of which pages have ECC errors and then simply removing from circulation any pages in which more than one error has been detected! I believe this is similar to what modern SSDs do to help compensate for NAND aging.

From what I've read, there are two reasons DDR5 has on-die ECC:

Relative to DDR4, it runs at lower voltage and increased the refresh interval. Both of these changes provide power-savings, but increase the likelihood of bit errors.
Shrinking cell sizes (and die-stacking?) also increases the likelihood of bit errors.

So, to compensate for its higher intrinsic error rate, the DDR5 spec allows for on-die ECC. The expectation is that manufacturers will use just enough to achieve comparable external error rates as DDR4. So, don't expect a net benefit from DDR5's on-die ECC. It's there to provide adequate reliability for mass-market applications, not extra. That's why the DDR5 DIMM spec allows for end-to-end ECC!

**coder** · 06 October 2023, 03:17 AM

Originally posted by undersuit View Post

We don't know how often the in-die ECC is correcting errors introduced by the chip and if a cosmic ray hits your memory in the neighborhood of the data your overclocked DDR5 DIMM is triggering repeat errors you've got a corruption.

In my experience, most ECC errors occur due to manufacturing defects and normal wear out. One server had recurring ECC errors in a particular channel, but they persisted after replacing that DIMM, meaning the defect is either in the motherboard (including the DIMM socket) or the CPU.

For a sysadmin, ECC errors are a sign that you should replace a DIMM. Without (visible) ECC, you won't know it has worn out until it's so bad the machine has become unstable or corrupted data -- and then, just maybe you happen to have the presence of mind to run memtest.

Long ago, I read that poor quality power supplies can also increase the frequency of memory errors, though I'm not sure if that's due more to noise or voltages being off.

Announcement

AMD Ryzen 9 7900X Performance With ECC DDR5 Memory

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment