Announcement

Collapse
No announcement yet.

AMD Ryzen 9 7900X Performance With ECC DDR5 Memory

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • coder
    replied
    Originally posted by Old Nobody View Post
    All reported errors got corrected. You have to decide on you own if it's worth the extra money.
    A double-bit (detected) error will abort your process, if it happens in userspace. In the kernel, I'd imagine you'll get a kernel panic. It's not nice, but at least it stops you from continuing and possibly propagating corrupt data.

    I experienced this, in practice. We had a server where apps were crashing a lot. I think there were also reports of the machine locking up or hitting kernel panics, but I'm foggy on those details. Upon investigation, syslog was filled with reams of ECC errors, some of them double-bit!
    Last edited by coder; 06 October 2023, 03:33 AM.

    Leave a comment:


  • coder
    replied
    Originally posted by unwind-protect View Post
    In summary, I am surprised that there is a measurable difference from turning ECC on and off. Even 2-3% isn't what I expcted.

    7% slower surprises me. It doesn't match my mental model of how ECC works.
    The problem is you just looked at the outliers. The geomean across all 242 benchmarks showed that enabling ECC still gives you an average of 99.74% of the performance you get with it disabled.




    Without further investigation, we don't know why the outliers are outliers. I wonder how consistent their scores are. It could be that they're just some of the more highly-variable benchmarks included in the suite, and much of the discrepancy we're seeing is due to random variation. Or, maybe it's something else, like that enabling ECC has the effect of disabling burst-chop.

    The bigger performance penalty is having to use lower-spec'd RAM, because ECC UDIMMs lag non-ECC in terms of the speed and timings available. Speaking of which, Kingston provides DDR5-5600 ECC UDIMMs, Michael (the article says only DDR5-5200):

    Trust Kingston for all of your servers, desktops and laptops memory needs. Kingston DRAM is designed to maximize the performance of a specific computer system. Find memory for your device here.



    P.S. thanks for the benchmarks, Michael. I'd love to know more about some of those more extreme outliers (how consistent were the scores, without changing the memory setting?).
    Last edited by coder; 06 October 2023, 03:54 AM.

    Leave a comment:


  • NM64
    replied
    Originally posted by peterdk View Post
    Are there any consumer motherboards that support ECC on AMD 7000 series?
    Originally posted by wertigon View Post
    Asus got support across pretty much the whole range, do double check but seems like most Prime and ROG boards support it, and possibly more.
    Originally posted by gary7 View Post
    Most ASRock AM4 mother boards support ECC. A quick check of their AM5 motherboards shows that ECC support is 50/50 or less with few of the low end and even some of the high(est) end motherboards not supporting it.
    Just a quick protip: you can typically find out if a given motherboard supports ECC by just downloading the PDF manual and doing a search for the likes of "error" or "ecc" or "e.c.c" (yes, without the final period to make sure they didn't derp up)‚Äč

    Originally posted by coder View Post
    So, why the higher latency spec? That's only because they used conservative timing and adhere strictly to JEDEC specifications.
    One of the fun thing of using consumer boards that support ECC is that you can overclock the RAM and, due to the aforementioned error-correcting functionality, it makes it not just way easier to find stability but also gives you that extra peace of mind that, even if some random teeny bit of instability managed to slip through, it'll get corrected on-the-fly anyway.

    Fun fact: I have some 2x4GB DDR3-1066 Crucial unbuffered ECC that easily runs at 1600 9-9-9-24 even when undervolted some. It's similarly paired with some SuperTalent 2x8GB DDR3-1333 that also runs at 1600 9-9-9-24 when undervolted (one of the sticks even runs at 1866 with a minor undervolt, but the system it's currently on can't properly do 1866 anyway so I have all four DIMMs at 1600 9-9-9-24)

    Oh yeah, undervolting is also a great way to test your RAM's stability. If it's stable at 1.40v, then it sure as heck should be stable at 1.50v with the error-correcting even as a fallback.

    ...that being said, I did find a funny thing where you can have RAM pass memtest86+ but fail OCCT with medium data set + extreme, yet not due to any sort of overclock but rather the combination of DIMMs and/or motherboard just not jiving (I have a 2x8GB DDR3-1600 Kingston kit that just wouldn't jive with a 2x8GB DDR3-1600 Corsair kit, but the Kingston kit jives just fine with a 2x4GB DDR3-1866 Corsair kit, and that same 2x4GB DDR3-1866 Corsair kit jives just fine with the 2x8GB DDR3-1600 Corsair kit on even the same motherboard and CPU, so...)
    Last edited by NM64; 06 October 2023, 03:32 AM.

    Leave a comment:


  • Smurphy
    replied
    Originally posted by Veto View Post
    I have often wondered, if I should begin to use ECC RAM for my NAS/server running 24/7. However, I have not really realized any issues being due to RAM errors.

    Does anyone have any experience with running ECC RAM? Do you get errors/corrections reported in your logs regularly or at all? Is it really necessary in real life?
    I do use ECC Ram in my server (Build blog here https://www.solsys.org/dyntbl.php?mo...3&op=View&id=9 ) and didn't have any errors show up yet.
    My old NAS had to be retired as it started showing erratic behavior with sometimes data loss. These problems do not exist anymore since I use that setup.

    Leave a comment:


  • coder
    replied
    Originally posted by undersuit View Post
    We don't know how often the in-die ECC is correcting errors introduced by the chip and if a cosmic ray hits your memory in the neighborhood of the data your overclocked DDR5 DIMM is triggering repeat errors you've got a corruption.
    In my experience, most ECC errors occur due to manufacturing defects and normal wear out. One server had recurring ECC errors in a particular channel, but they persisted after replacing that DIMM, meaning the defect is either in the motherboard (including the DIMM socket) or the CPU.

    For a sysadmin, ECC errors are a sign that you should replace a DIMM. Without (visible) ECC, you won't know it has worn out until it's so bad the machine has become unstable or corrupted data -- and then, just maybe you happen to have the presence of mind to run memtest.

    Long ago, I read that poor quality power supplies can also increase the frequency of memory errors, though I'm not sure if that's due more to noise or voltages being off.
    Last edited by coder; 06 October 2023, 03:56 AM.

    Leave a comment:


  • coder
    replied
    Originally posted by Yalok View Post
    According to DDR5 Wikipedia article, there is in fact some form of ECC by design on all DDR5 sticks, that was not present on standard DDR4.
    On-die ECC is much less dense. With an ECC UDIMM, you get 8 bits per 32 (external). At the chip-level DDR5 typically uses just 8 bits per 128. So, the level of protection is much less.

    Furthermore, as others have noted, DDR5 doesn't report these on-die errors. So, your machine will be blissfully unaware of any error detections or corrections, which I think is a wasted opportunity. Imagine the kernel keeping a list of which pages have ECC errors and then simply removing from circulation any pages in which more than one error has been detected! I believe this is similar to what modern SSDs do to help compensate for NAND aging.

    From what I've read, there are two reasons DDR5 has on-die ECC:
    1. Relative to DDR4, it runs at lower voltage and increased the refresh interval. Both of these changes provide power-savings, but increase the likelihood of bit errors.
    2. Shrinking cell sizes (and die-stacking?) also increases the likelihood of bit errors.

    So, to compensate for its higher intrinsic error rate, the DDR5 spec allows for on-die ECC. The expectation is that manufacturers will use just enough to achieve comparable external error rates as DDR4. So, don't expect a net benefit from DDR5's on-die ECC. It's there to provide adequate reliability for mass-market applications, not extra. That's why the DDR5 DIMM spec allows for end-to-end ECC!
    Last edited by coder; 06 October 2023, 03:59 AM.

    Leave a comment:


  • coder
    replied
    Originally posted by piorunz View Post
    Unfortunately it's not that easy. Normal non-ECC sticks have much lower latency and are faster because of that. ECC sticks have very high CL and are slower. Just inserting ECC stick with ECC disabled cost you some performance.
    This is the most ignorant thing I've read in a while. What's scary is how confidently you proclaim it.

    ECC UDIMMs only differ from non-ECC UDIMMs in terms of the number of DRAM chips on them. The DRAM chips, themselves, are exactly the same as the type used on non-ECC UDIMMs.

    So, why the higher latency spec? That's only because they used conservative timing and adhere strictly to JEDEC specifications. They also tend not to have heatspreaders on them, like you often to see on "gaming" DIMMs. So, I'm not sure how much the specs are down to reducing heat dissipation or power consumption.

    When the CPU is running in ECC mode, the CPU's integrated memory controller performs the ECC checking & computation. That can cost you a couple extra nanoseconds, at most. I'd further speculate that perhaps enabling ECC might disable burst chop, as that's just about the only way I can make sense of the more extreme outliers, assuming those results are stable.
    Last edited by coder; 06 October 2023, 04:01 AM.

    Leave a comment:


  • unwind-protect
    replied
    Originally posted by Yalok View Post

    According to DDR5 Wikipedia article, there is in fact some form of ECC by design on all DDR5 sticks, that was not present on standard DDR4.
    That's nice but you get no reporting. For all you know you could have a broken module that is spewing 1-bit errors on a constant basis and next thing you know you get a 2-bit error and wrong data - again without that fact being disclosed to you. In a way this in-module-only ECC functionality is worse than no ECC.

    Leave a comment:


  • undersuit
    replied
    Originally posted by Yalok View Post

    According to DDR5 Wikipedia article, there is in fact some form of ECC by design on all DDR5 sticks, that was not present on standard DDR4.
    I don't trust it to truly secure my data.

    I trust it to get DDR5 working well enough to be competitive with DDR4. We don't know how often the in-die ECC is correcting errors introduced by the chip and if a cosmic ray hits your memory in the neighborhood of the data your overclocked DDR5 DIMM is triggering repeat errors you've got a corruption.

    Leave a comment:


  • Psyord
    replied
    Originally posted by schmidtbag View Post
    This reminds me of the days when someone was arguing with me that ECC was going to be standard on DDR5 even for desktops and laptops. Sure seems that didn't pan out.
    But it did pan out... ECC is the standard for DDR5. As in, every DDR5 stick has on-die ECC. The caveat is that it's not full ECC on both chip and signal. And it was never suppose to be. So by default DDR5 "non-ECC"/"half-ECC" can only detect errors in the chip, but not in the transit like the "full ECC".

    Leave a comment:

Working...
X