8 vs. 12 Channel DDR5-6000 Memory Performance With AMD 5th Gen EPYC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • coder
    replied
    Originally posted by fairydreaming View Post
    This one is for memory-to-memory operations, so it's still not a general purpose store instruction that you could use to store values from registers in memory.
    Memory-to-memory doesn't mean DRAM-to-DRAM. It just means using addresses instead of registers. Those addresses might be in the cache hierarchy.

    So, just write the data into a stack-based buffer and then use REP MOVSB to copy it where you want. Stack-based local variables are virtually guaranteed to be in L1, so this avoids RFO, when initially populating the buffer. It might seem inefficient to populate a local buffer before copying it to DRAM, but I think it's worth the trouble if your system is memory-bottlenecked and assuming REP MOVSB can indeed avoid RFO.

    Leave a comment:


  • fairydreaming
    replied
    Originally posted by coder View Post
    If REP STOS works that way, why wouldn't REP MOVS?
    This one is for memory-to-memory operations, so it's still not a general purpose store instruction that you could use to store values from registers in memory.
    I guess we are stuck with non-temporal stores after all.

    Leave a comment:


  • coder
    replied
    Originally posted by fairydreaming View Post
    But is the REP STOS really that useful? I mean it's not a general purpose store that you can use to write arbitrary data to memory. As far as I know the way it works restricts its use cases to memset-like memory initialization.
    If REP STOS works that way, why wouldn't REP MOVS?

    I also noticed there are a couple of write-oriented prefetch instructions, PREFETCHW and PREFETCHWT1, which makes me wonder how they differ from regular PREFETCH.

    I don't really have time to dig into this now, but I might be tempted to cook up a little benchmark and try it on some Zen 4, Zen 5, and Xeon 4+ instances.

    Leave a comment:


  • fairydreaming
    replied
    Originally posted by coder View Post
    IMO, non-temporal loads & stores should be used when reuse isn't anticipated. This includes cases where you're dealing with a chunk of data that's larger than the largest L3 segment, such as in this recent memset() optimization. It also includes cases where something like a realtime system (which includes games, if you squint a little bit) might be updating a data structure that's not going to be accessed before the next time interval.

    If you simply use them in that way, then I think you're unlikely to get burned by their side-effects. I don't regard them as a good, general purpose way to avoid RFO, which is why it's really nice to know about the REP STOSB option.
    But is the REP STOS really that useful? I mean it's not a general purpose store that you can use to write arbitrary data to memory. As far as I know the way it works restricts its use cases to memset-like memory initialization.

    Leave a comment:


  • coder
    replied
    Originally posted by fairydreaming View Post
    The article mentions that REP STOS and non-temporal writes are two ways of avoiding RFO on Zen 4 but don't mention any other mechanisms:

    I guess if AVX-512 stores were another way to avoid RFOs on Zen 4 they would explicitly mention that.
    Maybe? They didn't talk about AVX-512 in that context, so it's hard to know whether it was an oversight, or a baseline assumption on their part.

    I guess I could rent a Zen 4 cloud instance and try it for myself, but I don't currently have a need to know the answer (other than idle curiosity).

    Originally posted by fairydreaming View Post
    ​While digging more info on this matter I found that in Ice Lake Intel introduced yet another mechanism for avoiding RFOs called SpecI2M. This seems to be a dynamic mechanism that kicks in only for specific workloads (STREAM benchmark seems to be one of them). I couldn't find anything comparable for AMD CPUs.
    I don't know if this is where you found it mentioned, but it explains not only a bit about SpecI2M but also mentions a key detail about why non-temporal stores avoid RFO:

    From IRMA's presentation on Icelake server it said:   "Covert RFO to specI2M when memory subsystem is heavily loaded Reduces mem bandwidth demand on streaming WLs that do full cache line writes (25% efficiency increase)" So I would like to understand what is specI2M and how it differs from RFO(...


    Specifically:

    "Non-temporal stores are weakly ordered and require the binary to include extra fence instructions when switching back to ordinary stores."

    I was unaware of this, which is why I didn't expect non-temporal stores to avoid RFO. I assumed the strong memory ordering of x86 applied to all instructions. Good to know.

    Also, with regard to another point made in that post:

    "The concept of 'memory' is becoming less clear, with the addition of layers 'beyond' DRAM (e.g., persistent memory), and/or the addition of caching layers between DRAM and the traditional caches"

    IMO, non-temporal loads & stores should be used when reuse isn't anticipated. This includes cases where you're dealing with a chunk of data that's larger than the largest L3 segment, such as in this recent memset() optimization. It also includes cases where something like a realtime system (which includes games, if you squint a little bit) might be updating a data structure that's not going to be accessed before the next time interval.

    If you simply use them in that way, then I think you're unlikely to get burned by their side-effects. I don't regard them as a good, general purpose way to avoid RFO, which is why it's really nice to know about the REP STOSB option.

    BTW, that post also helpfully reminds us that ARM's weaker memory model apparently makes avoiding RFO more effortless. I did a little searching to see if the ordering constraints on REP STOSB are at all relaxed, as in the case of the non-temporal writes, but didn't turn up anything on it.
    Last edited by coder; 30 November 2024, 05:38 PM.

    Leave a comment:


  • fairydreaming
    replied
    Originally posted by coder View Post

    Chips & Cheese compared 3 different write strategies, on Zen 4: normal write, a string write (i.e. a variable-length write of known size), and a non-temporal write. They found the string performed 0.9% better than the non-temporal write, which performed 71.5% better than a naive write.[INDENT]

    I don't see why an aligned AVX-512 write should behave any different than the string write, but it's possible that Zen 4 simply didn't optimize that case. Regardless, it does appear the non-temporal write is doing something that minimizes or eliminates the RFO penalty, assuming the benefit shown isn't simply due to it being a 512-bit write when the normal write isn't.
    The article mentions that REP STOS and non-temporal writes are two ways of avoiding RFO on Zen 4 but don't mention any other mechanisms:

    The REP STOSB instruction tells the CPU upfront how much data to set to a particular value, letting it avoid RFOs when it knows an entire cacheline will be overwritten. Non-temporal writes (with MOVNTPS) use a write-combining memory protocol that also avoids RFOs, and bypasses caches.
    I guess if AVX-512 stores were another way to avoid RFOs on Zen 4 they would explicitly mention that.

    While digging more info on this matter I found that in Ice Lake Intel introduced yet another mechanism for avoiding RFOs called SpecI2M. This seems to be a dynamic mechanism that kicks in only for specific workloads (STREAM benchmark seems to be one of them). I couldn't find anything comparable for AMD CPUs.

    Leave a comment:


  • coder
    replied
    Originally posted by fairydreaming View Post
    Yes, but in the STREAM benchmark most of the time we are dealing with a case where all 64 bytes of the cache line buffer buffer have been written, and in this case the memory controller can simply overwrite the value in memory with the new data without reading it first,
    If you atomically write all of the data in a cache line, I think it should always be true that it can be written without first being read. How well this is implemented in practice is another matter.

    Originally posted by fairydreaming View Post
    ​that's how I understand this explanation by John McCalpin: https://community.intel.com/t5/Softw.../1267912#M7845
    That post incorrectly claims that a non-temporal store will invalidate the cacheline, elsewhere in the cache hierarchy, regardless of whether you overwrite the whole thing. That's not true. Non-temporal writes are still cache-coherent. Maybe the invalidate happens if the entire thing is atomically overwritten, though.

    Or, perhaps the author meant to say that it triggers a flush of a dirty copy, should one exist. That would at least be consistent with the memory controller doing the merge (which still sounds a bit odd, to me).

    Originally posted by fairydreaming View Post
    ​​likwid-bench benchmark results from my Epyc 9374F seem to support this:
    • stream - 287 GB/s
    • stream_avx512 - 285 GB/s
    • stream_mem_avx512 - 376 GB/s
    As you can see using avx512 alone doesn't improve the bandwidth. Only when it's used along with non-temporal stores (_mem_) the performance improves by 31%.
    Chips & Cheese compared 3 different write strategies, on Zen 4: normal write, a string write (i.e. a variable-length write of known size), and a non-temporal write. They found the string performed 0.9% better than the non-temporal write, which performed 71.5% better than a naive write.


    I don't see why an aligned AVX-512 write should behave any different than the string write, but it's possible that Zen 4 simply didn't optimize that case. Regardless, it does appear the non-temporal write is doing something that minimizes or eliminates the RFO penalty, assuming the benefit shown isn't simply due to it being a 512-bit write when the normal write isn't.

    Source:

    Leave a comment:


  • fairydreaming
    replied
    Originally posted by coder View Post
    Non-temporal stores don't avoid read-for-overwrite. They're normal, cache-coherent stores and using one to write a partial cacheline will certainly first have to do a fetch.
    Yes, but in the STREAM benchmark most of the time we are dealing with a case where all 64 bytes of the cache line buffer buffer have been written, and in this case the memory controller can simply overwrite the value in memory with the new data without reading it first, at least that's how I understand this explanation by John McCalpin: https://community.intel.com/t5/Softw.../1267912#M7845

    likwid-bench benchmark results from my Epyc 9374F seem to support this:
    • stream - 287 GB/s
    • stream_avx512 - 285 GB/s
    • stream_mem_avx512 - 376 GB/s
    As you can see using avx512 alone doesn't improve the bandwidth. Only when it's used along with non-temporal stores (_mem_) the performance improves by 31%.

    Leave a comment:


  • coder
    replied
    Originally posted by fairydreaming View Post
    As for the "phantom reads", the "mem" benchmark variants use non-temporal stores that avoid the overhead of "phantom reads".
    Non-temporal stores don't avoid read-for-overwrite. They're normal, cache-coherent stores and using one to write a partial cacheline will certainly first have to do a fetch.

    The benefit of non-temporal stores is avoiding cache pollution. Depending how the CPU's memory prefetchers work, this could be the win. If they prefetch into L3, then using non-temporal reads & writes would minimize thrashing caused by the prefetcher. That said, I expect the prefetchers exist per-core and actually fetch into L2, where there shouldn't be a thrashing problem.

    Leave a comment:


  • coder
    replied
    Originally posted by fairydreaming View Post
    Thanks for this information, very helpful. Based on what you said the highest possible STREAM TRIAD benchmark result:
    • for Epyc Turin would be 75% * 576 GB/s = 432 GB/s
    • for Epyc Genoa would be 75% * 460.8 GB/s = 345.6 GB/s.
    But for example in Fujitsu Server PRIMERGY Performance Report for RX1440 M2​ servers (Epyc Genoa) we can observe values close to 400 GB/s. Link to report here: https://sp.ts.fujitsu.com/dmsp/Publi...0-m2-ww-en.pdf

    Any idea how is that possible?
    Thanks for posting that!

    The linked document shows a peak single-socket Stream TRIAD score of 398 GB/s. First, I'd like to point out that it was run in NPS4 mode, which should help. However, that doesn't explain the discrepancy you noted between theoretical performance, when accounting for RFO (read-for-overwrite).

    Maybe AVX-512 optimizations are in effect, which could avoid the RFO penalty. The way that could work is because the size of a full AVX-512 register is 64b, which matches the cacheline size. If you do a cache-aligned write of an entire cacheline, then there's no point in fetching the old value, first. RFO is something that only makes sense for partial writes.

    That reopens the question around Michael's performance measurements, since he compiled Stream Triad with -march=native, which should mean he also gets the benefits of AVX-512, in this case. So, I'd have to wonder whether his NPS mode setting & other configuration might help explain it? Also, they ran with SMT disabled.

    It would be interesting to dig deeper into this.

    Leave a comment:

Working...
X