Announcement

Collapse
No announcement yet.

DDR5 Memory Channel Scaling Performance With AMD EPYC 9004 Series

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • filbo
    replied
    To keep the total test matrix somewhat sane, do all #channels x same memory size, then all NUMA configs x full 12 channels x still the same memory size, and x full memory size, chopping out the intermediate steps...

    Leave a comment:


  • filbo
    replied
    These benchmarks are interesting, but I feel they are mixing answers to two semi-correlated questions: how sensitive to bandwidth/channels, and how sensitive to total available memory.

    These are 64GiB DIMMs, 2 per channel, from 6 to 12 channels, so from 768 to 1536GiB by steps of 256GiB. There are kernel command line switches to limit total memory; I'm pretty sure (and I guess benchmarking would clarify) that limiting each N-channel config to the same 768GiB memory horizon would have the desired effect. That is, all N channels would be spread across the low 768GiB, with increasingly large ignored sections at the top.

    Such tests would truly focus on #channels sensitivity. They would address questions like 'all else being equal, are 6 channels of 64GiB DIMMs likely to underperform 12 channels of 32GiB?' (and by how much). This could be a rather significant question if there's a big elbow in the DIMM price curve. (I was going to give an actual example based on Michael's quoted prices, but $100/16GiB, $183/32, $350/64 is actually almost perfectly flat!)

    These questions might well be explored at the same time as the NUMA modes exploration. That is, do the NUMA modes testing with flat total memory size, varied number of channels. A slice of those tests focused on only one NUMA mode answers the channels-vs-size question.

    Leave a comment:


  • coder
    replied
    Originally posted by kobblestown View Post
    I would still have liked to see the 2c and 4c scenarios. You are probably correct that they are not very likely for real deplyoments (maybe for the SKUs with fewer cores?) but could illuminate some intricacies of the workloads.
    I was thinking it should start at quad-channel. dual-channel would be a little funny and also probably shed a bit more light on how memory-sensitive certain workloads are.

    Originally posted by kobblestown View Post
    For instance, I am a bit surprised by the bad scaling of compilation benchmarks.
    I've got to wonder if that's not due to I/O bottlenecks. It'd be cool if PTS did something like running iostat -x 1, in the background, and captured the output. Then, we could see how often drive utilization was maxed out.

    Originally posted by kobblestown View Post
    for big projects it has been proved beneficial to use unity builds whereby the compilation units are auto-generated files which include (as in #include file1.cpp, etc) the c/cpp files so many source files are compiled at once.
    I don't like it, because there are too many pitfalls and relatively limited room for benefits (i.e. clashing static variable & function names; lots of disjoint control flow, between the source files).

    Fortunately, LTO should render such practices mostly obsolete.

    Originally posted by kobblestown View Post
    Then a single compiler process can go up to 2-3 GB of memory and there would be many of those.
    I've seen C++ code easily use a couple GB per compilation unit. If you looked, I'm guessing you'd see some of LLVM's files do that.

    Originally posted by kobblestown View Post
    BTW, such platforms probably have several different memory interleaving modes. I would be interested to see how much difference they make for the different workloads.
    That touches on something I've been wondering about, which is how memory gets interleaved on these CPUs. Is it interleaved at cacheline granularity, at page granularity, or do the channels map linearly and then it's up to the OS to interleave via the page table (if at all)?

    And then, what exactly are the NUMA modes doing (e.g. NPS1, NPS2, NPS4)? Do they simply change the address mapping of the memory channels? That's mostly what Anandtech seems to suggest, in their review of Milan.

    Would be nice to get some more insight into these matters, if anyone has solid information on them.

    Leave a comment:


  • kobblestown
    replied
    Michael, I would still have liked to see the 2c and 4c scenarios. You are probably correct that they are not very likely for real deplyoments (maybe for the SKUs with fewer cores?) but could illuminate some intricacies of the workloads. For instance, I am a bit surprised by the bad scaling of compilation benchmarks. I assume the build systems use the classic each file is its own compilation unit approach. I imagine, then, that each compilation process does not require much memory and the caches are doing a good job covering most of it.

    However, for big projects it has been proved beneficial to use unity builds whereby the compilation units are auto-generated files which include (as in #include file1.cpp, etc) the c/cpp files so many source files are compiled at once. Then a single compiler process can go up to 2-3 GB of memory and there would be many of those.

    Sorry, I cannot point you to any OSS project that uses such build to include as a benchmark. That's why I think it's a good idea to include tests with 2c and 4c.

    I understand that this requires extra work and you are within your right to disregard it. And you wrote a justification for your choice, so kudos for that! But you might want to consider it for the future.

    BTW, such platforms probably have several different memory interleaving modes. I would be interested to see how much difference they make for the different workloads. Whenever I've searched more information about them I cannot find any detailed source that would explain what they really mean and I cannot even hypothesize about the impact. So having them benchmarked could prove very useful.

    Still, many thanks for these benchmarks. There aren't many sites that pay attention to these things (are there any at all?).

    Cheers

    Leave a comment:


  • dkokron
    replied
    coder Much appreciated.

    Leave a comment:


  • coder
    replied
    Originally posted by dkokron View Post
    I read the anandtech article. The benchmarks aren't the type that I'm interested in. I am looking for HPC benchmarks like the type Michael runs.
    Okay, then you probably won't find this very interesting, but at least they're on the right platform and using RDIMMs. It also claims dual-rank memory is needed to extract the most bandwidth from Genoa:

    I don't know if that means "at least dual-rank" or "only dual-rank".

    Leave a comment:


  • dkokron
    replied
    Originally posted by coder View Post
    For unbuffered, you want dual-rank, but 1 DIMM per channel.



    However, I can't say if that applies to the registered memory used in these servers. For most, there's no choice. DDR5 is too new, so memory capacity requirements will likely drive decisions about which size, type, and number of DIMMs to buy.
    I read the anandtech article. The benchmarks aren't the type that I'm interested in. I am looking for HPC benchmarks like the type Michael runs.

    Leave a comment:


  • coder
    replied
    Originally posted by Michael View Post
    The drive was P5800X used. (AMD didn't supply storage with Titanie besides like some odd WD ~SN750 NVMe SSD for use as a boot drive....)
    That's good to hear. The system description at top says: "800GB INTEL SSDPF21Q800GB", which is a 800 GB Intel DC P3600 MLC ssd.

    Any idea how the error crept in? Was that data collected from a different run?

    Leave a comment:


  • Michael
    replied
    Originally posted by coder View Post
    Overall, mostly what I expected. I'm a little surprised some of the deep learning benchmarks weren't more sensitive to memory bandwidth, but I guess those must've used models small enough to fit in L3 cache.

    As for the rendering benchmarks, I wasn't too surprised after having seen:



    For me, the biggest surprise was that compilation wasn't more bandwidth intensive! Could it possibly have been I/O-bottlenecked? The drive appears to be a 800 GB Intel DC P3600 MLC ssd. Read-oriented, probably. ark.intel.com already scrubbed it from their database, and I'm too lazy to search Solidigm's site.

    Here are the specs from Google's cache:
    • Sequential Bandwidth - 100% Read (up to) 2600 MB/s
    • Sequential Bandwidth - 100% Write (up to) 1000 MB/s
    • Random Read (100% Span) 430000 IOPS (4K Blocks)
    • Random Write (100% Span) 50000 IOPS (4K Blocks)
    ‚Äč
    Michael, I don't suppose this was a SSD that AMD sent you? Or was the platform shipped without storage?
    The drive was P5800X used. (AMD didn't supply storage with Titanie besides like some odd WD ~SN750 NVMe SSD for use as a boot drive....)

    Leave a comment:


  • coder
    replied
    Originally posted by dkokron View Post
    I didn't find what I am looking for via the search feature so I'm asking here. Do you have any comparisons between single-rank and dual-rank DDR5?
    For unbuffered, you want dual-rank, but 1 DIMM per channel.



    However, I can't say if that applies to the registered memory used in these servers. For most, there's no choice. DDR5 is too new, so memory capacity requirements will likely drive decisions about which size, type, and number of DIMMs to buy.

    Leave a comment:

Working...
X