Announcement

Collapse
No announcement yet.

DDR5 Memory Channel Scaling Performance With AMD EPYC 9004 Series

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by dkokron View Post
    I read the anandtech article. The benchmarks aren't the type that I'm interested in. I am looking for HPC benchmarks like the type Michael runs.
    Okay, then you probably won't find this very interesting, but at least they're on the right platform and using RDIMMs. It also claims dual-rank memory is needed to extract the most bandwidth from Genoa:

    I don't know if that means "at least dual-rank" or "only dual-rank".

    Comment


    • #12
      coder Much appreciated.

      Comment


      • #13
        Michael, I would still have liked to see the 2c and 4c scenarios. You are probably correct that they are not very likely for real deplyoments (maybe for the SKUs with fewer cores?) but could illuminate some intricacies of the workloads. For instance, I am a bit surprised by the bad scaling of compilation benchmarks. I assume the build systems use the classic each file is its own compilation unit approach. I imagine, then, that each compilation process does not require much memory and the caches are doing a good job covering most of it.

        However, for big projects it has been proved beneficial to use unity builds whereby the compilation units are auto-generated files which include (as in #include file1.cpp, etc) the c/cpp files so many source files are compiled at once. Then a single compiler process can go up to 2-3 GB of memory and there would be many of those.

        Sorry, I cannot point you to any OSS project that uses such build to include as a benchmark. That's why I think it's a good idea to include tests with 2c and 4c.

        I understand that this requires extra work and you are within your right to disregard it. And you wrote a justification for your choice, so kudos for that! But you might want to consider it for the future.

        BTW, such platforms probably have several different memory interleaving modes. I would be interested to see how much difference they make for the different workloads. Whenever I've searched more information about them I cannot find any detailed source that would explain what they really mean and I cannot even hypothesize about the impact. So having them benchmarked could prove very useful.

        Still, many thanks for these benchmarks. There aren't many sites that pay attention to these things (are there any at all?).

        Cheers

        Comment


        • #14
          Originally posted by kobblestown View Post
          I would still have liked to see the 2c and 4c scenarios. You are probably correct that they are not very likely for real deplyoments (maybe for the SKUs with fewer cores?) but could illuminate some intricacies of the workloads.
          I was thinking it should start at quad-channel. dual-channel would be a little funny and also probably shed a bit more light on how memory-sensitive certain workloads are.

          Originally posted by kobblestown View Post
          For instance, I am a bit surprised by the bad scaling of compilation benchmarks.
          I've got to wonder if that's not due to I/O bottlenecks. It'd be cool if PTS did something like running iostat -x 1, in the background, and captured the output. Then, we could see how often drive utilization was maxed out.

          Originally posted by kobblestown View Post
          for big projects it has been proved beneficial to use unity builds whereby the compilation units are auto-generated files which include (as in #include file1.cpp, etc) the c/cpp files so many source files are compiled at once.
          I don't like it, because there are too many pitfalls and relatively limited room for benefits (i.e. clashing static variable & function names; lots of disjoint control flow, between the source files).

          Fortunately, LTO should render such practices mostly obsolete.

          Originally posted by kobblestown View Post
          Then a single compiler process can go up to 2-3 GB of memory and there would be many of those.
          I've seen C++ code easily use a couple GB per compilation unit. If you looked, I'm guessing you'd see some of LLVM's files do that.

          Originally posted by kobblestown View Post
          BTW, such platforms probably have several different memory interleaving modes. I would be interested to see how much difference they make for the different workloads.
          That touches on something I've been wondering about, which is how memory gets interleaved on these CPUs. Is it interleaved at cacheline granularity, at page granularity, or do the channels map linearly and then it's up to the OS to interleave via the page table (if at all)?

          And then, what exactly are the NUMA modes doing (e.g. NPS1, NPS2, NPS4)? Do they simply change the address mapping of the memory channels? That's mostly what Anandtech seems to suggest, in their review of Milan.

          Would be nice to get some more insight into these matters, if anyone has solid information on them.

          Comment


          • #15
            These benchmarks are interesting, but I feel they are mixing answers to two semi-correlated questions: how sensitive to bandwidth/channels, and how sensitive to total available memory.

            These are 64GiB DIMMs, 2 per channel, from 6 to 12 channels, so from 768 to 1536GiB by steps of 256GiB. There are kernel command line switches to limit total memory; I'm pretty sure (and I guess benchmarking would clarify) that limiting each N-channel config to the same 768GiB memory horizon would have the desired effect. That is, all N channels would be spread across the low 768GiB, with increasingly large ignored sections at the top.

            Such tests would truly focus on #channels sensitivity. They would address questions like 'all else being equal, are 6 channels of 64GiB DIMMs likely to underperform 12 channels of 32GiB?' (and by how much). This could be a rather significant question if there's a big elbow in the DIMM price curve. (I was going to give an actual example based on Michael's quoted prices, but $100/16GiB, $183/32, $350/64 is actually almost perfectly flat!)

            These questions might well be explored at the same time as the NUMA modes exploration. That is, do the NUMA modes testing with flat total memory size, varied number of channels. A slice of those tests focused on only one NUMA mode answers the channels-vs-size question.

            Comment


            • #16
              To keep the total test matrix somewhat sane, do all #channels x same memory size, then all NUMA configs x full 12 channels x still the same memory size, and x full memory size, chopping out the intermediate steps...

              Comment

              Working...
              X