8 vs. 12 Channel DDR5-6000 Memory Performance With AMD 5th Gen EPYC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • coder
    Senior Member
    • Nov 2014
    • 8863

    #11
    Originally posted by GPTshop.ai View Post
    I decided to buy ASRock Rack BERGAMOD8-2L2T. It has only 8x DIMM slots but PCIe 5.0 M.2. I expect faster disk speeds to have a greater imact on performance than faster RAM speeds.
    Their non-Pro ThreadRipper platform has only 8 channels, so they clearly think that's enough for some stuff. How many cores does the CPU have?

    If you're spending serious money on a machine, like that, I'd go with U.2 SSDs for the main storage workload. They're easier to cool, if you mount them in a drive bay, provide better tail latencies, and withstand abuse much better than M.2 drives.

    A while ago, I read one account of someone who was advocating hard disks for database workloads, because he previously tried using a Samsung 980 Pro (M.2) SSD and it would pause for a couple of seconds, every now and then. He thought, because it had "Pro" in the name and was a fast drive, that it should be suitable for what he was trying to do. Except it's a consumer drive and Samsung knows this, because they put it in their consumer lineup, not among their datacenter products. So, instead of using a SSD that's actually designed to handle database workloads, he took the wrong lesson and went back to hard drives.

    On server boards, the main reason they even have a M.2 slot is just for the boot drive. They don't expect you to use it for anything serious. A lot of those server boards don't even have a builtin heatsink for it.

    Here's a good place to find reviews of datacenter SSDs:
    Last edited by coder; 21 November 2024, 10:09 AM.

    Comment

    • GPTshop.ai
      Junior Member
      • Feb 2024
      • 38

      #12
      Originally posted by coder View Post
      Their non-Pro ThreadRipper platform has only 8 channels, so they clearly think that's enough for some stuff. How many cores does the CPU have?

      If you're spending serious money on a machine, like that, I'd go with U.2 SSDs for the main storage workload. They're easier to cool, if you mount them in a drive bay, provide better tail latencies, and withstand abuse much better than M.2 drives.

      A while ago, I read one account of someone who was advocating hard disks for database workloads, because he previously tried using a Samsung 980 Pro (M.2) SSD and it would pause for a couple of seconds, every now and then. He thought, because it had "Pro" in the name and was a fast drive, that it should be suitable for what he was trying to do. Except it's a consumer drive and Samsung knows this, because they put it in their consumer lineup, not among their datacenter products. So, instead of using a SSD that's actually designed to handle database workloads, he took the wrong lesson and went back to hard drives.

      On server boards, the main reason they even have a M.2 slot is just for the boot drive. They don't expect you to use it for anything serious. A lot of those server boards don't even have a builtin heatsink for it.

      Here's a good place to find reviews of datacenter SSDs:
      For a simple webserver, IMHO you do not need much storage. I will use 2x Crucial T705 2TB, M.2, the fastest SSD available to my knowledge....

      PS: CPU Epyc 9135

      Comment

      • coder
        Senior Member
        • Nov 2014
        • 8863

        #13
        Originally posted by GPTshop.ai View Post
        For a simple webserver, IMHO you do not need much storage.
        Depending on what it's doing, it could be read-mostly workload. Consumer SSDs are probably okay for that, but make sure they have heatsinks!

        Originally posted by GPTshop.ai View Post
        ​PS: CPU Epyc 9135
        For 16 cores, even 8-channel memory should be overkill.

        Comment

        • GPTshop.ai
          Junior Member
          • Feb 2024
          • 38

          #14
          Originally posted by coder View Post
          Depending on what it's doing, it could be read-mostly workload. Consumer SSDs are probably okay for that, but make sure they have heatsinks!


          For 16 cores, even 8-channel memory should be overkill.
          Crucial T705 2TB, M.2 has absolutely massive heatsinks. And yes, Epyc Turin for a websever is overkill...

          Comment

          • jruhe
            Junior Member
            • Nov 2024
            • 4

            #15
            Michael, you might consider running Likwid-Bench, which is NUMA-aware.

            What I’m particularly interested in is a comparison with another bandwidth-intensive processor, Apple’s M4 CPU: The token generation speed of LLMs is entirely dependent on memory bandwidth, so llama-bench of llama.cpp gives valuable insights: https://github.com/ggerganov/llama.cpp/discussions/4167

            Comment

            • Michael
              Phoronix
              • Jun 2006
              • 14296

              #16
              Originally posted by jruhe View Post
              Michael, you might consider running Likwid-Bench, which is NUMA-aware.

              What I’m particularly interested in is a comparison with another bandwidth-intensive processor, Apple’s M4 CPU: The token generation speed of LLMs is entirely dependent on memory bandwidth, so llama-bench of llama.cpp gives valuable insights: https://github.com/ggerganov/llama.cpp/discussions/4167
              I don't think I've heard of likwid-bench before but from quick look appears straight forward enough that I should be able to add to PTS for future benchmarks.
              Michael Larabel
              https://www.michaellarabel.com/

              Comment

              • jruhe
                Junior Member
                • Nov 2024
                • 4

                #17
                Originally posted by Michael View Post

                I don't think I've heard of likwid-bench before but from quick look appears straight forward enough that I should be able to add to PTS for future benchmarks.
                Right. Easy to build, easy to use. And every bench type comes as avx512, avx2 etc variant.

                But same applies to llama-bench, which also can be parametrized to generate sequences of results.

                Comment

                • fairydreaming
                  Junior Member
                  • Oct 2024
                  • 9

                  #18
                  jruhe I tried likwid-bench and it's very good, finally a way to perform NUMA-aware benchmarks without much hassle. Do I understand correctly that it already takes into account the "phantom read" that coder talked about when performing triad benchmark and presenting the results? I got ~380 GB/s on my 9374F on triad_mem_avx512 bench, this result is almost the same as ~387 GB/s in load_avx512 bench.

                  Edit: It's more complicated than I expected. The "triad" benchmarks in likwid-bench are different from the original STREAM TRIAD benchmark, as they use A[i] = B[i] + C[i] * D[i] kernel. That's why they use 3 loads and 1 store per update. The original STREAM TRIAD benchmarks with A[i] = B[i] + a * C[i] kernel​ are also present, but they have "stream" prefix and they use 2 loads and 1 store per update. As for the "phantom reads", the "mem" benchmark variants use non-temporal stores that avoid the overhead of "phantom reads".
                  Last edited by fairydreaming; 24 November 2024, 04:37 AM.

                  Comment

                  • dkokron
                    Junior Member
                    • May 2021
                    • 19

                    #19
                    Originally posted by fairydreaming View Post

                    Thanks for this information, very helpful. Based on what you said the highest possible STREAM TRIAD benchmark result:
                    • for Epyc Turin would be 75% * 576 GB/s = 432 GB/s
                    • for Epyc Genoa would be 75% * 460.8 GB/s = 345.6 GB/s.
                    But for example in Fujitsu Server PRIMERGY Performance Report for RX1440 M2​ servers (Epyc Genoa) we can observe values close to 400 GB/s. Link to report here: https://sp.ts.fujitsu.com/dmsp/Publi...0-m2-ww-en.pdf

                    Any idea how is that possible?
                    On my 2 socket 7402 system, STREAM triad gives the best results when run with 1 process per CCX. DELL has a nice write up of their findings at https://www.dell.com/support/kbdoc/e...pc-performance

                    Comment

                    • coder
                      Senior Member
                      • Nov 2014
                      • 8863

                      #20
                      Originally posted by dkokron View Post
                      On my 2 socket 7402 system, STREAM triad gives the best results when run with 1 process per CCX. DELL has a nice write up of their findings at https://www.dell.com/support/kbdoc/e...pc-performance
                      From what I recall AMD had reduced the NUMA affects on subsequent generations of EPYC, which would reduce the benefit from running in NPS4 mode.

                      As for the one-thread-per-CCX, I wonder if that helps simply by making the accesses more linear.

                      Comment

                      Working...
                      X