Announcement

Collapse
No announcement yet.

AMD Ryzen 9 3900X Linux Memory Scaling Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by BillBroadley View Post

    Yes, CAS 14 @ 1.6GHz has lower latency than CAS16 @ 1.8GHz, but that only matters for codes that are bottlenecked on memory latency. Codes that are bandwidth limited will run better on the higher bandwidth memory.
    So, games can still benefit from using 3600 MHz / CAS16 rather than 3200 MHz / CAS14? Games usually are RAM bandwidth sensitive.

    Comment


    • #12
      Whoa! What happened to the Apache 3800MHz benchmarks?

      Comment


      • #13
        You can see how proud they are of Zen 2, they shipped out the full review kit to Michael, incl. the Trident Z Royal DIMMs.

        Comment


        • #14
          Originally posted by shmerl View Post

          So, games can still benefit from using 3600 MHz / CAS16 rather than 3200 MHz / CAS14? Games usually are RAM bandwidth sensitive.
          I'd imagine that would be highly game dependent. Between textures uploaded to the GPU, implementing the AI for any machine controlled units/players, tracking who can see what, enforcing rules, physics, and complex audio I find it a bit surprising that it would be bandwidth more than latency.

          Comment


          • #15
            Originally posted by profoundWHALE View Post
            Whoa! What happened to the Apache 3800MHz benchmarks?

            I believe that benchmark does significant CPU<->CPU communication and running the ram that fast halves the interconnect speed.

            Comment


            • #16
              Originally posted by microcode View Post
              You can see how proud they are of Zen 2, they shipped out the full review kit to Michael, incl. the Trident Z Royal DIMMs.
              I hope they do the same next year with the laptop/notebooks

              Comment


              • #17
                Originally posted by existensil View Post
                ... The regression isn't related to increased latency in memory above that speed, but is instead caused by the dramatic reduction in the speed of the infinity fabric above DDR4-3600. From Anandtech:

                The big reduction in performance with Apache Siege leads me to believe that benchmark involves a heavy amount of cross-core communication and is saturating the infinity fabric, especially when it's nearly halved in speed.
                Slight clarification since the Anandtech article is wrong: the gear-down mode going to 2:1 DDR to Infinity Fabric clock kicks in above DDR4-3733. Technically that's still "above 3600", but Michael correctly points out in his article that 3733 would still be a high performance option even though it wasn't a tested config.

                I believe you are spot on with the Apache benchmark. Given the way Apache spawns workers it's quite possible that benchmark is both spilling threads to the other CCD, as well as also thrashing the cache from the parent thread back on the other die.

                The original source of the AMD RAM and Fabric upgrades is here: https://www.techpowerup.com/img/dF4sjxFh6HNk7GXn.jpg


                Interestingly @BeardedHardware on Youtube (https://www.youtube.com/channel/UCHc...zq231nAS5zFmsw) tested a lot with DDR4 overclocking and the fabric speed on Ryzen 7 3700X yesterday and at least with the MSI X570 board he had, there was a "disable gear down" option that let him overclock the Infinity Fabric to 1900Mhz, therefore running 1:1 with DDR4-3800. But it sounds like stability falls off a cliff right around there, and some chip's silicon lottery might only do 1800 or 1866 fabric clock. With his overclocks he was seeing 64ns RAM latency with DDR4-3800 CL15 and 3733 CL14.

                Comment


                • #18
                  Originally posted by shmerl View Post
                  So you used the same memory kit running it at different frequencies? I don't think this is very informative. Since different RAM can also have different latency for different frequencies.

                  I.e. let's say you have 3200 MHz dual channel RAM with 14 CAS latency, and 3600 MHz one with 16 CAS.

                  So timing usually is calculated as CL / (single channel frequency) * 1000.

                  I.e.:

                  14 / 1600 * 1000 = 8.75 ns.
                  16 / 1800 * 1000 ≈ 8.89 ns.

                  So If I understand it correctly, 3200 MHz RAM with 14 CAS latency should perform better than 3600 MHz one with 16 CAS. Though I've never tested that, would be interesting to confirm.
                  I played around a lot with memory overclocking, and in my experience when the latencies are vaguely close like in your example, then the higher speed always wins in benchmarks. I guess what's happening is that after the transfer starts, the faster memory speed starts catching up to the lower latency one.

                  One clock cycle at 1600MHz is 0.625ns and at 1800MHz is 0.556ns. This would then make the 16/1800 example catch up to the 14/1600 one if a second clock cycle is needed to satisfy whatever the CPU wanted from the memory.

                  I don't know how often it happens that the CPU is satisfied after just one clock cycle and how often it wants more from the memory. One clock cycle means 16 bytes of data, so that's already quite a bit.

                  About using the same memory kit at different speeds, the motherboard is likely scaling the timings from the XMP profile. You don't need different kits for experimenting, the board's BIOS will already do exactly that calculation you did and will turn the 14 CL at 3200MHz timing into a 16 CL timing at 3600MHz. EDIT: I checked again on my current motherboard, and my current motherboard is not doing what I describe here. It is only scaling tRFC and tREFI when I change the speed. It's perhaps likely that I'm remembering wrong how the old motherboard behaved where I got this idea about the motherboard scaling the timings automatically.
                  Last edited by Ropid; 10 July 2019, 09:46 PM.

                  Comment


                  • #19
                    To get the full picture we need to know if the cache line length is still 32 bytes or something else. If you use only 1 byte out of the fetch the penality is worst and latency more important.

                    Maybe someone can write a straddled loop benchmark. While at it it could be fun to check if the L3 size is the main reason for the real performance gain or is it pure IPC gain.

                    Comment


                    • #20
                      So the goal is 3733 at the lowest CAS latency available. That, or a 5100 ryzen kit that doesn't exist yet.

                      Comment

                      Working...
                      X