Announcement

Collapse
No announcement yet.

AVX-512 Performance With 256-bit vs. 512-bit Data Path For AMD EPYC 9005 CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by Teggs View Post

    I don't think this is what you meant, but I hope we don't literally hear it. Actively cooled RAM is not on my wish list.
    We might not hear it but we might be able to see it from space. MRDIMM is essentially another layer of multiplexing DDR5 chips right? That's a lot of chips. And a lot of density, I wouldn't bet against active cooling TBH which is already in place for the server target market anyway.

    At some point pretty soon it's going to be untenable to have sticks of ram in servers. They already take up lots of server motherboard real estate and the signalling issues are harder to overcome the more channels get added. Unless motherboards get 3D with daughterboards dedicated to ram sticks *shudder*.

    Comment


    • #12
      Originally posted by drakonas777 View Post
      I have not seen any solid memory scaling AVX512 benchmarks which would confirm this theory that ZEN5 AVX512 is limited by memory bandwidth in any meaningful margin. Someone wrote this speculation after ZEN4->ZEN5 AVX512 comparison and AFAIK it somehow became a "given truth" without any decent research.
      It's always a balance between compute and bandwidth. They'll have balanced it to cover the vast majority of workloads as best they can, but it's trivial to create a workload that maxes out memory bandwidth.

      A good example is Prime95 LL/PRP testing, it's pretty much always memory bandwidth limited. The best CPU's to use for this workload (bang for buck) normally aren't the top end but instead halfway down the product stack. Technically the top end can be slightly faster but that's mostly down to the extra cache the extra processors provide which slightly reduces the memory bandwidth required for a given chunk of work. Things like the 2 chiplet zen parts are a special case where the top end is relevant still, they don't just add an incremental amount of cache they add a bucketload.

      Comment


      • #13
        I'm not seeing any tests where disabling half the FPU helps, so why does this BIOS option exist?

        Comment


        • #14
          Originally posted by yump View Post
          I'm not seeing any tests where disabling half the FPU helps, so why does this BIOS option exist?
          I don't know the actual reason, but it seems handy for testing purposes to "emulate" zen5c. It might be in preparation for AVX10.256 and AVX10.512 somehow. Or there might be a binned zen5 (not c) part down the line that disables the half that has defects and forces this option.

          Comment


          • #15
            Originally posted by yump View Post
            I'm not seeing any tests where disabling half the FPU helps, so why does this BIOS option exist?
            good question, the setting isn't even mentioned in the performance tuning guide for 9005 from AMD except when it comes to DPDK where they just mention that it should be disabled if you configured/compiled DPDK for AVX-256, so it's possible that you might get a slight performance gain in that particular case even though compiling for 512 gives a larger boost (but some people might be stuck with 256 for some reason).

            Comment


            • #16
              Originally posted by drakonas777 View Post
              I have not seen any solid memory scaling AVX512 benchmarks which would confirm this theory that ZEN5 AVX512 is limited by memory bandwidth in any meaningful margin. Someone wrote this speculation after ZEN4->ZEN5 AVX512 comparison and AFAIK it somehow became a "given truth" without any decent research.

              An example of a program that is limited by the memory bandwidth on Zen 5, so it cannot use the enhanced computational throughput of 512-bit AVX-512 is y-cruncher.

              See the explanation at:




              The same must be true for all benchmarks where Granite Rapids with faster MRDIMMs beats Turin, despite having slower CPU cores.

              For any program where most of the data used in AVX or AVX-512 computations does not come from the L1 or from the L2 cache memories, the throughput of the computational units cannot be reached and 512-bit functional units need a higher fraction of the operands to come from L1/L2, to keep them occupied, than 256-bit functional units with half throughput.
              Last edited by AdrianBc; 12 October 2024, 11:55 AM.

              Comment


              • #17
                The SVT-AX1 benchmarks perfectly illustrate why I say these high core count CPUs from AMD are a scam.

                4k Bosphorus with preset=3 encode is done in 13-14 fps; the source is 4k 120 fps, meaning that a top of the line 128C/256T EPYC is epically slow encoding video at 1/8 real time.

                Meanwhile I can buy a cheap $100 Intel Arc video card, put it into a cheap low end computer and encode in real time with much lower power consumption.

                What an EPIC failure.

                Comment


                • #18
                  Originally posted by AdrianBc View Post


                  An example of a program that is limited by the memory bandwidth on Zen 5, so it cannot use the enhanced computational throughput of 512-bit AVX-512 is y-cruncher.

                  See the explanation at:




                  The same must be true for all benchmarks where Granite Rapids with faster MRDIMMs beats Turin, despite having slower CPU cores.

                  For any program where most of the data used in AVX or AVX-512 computations does not come from the L1 or from the L2 cache memories, the throughput of the computational units cannot be reached and 512-bit functional units need a higher fraction of the operands to come from L1/L2, to keep them occupied, than 256-bit functional units with half throughput.
                  It's mostly a theoretical analysis. In theory everything is limited by BW (to be more precise - RAM), because RAM is always going to be slower then some theoretical memory model, which has a register level latency and BW. What I meant in my previous comment is an average of wide range of AVX512 real world workloads and BW impact on AVX512 perf delta, by isolating variables such as frequency drops. If, say, for example you get an average of 1-5% better performance by increasing BW by 20-30% (or more), it's not much of a practical BW limitation despite some outliers in my eyes.

                  I'm not categorically opposed to this AVX512 BW limitation idea, but I'd like to see more empirical data regarding practical ramifications, that's all. Simply because I can sense this sort of notion that ZEN5 AVX512 is somewhat "semi-broken" because of BW limitation when in reality we lack data in practical use cases and we especially lack data comparing ZEN5 to competing platforms regarding this idea, where it may very well be not more "problematic" than it's competitors.
                  Last edited by drakonas777; 13 October 2024, 08:29 AM.

                  Comment


                  • #19
                    Originally posted by sophisticles View Post
                    The SVT-AX1 benchmarks perfectly illustrate why I say these high core count CPUs from AMD are a scam.

                    4k Bosphorus with preset=3 encode is done in 13-14 fps; the source is 4k 120 fps, meaning that a top of the line 128C/256T EPYC is epically slow encoding video at 1/8 real time.

                    Meanwhile I can buy a cheap $100 Intel Arc video card, put it into a cheap low end computer and encode in real time with much lower power consumption.

                    What an EPIC failure.
                    Did you consider writing comments in WCCFtech comment section? I think people there are far more compatible with your mentality.

                    Comment


                    • #20
                      Originally posted by Teggs View Post
                      I don't think this is what you meant, but I hope we don't literally hear it. Actively cooled RAM is not on my wish list.
                      Yeah. FWIW, only the interface runs at 17600. The actual DRAM chips are clocked at only half that.

                      Originally posted by geerge View Post
                      MRDIMM is essentially another layer of multiplexing DDR5 chips right? That's a lot of chips. And a lot of density, I wouldn't bet against active cooling TBH which is already in place for the server target market anyway.
                      With 2-DIMMs per channel basically going away due to crowding by ever-larger CPU sockets, there might not actually be more DRAM chips than we had in the 2DPC era.

                      Also, keep in mind that we've already seen quad-ranked RDIMMs.

                      Originally posted by geerge View Post
                      ​At some point pretty soon it's going to be untenable to have sticks of ram in servers. They already take up lots of server motherboard real estate and the signalling issues are harder to overcome the more channels get added. Unless motherboards get 3D with daughterboards dedicated to ram sticks *shudder*.
                      CXL will allow more flexibility in form factor. It also supports switching and pooling.​
                      Last edited by coder; 14 October 2024, 01:26 AM.

                      Comment

                      Working...
                      X