Announcement

Collapse
No announcement yet.

AMD 4th Gen EPYC 9654 "Genoa" AVX-512 Performance Analysis

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • AMD 4th Gen EPYC 9654 "Genoa" AVX-512 Performance Analysis

    Phoronix: AMD 4th Gen EPYC 9654 "Genoa" AVX-512 Performance Analysis

    With the great AMD 4th Gen EPYC Linux performance showing significant generational uplift and dominating against the current Xeon Scalable "Ice Lake" competition, it's a combination of the twelve channels of DDR5 system memory support, up to 96 cores per socket, introduction of AVX-512, and other Zen 4 micro-architectural improvements. As follow-up testing articles to all of the Genoa data delivered thus far, over the weeks ahead I have additional benchmark results to share looking more closely at these different areas of improvement for AMD 4th Gen EPYC. In today's article is a look at the EPYC 9654 2P performance with AVX-512 on/off while also looking at the CPU power consumption impact and the affect on CPU clock frequencies and thermals.

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    This is one of those things where Linus has proven to be wrong to disapprove of something. In most cases, the compute performance outpaces the increase in power consumption. It's an all-around win, and in some cases, the performance difference is pretty significant.

    Comment


    • #3
      i am keen to see on the sapphire rapids - avx512 implementation - benches in comperance to this in january.

      Comment


      • #4
        Originally posted by spiral_23 View Post
        i am keen to see on the sapphire rapids - avx512 implementation - benches in comperance to this in january.
        When is sapphire rapids due ?

        Comment


        • #5
          Originally posted by schmidtbag View Post
          This is one of those things where Linus has proven to be wrong to disapprove of something. In most cases, the compute performance outpaces the increase in power consumption. It's an all-around win, and in some cases, the performance difference is pretty significant.
          I wouldn't say Linus is wrong.
          It's hardly as simple as throwing more instructions and wider data handling at any problem.
          And adding more instructions to a endless instruction space like CISC adds even more fragmentation and problems.
          Both Intel and AMD are inventing instructions to the left and right (mostly because CISC can support this type of behavior).

          Intels implementation caused a lot of grief in a lot of situations.
          It's not even guaranteed to be present everywhere. Sure AVX512, can forego some frontend handling for the same data.
          But depending on what tasks you do, lugging around this type of complexity can be a bad tradeoff.

          I think this mostly boils down to the balanced tradeoffs made in AMDs implementation.
          So, congrats to AMD?

          Comment


          • #6
            Originally posted by schmidtbag View Post
            This is one of those things where Linus has proven to be wrong to disapprove of something. In most cases, the compute performance outpaces the increase in power consumption. It's an all-around win, and in some cases, the performance difference is pretty significant.
            Sorry, but Linus Torvalds was absolutely right about AVX-512 back when he talked about it. Please take a look at this extract of the original comment where Linus wishes AVX-512 a painful death:
            And AVX512 has real downsides. I'd much rather see that transistor budget used on other things that are much more relevant. Even if it's still FP math (in the GPU, rather than AVX512). Or just give me more cores (with good single-thread performance, but without the garbage like AVX512) like AMD did.

            I want my power limits to be reached with regular integer code, not with some AVX512 power virus that takes away top frequency (because people ended up using it for memcpy!) and takes away cores (because those useless garbage units take up space).
            Linus is arguing that AVX-512, which at the time was only implemented on Intel CPUs, takes a lot of space (dark silicon problem), takes a frequency hit and takes a power limit hit. He also says that the transistor budget assigned to AVX-512 could be used for something better.

            More than two years later, all of his claims are still absolutely true and real problems, but only on Intel CPUs. AMD has done a much better implementation of AVX-512 with its double-pumped design that does not take away top frequency, nor cores (up to 128 on Bergamo) and it does not take much space.

            Sometimes he is wrong, but not this time.

            Comment


            • #7
              Originally posted by pete910 View Post

              When is sapphire rapids due ?


              jannuary 10th intel will do. - the xeons will have double avx-512 units per core - as i could filter out from these marketing papers around. so i want to see - how they compete to the amd epyc.

              Comment


              • #8
                i have not been really impressed with avx512. Sure you get some(major) uplift in some workloads, but most of those are still better done on either a gpu orb better an ai accelerator. Most consumer workloads don't touch it and the enterprise can pay for better solutions.

                Comment


                • #9
                  On Linus: Intel's other mistake was not standardizing AVX512, or even trying to proliferate it by putting it in "narrow" 256 bit designs like AMD, and instead using it as a tool for product segmentation. He was right about that segmentation issue at the time.

                  avx512_uarchs.png
                  (Note that this image is from 2019)
                  Last edited by brucethemoose; 19 December 2022, 01:43 PM.

                  Comment


                  • #10
                    Originally posted by milkylainen View Post

                    I wouldn't say Linus is wrong.
                    It's hardly as simple as throwing more instructions and wider data handling at any problem.
                    And adding more instructions to a endless instruction space like CISC adds even more fragmentation and problems.
                    Both Intel and AMD are inventing instructions to the left and right (mostly because CISC can support this type of behavior).

                    Intels implementation caused a lot of grief in a lot of situations.
                    It's not even guaranteed to be present everywhere. Sure AVX512, can forego some frontend handling for the same data.
                    But depending on what tasks you do, lugging around this type of complexity can be a bad tradeoff.

                    I think this mostly boils down to the balanced tradeoffs made in AMDs implementation.
                    So, congrats to AMD?
                    Context does indeed matter. No statements should be evaluated outside of the context in which they are delivered.
                    Both Intel and AMD are inventing instructions to the left and right (mostly because CISC can support this type of behavior).
                    There's nothing stopping RISC oriented designs from doing the same thing - and have been for decades now. The RISC v. CISC debate is obsolete at this point.

                    Comment

                    Working...
                    X