Announcement

Collapse
No announcement yet.

AVX / AVX2 / AVX-512 Performance + Power On Intel Rocket Lake

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by lucasbekker View Post
    AVX-512 is primarily aimed at software that has to perform a LOT of similar mathematical operations on large amounts of data. These kind of programs mostly fall into two categories:
    You forgot crypto.

    Originally posted by lucasbekker View Post
    It is unfortunate that AVX-512 is getting a bad reputation because of these kind of benchmarks, because if you are making use of AVX-512 in the intended way, the performance benefints can be HUGE.
    AVX-512 can be a disaster, for performance! Note what Michael said about compilers now defaulting the vector width to 256-bits, to try and limit clock-throttling.

    Here's the worst-case scenario, for AVX-512: https://blog.cloudflare.com/on-the-d...uency-scaling/

    Intel screwed themselves by getting ahead of what the process technology could support. Just like they did with AVX2, except worse. Dropping the CPU from a base clock of 2.1 GHz to 1.4 is just not forgivable! Especially when you're just executing 512-bit instructions for a small % of the time!

    The only time AVX-512 is a net win (performance-wise, not even to speak of pref/W) is when your workload is using it very heavily. That's why Torvalds was complaining about the possibility of some idiot using it for memcpy().

    Now, my hope and expectation is for Ice Lake SP (and maybe even Rocket Lake) to exercise more care and less latency around clock-speed adjustments, so that it wouldn't be a liability. However, I have yet to see good data on whether Intel managed to effectively mitigate the performance pitfalls of moderate AVX-512 usage, in Ice Lake (or Rocket Lake).

    Comment


    • #32
      Originally posted by willmore View Post
      I disagree. Linus probably doesn't give a crap about the ugliness of the instruction set, etc. He is concerned with ...
      Ugh. Why do people try to decide for themselves what he doesn't like about it? Is it really that hard to google "torvalds AVX-512" and find his actual statement?

      Comment


      • #33
        Originally posted by jayN View Post
        Intel's avx512 extensions are coming from the server chip world. They've targeted networking and AI processing applications with the projection that both will become ubiquitous.
        AVX-512 is many things to many people. That's part of the reason they split it up into numerous different subsets.

        https://en.wikipedia.org/wiki/AVX-512

        However, the "Foundation" set includes 64-bit floating-point. This is just like SSE2 and original AVX, and not at all useful for either AI or "networking". In fact, the first Intel product with AVX-512 was their HPC-oriented Xeon Phi (2nd Gen), which was aimed at competing with GPU compute accelerators. Ironically, even though Phi failed and gave way to Intel building true GPU cards, we're still stuck with AVX-512.

        Really, I think AVX-512 is just Intel being lazy. They had success with MMX, SSE1/2/3/4/4.1/4.2, SSSE, and AVX1/2. So, it was almost a foregone conclusion that they'd try the same move, yet again. However, Intel's ambition was its own downfall. The fragmentation and power/clock-throttling issues have done a lot to undermine this generation of vector extensions.

        FWIW, I think ARM's SVE is a much more elegant approach. Yet, with AMX, Intel looks to be doubling down and going all the way to 8192-bits! Yeah, for real.
        Last edited by coder; 08 April 2021, 08:50 AM.

        Comment


        • #34
          Originally posted by torsionbar28 View Post
          yet nobody has any examples of this usage benefiting from AVX-512.
          Did you actually look at the benchmarks? It's not entirely an exercise in futility!

          Originally posted by torsionbar28 View Post
          AVX-512 feels like the CPU instruction equivalent of an herbal supplement, with promises of increased vitality, improved clarity, and stronger constitution. Not FDA approved. Not intended to treat or cure any disease. Consult your doctor before taking. Results not guaranteed. Advertisement contains paid actors. Batteries not included. Void where prohibited. Not for sale in ME, TX, CA, NY, or NJ.
          Cute.

          Comment


          • #35
            Originally posted by ms178 View Post
            But I guess developers need more time to make use of these and adoption of AVX-512 is still slow because next to no processor with any meaningful market penetrations supports it yet.
            The bigger issue is that it has significant downsides, if you're only using it modestly. Check out the links I already posted plus the part in the article where Michael had to override the default vector width, due to compilers "fixing" performance regressions by reverting to 256-bits.

            Originally posted by ms178 View Post
            AVX2 also took a long time to become important,
            AVX2 didn't have nearly the same level of issues. Even so, you can read about a certain amount of consternation among Haswell users about AVX2-induced clock-throttling.

            Originally posted by ms178 View Post
            With the new x86 feature levels AVX-512 support is a requirement for v4, so you will need it sooner or later if you want to be compatible with the latest feature level.
            Ha! Even that doesn't contain most of the subsets. That's just Skylake SP-level.

            I'm not going to worry about having an AVX-512 -capable CPU for quite a while, if ever. Intel's latest low-power cores don't even support it. Those still don't even support regular AVX!

            Comment


            • #36
              Originally posted by torsionbar28 View Post
              So it's completely useless for 99.999% of today's workloads, client or server, and the few pieces of software that theoretically could use it, haven't been written yet. Got it.
              You could still use it down inside libraries like BLAS and MKL, without higher-level code having to know anything about it. That's how we get the benefit of crypto acceleration features and even things like TSX, for instance.

              The huge caveat is that using even a small amount of AVX-512, on Skylake SP and Cascade Lake CPUs is enough to trigger massive clock-throttling that will kill your overall application performance (unless you're using it quite heavily). So, that blows a huge crater in the normal adoption pipeline for new CPU instructions.

              The world is going to be wary of AVX-512 for quite some time. And, by the time it's not, maybe ARM will finally be having its day in the sun.

              Comment


              • #37
                Originally posted by carewolf View Post
                You can also do inline-assembler with clang/gcc. That way you avoid having to worry about registers. I find intrinsics better for SSE/AVX though.
                Been there, done that. Not relevant to my (main) point, which was about C compiler options not affecting stuff written in pure ASM.

                Originally posted by carewolf View Post
                For NEON however sometimes using inline assembler is better because NEON has some weird instructions that operate on groups of 2-4 adjacent registers, and the intrinsics just have a tendency to not end up in a an optimal form in machine code (can end up with a move for each register before the instructions and a move for each after, instead of just fixing the allocated registers to some that are adjacent).
                That's a shame. Unless I really needed every last ounce of performance, I'd be tempted to stick with the intrinsics and hope the compilers eventually catch up.

                That's weird, besides. There are plenty of other examples where CPUs (or GPUs) operate on register pairs or otherwise constrained/fixed registers.

                Comment


                • #38
                  Originally posted by coder View Post
                  You forgot crypto.


                  AVX-512 can be a disaster, for performance! Note what Michael said about compilers now defaulting the vector width to 256-bits, to try and limit clock-throttling.

                  Here's the worst-case scenario, for AVX-512: https://blog.cloudflare.com/on-the-d...uency-scaling/

                  Intel screwed themselves by getting ahead of what the process technology could support. Just like they did with AVX2, except worse. Dropping the CPU from a base clock of 2.1 GHz to 1.4 is just not forgivable! Especially when you're just executing 512-bit instructions for a small % of the time!

                  The only time AVX-512 is a net win (performance-wise, not even to speak of pref/W) is when your workload is using it very heavily. That's why Torvalds was complaining about the possibility of some idiot using it for memcpy().

                  Now, my hope and expectation is for Ice Lake SP (and maybe even Rocket Lake) to exercise more care and less latency around clock-speed adjustments, so that it wouldn't be a liability. However, I have yet to see good data on whether Intel managed to effectively mitigate the performance pitfalls of moderate AVX-512 usage, in Ice Lake (or Rocket Lake).
                  It is even worse. Intel had been adding AVX versions because they weren't able to use gpgpu better. There is no reason for AVX-512 when you can have gpgpu for that job. AMD had the right idea with Bulldozer and Fusion (RIP), trying to decouple some math from the cpu cores with the intention of potentially moving them to a gpu. Sadly AMD at the time didn't have the marketing power to force such a paradigm shift, Intel didn't want it because their gpu tech sucked, and Nvidia didn't want it because they didn't have x86 cpu cores (they did experiment with ARM cores though).

                  How much better would the x86 world be today, if x86 cores were stripped from all these SIMD bullshit and these functions were moved to gpu cores? Perhaps with an emulation layer for legacy code. I bet the cpu clocks could be raised higher, and the TDPs would be lower for them. Perhaps this would push memory technology like HBM to mature faster and be cheaper.

                  In any case, now that Intel is investing heavily in gpus, i can see them pushing for Fusion in the future, ironically.

                  Comment


                  • #39
                    Originally posted by coder View Post
                    The bigger issue is that it has significant downsides, if you're only using it modestly. Check out the links I already posted plus the part in the article where Michael had to override the default vector width, due to compilers "fixing" performance regressions by reverting to 256-bits.


                    AVX2 didn't have nearly the same level of issues. Even so, you can read about a certain amount of consternation among Haswell users about AVX2-induced clock-throttling.


                    Ha! Even that doesn't contain most of the subsets. That's just Skylake SP-level.

                    I'm not going to worry about having an AVX-512 -capable CPU for quite a while, if ever. Intel's latest low-power cores don't even support it. Those still don't even support regular AVX!
                    I am on Haswell-EP with 12C/24T, with -80/-70/-70 mV undervolt and using a -0 AVX2 offset (and having the Turbo Boost unlock applied), I see some downclocking in heavy AVX2 workloads when using all 24 threads, but it is still higher than base clock at around 3.1 Ghz (Cinebench R23 multi-core performance is 11.200 after a single round and 10.800 after ten minutes - which is a tad below the 11600K for a price of around 75 EUR, that is what I call a "value offering"). I also agree that Intel's AVX-512 implementation is far from perfect, let's wait and see if AMD can do it better in Zen 4. But I am sure these implementation issues with downclocking will fade over time when Intel finally moves on to newer nodes. And even if AVX-512 might not be relevant today, it will be the day after tomorrow and if you want to keep your CPU for more than five years, that could be of importance further down the road. At least we have seen the rise of ISA-relevance with the introduction of the feature levels, long-term CPU users should take that in mind (which would mean better long-term value for Zen 4 compared to Zen 3, it may be worth the wait for some).
                    Last edited by ms178; 08 April 2021, 06:51 AM.

                    Comment


                    • #40
                      Originally posted by TemplarGR View Post
                      It is even worse. Intel had been adding AVX versions because they weren't able to use gpgpu better. There is no reason for AVX-512 when you can have gpgpu for that job. AMD had the right idea with Bulldozer and Fusion (RIP), trying to decouple some math from the cpu cores with the intention of potentially moving them to a gpu. Sadly AMD at the time didn't have the marketing power to force such a paradigm shift, Intel didn't want it because their gpu tech sucked, and Nvidia didn't want it because they didn't have x86 cpu cores (they did experiment with ARM cores though).
                      I know this isn't a popular opinion, but I think Intel's iGPU is actually better-suited to GPGPU than most. It has more, narrower cores that are a lot more CPU-like than most modern GPUs. They also had more double-precision compute than you usually find in consumer GPUs, but Intel started dialing that back in Gen11. Even their vector capabilities are a lot more like SSE, with horizontal operations you don't normally find in GPUs.

                      I think the main problem is that their iGPU group lacked political clout and was relegated to cheap graphics until more recently. Not having them in servers, where compute is at a premium, meant that Intel was always going to reach for something like AVX-512.

                      Originally posted by TemplarGR View Post
                      How much better would the x86 world be today, if x86 cores were stripped from all these SIMD bullshit and these functions were moved to gpu cores? Perhaps with an emulation layer for legacy code. I bet the cpu clocks could be raised higher, and the TDPs would be lower for them. Perhaps this would push memory technology like HBM to mature faster and be cheaper.
                      I think SSE should definitely stay. For scalar arithmetic, it's a net win over x87 math. And some basic degree of vector arithmetic capability in the CPU cores is definitely something you want. I could even be persuaded that having 4-element fp64 vectors via AVX is something we need.

                      But, I tend to agree that all the packed-int stuff in AVX2 is maybe getting to the point where you'd rather be using a DSP/GPU-like architecture. As for true SIMD-programming, GPUs just do that so much better.

                      BTW, I don't know if you saw my comments about AMX, but Intel looks to be doubling/tripling-down on special-purpose CPU units/instructions!
                      Last edited by coder; 08 April 2021, 08:09 AM.

                      Comment

                      Working...
                      X