Announcement

Collapse
No announcement yet.

Intel Contributes AVX-512 Optimizations To Numpy, Yields Massive Speedups

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
    coder
    Senior Member

  • coder
    replied
    Originally posted by smitty3268 View Post
    If you're running a compute application, that's not a problem.
    If you're running a game, though, do you want the physics engine using AVX-512 while slowing down the rest of the game code? Or is it better to keep everything off AVX and trust that it can keep up?
    And what about a random Gnome app that happens to hit an AVX path in the GTK code that's just copying memory around?
    Agreed, but let's be clear about the distinctions between AVX/AVX2 and AVX-512. On Skylake and newer CPUs, there isn't much penalty from AVX/AVX2. And certainly not nearly on the same level as AVX-512.

    Even Haswell wasn't as bad with AVX2 as Skylake SP was with AVX-512. You'd take a few-hundred MHz hit, but nothing like what I observed with AVX-512.

    Leave a comment:

  • coder
    Senior Member

  • coder
    replied
    Originally posted by ezekiel68 View Post
    I've seen a good bit of FUD in these forum comments about latency and "massive downclocking" for AVX-512. Here are some facts:
    According to specs, a Xeon W-2104 CPU (released in 2017) running at non-AVX turbo can hit 3.6 GHz (the advertised turbo speed for the chip), AVX/AVX2 turbo could hit 2.8 GHz, and AVX-512 would turbo at 2.4 GHz. That's a 33.333...% decrease in turbo clock speed on a core of this Skylake-architecture chip when it is running AVX-512. Seems massive. BUT recall that vector operations are running on multiple values (the "MD" part of "SIMD") at a time.
    I have firsthand experience of a Xeon Silver Skylake SP with a baseclock of 2.1 GHz and a turbo of 3.0 GHz. Running an AVX-512 heavy workload, it downclocked as low as 1.3 GHz on heavily-loaded cores. Temperature monitoring showed that temps were absolutely fine. This was all due to AVX-512.

    Once all of the code was recompiled to use only AVX2, the chip stayed at or above its base clock of 2.1 GHz. The aggregate throughput was substantially increased, because although the code was AVX-512 heavy, it wasn't all AVX-512. So, AVX-512 wasn't able to make up for the difference in clockspeed based on doubling the width of AVX2, or by other ISA enhancements.

    Oh, and the workload in question involved deep learning in PyTorch and Intel's own OpenVINO framework. It wasn't 100% deep learning, but probably around 60% - 75%.

    Originally posted by ezekiel68 View Post
    Let's illustrate this
    No amount of theoretical examples can outweigh real world experience.

    I'll grant that if code running on the CPU is like 90% highly vectorizable computation, then AVX-512 would yield a net improvement over AVX2. The problems come when there's enough AVX-512 to trigger significant downclocking, but not enough to yield much benefit by virtue of its additional width. Here's another case-in-point:

    https://blog.cloudflare.com/on-the-d...uency-scaling/

    Originally posted by ezekiel68 View Post
    this float or integer example creates a 16:1 advantage of AVX-512 operations over normal instructions
    And OMG, why are you comparing AVX-512 with scalar code? Try reading about AVX2, please.

    Originally posted by ezekiel68 View Post
    So much for the impact of massive downclocking.
    I didn't even see much mention of downclocking in this thread, before you decided to make a big post about it.

    Originally posted by ezekiel68 View Post
    The Ice Lake Xeons do not exhibit this 33% level of downclocking.
    Yes, that's known by me, at least. Unfortunately, Ice Lake SP only started shipping to the general public a few months ago. However, millions of Skylake SP and Cascade Lake CPUs will remain in service for several years yet to come.

    I do agree that it shouldn't be as much an issue, going forward. However, Intel jumped the gun on putting AVX-512 in their mainstream CPUs. A bit like the issues people had with AVX2-based clock throttling on Haswell, but even worse. I think AMD is coming to it at just the right time.

    Originally posted by ezekiel68 View Post
    On the desktop side, the Rocket Lake chips such as the i9-11700K don't downclock at all.
    This is only if you're using a gaming motherboard that disables the power limits and you have profuse amounts of cooling. The reason they can afford not to downclock is that they aren't constrained by the strict power limits that server CPUs must obey. Furthermore, I'm pretty sure that Rocket Lake has only one AVX-512 FMA per core, as was mentioned above. The following data shows just what a space heater that little "125 W" CPU can become (i.e. capable of sustaining dissipation at nearly twice that)!

    https://www.anandtech.com/show/16495...1700k-11600k/5

    Originally posted by ezekiel68 View Post
    I hope this information will clear up some misconceptions about the behavior of AVX-512 and perhaps give pause to any who might be tempted to parrot outdated tropes on the subject.
    Ironically, it just gave me an opportunity to reiterate my firsthand experience with the subject and correct your correction. Thanks, I guess?

    Leave a comment:

  • smitty3268
    Senior Member

  • smitty3268
    replied
    Originally posted by ezekiel68 View Post
    I've seen a good bit of FUD in these forum comments about latency and "massive downclocking" for AVX-512. Here are some facts:
    According to specs, a Xeon W-2104 CPU (released in 2017) running at non-AVX turbo can hit 3.6 GHz (the advertised turbo speed for the chip), AVX/AVX2 turbo could hit 2.8 GHz, and AVX-512 would turbo at 2.4 GHz. That's a 33.333...% decrease in turbo clock speed on a core of this Skylake-architecture chip when it is running AVX-512. Seems massive. BUT recall that vector operations are running on multiple values (the "MD" part of "SIMD") at a time.
    The reason people hate the downclocking isn't that they think their AVX-512 compute workload is going to be slower than not having it. The benchmarks are very clear in the kinds of speedups it provides.

    The reason it's got a bad rap is that non-compute workloads like a random system library your app happens to call can have some AVX instructions in it, which then slows down everything else that's running on your computer.

    If you're running a compute application, that's not a problem.
    If you're running a game, though, do you want the physics engine using AVX-512 while slowing down the rest of the game code? Or is it better to keep everything off AVX and trust that it can keep up?
    And what about a random Gnome app that happens to hit an AVX path in the GTK code that's just copying memory around?

    Anyway, I think that was more of an issue with the old 14nm process anyway. 10nm processors are doing a better job. (As i see you noted too)
    smitty3268
    Senior Member
    Last edited by smitty3268; 13 October 2021, 08:02 PM.

    Leave a comment:

  • ezekiel68
    Junior Member

  • ezekiel68
    replied
    I've seen a good bit of FUD in these forum comments about latency and "massive downclocking" for AVX-512. Here are some facts:
    According to specs, a Xeon W-2104 CPU (released in 2017) running at non-AVX turbo can hit 3.6 GHz (the advertised turbo speed for the chip), AVX/AVX2 turbo could hit 2.8 GHz, and AVX-512 would turbo at 2.4 GHz. That's a 33.333...% decrease in turbo clock speed on a core of this Skylake-architecture chip when it is running AVX-512. Seems massive. BUT recall that vector operations are running on multiple values (the "MD" part of "SIMD") at a time.

    Let's illustrate this -- but also keep the math simple by imagining the AVX-512 turbo speed penalty was 50% per core, rather than the 33% that it is. So let's say you've got a core with regular instructions coming in to multiply two numbers together. Normally it can do this at up to 3.6 GHz. If you realize you have quite a few numbers to multiply together, you could instead set up a loop through vectors/arrays to do this via AVX-512 instead. You are tempted to feel sad that you can only perform the vector multiplications a lower clock speed. Yet if those numbers are 32-bit floats or integers (yes, AVX-512 works on integers too), this will pack 16 pairs of factors into the AVX-512 registers at a time. There is some other latency (measured at a max of 30 microseconds) involved as the core transitions to vector instructions, but this float or integer example creates a 16:1 advantage of AVX-512 operations over normal instructions. So -- with our fictitious 50% speed penalty, we could expect something in the ballpark of an 8X speedup for multiplying the numbers in our two arrays with AVX-512 even with the clock-speed penalty.

    So much for the impact of massive downclocking.

    But what isn't widely known yet (judging by the comments I see in this forum) is that the above is old news. The Ice Lake Xeons do not exhibit this 33% level of downclocking. If one core is using AVX-512, that core's turbo speed only lowers from 3.7 GHz to 3.6 GHz. If four cores are using AVX-512, those four cores' turbo speeds only lower from 3.7 GHz to 3.3 GHz. On the desktop side, the Rocket Lake chips such as the i9-11700K don't downclock at all. So -- I hope this information will clear up some misconceptions about the behavior of AVX-512 and perhaps give pause to any who might be tempted to parrot outdated tropes on the subject.

    Two great articles (by the same researcher) with tons of data on the topic:
    AVX-512 behavior on older chips: https://travisdowns.github.io/blog/2.../avxfreq1.html
    AVX-512 behavior on newer chips: https://travisdowns.github.io/blog/2...x512-freq.html
    ezekiel68
    Junior Member
    Last edited by ezekiel68; 13 October 2021, 05:23 PM.

    Leave a comment:

  • ezekiel68
    Junior Member

  • ezekiel68
    replied
    No Xeon Scalable chips (or HEDT, desktop, or mobile) have "two AVX-512" per core. In the Xeon Scalable line, the older generation Bronze and Silver chips (and a few older Gold chips) only had one port per core for AVX-512 FMA (fused multiply-add) operations, whereas most of the older Gold and all Platinum chips had 2 ports for AVX-512 FMA per core. HOWEVER, all Ice Lake Scalable Xeon chips have 2 FMA ports per core. The HEDT chips launched as in 2019 (Cascade Lake-X) have 2 ports per core. I'm pretty sure (but not certain) that cores in the Rocket Lake desktop chips (which do have AVX-512) only have one port per core (and the same for all mobile chips with AVX-512).

    Leave a comment:

  • RedEyed
    Senior Member

  • RedEyed
    replied
    Originally posted by mdedetrich View Post

    Very few hyperscale/high end servers use Intel+AVX 512 for compute because for what you are paying its extremely expensive. In these cases all compute is offloaded onto GPU's or accelerators.

    The best case for AVX-512 is actually for doing AI related tasks on a local machine or workstation use, i.e. think of what iPhone did recently where they offloaded some speech recognition onto the phone rather than in the cloud.
    Yes, local machine or workstation, that's the case .

    Leave a comment:

  • mdedetrich
    Senior Member

  • mdedetrich
    replied
    Originally posted by RedEyed View Post

    Those how buy hw for compute, buy Intel + NVIDIA, rather than AMD + NVIDIA
    Very few hyperscale/high end servers use Intel+AVX 512 for compute because for what you are paying its extremely expensive. In these cases all compute is offloaded onto GPU's or accelerators.

    The best case for AVX-512 is actually for doing AI related tasks on a local machine or workstation use, i.e. think of what iPhone did recently where they offloaded some speech recognition onto the phone rather than in the cloud.
    mdedetrich
    Senior Member
    Last edited by mdedetrich; 13 October 2021, 06:28 AM.

    Leave a comment:

  • coder
    Senior Member

  • coder
    replied
    Originally posted by keivan View Post
    AMD Epyc zen 3 has two avx2 (somewhat avx256) per core, for those who did not know.
    This is a little simplistic. Zen 3 has 2 AVX multiply/FMA units, but it also has 2 dedicated AVX adders, as well as a FP Store and a FP Store/int converter. So, the theoretical GFLOPS per core is probably higher than just the dual-FMA would suggest.


    Source: https://www.anandtech.com/show/16214...5700x-tested/3

    Originally posted by keivan View Post
    Xeon platinum has two avx512 per core. Other Xeon processors have one avx512 per core.
    I thought some Gold also had dual, in Cascade Lake? Anyway, Ice Lake finally rolled out dual- AVX-512 FMAs across the whole server CPU product range. I'm not sure about the Xeon-W workstation CPUs, but I assume those all have it, too.

    Leave a comment:

  • coder
    Senior Member

  • coder
    replied
    Originally posted by evergreen View Post
    You realize this innovation is meant for server chips, like future Sapphire Rapid ? Not for your typical gamer rig ...
    The mainstream CPUs are also sold as E-series Xeons and used in entry-level workstations. So, it's not just "gamer rigs".

    Also, with a but a handful of exceptions, laptops do not use the W-series Xeon or HEDT CPUs. They use basically the same mainstream dies as desktops and small servers/workstations.

    Intel has now shipped 2 generations of laptop chips (Ice Lake & Tiger Lake) and one generation of desktop chips (Rocket Lake) with AVX-512. So, there's been every indication they didn't intend it to be a technology reserved exclusively for server & big workstation CPUs.
    coder
    Senior Member
    Last edited by coder; 12 October 2021, 10:12 PM.

    Leave a comment:

  • Paradigm Shifter
    Senior Member

  • Paradigm Shifter
    replied
    Originally posted by RedEyed View Post
    Unfortunately, these who buy hardware explicitly for computation tasks, chose Intel.
    It would be nice if AMD contribute to compute libraries such as numpy at least
    At least in the field I work, whether AMD or Intel is immaterial - what matters is price and performance... and how many nVidia GPUs we can cram in.


    Originally posted by numacross View Post
    Don't you mean Nvidia?
    My thoughts exactly. Although I'd love to be able to use AMD cards too.


    Originally posted by RedEyed View Post
    AFAIK, there is runtime dispatch in *BLAS libs which explicitly checks for intel (rather than instruction set)
    Hey, if you can't win honestly, dirty tricks work too, right...?


    Originally posted by numacross View Post
    EPYC is more performant (apart from AVX-512), has more PCIe lanes, was first to market with PCIe 4.0 and is cheaper than Xeon. All the deployments I know of went for AMD, and only a few ended up with Intel because of supply issues.
    Same. The only Intel systems we've purchased recently have been explicitly Intel because of AVX-512. If AMD's AVX-512 implementation (if true) can get within 10% of Intel's performance per thread, we'd be back on a level playing field again, because the systems would either cost us half as much for 90% of the performance, or the same with double the cores and more PCI-E connectivity.


    Originally posted by jabl View Post
    I don't know whether that's the intention of @clownstown, but that is completely incorrect. Making a fast AND accurate math library is VERY hard. Even more so if you want it vectorized.

    And no, since x87 FPU's haven't had hardware implementations of math functions (and in many respects, even the x87 HW implementations were/are shit), library implementations have to implement it using various numerical algorithms based on +, -, *, /.
    Computers are bad at things which aren't powers of two.

    I still get headaches when I remember having to give a lecture explaining how floating point loses accuracy... and why.

    And yes, doing complex maths accurately is tough.


    Originally posted by evergreen View Post
    You realize this innovation is meant for server chips, like future Sapphire Rapid ? Not for your typical gamer rig ...
    And this is why, if the AVX-512 implementatio in Zen 4 is a) true, b) not hobbled by the compiler and c) covers all the various sub-instructions, that AMD will be of great interest. With the chiplet design, it will probably filter down to desktop chips (or at least more affordable CPUs).


    Originally posted by Slithery View Post
    But how much more power does it use?
    Enough to clock down hard.
    Paradigm Shifter
    Senior Member
    Last edited by Paradigm Shifter; 12 October 2021, 09:08 PM.

    Leave a comment:

Working...
X