Announcement

**keivan** · 12 October 2021, 08:43 PM

Sorry, with all so respect, I do not believe these results, until I see your independent benchmarks.

AMD Epyc zen 3 has two avx2 (somewhat avx256) per core, for those who did not know.

Xeon platinum has two avx512 per core. Other Xeon processors have one avx512 per core.

Should we compile NumPy to enjoy these optimizations or the version in anaconda is already enables avx?

**toguro123** · 12 October 2021, 08:45 PM

It is my understanding the a big chunk of the matrix operations performed by numpy (SP and DP) are just direct calls to BLAS/LAPACK (oftentimes provided by OpenBLAS). OpenBLAS already features support for AVX512 in many of its routines. So my guess is that the AVX512 optimizations referenced by the article pertain to matrix related operations involving half- or quarter- precision floats (boring AI stuff) not provided by BLAS/LAPACK.

**Paradigm Shifter** · 12 October 2021, 09:05 PM

Originally posted by RedEyed View Post

Unfortunately, these who buy hardware explicitly for computation tasks, chose Intel.
It would be nice if AMD contribute to compute libraries such as numpy at least

At least in the field I work, whether AMD or Intel is immaterial - what matters is price and performance... and how many nVidia GPUs we can cram in.

Originally posted by numacross View Post

Don't you mean Nvidia?

My thoughts exactly. Although I'd love to be able to use AMD cards too.

Originally posted by RedEyed View Post

AFAIK, there is runtime dispatch in *BLAS libs which explicitly checks for intel (rather than instruction set)

Hey, if you can't win honestly, dirty tricks work too, right...?

Originally posted by numacross View Post

EPYC is more performant (apart from AVX-512), has more PCIe lanes, was first to market with PCIe 4.0 and is cheaper than Xeon. All the deployments I know of went for AMD, and only a few ended up with Intel because of supply issues.

Same. The only Intel systems we've purchased recently have been explicitly Intel because of AVX-512. If AMD's AVX-512 implementation (if true) can get within 10% of Intel's performance per thread, we'd be back on a level playing field again, because the systems would either cost us half as much for 90% of the performance, or the same with double the cores and more PCI-E connectivity.

Originally posted by jabl View Post

I don't know whether that's the intention of @clownstown, but that is completely incorrect. Making a fast AND accurate math library is VERY hard. Even more so if you want it vectorized.

And no, since x87 FPU's haven't had hardware implementations of math functions (and in many respects, even the x87 HW implementations were/are shit), library implementations have to implement it using various numerical algorithms based on +, -, *, /.

Computers are bad at things which aren't powers of two.

I still get headaches when I remember having to give a lecture explaining how floating point loses accuracy... and why.

And yes, doing complex maths accurately is tough.

Originally posted by evergreen View Post

You realize this innovation is meant for server chips, like future Sapphire Rapid ? Not for your typical gamer rig ...

And this is why, if the AVX-512 implementatio in Zen 4 is a) true, b) not hobbled by the compiler and c) covers all the various sub-instructions, that AMD will be of great interest.

With the chiplet design, it will probably filter down to desktop chips (or at least more affordable CPUs).

Originally posted by Slithery View Post

But how much more power does it use?

Enough to clock down hard.

**coder** · 12 October 2021, 10:02 PM

Originally posted by evergreen View Post

You realize this innovation is meant for server chips, like future Sapphire Rapid ? Not for your typical gamer rig ...

The mainstream CPUs are also sold as E-series Xeons and used in entry-level workstations. So, it's not just "gamer rigs".

Also, with a but a handful of exceptions, laptops do not use the W-series Xeon or HEDT CPUs. They use basically the same mainstream dies as desktops and small servers/workstations.

Intel has now shipped 2 generations of laptop chips (Ice Lake & Tiger Lake) and one generation of desktop chips (Rocket Lake) with AVX-512. So, there's been every indication they didn't intend it to be a technology reserved exclusively for server & big workstation CPUs.

**coder** · 12 October 2021, 10:09 PM

Originally posted by keivan View Post

AMD Epyc zen 3 has two avx2 (somewhat avx256) per core, for those who did not know.

This is a little simplistic. Zen 3 has 2 AVX multiply/FMA units, but it also has 2 dedicated AVX adders, as well as a FP Store and a FP Store/int converter. So, the theoretical GFLOPS per core is probably higher than just the dual-FMA would suggest.

Source: https://www.anandtech.com/show/16214...5700x-tested/3

Originally posted by keivan View Post

Xeon platinum has two avx512 per core. Other Xeon processors have one avx512 per core.

I thought some Gold also had dual, in Cascade Lake? Anyway, Ice Lake finally rolled out dual- AVX-512 FMAs across the whole server CPU product range. I'm not sure about the Xeon-W workstation CPUs, but I assume those all have it, too.

**mdedetrich** · 13 October 2021, 06:21 AM

Originally posted by RedEyed View Post

Those how buy hw for compute, buy Intel + NVIDIA, rather than AMD + NVIDIA

Very few hyperscale/high end servers use Intel+AVX 512 for compute because for what you are paying its extremely expensive. In these cases all compute is offloaded onto GPU's or accelerators.

The best case for AVX-512 is actually for doing AI related tasks on a local machine or workstation use, i.e. think of what iPhone did recently where they offloaded some speech recognition onto the phone rather than in the cloud.

**RedEyed** · 13 October 2021, 07:46 AM

Originally posted by mdedetrich View Post

Very few hyperscale/high end servers use Intel+AVX 512 for compute because for what you are paying its extremely expensive. In these cases all compute is offloaded onto GPU's or accelerators.

The best case for AVX-512 is actually for doing AI related tasks on a local machine or workstation use, i.e. think of what iPhone did recently where they offloaded some speech recognition onto the phone rather than in the cloud.

Yes, local machine or workstation, that's the case .

**ezekiel68** · 13 October 2021, 04:13 PM

No Xeon Scalable chips (or HEDT, desktop, or mobile) have "two AVX-512" per core. In the Xeon Scalable line, the older generation Bronze and Silver chips (and a few older Gold chips) only had one port per core for AVX-512 FMA (fused multiply-add) operations, whereas most of the older Gold and all Platinum chips had 2 ports for AVX-512 FMA per core. HOWEVER, all Ice Lake Scalable Xeon chips have 2 FMA ports per core. The HEDT chips launched as in 2019 (Cascade Lake-X) have 2 ports per core. I'm pretty sure (but not certain) that cores in the Rocket Lake desktop chips (which do have AVX-512) only have one port per core (and the same for all mobile chips with AVX-512).

**ezekiel68** · 13 October 2021, 05:09 PM

I've seen a good bit of FUD in these forum comments about latency and "massive downclocking" for AVX-512. Here are some facts:
According to specs, a Xeon W-2104 CPU (released in 2017) running at non-AVX turbo can hit 3.6 GHz (the advertised turbo speed for the chip), AVX/AVX2 turbo could hit 2.8 GHz, and AVX-512 would turbo at 2.4 GHz. That's a 33.333...% decrease in turbo clock speed on a core of this Skylake-architecture chip when it is running AVX-512. Seems massive. BUT recall that vector operations are running on multiple values (the "MD" part of "SIMD") at a time.

Let's illustrate this -- but also keep the math simple by imagining the AVX-512 turbo speed penalty was 50% per core, rather than the 33% that it is. So let's say you've got a core with regular instructions coming in to multiply two numbers together. Normally it can do this at up to 3.6 GHz. If you realize you have quite a few numbers to multiply together, you could instead set up a loop through vectors/arrays to do this via AVX-512 instead. You are tempted to feel sad that you can only perform the vector multiplications a lower clock speed. Yet if those numbers are 32-bit floats or integers (yes, AVX-512 works on integers too), this will pack 16 pairs of factors into the AVX-512 registers at a time. There is some other latency (measured at a max of 30 microseconds) involved as the core transitions to vector instructions, but this float or integer example creates a 16:1 advantage of AVX-512 operations over normal instructions. So -- with our fictitious 50% speed penalty, we could expect something in the ballpark of an 8X speedup for multiplying the numbers in our two arrays with AVX-512 even with the clock-speed penalty.

So much for the impact of massive downclocking.

But what isn't widely known yet (judging by the comments I see in this forum) is that the above is old news. The Ice Lake Xeons do not exhibit this 33% level of downclocking. If one core is using AVX-512, that core's turbo speed only lowers from 3.7 GHz to 3.6 GHz. If four cores are using AVX-512, those four cores' turbo speeds only lower from 3.7 GHz to 3.3 GHz. On the desktop side, the Rocket Lake chips such as the i9-11700K don't downclock at all. So -- I hope this information will clear up some misconceptions about the behavior of AVX-512 and perhaps give pause to any who might be tempted to parrot outdated tropes on the subject.

Two great articles (by the same researcher) with tons of data on the topic:
AVX-512 behavior on older chips: https://travisdowns.github.io/blog/2.../avxfreq1.html
AVX-512 behavior on newer chips: https://travisdowns.github.io/blog/2...x512-freq.html

**smitty3268** · 13 October 2021, 07:58 PM

Originally posted by ezekiel68 View Post

I've seen a good bit of FUD in these forum comments about latency and "massive downclocking" for AVX-512. Here are some facts:
According to specs, a Xeon W-2104 CPU (released in 2017) running at non-AVX turbo can hit 3.6 GHz (the advertised turbo speed for the chip), AVX/AVX2 turbo could hit 2.8 GHz, and AVX-512 would turbo at 2.4 GHz. That's a 33.333...% decrease in turbo clock speed on a core of this Skylake-architecture chip when it is running AVX-512. Seems massive. BUT recall that vector operations are running on multiple values (the "MD" part of "SIMD") at a time.

The reason people hate the downclocking isn't that they think their AVX-512 compute workload is going to be slower than not having it. The benchmarks are very clear in the kinds of speedups it provides.

The reason it's got a bad rap is that non-compute workloads like a random system library your app happens to call can have some AVX instructions in it, which then slows down everything else that's running on your computer.

If you're running a compute application, that's not a problem.
If you're running a game, though, do you want the physics engine using AVX-512 while slowing down the rest of the game code? Or is it better to keep everything off AVX and trust that it can keep up?
And what about a random Gnome app that happens to hit an AVX path in the GTK code that's just copying memory around?

Anyway, I think that was more of an issue with the old 14nm process anyway. 10nm processors are doing a better job. (As i see you noted too)

Announcement

Intel Contributes AVX-512 Optimizations To Numpy, Yields Massive Speedups

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment