Announcement

**willmore** · 07 April 2021, 01:08 PM

Originally posted by ddriver View Post

I don't think it is that. The thing is so far SIMD units have been fairly general purpose. Intel is cramming a bunch of highly purpose specific stuff into avx 512. His is not a problem with the width of execution or power efficiency, but with the support hell it is to keep introducing new niche use instructions and having no instruction set and features uniformity between platforms.

I disagree. Linus probably doesn't give a crap about the ugliness of the instruction set, etc. He is concerned with what the processor context looks like and how easy/hard it is to keep track of and save/restore on context switches. Like the FPU state was in the early days, there's a lot of work that has to go into tracking of the FPU context needs to be saved/restored on CS. Adding in even more *huge* registers to this for the AVX-512 extension doesn't sit well with OS people. CSs are already horribly expensive.

**jayN** · 07 April 2021, 01:38 PM

Originally posted by ddriver View Post

Intel appears to have given up on improving general purpose performance and is doing a lot of work to boost purpose specific tasks both in hardware and on the software front as well. This also explains how come their cpus show disproportionately big gains in some corner cases, even if more or less stuck in general.

Intel's avx512 extensions are coming from the server chip world. They've targeted networking and AI processing applications with the projection that both will become ubiquitous. They have big customers that want this, for example FB wanting avx512 bfloat16 operations for training in their Zion platform.

I read recently that the Rocket Lake implementation combines two avx2 FMA units when running avx512 operations. If so, does this explain the small gains between avx2 and avx512 configurations?

The Ice Lake Server chips are documented as having two avx512 FMA units... even on the 8 core products. It would be interesting to compare those vs Rocket Lake on the AI tests to see which benchmarks can benefit.

**torsionbar28** · 07 April 2021, 01:42 PM

Originally posted by lucasbekker View Post

AVX-512 is primarily aimed at software that has to perform a LOT of similar mathematical operations on large amounts of data. These kind of programs mostly fall into two categories:
- Math libraries like openBLAS or MKL, which use intrinsics
- Custom math kernels, mostly written in CUDA or other SPMD compilers like Intel® Implicit SPMD Program Compiler (ispc.github.io)

These benchmarks only show that the software being compiled does not fall into either of these catagories (except for some of the mining stuff, probably hacky custom kernels...)

This is the oft cited answer for what AVX-512 is useful for, yet nobody has any examples of this usage benefiting from AVX-512. Do you have any real world examples? Actual software products that use it, and specific workloads where there is a demonstrated benefit? These seem to always be missing from these AVX 512 discussions. IMO the benefit of AVX 512 seems far more theoretical than practical at this point.

Originally posted by lucasbekker View Post

It is unfortunate that AVX-512 is getting a bad reputation because of these kind of benchmarks, because if you are making use of AVX-512 in the intended way, the performance benefints can be HUGE.

What is the intended way? Can you quantify the benefits? Where can I go to see this benefit demonstrated?

Edit: AVX-512 feels like the CPU instruction equivalent of an herbal supplement, with promises of increased vitality, improved clarity, and stronger constitution. Not FDA approved. Not intended to treat or cure any disease. Consult your doctor before taking. Results not guaranteed. Advertisement contains paid actors. Batteries not included. Void where prohibited. Not for sale in ME, TX, CA, NY, or NJ.

**ms178** · 07 April 2021, 02:14 PM

Originally posted by torsionbar28 View Post

IMO the benefit of AVX 512 seems far more theoretical than practical at this point. [...] What is the intended way? Can you quantify the benefits? Where can I go to see this benefit demonstrated?

There are a couple of highly specialized applications where AVX-512 is used already, we have seen Garlicoin here, anandtech has some other benchmars, such as NAMD. So there are benefits when it is properly used. I would partially agree that at this point home users aren't affected that much. But I guess developers need more time to make use of these and adoption of AVX-512 is still slow because next to no processor with any meaningful market penetrations supports it yet. AVX2 also took a long time to become important, history might repeat itself here in regards to AVX-512. With the new x86 feature levels AVX-512 support is a requirement for v4, so you will need it sooner or later if you want to be compatible with the latest feature level.

As far as I have read, AVX-512 is also designed to allow for more general pupose usage (scatter and gather instructions) than former vector extensions. I suggest this blog post from Matt Pharr (who wrote ISPC) on the benefits: https://pharr.org/matt/blog/2018/04/...s-and-ooo.html (and the follow-ups)

**TemplarGR** · 07 April 2021, 02:34 PM

Seems the benefit of AVX in general is too small. Really, really, small. Unless specifically written for it.

**Azrael5** · 07 April 2021, 02:52 PM

Except in some minor cases, avx512 appears to be better in almost all the comparisons. In some cases avx512 reach high rates.

**willmore** · 07 April 2021, 03:33 PM

Originally posted by Azrael5 View Post

Except in some minor cases, avx512 appears to be better in almost all the comparisons. In some cases avx512 reach high rates.

Until you look at efficiency and then it's horrible.

**vegabook** · 07 April 2021, 03:48 PM

Originally posted by TemplarGR View Post

Seems the benefit of AVX in general is too small. Really, really, small. Unless specifically written for it.

I get 3-4x using AVX2-enabled MKL vs scalar blas on matrix tasks such as eigenvector calculation or matrix mult. These calcs are increasingly common across consumer workloads especially for creators which is basically where desktop is going.

my python 2 is scalar blas, whereas my python3 numpy is AVX2-enabled blas:

🐅tbrowne@RyVe:~$ python2
Python 2.7.18 (default, Mar 8 2021, 13:02:45)
[GCC 9.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import datetime as dt
>>> import numpy as np
>>> def calc():
... nn = dt.datetime.utcnow()
... xx = np.random.rand(1000, 1000)
... yy = np.linalg.eig(xx)
... print (dt.datetime.utcnow() - nn)
...
>>> calc()
0:00:04.335998

🐅tbrowne@RyVe:~$ python3
Python 3.8.5 (default, Jan 27 2021, 15:41:15)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import datetime as dt
>>> import numpy as np
>>> def calc():
... nn = dt.datetime.utcnow()
... xx = np.random.rand(1000, 1000)
... yy = np.linalg.eig(xx)
... print(dt.datetime.utcnow() - nn)
...
>>> calc()
0:00:01.299623

This is on a Ryzen 2700X. Intel AVX is even faster.

**GPSnoopy** · 07 April 2021, 05:07 PM

Originally posted by vegabook View Post

This is on a Ryzen 2700X. Intel AVX is even faster.

Intel AVX is not faster. It's really not. Intel MKL by default does not use AVX on AMD CPUs, it falls back to something like SSE2 or SSE4. Quite slow.

You need to binary patch Intel MKL binaries to be able to bypass this rather arbitrary limitation. See:
- https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html
- https://www.extremetech.com/computin...eadripper-cpus
- http://www.swallowtail.org/naughty-intel.shtml

**jrch2k8** · 07 April 2021, 05:30 PM

Originally posted by TemplarGR View Post

Seems the benefit of AVX in general is too small. Really, really, small. Unless specifically written for it.

The performance gains are huge but it requires you to understand certain CPU basics that most developers simply ignore or assume the compiler should handle it, so effectively speaking 99% of the code you can find around will see no benefit or very little by simply recompiling because the code itself is in a not vectorizable state and there is nothing the compiler can do without introducing very nasty undefined behavior.

Now if your code is in a vectorizable state a recompile will show some nice gains but for maximum performance, yeah specific implementations are best simply because the developer is the one that understand that code and can go further than the compiler's safe approach.

Think of it as regular C/C++ and Co are akin to OpenGL/DX11 and SIMD C/C++ and Co is akin to Vulkan/DX12.

Also SIMD regardless of the platform/acronym do a hell of lot more than simple math as claimed by some other posts, sure BLAS and other software make of use of the "Math" side of SIMD but it also greatly speedup memory, cache, shifting, comparison, crypto (not only aes-ni do this btw), etc. operations and those can be used on any app/library as long as you understand what you are doing.

Announcement

AVX / AVX2 / AVX-512 Performance + Power On Intel Rocket Lake

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment