Originally posted by ddriver
View Post
Announcement
Collapse
No announcement yet.
AVX / AVX2 / AVX-512 Performance + Power On Intel Rocket Lake
Collapse
X
-
- Likes 2
-
Originally posted by ddriver View Post
Intel appears to have given up on improving general purpose performance and is doing a lot of work to boost purpose specific tasks both in hardware and on the software front as well. This also explains how come their cpus show disproportionately big gains in some corner cases, even if more or less stuck in general.
I read recently that the Rocket Lake implementation combines two avx2 FMA units when running avx512 operations. If so, does this explain the small gains between avx2 and avx512 configurations?
The Ice Lake Server chips are documented as having two avx512 FMA units... even on the 8 core products. It would be interesting to compare those vs Rocket Lake on the AI tests to see which benchmarks can benefit.
Comment
-
Originally posted by lucasbekker View PostAVX-512 is primarily aimed at software that has to perform a LOT of similar mathematical operations on large amounts of data. These kind of programs mostly fall into two categories:
- Math libraries like openBLAS or MKL, which use intrinsics
- Custom math kernels, mostly written in CUDA or other SPMD compilers like Intel® Implicit SPMD Program Compiler (ispc.github.io)
These benchmarks only show that the software being compiled does not fall into either of these catagories (except for some of the mining stuff, probably hacky custom kernels...)
Originally posted by lucasbekker View PostIt is unfortunate that AVX-512 is getting a bad reputation because of these kind of benchmarks, because if you are making use of AVX-512 in the intended way, the performance benefints can be HUGE.
Edit: AVX-512 feels like the CPU instruction equivalent of an herbal supplement, with promises of increased vitality, improved clarity, and stronger constitution. Not FDA approved. Not intended to treat or cure any disease. Consult your doctor before taking. Results not guaranteed. Advertisement contains paid actors. Batteries not included. Void where prohibited. Not for sale in ME, TX, CA, NY, or NJ.Last edited by torsionbar28; 07 April 2021, 02:02 PM.
- Likes 5
Comment
-
Originally posted by torsionbar28 View PostIMO the benefit of AVX 512 seems far more theoretical than practical at this point. [...] What is the intended way? Can you quantify the benefits? Where can I go to see this benefit demonstrated?
As far as I have read, AVX-512 is also designed to allow for more general pupose usage (scatter and gather instructions) than former vector extensions. I suggest this blog post from Matt Pharr (who wrote ISPC) on the benefits: https://pharr.org/matt/blog/2018/04/...s-and-ooo.html (and the follow-ups)Last edited by ms178; 07 April 2021, 02:23 PM.
Comment
-
Originally posted by TemplarGR View PostSeems the benefit of AVX in general is too small. Really, really, small. Unless specifically written for it.
my python 2 is scalar blas, whereas my python3 numpy is AVX2-enabled blas:
🐅tbrowne@RyVe:~$ python2
Python 2.7.18 (default, Mar 8 2021, 13:02:45)
[GCC 9.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import datetime as dt
>>> import numpy as np
>>> def calc():
... nn = dt.datetime.utcnow()
... xx = np.random.rand(1000, 1000)
... yy = np.linalg.eig(xx)
... print (dt.datetime.utcnow() - nn)
...
>>> calc()
0:00:04.335998
🐅tbrowne@RyVe:~$ python3
Python 3.8.5 (default, Jan 27 2021, 15:41:15)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import datetime as dt
>>> import numpy as np
>>> def calc():
... nn = dt.datetime.utcnow()
... xx = np.random.rand(1000, 1000)
... yy = np.linalg.eig(xx)
... print(dt.datetime.utcnow() - nn)
...
>>> calc()
0:00:01.299623
This is on a Ryzen 2700X. Intel AVX is even faster.Last edited by vegabook; 07 April 2021, 04:01 PM.
- Likes 2
Comment
-
Originally posted by vegabook View PostThis is on a Ryzen 2700X. Intel AVX is even faster.
You need to binary patch Intel MKL binaries to be able to bypass this rather arbitrary limitation. See:
- https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html
- https://www.extremetech.com/computin...eadripper-cpus
- http://www.swallowtail.org/naughty-intel.shtml
- Likes 5
Comment
-
Originally posted by TemplarGR View PostSeems the benefit of AVX in general is too small. Really, really, small. Unless specifically written for it.
Now if your code is in a vectorizable state a recompile will show some nice gains but for maximum performance, yeah specific implementations are best simply because the developer is the one that understand that code and can go further than the compiler's safe approach.
Think of it as regular C/C++ and Co are akin to OpenGL/DX11 and SIMD C/C++ and Co is akin to Vulkan/DX12.
Also SIMD regardless of the platform/acronym do a hell of lot more than simple math as claimed by some other posts, sure BLAS and other software make of use of the "Math" side of SIMD but it also greatly speedup memory, cache, shifting, comparison, crypto (not only aes-ni do this btw), etc. operations and those can be used on any app/library as long as you understand what you are doing.
Comment
Comment