Core i9 7980XE GCC 9 AVX Compiler Tuning Performance Benchmarks

willmore replied

01 June 2019, 11:36 AM
Could it be that a lot of this code is using SSE/AVX intrinsics that don't play well with the compiler throwing in AVX-256 and AVX-512 vecors as the data isn't stored properly for them to work ideally?
Leave a comment:
celrod replied

28 May 2019, 06:33 PM
On hacker news someone named dragontamer just made a great post about SIMD and GPGPU, here.

I'd suggest reading all of it, but in particular I'd want to highlight the problem that folks think too often about using SIMD to calculate the answer to one function faster, rather than to evaluate many at a time, as well as:

My short summary is... the AMD Vega64 is effectively a 16384-wide SIMD unit. (or perhaps... 64 parallel CU clusters of 256-wide SIMD units).

In my applications, taking advantage of 8-wide SIMD units is easy. 256, on the other hand, is much too wide.
Leave a comment:
edwaleni replied

28 May 2019, 03:36 PM
Originally posted by pegasus View Post

If you can't get your code ported to OpenMP/OpenCL/CUDA to eventually run on GPUs, take a look at NEC Aurora Tsubasa cards. They're super wide vector engines built with NEC decades of vector supercomputer knowhow.

Correct me if I am wrong, but don't Tsubasa vector cards run their own OS and you have to use their offload API to process your data?

This isn't much different than the Intel Knight's Bridge, where the card had its own OS and you had to use their API to get data into the vector engine.

Intel Mobidius I think works the same way.

With OpenCL, don't you just declare or request a device ID? Makes it much more portable I would think as you could run your data set against any vector platform that supports it.
Leave a comment:
edwaleni replied

28 May 2019, 01:57 PM
Originally posted by pegasus View Post

Very interesting indeed.

Are these tests run at a fixed cpu frequency? If not, can the frequency be monitored and displayed with the results as well?

I was thinking the same thing.

Charting CPU frequency and temp/power during the scalar operations.
Leave a comment:
celrod replied

28 May 2019, 01:19 PM
I think Bayesian statistics would be a great niche for avx512, but statisticians in general (Bayesian or otherwise) are computer-and-optimization-illiterate; the most popular BLAS library is the reference BLAS you get when downloading R with RStudio, at 30x slower than OpenBLAS & co.
Even software written by computer scientists meant for doing Bayes, like Stan, has it's most basic data structure specified in a way so as to be incompatible with vectorization: like elements are not stored contiguously in arrays.
In my experience, it isn't too difficult to get >4x or >100x better performance.

Of course, while reference BLAS implementations and Stan are popular, no one will benefit from any vectorization at all.
But programs running for days or weeks is fairly routine; there is a need, just no knowledge of how better is possible. It is a field that has a lot it can gain from better optimized software, if only its practitioners were as computer-literate as the machine-learning folks are.

Double precision is a must. I'm still a graduate student (I will finish in a couple months), so for now good double precision performance on a GPU is out of my budget (maybe I'll get a Radeon VII, which has quarter double precision -- or a Navi successor if any are similar).
And I do intend to play around eventually. Typical Bayesian models are no where near as clean to vectorize as machine learning.
A neural network is just a bunch of large matrix multiplications and non-linear transforms applied to large vectors.
With a Bayesian model, you have a lot of smaller-grained stuff. 256-wide vectors will be overkill for much of it. And I'm worried about memory problems if you try and run too many chains on simulated data sets at once.
But I have very little experience with GPGPU, so it still seems alien/intimidating/hard to reason about.
Leave a comment:
pegasus replied

28 May 2019, 12:31 PM
Originally posted by celrod View Post

In many of my workloads, avx512 provides a huge benefit, and I don't know how to get them working on a GPU (the standard avx512 counterargument).
But seeing benchmarks like these (and ones I ran on the Polyhedron test suite in an earlier Phoronix discussion) suggest I'm in the minority; that to most avx512 is wasted silicon. =(

You are indeed. I've seen most of the math heavy workloads migrate to gpus in the past few years and nobody cares for avx512 these days.
If you can't get your code ported to OpenMP/OpenCL/CUDA to eventually run on GPUs, take a look at NEC Aurora Tsubasa cards. They're super wide vector engines built with NEC decades of vector supercomputer knowhow.
Likes 2
Leave a comment:
celrod replied

28 May 2019, 10:51 AM
Originally posted by carewolf View Post

But note with AVX512VL support the compiler can also make masked operations with 128- and 256-bit widths.

Edit: Basically the AVX512VL extension makes it possible for the compiler to replace SSE and AVX instructions with more powerful AVX512 instructions.. Never checked if it really does so though, but I assume so as SSE instructions are replaced with AVX ones when using -mavx -mprefer-vector-width=128

Two simple cases where I think auto-vectorizers really ought to be able to take advantage of masking:
1. Loops; the remainder of a loop (ie, loop iterations mod vector width) ought to be done in a single masked iteration, not serially.
2. SLP vectorization. There are lots of opportunities where there are repeated calculations that show up in multiples that aren't a power of 2. These don't get vectorizer, or are handled inefficiently. Eg, if there are 7 elements, autovectorzers will often break that into vectors of length 4, 2, and a 1. That means they'll use 3 loads/stores, consume 3 registers, and use 3 instructions per operation it wants to perform on the 7 numbers. If they instead just used masked load/stores, all those "3"s become "1". For a simple column major matrix multiplication with a 7x13 * 13x10 matrix, this makes the difference between taking 23 ns vs 130 ns (requiring 3 registers per results in spilling).

Like you point out, Skylake-X and on can even use masks for the earlier instruction sets. Those sort of tricks make it much easier to explicitly vectorize avx512 code than avx2 code -- meaning even when the optimal width is 4 or less, they should often get a boost. But compilers do not seem to take advantage of it.

Originally posted by Virtus View Post

The frequency is lower when using AVX instead of scalar instructions, and even lower when using AVX-512. Tried on an Xeon Gold 6128 and the frequency is 1.9Ghz when using AVX-512 if I remember correctly.

Yeah. You can mess with these settings in the bios / when overclocking.
I have water cooling with a nice 360mm radiator. But, avx512-heavy loads do generate a lot of heat, and will crash the computer if you set their clocks too high and run them on all cores.
The crashes were practically instantaneous when doing linear algebra, which is more or less pure-avx512, and took about a minute when running statistical software I wrote. Obviously time to crash is dependent on how high your over-clocks were.
My point here is that -- as it wouldn't crash under avx2 loads -- the need to downclock on avx512 is real.

But, on the otherhand, all these testcases ran much faster thanks to avx512, because the wider vectors more than made up for the slower speeds.
Unfortunately, it seems that to see this benefit, you have to deliberately right the software with avx512-vectorization in mind...

In many of my workloads, avx512 provides a huge benefit, and I don't know how to get them working on a GPU (the standard avx512 counterargument).
But seeing benchmarks like these (and ones I ran on the Polyhedron test suite in an earlier Phoronix discussion) suggest I'm in the minority; that to most avx512 is wasted silicon. =(
Selfishly, I hope Intel doesn't decide to abandon avx512, AMD decides to at least support it like gen 1 Ryzen supported avx2, and that compilers improve.

Processor manufacturers obviously can't afford to give too many of them away, but I can't help but think they'd be better off if folks working on LLVM and GCC's optimizes had chips with the latest instructions. Same for libraries like OpenBLAS and FFTW. In the case of OpenBLAS, its primary maintainer (martin-frbg) said [regarding avx-512](https://github.com/xianyi/OpenBLAS/p...nt-427099496):

Fixed the naming for you, will leave the algorithm discussion to the adults with SkylakeX access.

OpenBLAS was very slow to get avx512 support, and AFAIK it is still pretty bad. Odds are, if he had access to such a chip, avx512 would've started looking good on OpenBLAS benchmarks much sooner. Julia's official binaries ship with OpenBLAS, as does R on most linux distros.
Leave a comment:
Mani replied

28 May 2019, 07:54 AM
Additional charts normalized to clock would indeed be very interesting. Especially since it should directly correlate to the powerdrain
Likes 1
Leave a comment:
Virtus replied

28 May 2019, 05:10 AM
Originally posted by pegasus View Post

Very interesting indeed.

Are these tests run at a fixed cpu frequency? If not, can the frequency be monitored and displayed with the results as well?

The frequency is lower when using AVX instead of scalar instructions, and even lower when using AVX-512. Tried on an Xeon Gold 6128 and the frequency is 1.9Ghz when using AVX-512 if I remember correctly.
Leave a comment:
carewolf replied

28 May 2019, 03:43 AM
Originally posted by celrod View Post

Great! I've been wanting to see benchmarks comparing different `-mprefer-vector-width=` options for a while.
It's disappointing to see `=512` do so poorly. I think compilers do a bad job taking advantage of it, eg, I've never seen one emit masked operations on their own.

But note with AVX512VL support the compiler can also make masked operations with 128- and 256-bit widths.

Edit: Basically the AVX512VL extension makes it possible for the compiler to replace SSE and AVX instructions with more powerful AVX512 instructions.. Never checked if it really does so though, but I assume so as SSE instructions are replaced with AVX ones when using -mavx -mprefer-vector-width=128

Last edited by carewolf; 28 May 2019, 03:46 AM.
Leave a comment:

Announcement

Core i9 7980XE GCC 9 AVX Compiler Tuning Performance Benchmarks

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: