Announcement

**celrod** · 27 May 2019, 08:11 PM

Great! I've been wanting to see benchmarks comparing different `-mprefer-vector-width=` options for a while.
It's disappointing to see `=512` do so poorly. I think compilers do a bad job taking advantage of it, eg, I've never seen one emit masked operations on their own.

**wolfwood** · 27 May 2019, 10:32 PM

This may be due to the use of -fvect-cost-model=dynamic with -O3. either the overheads of the checks, or lower rates of vectorization due to the profitability model slow things down. -O2 uses -fvect-cost-model=cheap, but I've found -fvect-cost-model=unlimited to be the best for a small set of benchmarks, on Skylake-X. Michael would it be possible to rerun these benchmarks looking at with -fvect-cost-model as well?

-fvect-cost-model=model
Alter the cost model used for vectorization. The model argument should be one of ‘unlimited’, ‘dynamic’ or ‘cheap’. With the ‘unlimited’ model the vectorized code-path is assumed to be profitable while with the ‘dynamic’ model a runtime check guards the vectorized code-path to enable it only for iteration counts that will likely execute faster than when executing the original scalar loop. The ‘cheap’ model disables vectorization of loops where doing so would be cost prohibitive for example due to required runtime checks for data dependence or alignment but otherwise is equal to the ‘dynamic’ model.

**pegasus** · 28 May 2019, 02:27 AM

Very interesting indeed.

Are these tests run at a fixed cpu frequency? If not, can the frequency be monitored and displayed with the results as well?

**carewolf** · 28 May 2019, 03:43 AM

Originally posted by celrod View Post

Great! I've been wanting to see benchmarks comparing different `-mprefer-vector-width=` options for a while.
It's disappointing to see `=512` do so poorly. I think compilers do a bad job taking advantage of it, eg, I've never seen one emit masked operations on their own.

But note with AVX512VL support the compiler can also make masked operations with 128- and 256-bit widths.

Edit: Basically the AVX512VL extension makes it possible for the compiler to replace SSE and AVX instructions with more powerful AVX512 instructions.. Never checked if it really does so though, but I assume so as SSE instructions are replaced with AVX ones when using -mavx -mprefer-vector-width=128

**Virtus** · 28 May 2019, 05:10 AM

Originally posted by pegasus View Post

Very interesting indeed.

Are these tests run at a fixed cpu frequency? If not, can the frequency be monitored and displayed with the results as well?

The frequency is lower when using AVX instead of scalar instructions, and even lower when using AVX-512. Tried on an Xeon Gold 6128 and the frequency is 1.9Ghz when using AVX-512 if I remember correctly.

**Mani** · 28 May 2019, 07:54 AM

Additional charts normalized to clock would indeed be very interesting. Especially since it should directly correlate to the powerdrain

**celrod** · 28 May 2019, 10:51 AM

Originally posted by carewolf View Post

But note with AVX512VL support the compiler can also make masked operations with 128- and 256-bit widths.

Edit: Basically the AVX512VL extension makes it possible for the compiler to replace SSE and AVX instructions with more powerful AVX512 instructions.. Never checked if it really does so though, but I assume so as SSE instructions are replaced with AVX ones when using -mavx -mprefer-vector-width=128

Two simple cases where I think auto-vectorizers really ought to be able to take advantage of masking:
1. Loops; the remainder of a loop (ie, loop iterations mod vector width) ought to be done in a single masked iteration, not serially.
2. SLP vectorization. There are lots of opportunities where there are repeated calculations that show up in multiples that aren't a power of 2. These don't get vectorizer, or are handled inefficiently. Eg, if there are 7 elements, autovectorzers will often break that into vectors of length 4, 2, and a 1. That means they'll use 3 loads/stores, consume 3 registers, and use 3 instructions per operation it wants to perform on the 7 numbers. If they instead just used masked load/stores, all those "3"s become "1". For a simple column major matrix multiplication with a 7x13 * 13x10 matrix, this makes the difference between taking 23 ns vs 130 ns (requiring 3 registers per results in spilling).

Like you point out, Skylake-X and on can even use masks for the earlier instruction sets. Those sort of tricks make it much easier to explicitly vectorize avx512 code than avx2 code -- meaning even when the optimal width is 4 or less, they should often get a boost. But compilers do not seem to take advantage of it.

Originally posted by Virtus View Post

The frequency is lower when using AVX instead of scalar instructions, and even lower when using AVX-512. Tried on an Xeon Gold 6128 and the frequency is 1.9Ghz when using AVX-512 if I remember correctly.

Yeah. You can mess with these settings in the bios / when overclocking.
I have water cooling with a nice 360mm radiator. But, avx512-heavy loads do generate a lot of heat, and will crash the computer if you set their clocks too high and run them on all cores.
The crashes were practically instantaneous when doing linear algebra, which is more or less pure-avx512, and took about a minute when running statistical software I wrote. Obviously time to crash is dependent on how high your over-clocks were.
My point here is that -- as it wouldn't crash under avx2 loads -- the need to downclock on avx512 is real.

But, on the otherhand, all these testcases ran much faster thanks to avx512, because the wider vectors more than made up for the slower speeds.
Unfortunately, it seems that to see this benefit, you have to deliberately right the software with avx512-vectorization in mind...

In many of my workloads, avx512 provides a huge benefit, and I don't know how to get them working on a GPU (the standard avx512 counterargument).
But seeing benchmarks like these (and ones I ran on the Polyhedron test suite in an earlier Phoronix discussion) suggest I'm in the minority; that to most avx512 is wasted silicon. =(
Selfishly, I hope Intel doesn't decide to abandon avx512, AMD decides to at least support it like gen 1 Ryzen supported avx2, and that compilers improve.

Processor manufacturers obviously can't afford to give too many of them away, but I can't help but think they'd be better off if folks working on LLVM and GCC's optimizes had chips with the latest instructions. Same for libraries like OpenBLAS and FFTW. In the case of OpenBLAS, its primary maintainer (martin-frbg) said [regarding avx-512](https://github.com/xianyi/OpenBLAS/p...nt-427099496):

Fixed the naming for you, will leave the algorithm discussion to the adults with SkylakeX access.

OpenBLAS was very slow to get avx512 support, and AFAIK it is still pretty bad. Odds are, if he had access to such a chip, avx512 would've started looking good on OpenBLAS benchmarks much sooner. Julia's official binaries ship with OpenBLAS, as does R on most linux distros.

**pegasus** · 28 May 2019, 12:31 PM

Originally posted by celrod View Post

In many of my workloads, avx512 provides a huge benefit, and I don't know how to get them working on a GPU (the standard avx512 counterargument).
But seeing benchmarks like these (and ones I ran on the Polyhedron test suite in an earlier Phoronix discussion) suggest I'm in the minority; that to most avx512 is wasted silicon. =(

You are indeed. I've seen most of the math heavy workloads migrate to gpus in the past few years and nobody cares for avx512 these days.
If you can't get your code ported to OpenMP/OpenCL/CUDA to eventually run on GPUs, take a look at NEC Aurora Tsubasa cards. They're super wide vector engines built with NEC decades of vector supercomputer knowhow.

**celrod** · 28 May 2019, 01:19 PM

I think Bayesian statistics would be a great niche for avx512, but statisticians in general (Bayesian or otherwise) are computer-and-optimization-illiterate; the most popular BLAS library is the reference BLAS you get when downloading R with RStudio, at 30x slower than OpenBLAS & co.
Even software written by computer scientists meant for doing Bayes, like Stan, has it's most basic data structure specified in a way so as to be incompatible with vectorization: like elements are not stored contiguously in arrays.
In my experience, it isn't too difficult to get >4x or >100x better performance.

Of course, while reference BLAS implementations and Stan are popular, no one will benefit from any vectorization at all.
But programs running for days or weeks is fairly routine; there is a need, just no knowledge of how better is possible. It is a field that has a lot it can gain from better optimized software, if only its practitioners were as computer-literate as the machine-learning folks are.

Double precision is a must. I'm still a graduate student (I will finish in a couple months), so for now good double precision performance on a GPU is out of my budget (maybe I'll get a Radeon VII, which has quarter double precision -- or a Navi successor if any are similar).
And I do intend to play around eventually. Typical Bayesian models are no where near as clean to vectorize as machine learning.
A neural network is just a bunch of large matrix multiplications and non-linear transforms applied to large vectors.
With a Bayesian model, you have a lot of smaller-grained stuff. 256-wide vectors will be overkill for much of it. And I'm worried about memory problems if you try and run too many chains on simulated data sets at once.
But I have very little experience with GPGPU, so it still seems alien/intimidating/hard to reason about.

Announcement

Core i9 7980XE GCC 9 AVX Compiler Tuning Performance Benchmarks

Core i9 7980XE GCC 9 AVX Compiler Tuning Performance Benchmarks

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment