Announcement

**celrod** · 08 March 2019, 01:33 AM

FYI, gcc will not use 512 bit vector instructions without `-mprefer-vector-width=512`. This is where much of the potential for big gains with skylake-avx512 performance tuning comes in.

EDIT:
I guess I'm disappointed by these results, because in my own workloads I tend to see close to a 2x performance gain out of avx512.
Part of that is that I made sure those workloads (which are numerical) will be vectorized, and then it delivers as promised.

So my impression is that, if all you're doing is running other people's code, and that code does not include BLAS/LAPACK, don't bother with avx512, and don't bother to enable the instruction set while compiling.

But, if you're actually writing the code you will run, that code can be vectorized, and you put in the effort to enable that, you will see the expected benefit.

**Setif** · 08 March 2019, 01:37 AM

What about "-Ofast", "-Ofast -flto" and "-Ofast -march=skylake-avx512 -flto"
I have a hasewell-i7 processor and I got the best performance with "-Ofast -flto" .

**pegasus** · 08 March 2019, 03:04 AM

Originally posted by celrod View Post

So my impression is that, if all you're doing is running other people's code, and that code does not include BLAS/LAPACK, don't bother with avx512, and don't bother to enable the instruction set while compiling.

But, if you're actually writing the code you will run, that code can be vectorized, and you put in the effort to enable that, you will see the expected benefit.

This is a very good observation. Let me add another one: wide vectorization only makes sense if your data fits into cpu caches and/or is very cache friendly. If it does not, then available memory bandwidth is not enough to feed vector units and you see marginal, if any performance improvement.

If anyone out there has a code that can really bennefit from vectorization, then I recommend you to take a look at NEC Aurora Tsubasa cards. They're the latest iteration of their long line of vector supercomputers and come with 6 HBM2 per chip in order to feed their vector units.

**FireBurn** · 08 March 2019, 03:05 AM

Would be great to see a test where we could see the time to compile then the performance of the binary to get a performance per second metric, see where the sweet spot is

**Herem** · 08 March 2019, 03:05 AM

Michael I was wondering do the geometric mean results include the compilation time results or only the application performance results?

**CochainComplex** · 08 March 2019, 04:20 AM

Originally posted by Herem View Post

Michael I was wondering do the geometric mean results include the compilation time results or only the application performance results?

I was wondering too.

**-MacNuke-** · 08 March 2019, 04:25 AM

Originally posted by Setif View Post

What about "-Ofast", "-Ofast -flto" and "-Ofast -march=skylake-avx512 -flto"
I have a hasewell-i7 processor and I got the best performance with "-Ofast -flto" .

-Ofast introduces unsafe math operations and can break applications and results. It is not a good candidate for comparison at all.

**tildearrow** · 08 March 2019, 05:13 AM

dav1d:

-O0: 72.83fps
-O1: 131.83fps
-O2: 132.56fps
-O2 -march=skylake-avx512: 135.35fps
-O3: 133.94fps
-O3 -march=x86-64: 133.69fps
-O3 -march=skylake: 136.53fps
-O3 -march=skylake-avx512: 134.69fps
-Ofast -march=skylake-avx512: 134.04fps

**babali** · 08 March 2019, 05:18 AM

-flto shows poor results!

Announcement

GCC 9 Compiler Tuning Benchmarks On Intel Skylake AVX-512

GCC 9 Compiler Tuning Benchmarks On Intel Skylake AVX-512

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment