Announcement

**birdie** · 21 November 2019, 04:35 PM

And no results without -march at all and just -mtune=generic. That would be great to see as well.

**thelongdivider** · 21 November 2019, 05:00 PM

Originally posted by birdie View Post

And no results without -march at all and just -mtune=generic. That would be great to see as well.

Yeah, some baseline would be nice.

The results are still strange. The optimizations since march=skylake seem irrelevant for both processor manufacturers.

**lucasbekker** · 21 November 2019, 05:14 PM

These results show precisely what is to be expected. Automatic vectorization is not going to increase the performance of already performance optimized code.

However, the problem with these benchmark results is that they might suggest to the uninitiated that AVX-512 is to be avoided if one wants the best possible performance, which is obviously far from the truth.

Also interesting is the very strange choice of both GCC and LLVM to limit the vector length of AVX-512 to 256 bit by default. This is like NVIDIA making a driver change to only use half the amount of available CUDA cores...

**CommunityMember** · 21 November 2019, 05:26 PM

Originally posted by lucasbekker View Post

Also interesting is the very strange choice of both GCC and LLVM to limit the vector length of AVX-512 to 256 bit by default.

Using the full 512 can cause an additional processor downclock (over the 256 downclock). So careful tuning is required to know which is better for which type of application, and the compilers heuristics will never be perfect (compilers are still not omniscient).

**lucasbekker** · 22 November 2019, 03:45 AM

Originally posted by CommunityMember View Post

Using the full 512 can cause an additional processor downclock (over the 256 downclock). So careful tuning is required to know which is better for which type of application, and the compilers heuristics will never be perfect (compilers are still not omniscient).

I am aware of that. But like you said, one needs to test both 256 bit and 512 bit vector lengths to achieve the best performance. My gripe is that LLVM and GCC use 256 bit as the default. In my opinion, "-march=native" implies that ALL the resources of the CPU are available, not just half.

**carewolf** · 22 November 2019, 04:00 AM

You could try with -O2. I suspect the problem is autovectorization using AVX-512 slowing things down. It should probably only be used selectively.

edit: Or perhaps -mprefer-vector-width=128 vs 256 vs 512 together with -O3 would be more interesting.

**celrod** · 23 November 2019, 01:12 AM

Originally posted by lucasbekker View Post

However, the problem with these benchmark results is that they might suggest to the uninitiated that AVX-512 is to be avoided if one wants the best possible performance, which is obviously far from the truth.

Also interesting is the very strange choice of both GCC and LLVM to limit the vector length of AVX-512 to 256 bit by default. This is like NVIDIA making a driver change to only use half the amount of available CUDA cores...

I routinely get substantial benefits from avx512, but I always hold vectorization in mind as I code, and routinely use explicit SIMD rather than rely on the auto-vectorizers.
Some things like taking advantage of masks seem simple enough that I'd guess once someone writes or extends a pass, compilers will take advantage of to start vectorizing non-power of 2s (eg, treat 5-7 as 8 with some lanes masked, and 3 as 4 with 1 masked), which should help in quite a few situations. IIRC, ISPC can already do this.

An example I found out about yesterday was to replace a single loop that used an if-check to conditionally evaluated a function, with two loops. The first of these used compressed stores (VCOMPRESSPD) to filter out all the values that needed to be evaluated into much smaller vectors, and then the second loop could be run SIMD on top of that.
There are a lot of goodies in that instruction set that make vectorization easier. And yes, the computer may run 0.2 GHz slower than with avx2 (see Silicon Lottery's historical binning statistics), but the performance gain is often fairly substantial (ie, >50%).

The one thing I'd disagree with is...

Originally posted by lucasbekker View Post

These results show precisely what is to be expected. Automatic vectorization is not going to increase the performance of already performance optimized code.

I think code that isn't performance-optimized is actually less likely to benefit. Code written without vectorization in mind probably can't even take advantage of avx2, because the data layout and computations aren't arranged in a way to actually allow for SIMD.
This goes even for software that market's itself as high performance, eg:

Stan is a state-of-the-art platform for statistical modeling and high-performance statistical computation.

Stan

https://mc-stan.org/

Yet, that library's data structures preclude SIMD. There's a discussion there of the problem, and I give an example of code evaluating several thousand times faster (I used a CPU with avx512, but avx2 benefits substantially too -- it's SIMD on one side vs chasing pointers on the other).
I like Stan, which is why I'm familiar with it enough to use it as an example. But my impression is that this is a pervasive problem.

**lucasbekker** · 24 November 2019, 11:45 AM

Originally posted by celrod View Post

I think code that isn't performance-optimized is actually less likely to benefit. Code written without vectorization in mind probably can't even take advantage of avx2, because the data layout and computations aren't arranged in a way to actually allow for SIMD.

Still, why write code with SIMD in mind without going all the way? Projects like XSIMD (https://github.com/xtensor-stack/xsimd) will probably work better then automatic vectorization of the compiler.

**celrod** · 24 November 2019, 02:15 PM

Originally posted by lucasbekker View Post

Still, why write code with SIMD in mind without going all the way? Projects like XSIMD (https://github.com/xtensor-stack/xsimd) will probably work better then automatic vectorization of the compiler.

I agree with you. In regards to "why...without going all the way?", you're preaching to the choir.
I'm actually on the list of xsimd contributors

, although with a measly 2++/2-- change (fixing a couple typos), and maintain several similar Julia libraries (one of which wraps XSIMD) that almost all my projects depend on, either directly or indirectly through abstractions built on top.

Those libraries do work better than the autovectorizer, and they also give you special functions (exp, log, sin and friends). Recent versions of GCC's autovectorizer can handle these special functions, but Julia does not, nor does Clang without `-fveclib=`.
The autovectorizer is also prone to making mistakes. Maybe it doesn't vectorize because of failed alias analysis, or perhaps because it accidentally landed on a different, suboptimal vectorization pattern.
When I played around with Fortran, which (AFAIK) doesn't have any such library, I had to compile some files with `-fdisable-tree-cunrolli`, and other files without that option, otherwise the autovectorizer would do stupid things.
All that could change with different compilers or different versions of the same compiler, therefore you'd actually have to check each one to guarantee it's fast on them.

So yeah, to make code be fast you do need to be thinking "it should be vectorized in this way", but at that point you really should just use a library like xsimd, to give yourself peace of mind and save yourself from having to check all the assembly (or llvm IR) to check and double check that what you think should be happening actually is.

But I don't know about other people, how much they know, or what it would take to get them 50% or 90% of the way there performance wise.
Maybe it would be easier to teach/encourage them to do simple things like "store data that could be operated on in parallel with the same operations contiguously", or "use structs of arrays, not arrays of structs (when applicable)".

Announcement

A Look At The GCC Compiler Tuning Performance Impact For Intel Ice Lake

A Look At The GCC Compiler Tuning Performance Impact For Intel Ice Lake

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment