Announcement

**carewolf** · 08 March 2019, 02:36 PM

Originally posted by birdie View Post

Nowadays I compile most apps with

Code:

-O2 -march=native -flto

- this seems like the best combo for maximum performance.

Add -ftree-loop-vectorize to get auto-vectorization, and you have a really good combo.

**vegabook** · 08 March 2019, 03:16 PM

The problem with AVX512 is that workloads which benefit from it are getting too close to those which would simply benefit from a wholesale move to the GPU. Intel panicked as NVIDIA ate compute market share, gave us the failed Knight's Landing, and now tries to rescue the idea by putting AVX512 into mainstream processors.

AVX256 is probably the optimum when it comes to the tradeoff between latency and bandwidth. You're getting to a point where the 512 instructions are sooo power hungry that you're better off making the jump to GPU, which will absolutely smash it in terms of throughput.

**birdie** · 08 March 2019, 06:01 PM

Originally posted by carewolf View Post

Add -ftree-loop-vectorize to get auto-vectorization, and you have a really good combo.

I need to recheck this flag since back in the GCC 4.x days it often used to cause a performance loss.

**carewolf** · 08 March 2019, 06:13 PM

Originally posted by birdie View Post

I need to recheck this flag since back in the GCC 4.x days it often used to cause a performance loss.

It makes binaries bigger, but many small tight loops are automatically 2-4 times faster when using it. Though it is has to be a very clean and efficiently written loop. It makes a huge difference in the code that i write.

**celrod** · 08 March 2019, 08:47 PM

Originally posted by vegabook View Post

The problem with AVX512 is that workloads which benefit from it are getting too close to those which would simply benefit from a wholesale move to the GPU. Intel panicked as NVIDIA ate compute market share, gave us the failed Knight's Landing, and now tries to rescue the idea by putting AVX512 into mainstream processors.

AVX256 is probably the optimum when it comes to the tradeoff between latency and bandwidth. You're getting to a point where the 512 instructions are sooo power hungry that you're better off making the jump to GPU, which will absolutely smash it in terms of throughput.

My workloads are statistical simulations, using Hamiltonian Monte Carlo. Primarily verifying Bayesian models.
These consist of generating a lot of different fake data sets from known true values, possibly varying inputs such as prior information, and making sure the model performs correctly by recovering the "truth" used to generate the data reasonably well.

A single fit can easily take hours. To get a good sample of runs for validation, we may want hundreds of fits.
Hundreds is not "embarrassingly parallel". Each fit also uses too much data to realistically run 4000 simultaneously on a graphics card.
But they do not use enough data to become memory bound on a CPU core, when running just (number of physical cores) Markov Chains at a time.

Each fit consists of several Markov chains, which require repeatedly evaluating a model and calculating the gradients.
These can consist of a wide variety of computations. The most expensive of which can easily be vectorized with avx512, but some of it is serial. Lots of small/moderate sized matrix multiplications, vectorized special functions, etc.
The cost of dispatching the vectorizable pieces to a GPU/coprocessor would often be more expensive than simply using the CPU.

Avx512 is a really easy way to make this sort of work roughly twice as fast.

Now, my Skylake-X CPU did cost more money than my Vega-64 graphics card, yet that graphics card is about 10x faster at matrix multiplication than the Skylake-X (which in turn is about 4x faster than a similarly priced Ryzen Threadripper).
So, some day, I really do want to spend the time and see if it's possible to get things to work on a graphics card.

If I understand correctly, Vega64 could be thought of as 256 units that each have 64-wide vectors. So, running 256 chains at a time with 64-wide vector units each MIGHT be something that could work for some models.
If it's only 25% efficient, that would still be a big win: 25% of the 10x faster large matrix multiplication achieves is two and a half times faster. So there is a lot of potential.

But I am still not optimistic about the prospects, and imagine it will be a steep (although fun!) learning curve if it's even possible and practical. And will take time I don't have right now.

Versus avx512, which can just makes those workloads about twice as fast.

**vegabook** · 08 March 2019, 10:45 PM

Originally posted by celrod View Post

My workloads are statistical simulations, using Hamiltonian Monte Carlo. Primarily verifying Bayesian models.
These consist of generating a lot of different fake data sets from known true values, possibly varying inputs such as prior information, and making sure the model performs correctly by recovering the "truth" used to generate the data reasonably well.

A single fit can easily take hours. To get a good sample of runs for validation, we may want hundreds of fits.
Hundreds is not "embarrassingly parallel". Each fit also uses too much data to realistically run 4000 simultaneously on a graphics card.
But they do not use enough data to become memory bound on a CPU core, when running just (number of physical cores) Markov Chains at a time.

Each fit consists of several Markov chains, which require repeatedly evaluating a model and calculating the gradients.
These can consist of a wide variety of computations. The most expensive of which can easily be vectorized with avx512, but some of it is serial. Lots of small/moderate sized matrix multiplications, vectorized special functions, etc.
The cost of dispatching the vectorizable pieces to a GPU/coprocessor would often be more expensive than simply using the CPU.

Avx512 is a really easy way to make this sort of work roughly twice as fast.

Now, my Skylake-X CPU did cost more money than my Vega-64 graphics card, yet that graphics card is about 10x faster at matrix multiplication than the Skylake-X (which in turn is about 4x faster than a similarly priced Ryzen Threadripper).
So, some day, I really do want to spend the time and see if it's possible to get things to work on a graphics card.

If I understand correctly, Vega64 could be thought of as 256 units that each have 64-wide vectors. So, running 256 chains at a time with 64-wide vector units each MIGHT be something that could work for some models.
If it's only 25% efficient, that would still be a big win: 25% of the 10x faster large matrix multiplication achieves is two and a half times faster. So there is a lot of potential.

But I am still not optimistic about the prospects, and imagine it will be a steep (although fun!) learning curve if it's even possible and practical. And will take time I don't have right now.

Versus avx512, which can just makes those workloads about twice as fast.

Look I hear you. I got some amazing linalg wins using AVX256, which, when I moved to Intel's MKL BLAS libraries a few years ago I was getting something like 6x improvements! But then I moved to GPU and won 50x more....

AVX has been good to me, and it happens to be everywhere including on AMD, therefore is well supported in software. Unlike the mess which is the GPU software ecosystem outside of CUDA, and for which you need to pay very serious money if you need FP64, as do I (btw, on that, and if you have the patience to do OpenCL or Rocm, Radeon VII is an awesome price performance thanks to AMD only slightly gimping its FP64 units).

Mainstream GPU architecture has only been "democratic" for just over 10 years or so, whereas with the CPU it is more like 40 years. So coding for it is tough, unless you're doing something fairly straightforward such as what I am. But it's probably worth doing because the performance improvements can be sooo huge that you literally open up a whole new class of problems to feasability. Think several extra dimensions to your parameter search space. If I understand Hamiltonian Monte Carlo and I think I have a basic grasp, you should be able to do quite a lot GPU-side and it might be a revelation. But I'm not an expert.

BTW Zen 2, aka Ryzen 3, due later this year, will roughly double performance per core on AVX (apparently), so if AMD persists with its insane core counts, we might get some very interesting Threadrippers around August or so.

**Grinch** · 08 March 2019, 10:52 PM

Originally posted by -MacNuke- View Post

-Ofast introduces unsafe math operations and can break applications and results. It is not a good candidate for comparison at all.

Only in software that requires very high precision math in order to function correctly. As an example, I've been compiling Blender with -Ofast for extra speed when rendering using its Cycles path tracer for years without any problems.

I'd say a vast majority of software out there works perfectly with -Ofast, on the other hand the amount of software where using it makes a worthwhile performance difference is likely quite small.

**_Alex_** · 11 March 2019, 05:06 PM

How is it possible to have zero gains on average with avx512, when even if you don't use it for its 512 bit vector sizes you still get xmm0-31 / ymm0-31 + 8 k registers...

In my experience rewriting some asm code on sse and avx2 (128/256 bit vectors) with the avx512 implementation that allows accessing xmm16-31 / ymm16-31, the extra registers alone gave me 10% performance because it eliminated shuffling data around to the cache...

**max0x7ba** · 12 March 2019, 12:25 PM

Originally posted by carewolf View Post

Add -ftree-loop-vectorize to get auto-vectorization, and you have a really good combo.

-ftree-loop-vectorize is included in -O3 and it doesn't vectorize much.

Here is an example that neither g++-8 not clang++-6 vectorize, but explicit SIMD takes it to the next level: https://stackoverflow.com/a/55088736/412080

Announcement

GCC 9 Compiler Tuning Benchmarks On Intel Skylake AVX-512

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment