Announcement

**milkylainen** · 03 May 2019, 05:12 PM

That looks like it could be a decent raise of the bar from GCC 9.x. Nice!

**nuetzel** · 03 May 2019, 10:14 PM

And now with CPUs without AVX(512) and/or zen based. Thanks.

**carewolf** · 04 May 2019, 03:18 AM

I think you will also find it has also greatly improved when not using -march (like most Linux distros)

**carewolf** · 04 May 2019, 03:20 AM

Originally posted by nuetzel View Post

And now with CPUs without AVX(512) and/or zen based. Thanks.

He compiled with -march=skylake which would only be using AVX2. Though of course more varied data would be better. I suspect it is under way though

**celrod** · 04 May 2019, 04:06 AM

Originally posted by nuetzel View Post

And now with CPUs without AVX(512) and/or zen based. Thanks.

Originally posted by carewolf View Post

He compiled with -march=skylake which would only be using AVX2. Though of course more varied data would be better. I suspect it is under way though

Actually, `-march=cascadelake` does not enable avx512 either. For that, you need `-march=cascadelake -mprefer-vector-width=512`.

Checkout Godlbolt, the compiler explorer. Using mprefer-vector-width=512, versus not using that. Specifically, look at line 13 in both those examples, and you will notice a difference: the former uses zmm, and the latter ymm registers.
zmm registers are 512 bits, and ymm registers are 256 bits. You'll see that the first one uses a `vfmadd231pd` instruction with zmm registers, while the second instead uses ymm registers. That means only when using `mprefer-vector-width=512` will it actually use avx512 for that example.

This is even true when you tell it the specific number of loop iterations. For example, if N is hardcoded to 16:
-march=cascadelake -O3 -mprefer-vector-width=512: 2x vfma with zmm registers (each fma multiplies 8)
-march=cascadelake -O3: 4x vfma with ymm registers (each fma multiplies 4)

**hubicka** · 04 May 2019, 05:04 AM

Originally posted by celrod View Post

Actually, `-march=cascadelake` does not enable avx512 either. For that, you need `-march=cascadelake -mprefer-vector-width=512`.

Checkout Godlbolt, the compiler explorer. Using mprefer-vector-width=512, versus not using that. Specifically, look at line 13 in both those examples, and you will notice a difference: the former uses zmm, and the latter ymm registers.
zmm registers are 512 bits, and ymm registers are 256 bits. You'll see that the first one uses a `vfmadd231pd` instruction with zmm registers, while the second instead uses ymm registers. That means only when using `mprefer-vector-width=512` will it actually use avx512 for that example.

This is even true when you tell it the specific number of loop iterations. For example, if N is hardcoded to 16:
-march=cascadelake -O3 -mprefer-vector-width=512: 2x vfma with zmm registers (each fma multiplies 8)
-march=cascadelake -O3: 4x vfma with ymm registers (each fma multiplies 4)

It does enable AVX512 but CPU tuning for cascadelake says that 256bit vectors are preferred. This is because of the issues causing CPU to slow down clock when executing them.

**celrod** · 04 May 2019, 05:27 AM

If it doesn't use them, that's more or less the same thing as them not being enabled.

I write and run numerical code. I have Julia libraries for inserting LLVM vector intrinsics and vectorized special functions, and many more libraries that depend on them.

In my code, using 512 bit vectors are definitely faster than 256 bit, despite downclocking.*

Much of the time, I also simply rely on the autovectorizer. Thankfully, LLVM (and therefore Julia) do not share that preference with GCC, otherwise I could never rely on the autovectorizer and would have to explicitly vectorize every bit if performance-sensitive code.

Code must actually be written to be efficiently vectorizable. Eg, paying attention to memory layout.
Traditional advice on performance can sometimes be contrary to that. It is common advice that given column/row major matrices, access the first/second dimension most quickly, but that will often prevent vectorization. That is, treating the matrix as an array of structs doesn't offer many vectorization opportunities, but a struct of arrays does. That this is what SPMD encourages is one of the reasons the ISPC compiler often gets such great performance.

My point in this last paragraph is code often isn't written with vectorizability in mind, and therefore does not benefit. But in vectorizable numerical code, avx512 is much faster. As an extreme example, just compare BLAS and LAPACK benchmarks.

*Although I have to confess that it is hard to benchmark the downclocking, so I'm not sure how much I benefit for my real workloads. (I can confirm that it is much faster than Ryzen, at least.)
In microbenchmarks the difference is often close to 2x, but I only run my microbenchmarks on a single core. That doesn't result in much downclocking.
Running avx512 heavy process on all cores, however, does downclock (and if I mess with bios settings to stop it from downclocking, running on all cores will cause CPU temperatures to rise until the computer crashes).

EDIT:
Are there benchmarks accross a variety of workloads that gcc based it's decision on?
I'll try the Polyhedron Benchmark Suite and report back with what I find. Although, I have not looked at the code base, so I do not know how much of it can be vectorized.

EDIT:
I just realized your name is Hubicka. Off topic, but I loved that blog post, and amazing work!

**celrod** · 05 May 2019, 01:35 AM

Originally posted by hubicka View Post

It does enable AVX512 but CPU tuning for cascadelake says that 256bit vectors are preferred. This is because of the issues causing CPU to slow down clock when executing them.

I maintain that for highly vectorizable code, it does give a large improvement (linear algebra software being the most obvious example). But I will concede that highly vectorizable code probably took enough deliberate effort to write (and rarely happens by accident), so that the authors might not be particularly burdened by also specifying their preferred vector width.

That said, using the polyhedron Fortran benchmark suite (Fortran benchmarks are the easiest way to find numerical work loads), I see no significant differences between `-Ofast -march=native` and `-Ofast -march=native -mprefer-vector-width=512`) on an i9 9940X CPU.
These tests were using Clear Linux with gfortran 9.1.1.

Summary without specifying preferred vector width:

Code:

$ cat preferdefault.sum
================================================================================
Date & Time     :  4 May 2019  8:35:28
Test Name       : preferdefault
Compile Command : gfortran -Ofast -march=native %n.f90 -o %n
Benchmarks      : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
Maximum Times   :    10000.0
Target Error %  :      0.100
Minimum Repeats :    10
Maximum Repeats :   100
 
   Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac      0.37       43176      5.63      12  0.0262
      aermod     12.35     1089648      5.99      12  0.0622
         air      2.87      152536      2.02      36  0.0981
    capacita      1.20      106472     11.75      10  0.0743
    channel2      0.20       31216     57.80      18  0.0965
       doduc      2.02      165240      7.26      16  0.0890
    fatigue2      0.82       77928     48.05      11  0.0979
    gas_dyn2      1.02       95584     51.62      23  0.0913
     induct2      1.65      175232     19.46      18  0.0957
       linpk      0.25       34928      2.18      15  0.0978
        mdbx      0.80       90856      3.81      10  0.0417
mp_prop_desi      0.38       47944     50.23      13  0.0825
          nf      0.43       47600      4.28      16  0.0911
     protein      1.03       95112     13.79      10  0.0813
      rnflow      1.66      115416     14.59      13  0.0906
   test_fpu2      1.23       93608     19.11      14  0.0990
       tfft2      0.45       47520     24.67      14  0.0989

Geometric Mean Execution Time =      12.15 seconds

================================================================================

Preferring 512 bit vectors:

Code:

$ cat prefer512.sum
================================================================================
Date & Time     :  4 May 2019  6:53:28
Test Name       : prefer512
Compile Command : gfortran -Ofast -march=native -mprefer-vector-width=512 %n.f90 -o %n
Benchmarks      : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
Maximum Times   :    10000.0
Target Error %  :      0.100
Minimum Repeats :    10
Maximum Repeats :   100
 
   Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac      0.37       43176      5.63      10  0.0329
      aermod     12.36     1089648      6.00      13  0.0959
         air      2.85      152536      2.03      28  0.0959
    capacita      1.20      106472     11.74      10  0.0814
    channel2      0.19       31216     58.04      19  0.0874
       doduc      2.02      165240      7.25      10  0.0826
    fatigue2      0.82       77928     48.53      10  0.0676
    gas_dyn2      1.03       95584     51.91      16  0.0811
     induct2      1.65      175232     19.39      10  0.0596
       linpk      0.25       34928      2.18      17  0.0831
        mdbx      0.80       90856      3.82      10  0.0600
mp_prop_desi      0.39       47944     50.11      18  0.0953
          nf      0.40       47600      4.25      14  0.0919
     protein      1.04       95112     13.75      14  0.0574
      rnflow      1.64      115416     14.54      12  0.0781
   test_fpu2      1.22       93608     19.13      14  0.0860
       tfft2      0.45       47520     24.55      18  0.0910

Geometric Mean Execution Time =      12.16 seconds

================================================================================

For all I know, most of these executables may have been exactly the same. Note that the actual error is much higher than those listed. This is because the error is on an average of 10 runs; when the number of reported runs is greater than 10, it drops the Number Repeats - 10 runs furthest from the mean. It also drops one by one, so if you had 5 low times and 6 high times, it will drop a low time, pushing the average up, making it more likely to drop low times moving forward.
If instead it had 6 low times and 5 high times, it will do the opposite. This makes the average it converges on rather noisy -- much noisier than the stated "Estim Err %".

They should really take the approach of only dropping high times, or not dropping times at all and loosening the requirement on reported Estim Err before completing.

That said, I'd say all those results seem to be identical.

**celrod** · 05 May 2019, 01:44 AM

Link to the benchmarks.
One last comment on the above: I had to set `ulimit -s unlimited`, otherwise a few of those benchmarks would segfault due to wanting to allocate a lot of stack space.

For fun, I also tested ifort 19.0.3.199 and Flang build with LLVM 7.

For these benchmarks, ifort was the clear winner:

Code:

$ cat ifort.sum
================================================================================
Date & Time     :  4 May 2019 10:05:14
Test Name       : ifort
Compile Command : ifort -fast -qopt-zmm-usage=high %n.f90 -o %n
Benchmarks      : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
Maximum Times   :    10000.0
Target Error %  :      0.100
Minimum Repeats :    10
Maximum Repeats :   100
 
   Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac      1.41     8129584      4.06      12  0.0891
      aermod     17.15    10420728      6.47      10  0.0844
         air      4.03     8406208      1.51      26  0.0962
    capacita      1.70     8233712     10.07      14  0.0813
    channel2      0.46     8156672     55.34      16  0.0894
       doduc      2.38     8329416      7.24      12  0.0888
    fatigue2      2.08     8444576     47.62      12  0.0260
    gas_dyn2      1.49     8216560     24.10      42  0.0991
     induct2      4.86     8760336     21.19      13  0.0910
       linpk      0.45     8071840      2.15      15  0.0937
        mdbx      1.66     8186544      2.66      15  0.0841
mp_prop_desi      1.06     8529808     41.07      23  0.0936
          nf      0.86     8232200      4.34      15  0.0855
     protein      5.06     8411944     15.66      10  0.0887
      rnflow     23.57     8418128      7.65      10  0.0525
   test_fpu2      3.02     8396336     16.11      10  0.0650
       tfft2      0.62     8139256     41.51      18  0.0989

Geometric Mean Execution Time =      10.71 seconds

================================================================================

While Flang was clearly well behind gfortran. Flang failed the second test, and exited immediately (hence the exceptionally fast time there).

Code:

$ cat flang.sum
================================================================================
Date & Time     :  4 May 2019 13:47:07
Test Name       : flang
Compile Command : flang -Ofast -march=native %n.f90 -o %n
Benchmarks      : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
Maximum Times   :    10000.0
Target Error %  :      0.100
Minimum Repeats :    10
Maximum Repeats :   100
 
   Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac      0.35       54776      5.90      13  0.0632
      aermod     29.88     1387176      0.01     100 12.3405
         air      1.91      132240      2.25      15  0.0812
    capacita      1.13       92592      9.81      13  0.0814
    channel2      0.34       44952     62.17      12  0.0870
       doduc      2.28      156560      7.15      12  0.0966
    fatigue2      0.77      113128     75.23      17  0.0995
    gas_dyn2      0.80      100312     40.00      17  0.0784
     induct2      1.74      258128     49.57      12  0.0635
       linpk      0.33       42344      3.29      18  0.0914
        mdbx      0.95      111240      4.42      10  0.0399
mp_prop_desi      0.30       49928     87.90      13  0.0580
          nf      1.02       71872      6.65      13  0.0764
     protein      2.06      154880     14.70      10  0.0289
      rnflow      3.19      184816     13.41      10  0.0413
   test_fpu2      3.59      154432     23.60      13  0.0870
       tfft2      0.23       34896     40.01      17  0.0974

Geometric Mean Execution Time =      10.18 seconds

================================================================================

Announcement

Intel Cascade Lake Xeon Benchmarks With GCC 8 vs. GCC 9 Compilers

Intel Cascade Lake Xeon Benchmarks With GCC 8 vs. GCC 9 Compilers

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment