Announcement

**hubicka** · 05 May 2019, 04:09 AM

Originally posted by celrod View Post

Link to the benchmarks.
One last comment on the above: I had to set `ulimit -s unlimited`, otherwise a few of those benchmarks would segfault due to wanting to allocate a lot of stack space.

For fun, I also tested ifort 19.0.3.199 and Flang build with LLVM 7.

For these benchmarks, ifort was the clear winner:

Code:

$ cat ifort.sum
================================================================================
Date & Time : 4 May 2019 10:05:14
Test Name : ifort
Compile Command : ifort -fast -qopt-zmm-usage=high %n.f90 -o %n
Benchmarks : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
Maximum Times : 10000.0
Target Error % : 0.100
Minimum Repeats : 10
Maximum Repeats : 100

Benchmark Compile Executable Ave Run Number Estim
Name (secs) (bytes) (secs) Repeats Err %
--------- ------- ---------- ------- ------- ------
ac 1.41 8129584 4.06 12 0.0891
aermod 17.15 10420728 6.47 10 0.0844
air 4.03 8406208 1.51 26 0.0962
capacita 1.70 8233712 10.07 14 0.0813
channel2 0.46 8156672 55.34 16 0.0894
doduc 2.38 8329416 7.24 12 0.0888
fatigue2 2.08 8444576 47.62 12 0.0260
gas_dyn2 1.49 8216560 24.10 42 0.0991
induct2 4.86 8760336 21.19 13 0.0910
linpk 0.45 8071840 2.15 15 0.0937
mdbx 1.66 8186544 2.66 15 0.0841
mp_prop_desi 1.06 8529808 41.07 23 0.0936
nf 0.86 8232200 4.34 15 0.0855
protein 5.06 8411944 15.66 10 0.0887
rnflow 23.57 8418128 7.65 10 0.0525
test_fpu2 3.02 8396336 16.11 10 0.0650
tfft2 0.62 8139256 41.51 18 0.0989

Geometric Mean Execution Time = 10.71 seconds

================================================================================

While Flang was clearly well behind gfortran. Flang failed the second test, and exited immediately (hence the exceptionally fast time there).

Code:

$ cat flang.sum
================================================================================
Date & Time : 4 May 2019 13:47:07
Test Name : flang
Compile Command : flang -Ofast -march=native %n.f90 -o %n
Benchmarks : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
Maximum Times : 10000.0
Target Error % : 0.100
Minimum Repeats : 10
Maximum Repeats : 100

Benchmark Compile Executable Ave Run Number Estim
Name (secs) (bytes) (secs) Repeats Err %
--------- ------- ---------- ------- ------- ------
ac 0.35 54776 5.90 13 0.0632
aermod 29.88 1387176 0.01 100 12.3405
air 1.91 132240 2.25 15 0.0812
capacita 1.13 92592 9.81 13 0.0814
channel2 0.34 44952 62.17 12 0.0870
doduc 2.28 156560 7.15 12 0.0966
fatigue2 0.77 113128 75.23 17 0.0995
gas_dyn2 0.80 100312 40.00 17 0.0784
induct2 1.74 258128 49.57 12 0.0635
linpk 0.33 42344 3.29 18 0.0914
mdbx 0.95 111240 4.42 10 0.0399
mp_prop_desi 0.30 49928 87.90 13 0.0580
nf 1.02 71872 6.65 13 0.0764
protein 2.06 154880 14.70 10 0.0289
rnflow 3.19 184816 13.41 10 0.0413
test_fpu2 3.59 154432 23.60 13 0.0870
tfft2 0.23 34896 40.01 17 0.0974

Geometric Mean Execution Time = 10.18 seconds

================================================================================

It would be useful to report this to GCC bugzilla (along with ICC numbers). Vectorizer's cost metrics is far from ideal and one of things we plan to work it is to make it choose better the proper vector width (and of course improve codegen).

**celrod** · 05 May 2019, 05:43 AM

Originally posted by hubicka View Post

It would be useful to report this to GCC bugzilla (along with ICC numbers). Vectorizer's cost metrics is far from ideal and one of things we plan to work it is to make it choose better the proper vector width (and of course improve codegen).

I am excited about that.

But I have to issue a correction on my earlier tests. I was using Clear Linux, and it seems Clear Linux's gcc was patched to behave differently from Godbolt's.

Code:

$ g++ -S -Ofast -march=skylake-avx512 UpdateGroupProbs.cpp -o UpdateGroupProbs.s
$ g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-generic-linux/9/lto-wrapper
Target: x86_64-generic-linux
Configured with: ../gcc-9.1.0/configure --prefix=/usr --with-pkgversion='Clear Linux OS for Intel Architecture' --libdir=/usr/lib64 --enable-libstdcxx-pch --libexecdir=/usr/lib64 --with-system-zlib --enable-shared --enable-gnu-indirect-function --disable-vtable-verify --enable-threads=posix --enable-__cxa_atexit --enable-plugin --enable-ld=default --enable-clocale=gnu --disable-multiarch --enable-multilib --enable-lto --disable-werror --enable-linker-build-id --build=x86_64-generic-linux --target=x86_64-generic-linux --enable-languages=c,c++,fortran,go --enable-bootstrap --with-ppl=yes --with-isl --includedir=/usr/include --exec-prefix=/usr --with-glibc-version=2.19 --disable-libunwind-exceptions --with-gnu-ld --with-tune=haswell --with-arch=westmere --enable-cet --disable-libmpx --with-gcc-major-version-only --enable-default-pie
Thread model: posix
gcc version 9.1.1 20190503 gcc-9-branch@270849 (Clear Linux OS for Intel Architecture)

With Clear Linux's gcc, I actually have to specify -mprefer-vector-width=256 to get the default behavior of Godbolt's compiler. That may explain why I did not see any difference in those Fortran benchmarks. I'll rerun them and report back.

The function in the Godbolt link was the slowest part of a Gibbs sampler I wrote for my dissertation. It was called billions of times (4000 samples/fit * 36 fits / dataset * >17000 datasets), with N ranging from the hundreds to the thousands, and "K" (outer loop iterations, hard-coded to 6 in the above example) ranged from 2 to 16.

Benchmarking it from Julia with mprefer-vector-width=256 vs 512, with N=2827 (I could provide all the code to reproduce this example):

Code:

julia> @benchmark update_group_probs_cxx256!($indiv_probs2, $base_p, $revcholwisharts, $X, $Nv)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.150 μs (0.00% GC)
  median time:      21.249 μs (0.00% GC)
  mean time:        21.476 μs (0.00% GC)
  maximum time:     43.230 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark update_group_probs_cxx512!($indiv_probs2, $base_p, $revcholwisharts, $X, $Nv)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.216 μs (0.00% GC)
  median time:      14.313 μs (0.00% GC)
  mean time:        14.400 μs (0.00% GC)
  maximum time:     43.261 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

It was about 50% faster with the full-width vectors (the second) than half-width (the first). Over the course of a billion calls, that can make the difference of a couple of hours.

After the Fortran benchmark runs, I'll file an issue on gcc's bugzilla.

**celrod** · 06 May 2019, 01:05 PM

mprefer-vector-width=256 seems to outperform mprefer-vector-width=512 on the Polyhedron benchmarks.

I reran the benchmarks, now with exactly 10 runs each time. I believe the first is discarded, so the averages are of 9 runs. I believe these should be more accurate than the earlier results, besides taking less time to run.

gfortran with 512 bit vectors:

Code:

$ cat gfortran512.sum
================================================================================
Date & Time     :  5 May 2019 13:04:09
Test Name       : prefer512
Compile Command : gfortran -Ofast -march=native -mprefer-vector-width=512 %n.f90 -o %n
Benchmarks      : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
Maximum Times   :    10000.0
Target Error %  :      0.100
Minimum Repeats :    10
Maximum Repeats :    10
 
   Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac      0.37       43176      5.66      10  0.0342
      aermod     12.73     1089648      6.01      10  0.1469
         air      2.87      152536      2.11      10  0.7105
    capacita      1.21      106472     11.80      10  0.0697
    channel2      0.20       31216     57.66      10  0.0790
       doduc      2.03      165240      7.31      10  0.1554
    fatigue2      0.84       77928     46.47      10  0.2654
    gas_dyn2      1.03       95584     52.11      10  0.1648
     induct2      1.67      175232     19.54      10  0.1429
       linpk      0.25       34928      2.18      10  0.1739
        mdbx      0.82       90856      3.83      10  0.0872
mp_prop_desi      0.39       47944     53.32      10  1.3054
          nf      0.40       47600      4.32      10  0.2134
     protein      1.04       95112     13.84      10  0.1219
      rnflow      1.67      115416     14.64      10  0.1113
   test_fpu2      1.24       93608     19.18      10  0.1698
       tfft2      0.47       47520     24.78      10  0.1479

Geometric Mean Execution Time =      12.24 seconds

gfortran with 256 bit vectors:

Code:

$ cat gfortran256.sum
================================================================================
Date & Time     :  5 May 2019 14:02:05
Test Name       : prefer256
Compile Command : gfortran -Ofast -march=native -mprefer-vector-width=256 %n.f90 -o %n
Benchmarks      : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
Maximum Times   :    10000.0
Target Error %  :      0.100
Minimum Repeats :    10
Maximum Repeats :    10
 
   Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac      0.37       43176      4.92      10  0.0495
      aermod     12.19     1069064      5.50      10  0.0918
         air      2.02      119768      1.96      10  0.3457
    capacita      1.02       85992     11.31      10  0.1991
    channel2      0.19       31216     60.59      10  0.1206
       doduc      2.00      157048      7.13      10  0.1736
    fatigue2      0.78       73832     46.40      10  0.8044
    gas_dyn2      0.90       87392     52.02      10  0.1442
     induct2      1.70      175232     19.79      10  0.1299
       linpk      0.23       30832      2.40      10  0.1288
        mdbx      0.75       82664      3.81      10  0.1003
mp_prop_desi      0.35       43848     53.21      10  1.3351
          nf      0.34       43504      4.30      10  0.1648
     protein      0.92       86920     13.85      10  0.1292
      rnflow      1.51       99032     13.49      10  0.1446
   test_fpu2      1.07       77224     20.80      10  0.0451
       tfft2      0.37       39328     24.69      10  0.2454

Geometric Mean Execution Time =      12.08 seconds

================================================================================

Similarly, ifort with `-qopt-zmm-usage=high`:

Code:

$ cat ifort.sum
================================================================================
Date & Time     :  5 May 2019 11:10:44
Test Name       : ifort
Compile Command : ifort -fast -qopt-zmm-usage=high %n.f90 -o %n
Benchmarks      : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
Maximum Times   :    10000.0
Target Error %  :      0.100
Minimum Repeats :    10
Maximum Repeats :    10
 
   Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac      0.72     8129584      4.08      10  0.0418
      aermod     17.07    10420728      6.51      10  0.1651
         air      4.02     8406208      1.52      10  0.1871
    capacita      1.70     8233712     10.19      10  0.3839
    channel2      0.46     8156672     55.36      10  0.0736
       doduc      2.39     8329416      7.32      10  0.2329
    fatigue2      2.08     8444576     47.82      10  0.0126
    gas_dyn2      1.49     8216560     24.54      10  0.2795
     induct2      4.84     8760336     21.59      10  0.2213
       linpk      0.46     8071840      2.15      10  0.2217
        mdbx      1.67     8186544      2.66      10  0.0806
mp_prop_desi      1.06     8529808     44.78      10  1.7498
          nf      0.86     8232200      4.37      10  0.1005
     protein      5.07     8411944     15.78      10  0.1087
      rnflow     24.07     8418128      7.69      10  0.0717
   test_fpu2      3.00     8396336     16.20      10  0.1125
       tfft2      0.61     8139256     41.81      10  0.1581

Geometric Mean Execution Time =      10.83 seconds

================================================================================

versus low:

Code:

$ cat ifortzmmlow.sum
================================================================================
Date & Time     :  5 May 2019 12:11:32
Test Name       : ifortzmmlow
Compile Command : ifort -fast -qopt-zmm-usage=low %n.f90 -o %n
Benchmarks      : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
Maximum Times   :    10000.0
Target Error %  :      0.100
Minimum Repeats :    10
Maximum Repeats :    10
 
   Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac      0.85     8125488      3.37      10  0.1828
      aermod     15.86    10319112      5.93      10  0.1404
         air      3.97     8416280      1.44      10  0.3119
    capacita      1.56     8226744      9.49      10  0.1397
    channel2      0.44     8160784     57.27      10  0.0895
       doduc      2.20     8316928      6.37      10  0.1760
    fatigue2      2.06     8440376     40.10      10  0.0994
    gas_dyn2      1.47     8204072     24.29      10  0.1562
     induct2      4.81     8760336     24.66      10  0.8962
       linpk      0.42     8067544      2.28      10  0.4775
        mdbx      1.17     8153576      3.49      10  0.0862
mp_prop_desi      1.15     8549000     44.21      10  1.5834
          nf      0.87     8227904      4.22      10  0.0494
     protein      5.15     8412664     14.37      10  0.0801
      rnflow     23.86     8401544      7.49      10  0.3231
   test_fpu2      3.01     8383952     17.74      10  0.1773
       tfft2      0.60     8135840     41.03      10  0.2156

Geometric Mean Execution Time =      10.62 seconds

================================================================================

I would have to investigate further to figure out to what extant the time differences are caused by lower clock speeds caused by thermal throttling vs missed vectorizations for dynamic loops. Eg, if a loop is only 6 iterations and the 6 is unknown at compile time, mprefer-vector-width=512 would mean that loop isn't vectorized, while with 256 bit vectors the first iteration would be covered.

I'd expect llvm to be much more vulnerable to that, since it seems to unroll much more than gcc (based on looking at asm of functions like dot products).

Flang performed exactly the same whether `mprefer-vector-width=256` or `mprefer-vector-width=512`. A brief test seems to suggest it is ignoring the argument (same asm produced).

Announcement

Intel Cascade Lake Xeon Benchmarks With GCC 8 vs. GCC 9 Compilers

Comment

Comment

Comment