Announcement

Collapse
No announcement yet.

Intel Cascade Lake Xeon Benchmarks With GCC 8 vs. GCC 9 Compilers

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by celrod View Post
    Link to the benchmarks.
    One last comment on the above: I had to set `ulimit -s unlimited`, otherwise a few of those benchmarks would segfault due to wanting to allocate a lot of stack space.

    For fun, I also tested ifort 19.0.3.199 and Flang build with LLVM 7.

    For these benchmarks, ifort was the clear winner:
    Code:
    $ cat ifort.sum
    ================================================================================
    Date & Time : 4 May 2019 10:05:14
    Test Name : ifort
    Compile Command : ifort -fast -qopt-zmm-usage=high %n.f90 -o %n
    Benchmarks : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
    Maximum Times : 10000.0
    Target Error % : 0.100
    Minimum Repeats : 10
    Maximum Repeats : 100
    
    Benchmark Compile Executable Ave Run Number Estim
    Name (secs) (bytes) (secs) Repeats Err %
    --------- ------- ---------- ------- ------- ------
    ac 1.41 8129584 4.06 12 0.0891
    aermod 17.15 10420728 6.47 10 0.0844
    air 4.03 8406208 1.51 26 0.0962
    capacita 1.70 8233712 10.07 14 0.0813
    channel2 0.46 8156672 55.34 16 0.0894
    doduc 2.38 8329416 7.24 12 0.0888
    fatigue2 2.08 8444576 47.62 12 0.0260
    gas_dyn2 1.49 8216560 24.10 42 0.0991
    induct2 4.86 8760336 21.19 13 0.0910
    linpk 0.45 8071840 2.15 15 0.0937
    mdbx 1.66 8186544 2.66 15 0.0841
    mp_prop_desi 1.06 8529808 41.07 23 0.0936
    nf 0.86 8232200 4.34 15 0.0855
    protein 5.06 8411944 15.66 10 0.0887
    rnflow 23.57 8418128 7.65 10 0.0525
    test_fpu2 3.02 8396336 16.11 10 0.0650
    tfft2 0.62 8139256 41.51 18 0.0989
    
    Geometric Mean Execution Time = 10.71 seconds
    
    ================================================================================

    While Flang was clearly well behind gfortran. Flang failed the second test, and exited immediately (hence the exceptionally fast time there).

    Code:
    $ cat flang.sum
    ================================================================================
    Date & Time : 4 May 2019 13:47:07
    Test Name : flang
    Compile Command : flang -Ofast -march=native %n.f90 -o %n
    Benchmarks : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
    Maximum Times : 10000.0
    Target Error % : 0.100
    Minimum Repeats : 10
    Maximum Repeats : 100
    
    Benchmark Compile Executable Ave Run Number Estim
    Name (secs) (bytes) (secs) Repeats Err %
    --------- ------- ---------- ------- ------- ------
    ac 0.35 54776 5.90 13 0.0632
    aermod 29.88 1387176 0.01 100 12.3405
    air 1.91 132240 2.25 15 0.0812
    capacita 1.13 92592 9.81 13 0.0814
    channel2 0.34 44952 62.17 12 0.0870
    doduc 2.28 156560 7.15 12 0.0966
    fatigue2 0.77 113128 75.23 17 0.0995
    gas_dyn2 0.80 100312 40.00 17 0.0784
    induct2 1.74 258128 49.57 12 0.0635
    linpk 0.33 42344 3.29 18 0.0914
    mdbx 0.95 111240 4.42 10 0.0399
    mp_prop_desi 0.30 49928 87.90 13 0.0580
    nf 1.02 71872 6.65 13 0.0764
    protein 2.06 154880 14.70 10 0.0289
    rnflow 3.19 184816 13.41 10 0.0413
    test_fpu2 3.59 154432 23.60 13 0.0870
    tfft2 0.23 34896 40.01 17 0.0974
    
    Geometric Mean Execution Time = 10.18 seconds
    
    ================================================================================
    It would be useful to report this to GCC bugzilla (along with ICC numbers). Vectorizer's cost metrics is far from ideal and one of things we plan to work it is to make it choose better the proper vector width (and of course improve codegen).

    Comment


    • #12
      Originally posted by hubicka View Post

      It would be useful to report this to GCC bugzilla (along with ICC numbers). Vectorizer's cost metrics is far from ideal and one of things we plan to work it is to make it choose better the proper vector width (and of course improve codegen).
      I am excited about that.

      But I have to issue a correction on my earlier tests. I was using Clear Linux, and it seems Clear Linux's gcc was patched to behave differently from Godbolt's.
      Code:
      $ g++ -S -Ofast -march=skylake-avx512 UpdateGroupProbs.cpp -o UpdateGroupProbs.s
      $ g++ -v
      Using built-in specs.
      COLLECT_GCC=g++
      COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-generic-linux/9/lto-wrapper
      Target: x86_64-generic-linux
      Configured with: ../gcc-9.1.0/configure --prefix=/usr --with-pkgversion='Clear Linux OS for Intel Architecture' --libdir=/usr/lib64 --enable-libstdcxx-pch --libexecdir=/usr/lib64 --with-system-zlib --enable-shared --enable-gnu-indirect-function --disable-vtable-verify --enable-threads=posix --enable-__cxa_atexit --enable-plugin --enable-ld=default --enable-clocale=gnu --disable-multiarch --enable-multilib --enable-lto --disable-werror --enable-linker-build-id --build=x86_64-generic-linux --target=x86_64-generic-linux --enable-languages=c,c++,fortran,go --enable-bootstrap --with-ppl=yes --with-isl --includedir=/usr/include --exec-prefix=/usr --with-glibc-version=2.19 --disable-libunwind-exceptions --with-gnu-ld --with-tune=haswell --with-arch=westmere --enable-cet --disable-libmpx --with-gcc-major-version-only --enable-default-pie
      Thread model: posix
      gcc version 9.1.1 20190503 gcc-9-branch@270849 (Clear Linux OS for Intel Architecture)
      With Clear Linux's gcc, I actually have to specify -mprefer-vector-width=256 to get the default behavior of Godbolt's compiler. That may explain why I did not see any difference in those Fortran benchmarks. I'll rerun them and report back.

      The function in the Godbolt link was the slowest part of a Gibbs sampler I wrote for my dissertation. It was called billions of times (4000 samples/fit * 36 fits / dataset * >17000 datasets), with N ranging from the hundreds to the thousands, and "K" (outer loop iterations, hard-coded to 6 in the above example) ranged from 2 to 16.

      Benchmarking it from Julia with mprefer-vector-width=256 vs 512, with N=2827 (I could provide all the code to reproduce this example):
      Code:
      julia> @benchmark update_group_probs_cxx256!($indiv_probs2, $base_p, $revcholwisharts, $X, $Nv)
      BenchmarkTools.Trial:
        memory estimate:  0 bytes
        allocs estimate:  0
        --------------
        minimum time:     21.150 μs (0.00% GC)
        median time:      21.249 μs (0.00% GC)
        mean time:        21.476 μs (0.00% GC)
        maximum time:     43.230 μs (0.00% GC)
        --------------
        samples:          10000
        evals/sample:     1
      
      julia> @benchmark update_group_probs_cxx512!($indiv_probs2, $base_p, $revcholwisharts, $X, $Nv)
      BenchmarkTools.Trial:
        memory estimate:  0 bytes
        allocs estimate:  0
        --------------
        minimum time:     14.216 μs (0.00% GC)
        median time:      14.313 μs (0.00% GC)
        mean time:        14.400 μs (0.00% GC)
        maximum time:     43.261 μs (0.00% GC)
        --------------
        samples:          10000
        evals/sample:     1
      It was about 50% faster with the full-width vectors (the second) than half-width (the first). Over the course of a billion calls, that can make the difference of a couple of hours.

      After the Fortran benchmark runs, I'll file an issue on gcc's bugzilla.

      Comment


      • #13
        mprefer-vector-width=256 seems to outperform mprefer-vector-width=512 on the Polyhedron benchmarks.

        I reran the benchmarks, now with exactly 10 runs each time. I believe the first is discarded, so the averages are of 9 runs. I believe these should be more accurate than the earlier results, besides taking less time to run.

        gfortran with 512 bit vectors:
        Code:
        $ cat gfortran512.sum
        ================================================================================
        Date & Time     :  5 May 2019 13:04:09
        Test Name       : prefer512
        Compile Command : gfortran -Ofast -march=native -mprefer-vector-width=512 %n.f90 -o %n
        Benchmarks      : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
        Maximum Times   :    10000.0
        Target Error %  :      0.100
        Minimum Repeats :    10
        Maximum Repeats :    10
         
           Benchmark   Compile  Executable   Ave Run  Number   Estim
                Name    (secs)     (bytes)    (secs) Repeats   Err %
           ---------   -------  ----------   ------- -------  ------
                  ac      0.37       43176      5.66      10  0.0342
              aermod     12.73     1089648      6.01      10  0.1469
                 air      2.87      152536      2.11      10  0.7105
            capacita      1.21      106472     11.80      10  0.0697
            channel2      0.20       31216     57.66      10  0.0790
               doduc      2.03      165240      7.31      10  0.1554
            fatigue2      0.84       77928     46.47      10  0.2654
            gas_dyn2      1.03       95584     52.11      10  0.1648
             induct2      1.67      175232     19.54      10  0.1429
               linpk      0.25       34928      2.18      10  0.1739
                mdbx      0.82       90856      3.83      10  0.0872
        mp_prop_desi      0.39       47944     53.32      10  1.3054
                  nf      0.40       47600      4.32      10  0.2134
             protein      1.04       95112     13.84      10  0.1219
              rnflow      1.67      115416     14.64      10  0.1113
           test_fpu2      1.24       93608     19.18      10  0.1698
               tfft2      0.47       47520     24.78      10  0.1479
        
        Geometric Mean Execution Time =      12.24 seconds
        gfortran with 256 bit vectors:
        Code:
        $ cat gfortran256.sum
        ================================================================================
        Date & Time     :  5 May 2019 14:02:05
        Test Name       : prefer256
        Compile Command : gfortran -Ofast -march=native -mprefer-vector-width=256 %n.f90 -o %n
        Benchmarks      : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
        Maximum Times   :    10000.0
        Target Error %  :      0.100
        Minimum Repeats :    10
        Maximum Repeats :    10
         
           Benchmark   Compile  Executable   Ave Run  Number   Estim
                Name    (secs)     (bytes)    (secs) Repeats   Err %
           ---------   -------  ----------   ------- -------  ------
                  ac      0.37       43176      4.92      10  0.0495
              aermod     12.19     1069064      5.50      10  0.0918
                 air      2.02      119768      1.96      10  0.3457
            capacita      1.02       85992     11.31      10  0.1991
            channel2      0.19       31216     60.59      10  0.1206
               doduc      2.00      157048      7.13      10  0.1736
            fatigue2      0.78       73832     46.40      10  0.8044
            gas_dyn2      0.90       87392     52.02      10  0.1442
             induct2      1.70      175232     19.79      10  0.1299
               linpk      0.23       30832      2.40      10  0.1288
                mdbx      0.75       82664      3.81      10  0.1003
        mp_prop_desi      0.35       43848     53.21      10  1.3351
                  nf      0.34       43504      4.30      10  0.1648
             protein      0.92       86920     13.85      10  0.1292
              rnflow      1.51       99032     13.49      10  0.1446
           test_fpu2      1.07       77224     20.80      10  0.0451
               tfft2      0.37       39328     24.69      10  0.2454
        
        Geometric Mean Execution Time =      12.08 seconds
        
        ================================================================================
        Similarly, ifort with `-qopt-zmm-usage=high`:
        Code:
        $ cat ifort.sum
        ================================================================================
        Date & Time     :  5 May 2019 11:10:44
        Test Name       : ifort
        Compile Command : ifort -fast -qopt-zmm-usage=high %n.f90 -o %n
        Benchmarks      : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
        Maximum Times   :    10000.0
        Target Error %  :      0.100
        Minimum Repeats :    10
        Maximum Repeats :    10
         
           Benchmark   Compile  Executable   Ave Run  Number   Estim
                Name    (secs)     (bytes)    (secs) Repeats   Err %
           ---------   -------  ----------   ------- -------  ------
                  ac      0.72     8129584      4.08      10  0.0418
              aermod     17.07    10420728      6.51      10  0.1651
                 air      4.02     8406208      1.52      10  0.1871
            capacita      1.70     8233712     10.19      10  0.3839
            channel2      0.46     8156672     55.36      10  0.0736
               doduc      2.39     8329416      7.32      10  0.2329
            fatigue2      2.08     8444576     47.82      10  0.0126
            gas_dyn2      1.49     8216560     24.54      10  0.2795
             induct2      4.84     8760336     21.59      10  0.2213
               linpk      0.46     8071840      2.15      10  0.2217
                mdbx      1.67     8186544      2.66      10  0.0806
        mp_prop_desi      1.06     8529808     44.78      10  1.7498
                  nf      0.86     8232200      4.37      10  0.1005
             protein      5.07     8411944     15.78      10  0.1087
              rnflow     24.07     8418128      7.69      10  0.0717
           test_fpu2      3.00     8396336     16.20      10  0.1125
               tfft2      0.61     8139256     41.81      10  0.1581
        
        Geometric Mean Execution Time =      10.83 seconds
        
        ================================================================================
        versus low:
        Code:
        $ cat ifortzmmlow.sum
        ================================================================================
        Date & Time     :  5 May 2019 12:11:32
        Test Name       : ifortzmmlow
        Compile Command : ifort -fast -qopt-zmm-usage=low %n.f90 -o %n
        Benchmarks      : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
        Maximum Times   :    10000.0
        Target Error %  :      0.100
        Minimum Repeats :    10
        Maximum Repeats :    10
         
           Benchmark   Compile  Executable   Ave Run  Number   Estim
                Name    (secs)     (bytes)    (secs) Repeats   Err %
           ---------   -------  ----------   ------- -------  ------
                  ac      0.85     8125488      3.37      10  0.1828
              aermod     15.86    10319112      5.93      10  0.1404
                 air      3.97     8416280      1.44      10  0.3119
            capacita      1.56     8226744      9.49      10  0.1397
            channel2      0.44     8160784     57.27      10  0.0895
               doduc      2.20     8316928      6.37      10  0.1760
            fatigue2      2.06     8440376     40.10      10  0.0994
            gas_dyn2      1.47     8204072     24.29      10  0.1562
             induct2      4.81     8760336     24.66      10  0.8962
               linpk      0.42     8067544      2.28      10  0.4775
                mdbx      1.17     8153576      3.49      10  0.0862
        mp_prop_desi      1.15     8549000     44.21      10  1.5834
                  nf      0.87     8227904      4.22      10  0.0494
             protein      5.15     8412664     14.37      10  0.0801
              rnflow     23.86     8401544      7.49      10  0.3231
           test_fpu2      3.01     8383952     17.74      10  0.1773
               tfft2      0.60     8135840     41.03      10  0.2156
        
        Geometric Mean Execution Time =      10.62 seconds
        
        ================================================================================
        I would have to investigate further to figure out to what extant the time differences are caused by lower clock speeds caused by thermal throttling vs missed vectorizations for dynamic loops. Eg, if a loop is only 6 iterations and the 6 is unknown at compile time, mprefer-vector-width=512 would mean that loop isn't vectorized, while with 256 bit vectors the first iteration would be covered.

        I'd expect llvm to be much more vulnerable to that, since it seems to unroll much more than gcc (based on looking at asm of functions like dot products).

        Flang performed exactly the same whether `mprefer-vector-width=256` or `mprefer-vector-width=512`. A brief test seems to suggest it is ignoring the argument (same asm produced).
        Last edited by celrod; 06 May 2019, 01:20 PM.

        Comment

        Working...
        X