Announcement

Collapse
No announcement yet.

Intel Cascade Lake Xeon Benchmarks With GCC 8 vs. GCC 9 Compilers

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Intel Cascade Lake Xeon Benchmarks With GCC 8 vs. GCC 9 Compilers

    Phoronix: Intel Cascade Lake Xeon Benchmarks With GCC 8 vs. GCC 9 Compilers

    With today's release of the GCC 9 code compiler among the many features offered in this annual update to the GNU Compiler Collection is the initial "cascadelake" target/tuning support that enables AVX-512 VNNI tuning by default as the most notable change over the former "skylake-avx512" GCC target. Anyhow, for those wondering how the GCC 9 performance differs compared to last year's GCC 8 compiler release, here are some preliminary benchmarks...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    That looks like it could be a decent raise of the bar from GCC 9.x. Nice!

    Comment


    • #3
      And now with CPUs without AVX(512) and/or zen based. Thanks.

      Comment


      • #4
        I think you will also find it has also greatly improved when not using -march (like most Linux distros)

        Comment


        • #5
          Originally posted by nuetzel View Post
          And now with CPUs without AVX(512) and/or zen based. Thanks.
          He compiled with -march=skylake which would only be using AVX2. Though of course more varied data would be better. I suspect it is under way though

          Comment


          • #6
            Originally posted by nuetzel View Post
            And now with CPUs without AVX(512) and/or zen based. Thanks.
            Originally posted by carewolf View Post
            He compiled with -march=skylake which would only be using AVX2. Though of course more varied data would be better. I suspect it is under way though
            Actually, `-march=cascadelake` does not enable avx512 either. For that, you need `-march=cascadelake -mprefer-vector-width=512`.

            Checkout Godlbolt, the compiler explorer. Using mprefer-vector-width=512, versus not using that. Specifically, look at line 13 in both those examples, and you will notice a difference: the former uses zmm, and the latter ymm registers.
            zmm registers are 512 bits, and ymm registers are 256 bits. You'll see that the first one uses a `vfmadd231pd` instruction with zmm registers, while the second instead uses ymm registers. That means only when using `mprefer-vector-width=512` will it actually use avx512 for that example.

            This is even true when you tell it the specific number of loop iterations. For example, if N is hardcoded to 16:
            -march=cascadelake -O3 -mprefer-vector-width=512: 2x vfma with zmm registers (each fma multiplies 8)
            -march=cascadelake -O3: 4x vfma with ymm registers (each fma multiplies 4)

            Comment


            • #7
              Originally posted by celrod View Post



              Actually, `-march=cascadelake` does not enable avx512 either. For that, you need `-march=cascadelake -mprefer-vector-width=512`.

              Checkout Godlbolt, the compiler explorer. Using mprefer-vector-width=512, versus not using that. Specifically, look at line 13 in both those examples, and you will notice a difference: the former uses zmm, and the latter ymm registers.
              zmm registers are 512 bits, and ymm registers are 256 bits. You'll see that the first one uses a `vfmadd231pd` instruction with zmm registers, while the second instead uses ymm registers. That means only when using `mprefer-vector-width=512` will it actually use avx512 for that example.

              This is even true when you tell it the specific number of loop iterations. For example, if N is hardcoded to 16:
              -march=cascadelake -O3 -mprefer-vector-width=512: 2x vfma with zmm registers (each fma multiplies 8)
              -march=cascadelake -O3: 4x vfma with ymm registers (each fma multiplies 4)
              It does enable AVX512 but CPU tuning for cascadelake says that 256bit vectors are preferred. This is because of the issues causing CPU to slow down clock when executing them.

              Comment


              • #8
                If it doesn't use them, that's more or less the same thing as them not being enabled.

                I write and run numerical code. I have Julia libraries for inserting LLVM vector intrinsics and vectorized special functions, and many more libraries that depend on them.

                In my code, using 512 bit vectors are definitely faster than 256 bit, despite downclocking.*

                Much of the time, I also simply rely on the autovectorizer. Thankfully, LLVM (and therefore Julia) do not share that preference with GCC, otherwise I could never rely on the autovectorizer and would have to explicitly vectorize every bit if performance-sensitive code.

                Code must actually be written to be efficiently vectorizable. Eg, paying attention to memory layout.
                Traditional advice on performance can sometimes be contrary to that. It is common advice that given column/row major matrices, access the first/second dimension most quickly, but that will often prevent vectorization. That is, treating the matrix as an array of structs doesn't offer many vectorization opportunities, but a struct of arrays does. That this is what SPMD encourages is one of the reasons the ISPC compiler often gets such great performance.

                My point in this last paragraph is code often isn't written with vectorizability in mind, and therefore does not benefit. But in vectorizable numerical code, avx512 is much faster. As an extreme example, just compare BLAS and LAPACK benchmarks.

                *Although I have to confess that it is hard to benchmark the downclocking, so I'm not sure how much I benefit for my real workloads. (I can confirm that it is much faster than Ryzen, at least.)
                In microbenchmarks the difference is often close to 2x, but I only run my microbenchmarks on a single core. That doesn't result in much downclocking.
                Running avx512 heavy process on all cores, however, does downclock (and if I mess with bios settings to stop it from downclocking, running on all cores will cause CPU temperatures to rise until the computer crashes).

                EDIT:
                Are there benchmarks accross a variety of workloads that gcc based it's decision on?
                I'll try the Polyhedron Benchmark Suite and report back with what I find. Although, I have not looked at the code base, so I do not know how much of it can be vectorized.

                EDIT:
                I just realized your name is Hubicka. Off topic, but I loved that blog post, and amazing work!
                Last edited by celrod; 04 May 2019, 06:29 AM.

                Comment


                • #9
                  Originally posted by hubicka View Post
                  It does enable AVX512 but CPU tuning for cascadelake says that 256bit vectors are preferred. This is because of the issues causing CPU to slow down clock when executing them.
                  I maintain that for highly vectorizable code, it does give a large improvement (linear algebra software being the most obvious example). But I will concede that highly vectorizable code probably took enough deliberate effort to write (and rarely happens by accident), so that the authors might not be particularly burdened by also specifying their preferred vector width.

                  That said, using the polyhedron Fortran benchmark suite (Fortran benchmarks are the easiest way to find numerical work loads), I see no significant differences between `-Ofast -march=native` and `-Ofast -march=native -mprefer-vector-width=512`) on an i9 9940X CPU.
                  These tests were using Clear Linux with gfortran 9.1.1.

                  Summary without specifying preferred vector width:
                  Code:
                  $ cat preferdefault.sum
                  ================================================================================
                  Date & Time     :  4 May 2019  8:35:28
                  Test Name       : preferdefault
                  Compile Command : gfortran -Ofast -march=native %n.f90 -o %n
                  Benchmarks      : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
                  Maximum Times   :    10000.0
                  Target Error %  :      0.100
                  Minimum Repeats :    10
                  Maximum Repeats :   100
                   
                     Benchmark   Compile  Executable   Ave Run  Number   Estim
                          Name    (secs)     (bytes)    (secs) Repeats   Err %
                     ---------   -------  ----------   ------- -------  ------
                            ac      0.37       43176      5.63      12  0.0262
                        aermod     12.35     1089648      5.99      12  0.0622
                           air      2.87      152536      2.02      36  0.0981
                      capacita      1.20      106472     11.75      10  0.0743
                      channel2      0.20       31216     57.80      18  0.0965
                         doduc      2.02      165240      7.26      16  0.0890
                      fatigue2      0.82       77928     48.05      11  0.0979
                      gas_dyn2      1.02       95584     51.62      23  0.0913
                       induct2      1.65      175232     19.46      18  0.0957
                         linpk      0.25       34928      2.18      15  0.0978
                          mdbx      0.80       90856      3.81      10  0.0417
                  mp_prop_desi      0.38       47944     50.23      13  0.0825
                            nf      0.43       47600      4.28      16  0.0911
                       protein      1.03       95112     13.79      10  0.0813
                        rnflow      1.66      115416     14.59      13  0.0906
                     test_fpu2      1.23       93608     19.11      14  0.0990
                         tfft2      0.45       47520     24.67      14  0.0989
                  
                  Geometric Mean Execution Time =      12.15 seconds
                  
                  ================================================================================

                  Preferring 512 bit vectors:
                  Code:
                  $ cat prefer512.sum
                  ================================================================================
                  Date & Time     :  4 May 2019  6:53:28
                  Test Name       : prefer512
                  Compile Command : gfortran -Ofast -march=native -mprefer-vector-width=512 %n.f90 -o %n
                  Benchmarks      : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
                  Maximum Times   :    10000.0
                  Target Error %  :      0.100
                  Minimum Repeats :    10
                  Maximum Repeats :   100
                   
                     Benchmark   Compile  Executable   Ave Run  Number   Estim
                          Name    (secs)     (bytes)    (secs) Repeats   Err %
                     ---------   -------  ----------   ------- -------  ------
                            ac      0.37       43176      5.63      10  0.0329
                        aermod     12.36     1089648      6.00      13  0.0959
                           air      2.85      152536      2.03      28  0.0959
                      capacita      1.20      106472     11.74      10  0.0814
                      channel2      0.19       31216     58.04      19  0.0874
                         doduc      2.02      165240      7.25      10  0.0826
                      fatigue2      0.82       77928     48.53      10  0.0676
                      gas_dyn2      1.03       95584     51.91      16  0.0811
                       induct2      1.65      175232     19.39      10  0.0596
                         linpk      0.25       34928      2.18      17  0.0831
                          mdbx      0.80       90856      3.82      10  0.0600
                  mp_prop_desi      0.39       47944     50.11      18  0.0953
                            nf      0.40       47600      4.25      14  0.0919
                       protein      1.04       95112     13.75      14  0.0574
                        rnflow      1.64      115416     14.54      12  0.0781
                     test_fpu2      1.22       93608     19.13      14  0.0860
                         tfft2      0.45       47520     24.55      18  0.0910
                  
                  Geometric Mean Execution Time =      12.16 seconds
                  
                  ================================================================================
                  For all I know, most of these executables may have been exactly the same. Note that the actual error is much higher than those listed. This is because the error is on an average of 10 runs; when the number of reported runs is greater than 10, it drops the Number Repeats - 10 runs furthest from the mean. It also drops one by one, so if you had 5 low times and 6 high times, it will drop a low time, pushing the average up, making it more likely to drop low times moving forward.
                  If instead it had 6 low times and 5 high times, it will do the opposite. This makes the average it converges on rather noisy -- much noisier than the stated "Estim Err %".

                  They should really take the approach of only dropping high times, or not dropping times at all and loosening the requirement on reported Estim Err before completing.

                  That said, I'd say all those results seem to be identical.

                  Comment


                  • #10
                    Link to the benchmarks.
                    One last comment on the above: I had to set `ulimit -s unlimited`, otherwise a few of those benchmarks would segfault due to wanting to allocate a lot of stack space.

                    For fun, I also tested ifort 19.0.3.199 and Flang build with LLVM 7.

                    For these benchmarks, ifort was the clear winner:
                    Code:
                    $ cat ifort.sum
                    ================================================================================
                    Date & Time     :  4 May 2019 10:05:14
                    Test Name       : ifort
                    Compile Command : ifort -fast -qopt-zmm-usage=high %n.f90 -o %n
                    Benchmarks      : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
                    Maximum Times   :    10000.0
                    Target Error %  :      0.100
                    Minimum Repeats :    10
                    Maximum Repeats :   100
                     
                       Benchmark   Compile  Executable   Ave Run  Number   Estim
                            Name    (secs)     (bytes)    (secs) Repeats   Err %
                       ---------   -------  ----------   ------- -------  ------
                              ac      1.41     8129584      4.06      12  0.0891
                          aermod     17.15    10420728      6.47      10  0.0844
                             air      4.03     8406208      1.51      26  0.0962
                        capacita      1.70     8233712     10.07      14  0.0813
                        channel2      0.46     8156672     55.34      16  0.0894
                           doduc      2.38     8329416      7.24      12  0.0888
                        fatigue2      2.08     8444576     47.62      12  0.0260
                        gas_dyn2      1.49     8216560     24.10      42  0.0991
                         induct2      4.86     8760336     21.19      13  0.0910
                           linpk      0.45     8071840      2.15      15  0.0937
                            mdbx      1.66     8186544      2.66      15  0.0841
                    mp_prop_desi      1.06     8529808     41.07      23  0.0936
                              nf      0.86     8232200      4.34      15  0.0855
                         protein      5.06     8411944     15.66      10  0.0887
                          rnflow     23.57     8418128      7.65      10  0.0525
                       test_fpu2      3.02     8396336     16.11      10  0.0650
                           tfft2      0.62     8139256     41.51      18  0.0989
                    
                    Geometric Mean Execution Time =      10.71 seconds
                    
                    ================================================================================

                    While Flang was clearly well behind gfortran. Flang failed the second test, and exited immediately (hence the exceptionally fast time there).

                    Code:
                    $ cat flang.sum
                    ================================================================================
                    Date & Time     :  4 May 2019 13:47:07
                    Test Name       : flang
                    Compile Command : flang -Ofast -march=native %n.f90 -o %n
                    Benchmarks      : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2
                    Maximum Times   :    10000.0
                    Target Error %  :      0.100
                    Minimum Repeats :    10
                    Maximum Repeats :   100
                     
                       Benchmark   Compile  Executable   Ave Run  Number   Estim
                            Name    (secs)     (bytes)    (secs) Repeats   Err %
                       ---------   -------  ----------   ------- -------  ------
                              ac      0.35       54776      5.90      13  0.0632
                          aermod     29.88     1387176      0.01     100 12.3405
                             air      1.91      132240      2.25      15  0.0812
                        capacita      1.13       92592      9.81      13  0.0814
                        channel2      0.34       44952     62.17      12  0.0870
                           doduc      2.28      156560      7.15      12  0.0966
                        fatigue2      0.77      113128     75.23      17  0.0995
                        gas_dyn2      0.80      100312     40.00      17  0.0784
                         induct2      1.74      258128     49.57      12  0.0635
                           linpk      0.33       42344      3.29      18  0.0914
                            mdbx      0.95      111240      4.42      10  0.0399
                    mp_prop_desi      0.30       49928     87.90      13  0.0580
                              nf      1.02       71872      6.65      13  0.0764
                         protein      2.06      154880     14.70      10  0.0289
                          rnflow      3.19      184816     13.41      10  0.0413
                       test_fpu2      3.59      154432     23.60      13  0.0870
                           tfft2      0.23       34896     40.01      17  0.0974
                    
                    Geometric Mean Execution Time =      10.18 seconds
                    
                    ================================================================================

                    Comment

                    Working...
                    X