Announcement

Collapse
No announcement yet.

GCC 9 Compiler Tuning Benchmarks On Intel Skylake AVX-512

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GCC 9 Compiler Tuning Benchmarks On Intel Skylake AVX-512

    Phoronix: GCC 9 Compiler Tuning Benchmarks On Intel Skylake AVX-512

    Recently I carried out a number of GCC 9 compiler benchmarks on AMD EPYC looking at the performance benefits of "znver1" compiler tuning and varying optimization levels to see when this level of compiler tuning pays off. There was interest from that in seeing some fresh Intel Skylake-X / AVX-512 figures, so here are those benchmarks of GCC 9 with various tuning options and their impact on the performance of the generated binaries.

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    FYI, gcc will not use 512 bit vector instructions without `-mprefer-vector-width=512`. This is where much of the potential for big gains with skylake-avx512 performance tuning comes in.

    EDIT:
    I guess I'm disappointed by these results, because in my own workloads I tend to see close to a 2x performance gain out of avx512.
    Part of that is that I made sure those workloads (which are numerical) will be vectorized, and then it delivers as promised.

    So my impression is that, if all you're doing is running other people's code, and that code does not include BLAS/LAPACK, don't bother with avx512, and don't bother to enable the instruction set while compiling.

    But, if you're actually writing the code you will run, that code can be vectorized, and you put in the effort to enable that, you will see the expected benefit.
    Last edited by celrod; 08 March 2019, 01:56 AM.

    Comment


    • #3
      What about "-Ofast", "-Ofast -flto" and "-Ofast -march=skylake-avx512 -flto"
      I have a hasewell-i7 processor and I got the best performance with "-Ofast -flto" .

      Comment


      • #4
        Originally posted by celrod View Post
        So my impression is that, if all you're doing is running other people's code, and that code does not include BLAS/LAPACK, don't bother with avx512, and don't bother to enable the instruction set while compiling.

        But, if you're actually writing the code you will run, that code can be vectorized, and you put in the effort to enable that, you will see the expected benefit.
        This is a very good observation. Let me add another one: wide vectorization only makes sense if your data fits into cpu caches and/or is very cache friendly. If it does not, then available memory bandwidth is not enough to feed vector units and you see marginal, if any performance improvement.

        If anyone out there has a code that can really bennefit from vectorization, then I recommend you to take a look at NEC Aurora Tsubasa cards. They're the latest iteration of their long line of vector supercomputers and come with 6 HBM2 per chip in order to feed their vector units.

        Comment


        • #5
          Would be great to see a test where we could see the time to compile then the performance of the binary to get a performance per second metric, see where the sweet spot is

          Comment


          • #6
            Michael I was wondering do the geometric mean results include the compilation time results or only the application performance results?

            Comment


            • #7
              Originally posted by Herem View Post
              Michael I was wondering do the geometric mean results include the compilation time results or only the application performance results?
              I was wondering too.

              Comment


              • #8
                Originally posted by Setif View Post
                What about "-Ofast", "-Ofast -flto" and "-Ofast -march=skylake-avx512 -flto"
                I have a hasewell-i7 processor and I got the best performance with "-Ofast -flto" .
                -Ofast introduces unsafe math operations and can break applications and results. It is not a good candidate for comparison at all.

                Comment


                • #9
                  dav1d:

                  -O0: 72.83fps
                  -O1: 131.83fps
                  -O2: 132.56fps
                  -O2 -march=skylake-avx512: 135.35fps
                  -O3: 133.94fps
                  -O3 -march=x86-64: 133.69fps
                  -O3 -march=skylake: 136.53fps
                  -O3 -march=skylake-avx512: 134.69fps
                  -Ofast -march=skylake-avx512: 134.04fps

                  Comment


                  • #10
                    -flto shows poor results!

                    Comment

                    Working...
                    X