Announcement

Collapse
No announcement yet.

A Look At The GCC 9 Performance On Intel Skylake Against GCC 8, LLVM Clang 7/8

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • nanonyme
    replied
    Is there some benchmark I should look at to determine actual compile speed differences LLVM/Webkit/Chromium compile tests?

    Leave a comment:


  • celrod
    replied
    I can use `-finline-limit` set to some arbitrarily big number to get everything to inline, given that I'm also using `-fno-semantic-interposition`.

    gfortran is still struggling hard compared to ifort. This helped versions 2 and 3 of gfortran to match Julia.

    Version 1 is still about the same slow, even though it is inlined. Rather than having a for loop that makes sense, there is a sequence of jump's that weave through a bunch of save instructions. I am not sure why it is trying to do, but version 1 still takes 35 microseconds despite getting inlined, while the manually inlined version takes 1.4 microseconds -- a 25x improvement from getting forced to manually inline.

    Version 0 of the function was the most obvious and natural to write -- just defining the function to operate on scalars, without worrying about vectorization, other than how the data is stored in memory. Julia took 10 microseconds vs 1.4 for the babysat vectorized version.
    I now decided to test that version in Fortran. gfortran took about 97 microseconds!
    ouch, that is slow.
    ifort took around 850 nanoseconds, well over 100x faster.

    Too bad the intel compilers are neither libre nor gratis. =(
    But at least in this problem, the vectorizer is doing a marvellous job -- including correctly optimizing the clearest and easiest way to write the function, while gfortran failed catastrophically (and Julia did pretty bad, too).

    Leave a comment:


  • celrod
    replied
    Adding `-mveclibabi=svml` may be a good idea in general, but it doesn't help for square roots:

    GCC currently emits calls to vmldExp2, vmldLn2, vmldLog102, vmldPow2, vmldTanh2, vmldTan2, vmldAtan2, vmldAtanh2, vmldCbrt2, vmldSinh2, vmldSin2, vmldAsinh2, vmldAsin2, vmldCosh2, vmldCos2, vmldAcosh2, vmldAcos2, vmlsExp4, vmlsLn4, vmlsLog104, vmlsPow4, vmlsTanh4, vmlsTan4, vmlsAtan4, vmlsAtanh4, vmlsCbrt4, vmlsSinh4, vmlsSin4, vmlsAsinh4, vmlsAsin4, vmlsCosh4, vmlsCos4, vmlsAcosh4 and vmlsAcos4 for corresponding function type when -mveclibabi=svml is used, and __vrd2_sin, __vrd2_cos, __vrd2_exp, __vrd2_log, __vrd2_log2, __vrd2_log10, __vrs4_sinf, __vrs4_cosf, __vrs4_expf, __vrs4_logf, __vrs4_log2f, __vrs4_log10f and __vrs4_powf for the corresponding function type when -mveclibabi=acml is used.
    Thanks, I did not know that `-fPIC` prevents inlining!
    I saw `-fno-semantic-interposition` on StackOverflow, and tested to confirm that using it allows inlining.

    However, gfortran still would not inline my example -- I wish it had something like the C/C++ `inline` attribute (which also inlines despite `fPIC`).
    That, vector intrinsics, and the standard library often make C++ easier than Fortran, despite less convenient native array support.

    Leave a comment:


  • Grinch
    replied
    Originally posted by Michael View Post
    but it's not 100% perfect with some pesky programs in always being able to catch the flags they are passing (not sure if there is any better or more uniform method today, this is just based upon this compiler masking code I wrote some years ago).
    Ah, that makes sense, thanks for the explanation.

    Leave a comment:


  • hubicka
    replied
    Originally posted by celrod View Post

    It isn't worth much, because it is only one example, but I tried three versions of a function in Fortran with gfortran 8.2 and ifort 19.0.1 on a computer with avx-512.

    All three versions are doing the exact same numerical calculations on the exact same input (calculating the product of a vector and the Cholesky decomposition of the inverse of a 3x3 matrix for each of 1024 matrices and vectors).
    All versions loop over the data to do these calculations.
    Version 1 calls another function that works on blocks from the inputs:
    Code:
    subroutine vpdbacksolve(Uix, x, S)
    real, dimension(16,3), intent(out) :: Uix
    real, dimension(16,3), intent(in) :: x
    real, dimension(16,6), intent(in) :: S
    real, dimension(16) :: U12, U13, U23, &
    Ui11, Ui12, Ui22, Ui13, Ui23, Ui33
    
    Ui33 = 1 / sqrt(S(:,6))
    U13 = S(:,4) * Ui33
    U23 = S(:,5) * Ui33
    Ui22 = 1 / sqrt(S(:,3) - U23**2)
    U12 = (S(:,2) - U13*U23) * Ui22
    
    Ui11 = 1 / sqrt(S(:,1) - U12**2 - U13**2) ! u11
    Ui12 = - U12 * Ui11 * Ui22 ! u12
    Ui13 = - (U13 * Ui11 + U23 * Ui12) * Ui33
    Ui23 = - U23 * Ui22 * Ui33
    
    Uix(:,1) = Ui11*x(:,1) + Ui12*x(:,2) + Ui13*x(:,3)
    Uix(:,2) = Ui22*x(:,2) + Ui23*x(:,3)
    Uix(:,3) = Ui33*x(:,3)
    
    end subroutine vpdbacksolve
    Version 2 does the exact same thing, except it splits each input matrix up into the individual columns. For example, instead of the 16x3 matrix X, it takes in x1, x2, and x3 as vectors of length 16.

    Version 3 manually inlined the function from version 1.

    Because the function operates on 16 matrices/vectors at a time, the batch size of 1024 would call a function like the above 64 times.

    I also wrote versions 2 in Julia, and an even older version (version 0), where vpdbacksolve is called on one element at a time in the for loop. Theoretically a compiler should be able to figure out it can still vectorize the loop (it did not).

    I compiled with
    Code:
    ifort -fast -qopt-zmm-usage=high -ansi-alias -shared -fPIC $FILE -o $INTELSHAREDLIBNAME
    gfortran -Ofast -fdisable-tree-cunrolli -march=native -mprefer-vector-width=512 -shared -fPIC $FILE -o $GCCSHAREDLIBNAME

    Benchmarking everything from Julia:
    Code:
    julia> @benchmark julia_version_0!($X32, $BPP32)
    BenchmarkTools.Trial:
    memory estimate: 0 bytes
    allocs estimate: 0
    --------------
    minimum time: 10.512 μs (0.00% GC)
    median time: 10.955 μs (0.00% GC)
    mean time: 11.196 μs (0.00% GC)
    maximum time: 43.002 μs (0.00% GC)
    --------------
    samples: 10000
    evals/sample: 1
    
    julia> @benchmark julia_version_2!($X32, $BPP32)
    BenchmarkTools.Trial:
    memory estimate: 0 bytes
    allocs estimate: 0
    --------------
    minimum time: 1.401 μs (0.00% GC)
    median time: 1.408 μs (0.00% GC)
    mean time: 1.467 μs (0.00% GC)
    maximum time: 3.543 μs (0.00% GC)
    --------------
    samples: 10000
    evals/sample: 10
    
    julia> @benchmark gfortran_version_1!($X32, $BPP32, $N)
    BenchmarkTools.Trial:
    memory estimate: 0 bytes
    allocs estimate: 0
    --------------
    minimum time: 35.203 μs (0.00% GC)
    median time: 35.475 μs (0.00% GC)
    mean time: 36.543 μs (0.00% GC)
    maximum time: 68.729 μs (0.00% GC)
    --------------
    samples: 10000
    evals/sample: 1
    
    julia> @benchmark gfortran_version_2!($X32, $BPP32, $N)
    BenchmarkTools.Trial:
    memory estimate: 0 bytes
    allocs estimate: 0
    --------------
    minimum time: 1.866 μs (0.00% GC)
    median time: 1.875 μs (0.00% GC)
    mean time: 1.943 μs (0.00% GC)
    maximum time: 5.220 μs (0.00% GC)
    --------------
    samples: 10000
    evals/sample: 10
    
    julia> @benchmark gfortran_version_3!($X32, $BPP32, $N)
    BenchmarkTools.Trial:
    memory estimate: 0 bytes
    allocs estimate: 0
    --------------
    minimum time: 1.423 μs (0.00% GC)
    median time: 1.435 μs (0.00% GC)
    mean time: 1.483 μs (0.00% GC)
    maximum time: 4.720 μs (0.00% GC)
    --------------
    samples: 10000
    evals/sample: 10
    
    julia> @benchmark ifort_version_1!($X32, $BPP32, $N)
    BenchmarkTools.Trial:
    memory estimate: 0 bytes
    allocs estimate: 0
    --------------
    minimum time: 1.523 μs (0.00% GC)
    median time: 1.538 μs (0.00% GC)
    mean time: 1.571 μs (0.00% GC)
    maximum time: 3.683 μs (0.00% GC)
    --------------
    samples: 10000
    evals/sample: 10
    
    julia> @benchmark ifort_version_2!($X32, $BPP32, $N)
    BenchmarkTools.Trial:
    memory estimate: 0 bytes
    allocs estimate: 0
    --------------
    minimum time: 925.719 ns (0.00% GC)
    median time: 954.156 ns (0.00% GC)
    mean time: 986.308 ns (0.00% GC)
    maximum time: 2.030 μs (0.00% GC)
    --------------
    samples: 10000
    evals/sample: 32
    
    julia> @benchmark ifort_version_3!($X32, $BPP32, $N)
    BenchmarkTools.Trial:
    memory estimate: 0 bytes
    allocs estimate: 0
    --------------
    minimum time: 866.052 ns (0.00% GC)
    median time: 870.172 ns (0.00% GC)
    mean time: 898.465 ns (0.00% GC)
    maximum time: 1.527 μs (0.00% GC)
    --------------
    samples: 10000
    evals/sample: 58

    gcc fails to see how profitable it is to just inline the function getting called in the for loop.
    It wastes a whole lot of time needlessly copying things around in version 1.

    I used the `@inline` macro in Julia's version 2 to get it to inline the called function. Without it, it still did better than gcc:
    Code:
    julia> @benchmark julia_version_2_noforcedinline!($X32, $BPP32)
    BenchmarkTools.Trial:
    memory estimate: 0 bytes
    allocs estimate: 0
    --------------
    minimum time: 1.484 μs (0.00% GC)
    median time: 1.496 μs (0.00% GC)
    mean time: 1.548 μs (0.00% GC)
    maximum time: 3.301 μs (0.00% GC)
    --------------
    samples: 10000
    evals/sample: 10
    but with forced inlining, I think gfortran and Julia produced rather similar code, and the assembly looks optimal (to me).


    ...yet somehow Intel was still more than 60% faster. I don't know how much of a role this played:
    Code:
    call *__svml_invsqrtf16_z0@GOTPCREL(%rip) #226.9
    There are a total of 192 inverse square roots needed, and it takes around 5 ns to use something like this instead ((%rax) would be 1f0):
    Code:
    vsqrtps %zmm0, %zmm0
    vbroadcastss (%rax), %zmm1
    vdivps %zmm0, %zmm1, %zmm0
    5ns is not a long time, so I doubt intel can shave off that much with a faster version.
    I am definitely impressed how much ifort was able to improve what I had thought was nearly optimal. I'll have to look closer to get some idea of where all that performance came from.

    It's definitely highly subject to how vectorizable the input code is, and probably a bunch of other factors. In the code I write, I tend to put some effort into making sure it can be vectorized. Eg, ensuring good data layouts.

    EDIT:
    gcc is actually using
    Code:
    vrsqrt14ps %zmm0, %zmm1{%k1}{z}
    I am not sure how to get Julia to do that. It is instead taking the square root and doing a division.
    gfortran -Ofast -fdisable-tree-cunrolli -march=native -mprefer-vector-width=512 -shared -fPIC $FILE -o $GCCSHAREDLIBNAME The reason why GCC did not inline could -fPIC. It makes the transfromation invalid since you may overwrite the symbol at runtime (with LD_PRELOAD) ICC seems to ignore this.

    Leave a comment:


  • hubicka
    replied
    Originally posted by celrod View Post


    ...yet somehow Intel was still more than 60% faster. I don't know how much of a role this played:
    Code:
    call *__svml_invsqrtf16_z0@GOTPCREL(%rip) #226.9
    There are a total of 192 inverse square roots needed, and it takes around 5 ns to use something like this instead ((%rax) would be 1f0):
    Code:
    vsqrtps %zmm0, %zmm0
    vbroadcastss (%rax), %zmm1
    vdivps %zmm0, %zmm1, %zmm0
    You may try to use -mveclibabi=svml with gcc and links with sml too

    Leave a comment:


  • celrod
    replied
    Originally posted by thebear View Post
    How does gcc/gfortran clang/flang compare to icc/ifort these days? (a coworker claimed Intel's compilers are still "superior")
    It isn't worth much, because it is only one example, but I tried three versions of a function in Fortran with gfortran 8.2 and ifort 19.0.1 on a computer with avx-512.

    All three versions are doing the exact same numerical calculations on the exact same input (calculating the product of a vector and the Cholesky decomposition of the inverse of a 3x3 matrix for each of 1024 matrices and vectors).
    All versions loop over the data to do these calculations.
    Version 1 calls another function that works on blocks from the inputs:
    Code:
    subroutine vpdbacksolve(Uix, x, S)
        real, dimension(16,3), intent(out) ::  Uix
        real, dimension(16,3), intent(in)  ::  x
        real, dimension(16,6), intent(in)  ::  S
        real, dimension(16)                ::  U12,  U13,  U23, &
                                            Ui11, Ui12, Ui22, Ui13, Ui23, Ui33
    
        Ui33 = 1 / sqrt(S(:,6))
        U13 = S(:,4) * Ui33
        U23 = S(:,5) * Ui33
        Ui22 = 1 / sqrt(S(:,3) - U23**2)
        U12 = (S(:,2) - U13*U23) * Ui22
    
        Ui11 = 1 / sqrt(S(:,1) - U12**2 - U13**2) ! u11
        Ui12 = - U12 * Ui11 * Ui22 ! u12
        Ui13 = - (U13 * Ui11 + U23 * Ui12) * Ui33
        Ui23 = - U23 * Ui22 * Ui33
    
        Uix(:,1) = Ui11*x(:,1) + Ui12*x(:,2) + Ui13*x(:,3)
        Uix(:,2) = Ui22*x(:,2) + Ui23*x(:,3)
        Uix(:,3) = Ui33*x(:,3)
    
    end subroutine vpdbacksolve
    Version 2 does the exact same thing, except it splits each input matrix up into the individual columns. For example, instead of the 16x3 matrix X, it takes in x1, x2, and x3 as vectors of length 16.

    Version 3 manually inlined the function from version 1.

    Because the function operates on 16 matrices/vectors at a time, the batch size of 1024 would call a function like the above 64 times.

    I also wrote versions 2 in Julia, and an even older version (version 0), where vpdbacksolve is called on one element at a time in the for loop. Theoretically a compiler should be able to figure out it can still vectorize the loop (it did not).

    I compiled with
    Code:
    ifort -fast -qopt-zmm-usage=high -ansi-alias -shared -fPIC $FILE -o $INTELSHAREDLIBNAME
    gfortran -Ofast -fdisable-tree-cunrolli -march=native -mprefer-vector-width=512 -shared -fPIC $FILE -o $GCCSHAREDLIBNAME

    Benchmarking everything from Julia:
    Code:
    julia> @benchmark julia_version_0!($X32, $BPP32)
    BenchmarkTools.Trial:
      memory estimate:  0 bytes
      allocs estimate:  0
      --------------
      minimum time:     10.512 μs (0.00% GC)
      median time:      10.955 μs (0.00% GC)
      mean time:        11.196 μs (0.00% GC)
      maximum time:     43.002 μs (0.00% GC)
      --------------
      samples:          10000
      evals/sample:     1
    
    julia> @benchmark julia_version_2!($X32, $BPP32)
    BenchmarkTools.Trial:
      memory estimate:  0 bytes
      allocs estimate:  0
      --------------
      minimum time:     1.401 μs (0.00% GC)
      median time:      1.408 μs (0.00% GC)
      mean time:        1.467 μs (0.00% GC)
      maximum time:     3.543 μs (0.00% GC)
      --------------
      samples:          10000
      evals/sample:     10
    
    julia> @benchmark gfortran_version_1!($X32, $BPP32, $N)
    BenchmarkTools.Trial:
      memory estimate:  0 bytes
      allocs estimate:  0
      --------------
      minimum time:     35.203 μs (0.00% GC)
      median time:      35.475 μs (0.00% GC)
      mean time:        36.543 μs (0.00% GC)
      maximum time:     68.729 μs (0.00% GC)
      --------------
      samples:          10000
      evals/sample:     1
    
    julia> @benchmark gfortran_version_2!($X32, $BPP32, $N)
    BenchmarkTools.Trial:
      memory estimate:  0 bytes
      allocs estimate:  0
      --------------
      minimum time:     1.866 μs (0.00% GC)
      median time:      1.875 μs (0.00% GC)
      mean time:        1.943 μs (0.00% GC)
      maximum time:     5.220 μs (0.00% GC)
      --------------
      samples:          10000
      evals/sample:     10
    
    julia> @benchmark gfortran_version_3!($X32, $BPP32, $N)
    BenchmarkTools.Trial:
      memory estimate:  0 bytes
      allocs estimate:  0
      --------------
      minimum time:     1.423 μs (0.00% GC)
      median time:      1.435 μs (0.00% GC)
      mean time:        1.483 μs (0.00% GC)
      maximum time:     4.720 μs (0.00% GC)
      --------------
      samples:          10000
      evals/sample:     10
    
    julia> @benchmark ifort_version_1!($X32, $BPP32, $N)
    BenchmarkTools.Trial:
      memory estimate:  0 bytes
      allocs estimate:  0
      --------------
      minimum time:     1.523 μs (0.00% GC)
      median time:      1.538 μs (0.00% GC)
      mean time:        1.571 μs (0.00% GC)
      maximum time:     3.683 μs (0.00% GC)
      --------------
      samples:          10000
      evals/sample:     10
    
    julia> @benchmark ifort_version_2!($X32, $BPP32, $N)
    BenchmarkTools.Trial:
      memory estimate:  0 bytes
      allocs estimate:  0
      --------------
      minimum time:     925.719 ns (0.00% GC)
      median time:      954.156 ns (0.00% GC)
      mean time:        986.308 ns (0.00% GC)
      maximum time:     2.030 μs (0.00% GC)
      --------------
      samples:          10000
      evals/sample:     32
    
    julia> @benchmark ifort_version_3!($X32, $BPP32, $N)
    BenchmarkTools.Trial:
      memory estimate:  0 bytes
      allocs estimate:  0
      --------------
      minimum time:     866.052 ns (0.00% GC)
      median time:      870.172 ns (0.00% GC)
      mean time:        898.465 ns (0.00% GC)
      maximum time:     1.527 μs (0.00% GC)
      --------------
      samples:          10000
      evals/sample:     58

    gcc fails to see how profitable it is to just inline the function getting called in the for loop.
    It wastes a whole lot of time needlessly copying things around in version 1.

    I used the `@inline` macro in Julia's version 2 to get it to inline the called function. Without it, it still did better than gcc:
    Code:
    julia> @benchmark julia_version_2_noforcedinline!($X32, $BPP32)
    BenchmarkTools.Trial:
      memory estimate:  0 bytes
      allocs estimate:  0
      --------------
      minimum time:     1.484 μs (0.00% GC)
      median time:      1.496 μs (0.00% GC)
      mean time:        1.548 μs (0.00% GC)
      maximum time:     3.301 μs (0.00% GC)
      --------------
      samples:          10000
      evals/sample:     10
    but with forced inlining, I think gfortran and Julia produced rather similar code, and the assembly looks optimal (to me).


    ...yet somehow Intel was still more than 60% faster. I don't know how much of a role this played:
    Code:
            call      *__svml_invsqrtf16_z0@GOTPCREL(%rip)          #226.9
    There are a total of 192 inverse square roots needed, and it takes around 5 ns to use something like this instead ((%rax) would be 1f0):
    Code:
        vsqrtps    %zmm0, %zmm0
        vbroadcastss    (%rax), %zmm1
        vdivps    %zmm0, %zmm1, %zmm0
    5ns is not a long time, so I doubt intel can shave off that much with a faster version.
    I am definitely impressed how much ifort was able to improve what I had thought was nearly optimal. I'll have to look closer to get some idea of where all that performance came from.

    It's definitely highly subject to how vectorizable the input code is, and probably a bunch of other factors. In the code I write, I tend to put some effort into making sure it can be vectorized. Eg, ensuring good data layouts.

    EDIT:
    gcc is actually using
    Code:
        vrsqrt14ps    %zmm0, %zmm1{%k1}{z}
    I am not sure how to get Julia to do that. It is instead taking the square root and doing a division.
    Last edited by celrod; 13 November 2018, 08:06 PM.

    Leave a comment:


  • AsuMagic
    replied
    Originally posted by thebear View Post
    How does gcc/gfortran clang/flang compare to icc/ifort these days? (a coworker claimed Intel's compilers are still "superior")
    AFAIK when dealing with vectorization icc is usually better. And IIRC it can do functionning versioning like gcc and now Clang can but automatically.

    Leave a comment:


  • thebear
    replied
    How does gcc/gfortran clang/flang compare to icc/ifort these days? (a coworker claimed Intel's compilers are still "superior")

    Leave a comment:


  • celrod
    replied
    Could you try comparing "-O3 -march=native" with "-O3 -march=native -mprefer-vector-width=512" ? I'd like to see the difference actually using avx-512 makes.

    Leave a comment:

Working...
X