Announcement
Collapse
No announcement yet.
A Look At The GCC 9 Performance On Intel Skylake Against GCC 8, LLVM Clang 7/8
Collapse
X
-
Is there some benchmark I should look at to determine actual compile speed differences LLVM/Webkit/Chromium compile tests?
-
I can use `-finline-limit` set to some arbitrarily big number to get everything to inline, given that I'm also using `-fno-semantic-interposition`.
gfortran is still struggling hard compared to ifort. This helped versions 2 and 3 of gfortran to match Julia.
Version 1 is still about the same slow, even though it is inlined. Rather than having a for loop that makes sense, there is a sequence of jump's that weave through a bunch of save instructions. I am not sure why it is trying to do, but version 1 still takes 35 microseconds despite getting inlined, while the manually inlined version takes 1.4 microseconds -- a 25x improvement from getting forced to manually inline.
Version 0 of the function was the most obvious and natural to write -- just defining the function to operate on scalars, without worrying about vectorization, other than how the data is stored in memory. Julia took 10 microseconds vs 1.4 for the babysat vectorized version.
I now decided to test that version in Fortran. gfortran took about 97 microseconds!
ouch, that is slow.
ifort took around 850 nanoseconds, well over 100x faster.
Too bad the intel compilers are neither libre nor gratis. =(
But at least in this problem, the vectorizer is doing a marvellous job -- including correctly optimizing the clearest and easiest way to write the function, while gfortran failed catastrophically (and Julia did pretty bad, too).
Leave a comment:
-
Adding `-mveclibabi=svml` may be a good idea in general, but it doesn't help for square roots:
GCC currently emits calls to vmldExp2, vmldLn2, vmldLog102, vmldPow2, vmldTanh2, vmldTan2, vmldAtan2, vmldAtanh2, vmldCbrt2, vmldSinh2, vmldSin2, vmldAsinh2, vmldAsin2, vmldCosh2, vmldCos2, vmldAcosh2, vmldAcos2, vmlsExp4, vmlsLn4, vmlsLog104, vmlsPow4, vmlsTanh4, vmlsTan4, vmlsAtan4, vmlsAtanh4, vmlsCbrt4, vmlsSinh4, vmlsSin4, vmlsAsinh4, vmlsAsin4, vmlsCosh4, vmlsCos4, vmlsAcosh4 and vmlsAcos4 for corresponding function type when -mveclibabi=svml is used, and __vrd2_sin, __vrd2_cos, __vrd2_exp, __vrd2_log, __vrd2_log2, __vrd2_log10, __vrs4_sinf, __vrs4_cosf, __vrs4_expf, __vrs4_logf, __vrs4_log2f, __vrs4_log10f and __vrs4_powf for the corresponding function type when -mveclibabi=acml is used.
I saw `-fno-semantic-interposition` on StackOverflow, and tested to confirm that using it allows inlining.
However, gfortran still would not inline my example -- I wish it had something like the C/C++ `inline` attribute (which also inlines despite `fPIC`).
That, vector intrinsics, and the standard library often make C++ easier than Fortran, despite less convenient native array support.
Leave a comment:
-
Originally posted by Michael View Postbut it's not 100% perfect with some pesky programs in always being able to catch the flags they are passing (not sure if there is any better or more uniform method today, this is just based upon this compiler masking code I wrote some years ago).
Leave a comment:
-
Originally posted by celrod View Post
It isn't worth much, because it is only one example, but I tried three versions of a function in Fortran with gfortran 8.2 and ifort 19.0.1 on a computer with avx-512.
All three versions are doing the exact same numerical calculations on the exact same input (calculating the product of a vector and the Cholesky decomposition of the inverse of a 3x3 matrix for each of 1024 matrices and vectors).
All versions loop over the data to do these calculations.
Version 1 calls another function that works on blocks from the inputs:
Code:subroutine vpdbacksolve(Uix, x, S) real, dimension(16,3), intent(out) :: Uix real, dimension(16,3), intent(in) :: x real, dimension(16,6), intent(in) :: S real, dimension(16) :: U12, U13, U23, & Ui11, Ui12, Ui22, Ui13, Ui23, Ui33 Ui33 = 1 / sqrt(S(:,6)) U13 = S(:,4) * Ui33 U23 = S(:,5) * Ui33 Ui22 = 1 / sqrt(S(:,3) - U23**2) U12 = (S(:,2) - U13*U23) * Ui22 Ui11 = 1 / sqrt(S(:,1) - U12**2 - U13**2) ! u11 Ui12 = - U12 * Ui11 * Ui22 ! u12 Ui13 = - (U13 * Ui11 + U23 * Ui12) * Ui33 Ui23 = - U23 * Ui22 * Ui33 Uix(:,1) = Ui11*x(:,1) + Ui12*x(:,2) + Ui13*x(:,3) Uix(:,2) = Ui22*x(:,2) + Ui23*x(:,3) Uix(:,3) = Ui33*x(:,3) end subroutine vpdbacksolve
Version 3 manually inlined the function from version 1.
Because the function operates on 16 matrices/vectors at a time, the batch size of 1024 would call a function like the above 64 times.
I also wrote versions 2 in Julia, and an even older version (version 0), where vpdbacksolve is called on one element at a time in the for loop. Theoretically a compiler should be able to figure out it can still vectorize the loop (it did not).
I compiled with
Code:ifort -fast -qopt-zmm-usage=high -ansi-alias -shared -fPIC $FILE -o $INTELSHAREDLIBNAME gfortran -Ofast -fdisable-tree-cunrolli -march=native -mprefer-vector-width=512 -shared -fPIC $FILE -o $GCCSHAREDLIBNAME
Benchmarking everything from Julia:
Code:julia> @benchmark julia_version_0!($X32, $BPP32) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 10.512 μs (0.00% GC) median time: 10.955 μs (0.00% GC) mean time: 11.196 μs (0.00% GC) maximum time: 43.002 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark julia_version_2!($X32, $BPP32) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.401 μs (0.00% GC) median time: 1.408 μs (0.00% GC) mean time: 1.467 μs (0.00% GC) maximum time: 3.543 μs (0.00% GC) -------------- samples: 10000 evals/sample: 10 julia> @benchmark gfortran_version_1!($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 35.203 μs (0.00% GC) median time: 35.475 μs (0.00% GC) mean time: 36.543 μs (0.00% GC) maximum time: 68.729 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark gfortran_version_2!($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.866 μs (0.00% GC) median time: 1.875 μs (0.00% GC) mean time: 1.943 μs (0.00% GC) maximum time: 5.220 μs (0.00% GC) -------------- samples: 10000 evals/sample: 10 julia> @benchmark gfortran_version_3!($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.423 μs (0.00% GC) median time: 1.435 μs (0.00% GC) mean time: 1.483 μs (0.00% GC) maximum time: 4.720 μs (0.00% GC) -------------- samples: 10000 evals/sample: 10 julia> @benchmark ifort_version_1!($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.523 μs (0.00% GC) median time: 1.538 μs (0.00% GC) mean time: 1.571 μs (0.00% GC) maximum time: 3.683 μs (0.00% GC) -------------- samples: 10000 evals/sample: 10 julia> @benchmark ifort_version_2!($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 925.719 ns (0.00% GC) median time: 954.156 ns (0.00% GC) mean time: 986.308 ns (0.00% GC) maximum time: 2.030 μs (0.00% GC) -------------- samples: 10000 evals/sample: 32 julia> @benchmark ifort_version_3!($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 866.052 ns (0.00% GC) median time: 870.172 ns (0.00% GC) mean time: 898.465 ns (0.00% GC) maximum time: 1.527 μs (0.00% GC) -------------- samples: 10000 evals/sample: 58
gcc fails to see how profitable it is to just inline the function getting called in the for loop.
It wastes a whole lot of time needlessly copying things around in version 1.
I used the `@inline` macro in Julia's version 2 to get it to inline the called function. Without it, it still did better than gcc:
Code:julia> @benchmark julia_version_2_noforcedinline!($X32, $BPP32) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.484 μs (0.00% GC) median time: 1.496 μs (0.00% GC) mean time: 1.548 μs (0.00% GC) maximum time: 3.301 μs (0.00% GC) -------------- samples: 10000 evals/sample: 10
...yet somehow Intel was still more than 60% faster. I don't know how much of a role this played:
Code:call *__svml_invsqrtf16_z0@GOTPCREL(%rip) #226.9
Code:vsqrtps %zmm0, %zmm0 vbroadcastss (%rax), %zmm1 vdivps %zmm0, %zmm1, %zmm0
I am definitely impressed how much ifort was able to improve what I had thought was nearly optimal. I'll have to look closer to get some idea of where all that performance came from.
It's definitely highly subject to how vectorizable the input code is, and probably a bunch of other factors. In the code I write, I tend to put some effort into making sure it can be vectorized. Eg, ensuring good data layouts.
EDIT:
gcc is actually using
Code:vrsqrt14ps %zmm0, %zmm1{%k1}{z}
- Likes 1
Leave a comment:
-
Originally posted by celrod View Post
...yet somehow Intel was still more than 60% faster. I don't know how much of a role this played:
Code:call *__svml_invsqrtf16_z0@GOTPCREL(%rip) #226.9
Code:vsqrtps %zmm0, %zmm0 vbroadcastss (%rax), %zmm1 vdivps %zmm0, %zmm1, %zmm0
Leave a comment:
-
Originally posted by thebear View PostHow does gcc/gfortran clang/flang compare to icc/ifort these days? (a coworker claimed Intel's compilers are still "superior")
All three versions are doing the exact same numerical calculations on the exact same input (calculating the product of a vector and the Cholesky decomposition of the inverse of a 3x3 matrix for each of 1024 matrices and vectors).
All versions loop over the data to do these calculations.
Version 1 calls another function that works on blocks from the inputs:
Code:subroutine vpdbacksolve(Uix, x, S) real, dimension(16,3), intent(out) :: Uix real, dimension(16,3), intent(in) :: x real, dimension(16,6), intent(in) :: S real, dimension(16) :: U12, U13, U23, & Ui11, Ui12, Ui22, Ui13, Ui23, Ui33 Ui33 = 1 / sqrt(S(:,6)) U13 = S(:,4) * Ui33 U23 = S(:,5) * Ui33 Ui22 = 1 / sqrt(S(:,3) - U23**2) U12 = (S(:,2) - U13*U23) * Ui22 Ui11 = 1 / sqrt(S(:,1) - U12**2 - U13**2) ! u11 Ui12 = - U12 * Ui11 * Ui22 ! u12 Ui13 = - (U13 * Ui11 + U23 * Ui12) * Ui33 Ui23 = - U23 * Ui22 * Ui33 Uix(:,1) = Ui11*x(:,1) + Ui12*x(:,2) + Ui13*x(:,3) Uix(:,2) = Ui22*x(:,2) + Ui23*x(:,3) Uix(:,3) = Ui33*x(:,3) end subroutine vpdbacksolve
Version 3 manually inlined the function from version 1.
Because the function operates on 16 matrices/vectors at a time, the batch size of 1024 would call a function like the above 64 times.
I also wrote versions 2 in Julia, and an even older version (version 0), where vpdbacksolve is called on one element at a time in the for loop. Theoretically a compiler should be able to figure out it can still vectorize the loop (it did not).
I compiled with
Code:ifort -fast -qopt-zmm-usage=high -ansi-alias -shared -fPIC $FILE -o $INTELSHAREDLIBNAME gfortran -Ofast -fdisable-tree-cunrolli -march=native -mprefer-vector-width=512 -shared -fPIC $FILE -o $GCCSHAREDLIBNAME
Benchmarking everything from Julia:
Code:julia> @benchmark julia_version_0!($X32, $BPP32) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 10.512 μs (0.00% GC) median time: 10.955 μs (0.00% GC) mean time: 11.196 μs (0.00% GC) maximum time: 43.002 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark julia_version_2!($X32, $BPP32) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.401 μs (0.00% GC) median time: 1.408 μs (0.00% GC) mean time: 1.467 μs (0.00% GC) maximum time: 3.543 μs (0.00% GC) -------------- samples: 10000 evals/sample: 10 julia> @benchmark gfortran_version_1!($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 35.203 μs (0.00% GC) median time: 35.475 μs (0.00% GC) mean time: 36.543 μs (0.00% GC) maximum time: 68.729 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark gfortran_version_2!($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.866 μs (0.00% GC) median time: 1.875 μs (0.00% GC) mean time: 1.943 μs (0.00% GC) maximum time: 5.220 μs (0.00% GC) -------------- samples: 10000 evals/sample: 10 julia> @benchmark gfortran_version_3!($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.423 μs (0.00% GC) median time: 1.435 μs (0.00% GC) mean time: 1.483 μs (0.00% GC) maximum time: 4.720 μs (0.00% GC) -------------- samples: 10000 evals/sample: 10 julia> @benchmark ifort_version_1!($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.523 μs (0.00% GC) median time: 1.538 μs (0.00% GC) mean time: 1.571 μs (0.00% GC) maximum time: 3.683 μs (0.00% GC) -------------- samples: 10000 evals/sample: 10 julia> @benchmark ifort_version_2!($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 925.719 ns (0.00% GC) median time: 954.156 ns (0.00% GC) mean time: 986.308 ns (0.00% GC) maximum time: 2.030 μs (0.00% GC) -------------- samples: 10000 evals/sample: 32 julia> @benchmark ifort_version_3!($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 866.052 ns (0.00% GC) median time: 870.172 ns (0.00% GC) mean time: 898.465 ns (0.00% GC) maximum time: 1.527 μs (0.00% GC) -------------- samples: 10000 evals/sample: 58
gcc fails to see how profitable it is to just inline the function getting called in the for loop.
It wastes a whole lot of time needlessly copying things around in version 1.
I used the `@inline` macro in Julia's version 2 to get it to inline the called function. Without it, it still did better than gcc:
Code:julia> @benchmark julia_version_2_noforcedinline!($X32, $BPP32) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.484 μs (0.00% GC) median time: 1.496 μs (0.00% GC) mean time: 1.548 μs (0.00% GC) maximum time: 3.301 μs (0.00% GC) -------------- samples: 10000 evals/sample: 10
...yet somehow Intel was still more than 60% faster. I don't know how much of a role this played:
Code:call *__svml_invsqrtf16_z0@GOTPCREL(%rip) #226.9
Code:vsqrtps %zmm0, %zmm0 vbroadcastss (%rax), %zmm1 vdivps %zmm0, %zmm1, %zmm0
I am definitely impressed how much ifort was able to improve what I had thought was nearly optimal. I'll have to look closer to get some idea of where all that performance came from.
It's definitely highly subject to how vectorizable the input code is, and probably a bunch of other factors. In the code I write, I tend to put some effort into making sure it can be vectorized. Eg, ensuring good data layouts.
EDIT:
gcc is actually using
Code:vrsqrt14ps %zmm0, %zmm1{%k1}{z}
Last edited by celrod; 13 November 2018, 08:06 PM.
Leave a comment:
-
Originally posted by thebear View PostHow does gcc/gfortran clang/flang compare to icc/ifort these days? (a coworker claimed Intel's compilers are still "superior")
- Likes 1
Leave a comment:
-
How does gcc/gfortran clang/flang compare to icc/ifort these days? (a coworker claimed Intel's compilers are still "superior")
- Likes 1
Leave a comment:
-
Could you try comparing "-O3 -march=native" with "-O3 -march=native -mprefer-vector-width=512" ? I'd like to see the difference actually using avx-512 makes.
Leave a comment:
Leave a comment: