Originally posted by thebear
View Post
All three versions are doing the exact same numerical calculations on the exact same input (calculating the product of a vector and the Cholesky decomposition of the inverse of a 3x3 matrix for each of 1024 matrices and vectors).
All versions loop over the data to do these calculations.
Version 1 calls another function that works on blocks from the inputs:
Code:
subroutine vpdbacksolve(Uix, x, S) real, dimension(16,3), intent(out) :: Uix real, dimension(16,3), intent(in) :: x real, dimension(16,6), intent(in) :: S real, dimension(16) :: U12, U13, U23, & Ui11, Ui12, Ui22, Ui13, Ui23, Ui33 Ui33 = 1 / sqrt(S(:,6)) U13 = S(:,4) * Ui33 U23 = S(:,5) * Ui33 Ui22 = 1 / sqrt(S(:,3) - U23**2) U12 = (S(:,2) - U13*U23) * Ui22 Ui11 = 1 / sqrt(S(:,1) - U12**2 - U13**2) ! u11 Ui12 = - U12 * Ui11 * Ui22 ! u12 Ui13 = - (U13 * Ui11 + U23 * Ui12) * Ui33 Ui23 = - U23 * Ui22 * Ui33 Uix(:,1) = Ui11*x(:,1) + Ui12*x(:,2) + Ui13*x(:,3) Uix(:,2) = Ui22*x(:,2) + Ui23*x(:,3) Uix(:,3) = Ui33*x(:,3) end subroutine vpdbacksolve
Version 3 manually inlined the function from version 1.
Because the function operates on 16 matrices/vectors at a time, the batch size of 1024 would call a function like the above 64 times.
I also wrote versions 2 in Julia, and an even older version (version 0), where vpdbacksolve is called on one element at a time in the for loop. Theoretically a compiler should be able to figure out it can still vectorize the loop (it did not).
I compiled with
Code:
ifort -fast -qopt-zmm-usage=high -ansi-alias -shared -fPIC $FILE -o $INTELSHAREDLIBNAME gfortran -Ofast -fdisable-tree-cunrolli -march=native -mprefer-vector-width=512 -shared -fPIC $FILE -o $GCCSHAREDLIBNAME
Benchmarking everything from Julia:
Code:
julia> @benchmark julia_version_0!($X32, $BPP32) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 10.512 μs (0.00% GC) median time: 10.955 μs (0.00% GC) mean time: 11.196 μs (0.00% GC) maximum time: 43.002 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark julia_version_2!($X32, $BPP32) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.401 μs (0.00% GC) median time: 1.408 μs (0.00% GC) mean time: 1.467 μs (0.00% GC) maximum time: 3.543 μs (0.00% GC) -------------- samples: 10000 evals/sample: 10 julia> @benchmark gfortran_version_1!($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 35.203 μs (0.00% GC) median time: 35.475 μs (0.00% GC) mean time: 36.543 μs (0.00% GC) maximum time: 68.729 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark gfortran_version_2!($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.866 μs (0.00% GC) median time: 1.875 μs (0.00% GC) mean time: 1.943 μs (0.00% GC) maximum time: 5.220 μs (0.00% GC) -------------- samples: 10000 evals/sample: 10 julia> @benchmark gfortran_version_3!($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.423 μs (0.00% GC) median time: 1.435 μs (0.00% GC) mean time: 1.483 μs (0.00% GC) maximum time: 4.720 μs (0.00% GC) -------------- samples: 10000 evals/sample: 10 julia> @benchmark ifort_version_1!($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.523 μs (0.00% GC) median time: 1.538 μs (0.00% GC) mean time: 1.571 μs (0.00% GC) maximum time: 3.683 μs (0.00% GC) -------------- samples: 10000 evals/sample: 10 julia> @benchmark ifort_version_2!($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 925.719 ns (0.00% GC) median time: 954.156 ns (0.00% GC) mean time: 986.308 ns (0.00% GC) maximum time: 2.030 μs (0.00% GC) -------------- samples: 10000 evals/sample: 32 julia> @benchmark ifort_version_3!($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 866.052 ns (0.00% GC) median time: 870.172 ns (0.00% GC) mean time: 898.465 ns (0.00% GC) maximum time: 1.527 μs (0.00% GC) -------------- samples: 10000 evals/sample: 58
gcc fails to see how profitable it is to just inline the function getting called in the for loop.
It wastes a whole lot of time needlessly copying things around in version 1.
I used the `@inline` macro in Julia's version 2 to get it to inline the called function. Without it, it still did better than gcc:
Code:
julia> @benchmark julia_version_2_noforcedinline!($X32, $BPP32) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.484 μs (0.00% GC) median time: 1.496 μs (0.00% GC) mean time: 1.548 μs (0.00% GC) maximum time: 3.301 μs (0.00% GC) -------------- samples: 10000 evals/sample: 10
...yet somehow Intel was still more than 60% faster. I don't know how much of a role this played:
Code:
call *__svml_invsqrtf16_z0@GOTPCREL(%rip) #226.9
Code:
vsqrtps %zmm0, %zmm0 vbroadcastss (%rax), %zmm1 vdivps %zmm0, %zmm1, %zmm0
I am definitely impressed how much ifort was able to improve what I had thought was nearly optimal. I'll have to look closer to get some idea of where all that performance came from.
It's definitely highly subject to how vectorizable the input code is, and probably a bunch of other factors. In the code I write, I tend to put some effort into making sure it can be vectorized. Eg, ensuring good data layouts.
EDIT:
gcc is actually using
Code:
vrsqrt14ps %zmm0, %zmm1{%k1}{z}
Comment