Announcement

**celrod** · 13 November 2018, 07:51 PM

Originally posted by thebear View Post

How does gcc/gfortran clang/flang compare to icc/ifort these days? (a coworker claimed Intel's compilers are still "superior")

It isn't worth much, because it is only one example, but I tried three versions of a function in Fortran with gfortran 8.2 and ifort 19.0.1 on a computer with avx-512.

All three versions are doing the exact same numerical calculations on the exact same input (calculating the product of a vector and the Cholesky decomposition of the inverse of a 3x3 matrix for each of 1024 matrices and vectors).
All versions loop over the data to do these calculations.
Version 1 calls another function that works on blocks from the inputs:

Code:

subroutine vpdbacksolve(Uix, x, S)
    real, dimension(16,3), intent(out) ::  Uix
    real, dimension(16,3), intent(in)  ::  x
    real, dimension(16,6), intent(in)  ::  S
    real, dimension(16)                ::  U12,  U13,  U23, &
                                        Ui11, Ui12, Ui22, Ui13, Ui23, Ui33

    Ui33 = 1 / sqrt(S(:,6))
    U13 = S(:,4) * Ui33
    U23 = S(:,5) * Ui33
    Ui22 = 1 / sqrt(S(:,3) - U23**2)
    U12 = (S(:,2) - U13*U23) * Ui22

    Ui11 = 1 / sqrt(S(:,1) - U12**2 - U13**2) ! u11
    Ui12 = - U12 * Ui11 * Ui22 ! u12
    Ui13 = - (U13 * Ui11 + U23 * Ui12) * Ui33
    Ui23 = - U23 * Ui22 * Ui33

    Uix(:,1) = Ui11*x(:,1) + Ui12*x(:,2) + Ui13*x(:,3)
    Uix(:,2) = Ui22*x(:,2) + Ui23*x(:,3)
    Uix(:,3) = Ui33*x(:,3)

end subroutine vpdbacksolve

Version 2 does the exact same thing, except it splits each input matrix up into the individual columns. For example, instead of the 16x3 matrix X, it takes in x1, x2, and x3 as vectors of length 16.

Version 3 manually inlined the function from version 1.

Because the function operates on 16 matrices/vectors at a time, the batch size of 1024 would call a function like the above 64 times.

I also wrote versions 2 in Julia, and an even older version (version 0), where vpdbacksolve is called on one element at a time in the for loop. Theoretically a compiler should be able to figure out it can still vectorize the loop (it did not).

I compiled with

Code:

ifort -fast -qopt-zmm-usage=high -ansi-alias -shared -fPIC $FILE -o $INTELSHAREDLIBNAME
gfortran -Ofast -fdisable-tree-cunrolli -march=native -mprefer-vector-width=512 -shared -fPIC $FILE -o $GCCSHAREDLIBNAME

Benchmarking everything from Julia:

Code:

julia> @benchmark julia_version_0!($X32, $BPP32)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.512 μs (0.00% GC)
  median time:      10.955 μs (0.00% GC)
  mean time:        11.196 μs (0.00% GC)
  maximum time:     43.002 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark julia_version_2!($X32, $BPP32)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.401 μs (0.00% GC)
  median time:      1.408 μs (0.00% GC)
  mean time:        1.467 μs (0.00% GC)
  maximum time:     3.543 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark gfortran_version_1!($X32, $BPP32, $N)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     35.203 μs (0.00% GC)
  median time:      35.475 μs (0.00% GC)
  mean time:        36.543 μs (0.00% GC)
  maximum time:     68.729 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gfortran_version_2!($X32, $BPP32, $N)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.866 μs (0.00% GC)
  median time:      1.875 μs (0.00% GC)
  mean time:        1.943 μs (0.00% GC)
  maximum time:     5.220 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark gfortran_version_3!($X32, $BPP32, $N)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.423 μs (0.00% GC)
  median time:      1.435 μs (0.00% GC)
  mean time:        1.483 μs (0.00% GC)
  maximum time:     4.720 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark ifort_version_1!($X32, $BPP32, $N)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.523 μs (0.00% GC)
  median time:      1.538 μs (0.00% GC)
  mean time:        1.571 μs (0.00% GC)
  maximum time:     3.683 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark ifort_version_2!($X32, $BPP32, $N)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     925.719 ns (0.00% GC)
  median time:      954.156 ns (0.00% GC)
  mean time:        986.308 ns (0.00% GC)
  maximum time:     2.030 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     32

julia> @benchmark ifort_version_3!($X32, $BPP32, $N)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     866.052 ns (0.00% GC)
  median time:      870.172 ns (0.00% GC)
  mean time:        898.465 ns (0.00% GC)
  maximum time:     1.527 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     58

gcc fails to see how profitable it is to just inline the function getting called in the for loop.
It wastes a whole lot of time needlessly copying things around in version 1.

I used the `@inline` macro in Julia's version 2 to get it to inline the called function. Without it, it still did better than gcc:

Code:

julia> @benchmark julia_version_2_noforcedinline!($X32, $BPP32)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.484 μs (0.00% GC)
  median time:      1.496 μs (0.00% GC)
  mean time:        1.548 μs (0.00% GC)
  maximum time:     3.301 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

but with forced inlining, I think gfortran and Julia produced rather similar code, and the assembly looks optimal (to me).

...yet somehow Intel was still more than 60% faster. I don't know how much of a role this played:

Code:

        call      *__svml_invsqrtf16_z0@GOTPCREL(%rip)          #226.9

There are a total of 192 inverse square roots needed, and it takes around 5 ns to use something like this instead ((%rax) would be 1f0):

Code:

    vsqrtps    %zmm0, %zmm0
    vbroadcastss    (%rax), %zmm1
    vdivps    %zmm0, %zmm1, %zmm0

5ns is not a long time, so I doubt intel can shave off that much with a faster version.
I am definitely impressed how much ifort was able to improve what I had thought was nearly optimal. I'll have to look closer to get some idea of where all that performance came from.

It's definitely highly subject to how vectorizable the input code is, and probably a bunch of other factors. In the code I write, I tend to put some effort into making sure it can be vectorized. Eg, ensuring good data layouts.

EDIT:
gcc is actually using

Code:

    vrsqrt14ps    %zmm0, %zmm1{%k1}{z}

I am not sure how to get Julia to do that. It is instead taking the square root and doing a division.

**hubicka** · 14 November 2018, 02:44 AM

Originally posted by celrod View Post

...yet somehow Intel was still more than 60% faster. I don't know how much of a role this played:

Code:

call *__svml_invsqrtf16_z0@GOTPCREL(%rip) #226.9

There are a total of 192 inverse square roots needed, and it takes around 5 ns to use something like this instead ((%rax) would be 1f0):

Code:

vsqrtps %zmm0, %zmm0
vbroadcastss (%rax), %zmm1
vdivps %zmm0, %zmm1, %zmm0

You may try to use -mveclibabi=svml with gcc and links with sml too

**hubicka** · 14 November 2018, 02:47 AM

Originally posted by celrod View Post

It isn't worth much, because it is only one example, but I tried three versions of a function in Fortran with gfortran 8.2 and ifort 19.0.1 on a computer with avx-512.

All three versions are doing the exact same numerical calculations on the exact same input (calculating the product of a vector and the Cholesky decomposition of the inverse of a 3x3 matrix for each of 1024 matrices and vectors).
All versions loop over the data to do these calculations.
Version 1 calls another function that works on blocks from the inputs:

Code:

subroutine vpdbacksolve(Uix, x, S)
real, dimension(16,3), intent(out) :: Uix
real, dimension(16,3), intent(in) :: x
real, dimension(16,6), intent(in) :: S
real, dimension(16) :: U12, U13, U23, &
Ui11, Ui12, Ui22, Ui13, Ui23, Ui33

Ui33 = 1 / sqrt(S(:,6))
U13 = S(:,4) * Ui33
U23 = S(:,5) * Ui33
Ui22 = 1 / sqrt(S(:,3) - U23**2)
U12 = (S(:,2) - U13*U23) * Ui22

Ui11 = 1 / sqrt(S(:,1) - U12**2 - U13**2) ! u11
Ui12 = - U12 * Ui11 * Ui22 ! u12
Ui13 = - (U13 * Ui11 + U23 * Ui12) * Ui33
Ui23 = - U23 * Ui22 * Ui33

Uix(:,1) = Ui11*x(:,1) + Ui12*x(:,2) + Ui13*x(:,3)
Uix(:,2) = Ui22*x(:,2) + Ui23*x(:,3)
Uix(:,3) = Ui33*x(:,3)

end subroutine vpdbacksolve

Version 2 does the exact same thing, except it splits each input matrix up into the individual columns. For example, instead of the 16x3 matrix X, it takes in x1, x2, and x3 as vectors of length 16.

Version 3 manually inlined the function from version 1.

Because the function operates on 16 matrices/vectors at a time, the batch size of 1024 would call a function like the above 64 times.

I also wrote versions 2 in Julia, and an even older version (version 0), where vpdbacksolve is called on one element at a time in the for loop. Theoretically a compiler should be able to figure out it can still vectorize the loop (it did not).

I compiled with

Code:

ifort -fast -qopt-zmm-usage=high -ansi-alias -shared -fPIC $FILE -o $INTELSHAREDLIBNAME
gfortran -Ofast -fdisable-tree-cunrolli -march=native -mprefer-vector-width=512 -shared -fPIC $FILE -o $GCCSHAREDLIBNAME

Benchmarking everything from Julia:

Code:

julia> @benchmark julia_version_0!($X32, $BPP32)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 10.512 μs (0.00% GC)
median time: 10.955 μs (0.00% GC)
mean time: 11.196 μs (0.00% GC)
maximum time: 43.002 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1

julia> @benchmark julia_version_2!($X32, $BPP32)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.401 μs (0.00% GC)
median time: 1.408 μs (0.00% GC)
mean time: 1.467 μs (0.00% GC)
maximum time: 3.543 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10

julia> @benchmark gfortran_version_1!($X32, $BPP32, $N)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 35.203 μs (0.00% GC)
median time: 35.475 μs (0.00% GC)
mean time: 36.543 μs (0.00% GC)
maximum time: 68.729 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1

julia> @benchmark gfortran_version_2!($X32, $BPP32, $N)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.866 μs (0.00% GC)
median time: 1.875 μs (0.00% GC)
mean time: 1.943 μs (0.00% GC)
maximum time: 5.220 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10

julia> @benchmark gfortran_version_3!($X32, $BPP32, $N)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.423 μs (0.00% GC)
median time: 1.435 μs (0.00% GC)
mean time: 1.483 μs (0.00% GC)
maximum time: 4.720 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10

julia> @benchmark ifort_version_1!($X32, $BPP32, $N)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.523 μs (0.00% GC)
median time: 1.538 μs (0.00% GC)
mean time: 1.571 μs (0.00% GC)
maximum time: 3.683 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10

julia> @benchmark ifort_version_2!($X32, $BPP32, $N)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 925.719 ns (0.00% GC)
median time: 954.156 ns (0.00% GC)
mean time: 986.308 ns (0.00% GC)
maximum time: 2.030 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 32

julia> @benchmark ifort_version_3!($X32, $BPP32, $N)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 866.052 ns (0.00% GC)
median time: 870.172 ns (0.00% GC)
mean time: 898.465 ns (0.00% GC)
maximum time: 1.527 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 58

gcc fails to see how profitable it is to just inline the function getting called in the for loop.
It wastes a whole lot of time needlessly copying things around in version 1.

I used the `@inline` macro in Julia's version 2 to get it to inline the called function. Without it, it still did better than gcc:

Code:

julia> @benchmark julia_version_2_noforcedinline!($X32, $BPP32)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.484 μs (0.00% GC)
median time: 1.496 μs (0.00% GC)
mean time: 1.548 μs (0.00% GC)
maximum time: 3.301 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10

but with forced inlining, I think gfortran and Julia produced rather similar code, and the assembly looks optimal (to me).

...yet somehow Intel was still more than 60% faster. I don't know how much of a role this played:

Code:

call *__svml_invsqrtf16_z0@GOTPCREL(%rip) #226.9

There are a total of 192 inverse square roots needed, and it takes around 5 ns to use something like this instead ((%rax) would be 1f0):

Code:

vsqrtps %zmm0, %zmm0
vbroadcastss (%rax), %zmm1
vdivps %zmm0, %zmm1, %zmm0

5ns is not a long time, so I doubt intel can shave off that much with a faster version.
I am definitely impressed how much ifort was able to improve what I had thought was nearly optimal. I'll have to look closer to get some idea of where all that performance came from.

It's definitely highly subject to how vectorizable the input code is, and probably a bunch of other factors. In the code I write, I tend to put some effort into making sure it can be vectorized. Eg, ensuring good data layouts.

EDIT:
gcc is actually using

Code:

vrsqrt14ps %zmm0, %zmm1{%k1}{z}

I am not sure how to get Julia to do that. It is instead taking the square root and doing a division.

gfortran -Ofast -fdisable-tree-cunrolli -march=native -mprefer-vector-width=512 -shared -fPIC $FILE -o $GCCSHAREDLIBNAME The reason why GCC did not inline could -fPIC. It makes the transfromation invalid since you may overwrite the symbol at runtime (with LD_PRELOAD) ICC seems to ignore this.

**Grinch** · 14 November 2018, 06:16 AM

Originally posted by Michael View Post

but it's not 100% perfect with some pesky programs in always being able to catch the flags they are passing (not sure if there is any better or more uniform method today, this is just based upon this compiler masking code I wrote some years ago).

Ah, that makes sense, thanks for the explanation.

**celrod** · 14 November 2018, 06:44 AM

Adding `-mveclibabi=svml` may be a good idea in general, but it doesn't help for square roots:

x86 Options (Using the GNU Compiler Collection (GCC))

https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

x86 Options (Using the GNU Compiler Collection (GCC))

GCC currently emits calls to vmldExp2, vmldLn2, vmldLog102, vmldPow2, vmldTanh2, vmldTan2, vmldAtan2, vmldAtanh2, vmldCbrt2, vmldSinh2, vmldSin2, vmldAsinh2, vmldAsin2, vmldCosh2, vmldCos2, vmldAcosh2, vmldAcos2, vmlsExp4, vmlsLn4, vmlsLog104, vmlsPow4, vmlsTanh4, vmlsTan4, vmlsAtan4, vmlsAtanh4, vmlsCbrt4, vmlsSinh4, vmlsSin4, vmlsAsinh4, vmlsAsin4, vmlsCosh4, vmlsCos4, vmlsAcosh4 and vmlsAcos4 for corresponding function type when -mveclibabi=svml is used, and __vrd2_sin, __vrd2_cos, __vrd2_exp, __vrd2_log, __vrd2_log2, __vrd2_log10, __vrs4_sinf, __vrs4_cosf, __vrs4_expf, __vrs4_logf, __vrs4_log2f, __vrs4_log10f and __vrs4_powf for the corresponding function type when -mveclibabi=acml is used.

Thanks, I did not know that `-fPIC` prevents inlining!
I saw `-fno-semantic-interposition` on StackOverflow, and tested to confirm that using it allows inlining.

However, gfortran still would not inline my example -- I wish it had something like the C/C++ `inline` attribute (which also inlines despite `fPIC`).
That, vector intrinsics, and the standard library often make C++ easier than Fortran, despite less convenient native array support.

**celrod** · 15 November 2018, 03:33 PM

I can use `-finline-limit` set to some arbitrarily big number to get everything to inline, given that I'm also using `-fno-semantic-interposition`.

gfortran is still struggling hard compared to ifort. This helped versions 2 and 3 of gfortran to match Julia.

Version 1 is still about the same slow, even though it is inlined. Rather than having a for loop that makes sense, there is a sequence of jump's that weave through a bunch of save instructions. I am not sure why it is trying to do, but version 1 still takes 35 microseconds despite getting inlined, while the manually inlined version takes 1.4 microseconds -- a 25x improvement from getting forced to manually inline.

Version 0 of the function was the most obvious and natural to write -- just defining the function to operate on scalars, without worrying about vectorization, other than how the data is stored in memory. Julia took 10 microseconds vs 1.4 for the babysat vectorized version.
I now decided to test that version in Fortran. gfortran took about 97 microseconds!
ouch, that is slow.
ifort took around 850 nanoseconds, well over 100x faster.

Too bad the intel compilers are neither libre nor gratis. =(
But at least in this problem, the vectorizer is doing a marvellous job -- including correctly optimizing the clearest and easiest way to write the function, while gfortran failed catastrophically (and Julia did pretty bad, too).

**nanonyme** · 05 May 2019, 04:29 AM

Is there some benchmark I should look at to determine actual compile speed differences LLVM/Webkit/Chromium compile tests?

Announcement

A Look At The GCC 9 Performance On Intel Skylake Against GCC 8, LLVM Clang 7/8

Comment

Comment

Comment

Comment

Comment

Comment

Comment