Originally posted by celrod
View Post
Announcement
Collapse
No announcement yet.
Intel Cascade Lake Xeon Benchmarks With GCC 8 vs. GCC 9 Compilers
Collapse
X
-
- Likes 1
-
Originally posted by hubicka View Post
It would be useful to report this to GCC bugzilla (along with ICC numbers). Vectorizer's cost metrics is far from ideal and one of things we plan to work it is to make it choose better the proper vector width (and of course improve codegen).
But I have to issue a correction on my earlier tests. I was using Clear Linux, and it seems Clear Linux's gcc was patched to behave differently from Godbolt's.
Code:$ g++ -S -Ofast -march=skylake-avx512 UpdateGroupProbs.cpp -o UpdateGroupProbs.s $ g++ -v Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-generic-linux/9/lto-wrapper Target: x86_64-generic-linux Configured with: ../gcc-9.1.0/configure --prefix=/usr --with-pkgversion='Clear Linux OS for Intel Architecture' --libdir=/usr/lib64 --enable-libstdcxx-pch --libexecdir=/usr/lib64 --with-system-zlib --enable-shared --enable-gnu-indirect-function --disable-vtable-verify --enable-threads=posix --enable-__cxa_atexit --enable-plugin --enable-ld=default --enable-clocale=gnu --disable-multiarch --enable-multilib --enable-lto --disable-werror --enable-linker-build-id --build=x86_64-generic-linux --target=x86_64-generic-linux --enable-languages=c,c++,fortran,go --enable-bootstrap --with-ppl=yes --with-isl --includedir=/usr/include --exec-prefix=/usr --with-glibc-version=2.19 --disable-libunwind-exceptions --with-gnu-ld --with-tune=haswell --with-arch=westmere --enable-cet --disable-libmpx --with-gcc-major-version-only --enable-default-pie Thread model: posix gcc version 9.1.1 20190503 gcc-9-branch@270849 (Clear Linux OS for Intel Architecture)
The function in the Godbolt link was the slowest part of a Gibbs sampler I wrote for my dissertation. It was called billions of times (4000 samples/fit * 36 fits / dataset * >17000 datasets), with N ranging from the hundreds to the thousands, and "K" (outer loop iterations, hard-coded to 6 in the above example) ranged from 2 to 16.
Benchmarking it from Julia with mprefer-vector-width=256 vs 512, with N=2827 (I could provide all the code to reproduce this example):
Code:julia> @benchmark update_group_probs_cxx256!($indiv_probs2, $base_p, $revcholwisharts, $X, $Nv) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 21.150 μs (0.00% GC) median time: 21.249 μs (0.00% GC) mean time: 21.476 μs (0.00% GC) maximum time: 43.230 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark update_group_probs_cxx512!($indiv_probs2, $base_p, $revcholwisharts, $X, $Nv) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 14.216 μs (0.00% GC) median time: 14.313 μs (0.00% GC) mean time: 14.400 μs (0.00% GC) maximum time: 43.261 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1
After the Fortran benchmark runs, I'll file an issue on gcc's bugzilla.
- Likes 1
Comment
-
mprefer-vector-width=256 seems to outperform mprefer-vector-width=512 on the Polyhedron benchmarks.
I reran the benchmarks, now with exactly 10 runs each time. I believe the first is discarded, so the averages are of 9 runs. I believe these should be more accurate than the earlier results, besides taking less time to run.
gfortran with 512 bit vectors:
Code:$ cat gfortran512.sum ================================================================================ Date & Time : 5 May 2019 13:04:09 Test Name : prefer512 Compile Command : gfortran -Ofast -march=native -mprefer-vector-width=512 %n.f90 -o %n Benchmarks : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2 Maximum Times : 10000.0 Target Error % : 0.100 Minimum Repeats : 10 Maximum Repeats : 10 Benchmark Compile Executable Ave Run Number Estim Name (secs) (bytes) (secs) Repeats Err % --------- ------- ---------- ------- ------- ------ ac 0.37 43176 5.66 10 0.0342 aermod 12.73 1089648 6.01 10 0.1469 air 2.87 152536 2.11 10 0.7105 capacita 1.21 106472 11.80 10 0.0697 channel2 0.20 31216 57.66 10 0.0790 doduc 2.03 165240 7.31 10 0.1554 fatigue2 0.84 77928 46.47 10 0.2654 gas_dyn2 1.03 95584 52.11 10 0.1648 induct2 1.67 175232 19.54 10 0.1429 linpk 0.25 34928 2.18 10 0.1739 mdbx 0.82 90856 3.83 10 0.0872 mp_prop_desi 0.39 47944 53.32 10 1.3054 nf 0.40 47600 4.32 10 0.2134 protein 1.04 95112 13.84 10 0.1219 rnflow 1.67 115416 14.64 10 0.1113 test_fpu2 1.24 93608 19.18 10 0.1698 tfft2 0.47 47520 24.78 10 0.1479 Geometric Mean Execution Time = 12.24 seconds
Code:$ cat gfortran256.sum ================================================================================ Date & Time : 5 May 2019 14:02:05 Test Name : prefer256 Compile Command : gfortran -Ofast -march=native -mprefer-vector-width=256 %n.f90 -o %n Benchmarks : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2 Maximum Times : 10000.0 Target Error % : 0.100 Minimum Repeats : 10 Maximum Repeats : 10 Benchmark Compile Executable Ave Run Number Estim Name (secs) (bytes) (secs) Repeats Err % --------- ------- ---------- ------- ------- ------ ac 0.37 43176 4.92 10 0.0495 aermod 12.19 1069064 5.50 10 0.0918 air 2.02 119768 1.96 10 0.3457 capacita 1.02 85992 11.31 10 0.1991 channel2 0.19 31216 60.59 10 0.1206 doduc 2.00 157048 7.13 10 0.1736 fatigue2 0.78 73832 46.40 10 0.8044 gas_dyn2 0.90 87392 52.02 10 0.1442 induct2 1.70 175232 19.79 10 0.1299 linpk 0.23 30832 2.40 10 0.1288 mdbx 0.75 82664 3.81 10 0.1003 mp_prop_desi 0.35 43848 53.21 10 1.3351 nf 0.34 43504 4.30 10 0.1648 protein 0.92 86920 13.85 10 0.1292 rnflow 1.51 99032 13.49 10 0.1446 test_fpu2 1.07 77224 20.80 10 0.0451 tfft2 0.37 39328 24.69 10 0.2454 Geometric Mean Execution Time = 12.08 seconds ================================================================================
Code:$ cat ifort.sum ================================================================================ Date & Time : 5 May 2019 11:10:44 Test Name : ifort Compile Command : ifort -fast -qopt-zmm-usage=high %n.f90 -o %n Benchmarks : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2 Maximum Times : 10000.0 Target Error % : 0.100 Minimum Repeats : 10 Maximum Repeats : 10 Benchmark Compile Executable Ave Run Number Estim Name (secs) (bytes) (secs) Repeats Err % --------- ------- ---------- ------- ------- ------ ac 0.72 8129584 4.08 10 0.0418 aermod 17.07 10420728 6.51 10 0.1651 air 4.02 8406208 1.52 10 0.1871 capacita 1.70 8233712 10.19 10 0.3839 channel2 0.46 8156672 55.36 10 0.0736 doduc 2.39 8329416 7.32 10 0.2329 fatigue2 2.08 8444576 47.82 10 0.0126 gas_dyn2 1.49 8216560 24.54 10 0.2795 induct2 4.84 8760336 21.59 10 0.2213 linpk 0.46 8071840 2.15 10 0.2217 mdbx 1.67 8186544 2.66 10 0.0806 mp_prop_desi 1.06 8529808 44.78 10 1.7498 nf 0.86 8232200 4.37 10 0.1005 protein 5.07 8411944 15.78 10 0.1087 rnflow 24.07 8418128 7.69 10 0.0717 test_fpu2 3.00 8396336 16.20 10 0.1125 tfft2 0.61 8139256 41.81 10 0.1581 Geometric Mean Execution Time = 10.83 seconds ================================================================================
Code:$ cat ifortzmmlow.sum ================================================================================ Date & Time : 5 May 2019 12:11:32 Test Name : ifortzmmlow Compile Command : ifort -fast -qopt-zmm-usage=low %n.f90 -o %n Benchmarks : ac aermod air capacita channel2 doduc fatigue2 gas_dyn2 induct2 linpk mdbx mp_prop_design nf protein rnflow test_fpu2 tfft2 Maximum Times : 10000.0 Target Error % : 0.100 Minimum Repeats : 10 Maximum Repeats : 10 Benchmark Compile Executable Ave Run Number Estim Name (secs) (bytes) (secs) Repeats Err % --------- ------- ---------- ------- ------- ------ ac 0.85 8125488 3.37 10 0.1828 aermod 15.86 10319112 5.93 10 0.1404 air 3.97 8416280 1.44 10 0.3119 capacita 1.56 8226744 9.49 10 0.1397 channel2 0.44 8160784 57.27 10 0.0895 doduc 2.20 8316928 6.37 10 0.1760 fatigue2 2.06 8440376 40.10 10 0.0994 gas_dyn2 1.47 8204072 24.29 10 0.1562 induct2 4.81 8760336 24.66 10 0.8962 linpk 0.42 8067544 2.28 10 0.4775 mdbx 1.17 8153576 3.49 10 0.0862 mp_prop_desi 1.15 8549000 44.21 10 1.5834 nf 0.87 8227904 4.22 10 0.0494 protein 5.15 8412664 14.37 10 0.0801 rnflow 23.86 8401544 7.49 10 0.3231 test_fpu2 3.01 8383952 17.74 10 0.1773 tfft2 0.60 8135840 41.03 10 0.2156 Geometric Mean Execution Time = 10.62 seconds ================================================================================
I'd expect llvm to be much more vulnerable to that, since it seems to unroll much more than gcc (based on looking at asm of functions like dot products).
Flang performed exactly the same whether `mprefer-vector-width=256` or `mprefer-vector-width=512`. A brief test seems to suggest it is ignoring the argument (same asm produced).Last edited by celrod; 06 May 2019, 01:20 PM.
- Likes 1
Comment
Comment