Originally posted by Marc Driftmeyer
View Post
Announcement
Collapse
No announcement yet.
GCC 7.0 vs. LLVM Clang 4.0 Performance (January 2017)
Collapse
X
-
Michael Larabel
https://www.michaellarabel.com/
-
Originally posted by Steffo View PostIt seems, soon GCC will be irrelevant. All new tools are based on llvm and now it can beat GCC on many benchmarks.
Comment
-
One things that could explain why Clang is performing so much better (factors of two or three) than GCC in some cases is due to dependency chains and loop unrolling. One example of a dependency chain is a reduction sum
for(int i=0; i<n; i++) sum += x;
Clang will unroll the loop to four independent sums (for float you need -Ofast) whereas GCC does not unroll at all. The problem is that if you don't unroll it's latency bound. You have to unroll the loop into intendant operations to get the full throughput. ICC incidentally unrolls twice from what I can tell.
The option `-funroll-loops` unrolls the loop eight times but does not break the dependency which is really stupid. This is one of the main reasons that GCC's auto-vectorization is not good enough and why you still need intrinsics with GCC.
Comment
-
Originally posted by atomsymbol
In my opinion, from a certain viewpoint both gcc and clang are flawed.
Code:$ [B]cat a.c[/B] #include <stddef.h> #include <stdlib.h> int main() { float sum = random(); const float x = random(); const size_t n = random(); for(size_t i=0; i<n; i++) sum += x; return sum; } $ [B]clang --version[/B] clang version 3.9.1 (tags/RELEASE_391/final) Target: x86_64-pc-linux-gnu $ [B]clang -O3 -S a.c[/B] $ [B]cat a.s[/B] .LBB0_9: addss %xmm0, %xmm1 addss %xmm0, %xmm1 addss %xmm0, %xmm1 addss %xmm0, %xmm1 addss %xmm0, %xmm1 addss %xmm0, %xmm1 addss %xmm0, %xmm1 addss %xmm0, %xmm1 addq $8, %rdx jne .LBB0_9 $ [B]clang -O3 -S a.c [COLOR=#006400]-march=native (or: -march=bdver3)[/COLOR][/B] $ [B]cat a.s[/B] .LBB0_4: vaddss %xmm1, %xmm0, %xmm1 decq %rax jne .LBB0_4
Comment
-
Originally posted by Michael View Post
With PTS7 git: phoronix-test-suite winners-and-losers <result file> if any other metrics/stats you would find interesting, can easily add. If it ends up adding more stuff to it, will probably rename from winners-and-losers.
Last edited by JakubJelinek; 01 February 2017, 09:31 AM.
Comment
-
Originally posted by JakubJelinek View Post
Michael, it seems e.g. EP-DGEMM spents pretty much all the runtime in dgemm_ function from libblas. Do you use different libblas versions (built by different compilers) for this testing?
Comment
-
Originally posted by zboson View PostOne things that could explain why Clang is performing so much better (factors of two or three) than GCC in some cases is due to dependency chains and loop unrolling. One example of a dependency chain is a reduction sum
for(int i=0; i<n; i++) sum += x;
Clang will unroll the loop to four independent sums (for float you need -Ofast) whereas GCC does not unroll at all. The problem is that if you don't unroll it's latency bound. You have to unroll the loop into intendant operations to get the full throughput. ICC incidentally unrolls twice from what I can tell.
The option `-funroll-loops` unrolls the loop eight times but does not break the dependency which is really stupid. This is one of the main reasons that GCC's auto-vectorization is not good enough and why you still need intrinsics with GCC.
(It would be even nicer if the testing infrastructure also supported PGO. Apple is clearly moving towards making this more-or-less automatic in XCode, though only the outline is there right now, not the full picture. Maybe at this year's WWDC? Basically, along with your standard unit tests and performance tests run after each compile, you'll be able to provide "training" tests that create profiling data for subsequent PGO. I'd guess Dev Studio already has a work flow like this. Eclipse probably, but I have no idea.)
Comment
Comment