Announcement

**nevion** · 22 May 2017, 12:52 AM

bridgman the performance of the compiler is not bad for a 1.0 (even if llvm derivative) but as of now by the numbers the compiler doesn't seem particularly advanced to warrant use. Is there a benchmark it excels in as of now? Where should we expect it to make a strong difference against vanilla gcc/clang and is there example code for that available? How does it's openmp runtime compare to gomp/llvm's btw? Seems like from the ryzen/Naples marketing that threading overhead is some of the lowest yet on general purpose processors (imagine this will be the same or better on Naples) so I'm wondering if the compiler can exploit this better than the state of art in gcc/clang.

" Also highly optimized libraries, which extracts the optimal performance from each x86 processor core, are used." - I only see libclang, llvm, BugpointPasses, omp, amdlibm, and LTO. Is more than this planned? Anything specific?

Speaking of LTO, how much should we really expect to see a difference here with LTO and the gold linker plugin supplied in a separate zip? "It depends" I'm sure but is there a benchmark available?

Also what's the release schedule we might see with it and features that could be coming "soon"?

**unquaid** · 22 May 2017, 04:45 AM

-O2 , -O3 and other addition key it look like lottery. Compiler don't know best solution for each program pieces without used profiler first.
Need use key -fprofile-generate then run program and -fprofile-use. Compiler must choose automatically best solution without prediction keys.

**carewolf** · 22 May 2017, 04:59 AM

Originally posted by indepe View Post

While in previous comparisons including -O2 and -O3 they often were very close, there used to be also a few tests where -O3 was more than 10% faster. At the same time, there seem to be fewer cases where -O3 is a bit slower. What happened?

(Of course, it is still nice to get 1% - 4% improvements by the mere flip of a switch.)

-O3 is not guaranteed to be better than -O2, anything that is universally better gets moved to -O2. The problem with -O3 is that it increases binary size, so it might not fit caches as well anymore, and if you don't get any of the improvements to compensate it can a few smaller percent slower.

Edit: It would actually be interesting to see if -Os is even faster in those rare cases.

**cj.wijtmans** · 22 May 2017, 06:40 AM

-Os can indeed be faster in some cases. and as you said -O2 has all the best optimizations. -O3 is more experimental optimizations that may turn out to be worse and unstable. Only profiling can really make a difference. And i wonder if a long source code, profiles should be shipped as well.

**mlau** · 22 May 2017, 09:01 AM

Originally posted by Marc.2377 View Post

User mlau has said twice ([1], [2]) that using the -mtune flag together with -march (eg. -march=znver1 -mtune=haswell) improves performance noticeably. It would be nice if Michael tested this. Anyway, thanks for the updated benchmark.

Don't expect miracles though

. It's noticeable in benchmarks, but in no way earth shattering.

**carewolf** · 22 May 2017, 10:14 AM

Originally posted by cj.wijtmans View Post

-Os can indeed be faster in some cases. and as you said -O2 has all the best optimizations. -O3 is more experimental optimizations that may turn out to be worse and unstable. Only profiling can really make a difference. And i wonder if a long source code, profiles should be shipped as well.

There shouldn't be anything unstable or unsafe in -O3, that is reserved to individual flags and -Ofast. O3 should just increase binary size and compile time.

**Marc.2377** · 22 May 2017, 05:06 PM

Originally posted by nevion View Post

bridgman the performance of the compiler is not bad for a 1.0 (even if llvm derivative) but as of now by the numbers the compiler doesn't seem particularly advanced to warrant use. Is there a benchmark it excels in as of now? Where should we expect it to make a strong difference against vanilla gcc/clang and is there example code for that available? How does it's openmp runtime compare to gomp/llvm's btw? Seems like from the ryzen/Naples marketing that threading overhead is some of the lowest yet on general purpose processors (imagine this will be the same or better on Naples) so I'm wondering if the compiler can exploit this better than the state of art in gcc/clang.

" Also highly optimized libraries, which extracts the optimal performance from each x86 processor core, are used." - I only see libclang, llvm, BugpointPasses, omp, amdlibm, and LTO. Is more than this planned? Anything specific?

Speaking of LTO, how much should we really expect to see a difference here with LTO and the gold linker plugin supplied in a separate zip? "It depends" I'm sure but is there a benchmark available?

Also what's the release schedule we might see with it and features that could be coming "soon"?

First off, you're right by paying attention to optimized libraries. IMHO, we should consider where does the strengths of the Intel C++ Compiler come from, and try to mirror them. Mainly:

- Hand-crafted math libraries, resulting in highly efficient machine code for numerical algorithms (and thus shining in benchmarks)
- Performing floating-point optimizations by default which are not allowed by the standards, resulting in even faster numerical code out-of-the-box. In other compilers we have switches to apply such optimizations, while in Intel's we have to explicitly disable these.
- Hand-optimized standard library implementations - in many cases, this makes more of a difference than code generation in itself
- A rather aggressive CPU dispatcher, which runs the best routines according to what instructions are supported by the CPU (detected at runtime)
- And of course, highly efficient machine code generation, although I really feel GCC, Clang/LLVM and specially the AOCC have really caught on in this point, except perhaps for some particular "less-than-optimal" source code, that I particularly don't care about to be honest.

Now about LTO, I always advocate in favor of enabling it. At least with GCC it doesn't help much in performance (for my usage scenarios), but it always helps reducing binary size significantly, which is welcome. It used to be tricky under Windows however, I should try it again these days.

Originally posted by carewolf View Post

-O3 is not guaranteed to be better than -O2, anything that is universally better gets moved to -O2. The problem with -O3 is that it increases binary size, so it might not fit caches as well anymore, and if you don't get any of the improvements to compensate it can a few smaller percent slower.

Edit: It would actually be interesting to see if -Os is even faster in those rare cases.

I found that this "problematic" regarding cache fit size isn't the case anymore since GCC 5.3, at least for my limited usage. -O3 enables many time-expensive optimizations which very often results in binaries that are at least as fast as -O2, and in some cases, a lot faster, as you can see in answers to this question on Stack Overflow (this particular case regards the -ftree-vectorize switch). These optimizations end up making more difference in the real world than some localized issues with cache-fitting - and by "localized" I mean "it happens, but it's rare and shouldn't be assumed unless profiling says otherwise". So I regard -O3 as the best option always, if you can bear the extra compile time, which is rarely a problem.

Originally posted by cj.wijtmans View Post

-Os can indeed be faster in some cases. and as you said -O2 has all the best optimizations. -O3 is more experimental optimizations that may turn out to be worse and unstable. Only profiling can really make a difference. And i wonder if a long source code, profiles should be shipped as well.

Experimental optimizations, instability, in my experience (and not only) these are also old news.

**cj.wijtmans** · 23 May 2017, 06:21 AM

Originally posted by Marc.2377 View Post

First off, you're right by paying attention to optimized libraries. IMHO, we should consider where does the strengths of the Intel C++ Compiler come from, and try to mirror them. Mainly:

- Hand-crafted math libraries, resulting in highly efficient machine code for numerical algorithms (and thus shining in benchmarks)
- Performing floating-point optimizations by default which are not allowed by the standards, resulting in even faster numerical code out-of-the-box. In other compilers we have switches to apply such optimizations, while in Intel's we have to explicitly disable these.
- Hand-optimized standard library implementations - in many cases, this makes more of a difference than code generation in itself
- A rather aggressive CPU dispatcher, which runs the best routines according to what instructions are supported by the CPU (detected at runtime)
- And of course, highly efficient machine code generation, although I really feel GCC, Clang/LLVM and specially the AOCC have really caught on in this point, except perhaps for some particular "less-than-optimal" source code, that I particularly don't care about to be honest.

Now about LTO, I always advocate in favor of enabling it. At least with GCC it doesn't help much in performance (for my usage scenarios), but it always helps reducing binary size significantly, which is welcome. It used to be tricky under Windows however, I should try it again these days.

I found that this "problematic" regarding cache fit size isn't the case anymore since GCC 5.3, at least for my limited usage. -O3 enables many time-expensive optimizations which very often results in binaries that are at least as fast as -O2, and in some cases, a lot faster, as you can see in answers to this question on Stack Overflow (this particular case regards the -ftree-vectorize switch). These optimizations end up making more difference in the real world than some localized issues with cache-fitting - and by "localized" I mean "it happens, but it's rare and shouldn't be assumed unless profiling says otherwise". So I regard -O3 as the best option always, if you can bear the extra compile time, which is rarely a problem.

Experimental optimizations, instability, in my experience (and not only) these are also old news.

Try compiling a gentoo desktop with -O3 and come back here after a few dys of uptime.

**carewolf** · 23 May 2017, 08:16 AM

Originally posted by cj.wijtmans View Post

Try compiling a gentoo desktop with -O3 and come back here after a few dys of uptime.

Which one? I compiled Qt and KDE with -O3 for years.

**Marc.2377** · 23 May 2017, 04:09 PM

Originally posted by cj.wijtmans View Post

Try compiling a gentoo desktop with -O3 and come back here after a few dys of uptime.

Don't know about Gentoo, wouldn't use it, but I always compile my kernels (and everything else, including GCC itself and the libraries) with -O3. Everything's smooth as it should.

Edit: x86_64 only. It may well still be "dangerous" for other archs.

Announcement

AMD Ryzen AOCC 1.0 Compiler Tuning Benchmarks

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment