If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.
Announcement
Collapse
No announcement yet.
GCC 9 Compiler Tuning Benchmarks On Intel Skylake AVX-512
If you are an application developer and want to use LTO without suffering from the gigantic link times, you can help it a bit by using clang with thinlto + caching, fyi.
Michael I was wondering do the geometric mean results include the compilation time results or only the application performance results?
It includes all benchmark results. (Of course, on lower is better results, the numbers are inverted so all data points are HIB for the geometric mean to make sense.)
It includes all benchmark results. (Of course, on lower is better results, the numbers are inverted so all data points are HIB for the geometric mean to make sense.)
It is to be expected that it will take longer to perform more in-depth optimisation. By including the compilation time results it is disguising the benefits which are being realised from these optimisations.
Can the compilation timing results be excluded from the geometric mean or maybe create two mean charts, one including compilation time and the other without?
It includes all benchmark results. (Of course, on lower is better results, the numbers are inverted so all data points are HIB for the geometric mean to make sense.)
Michael,
I think it would be more meaningful to do compile time and performance separately. Those are not really metrics that can be combined easily.
I also wonder, does the compile time benchmark include parallel make? With LTO it would make sense to use -flto=<parallelism> so the link-time is parallel as well. Otherwise you are comparing single-threaded optimization relative to multi-threaded. The number you add to -flto parameter has no effect on generated code.
If you are an application developer and want to use LTO without suffering from the gigantic link times, you can help it a bit by using clang with thinlto + caching, fyi.
GCC has -flto=n that performs similarly to clang (it seems to be faster for large programs)
GCC has -flto=n that performs similarly to clang (it seems to be faster for large programs)
Missed the reference to caching. Indeed on GCC side caching is not implemented yed, so with LTO one needs to rebuild whole binary each time. It is about 8 minutes for Firefox for -flto=16 on my 16 thread buldozer (compared to about 50 minutes -flto). Caching is something I plan to look into for GCC10 - GCC already streams the translation units for parallel compilation and adding a cache there is just a matter of tooling and a bit of work on stabilizing the partitioning algorithm so it does produce similar results after small source changes. That should get Firefox re-linking down to about a minute.
Clang thin-LTO is having bit different design from GCC lto and it trades more code quality for re-linking time improvements. It is also someting possible to do on GCC side (have two-level LTO) but I am not sure how useful it would be.
I will check why FFTW seems to run faster w/o LTO. Generally small stand-alone benchmarks like scimark are not testing LTO very realistically because whole hot loop is in one translation unit anyway so LTO is not very useful.
Comment