x86_64 microarchitecture levels benchmarked

Mat2

Senior Member

Join Date: Nov 2010

Posts: 144
- Share
- Tweet
#1

x86_64 microarchitecture levels benchmarked

15 March 2021, 03:31 PM

Hello,

I have benchmarked the performance impact of compiling code for various x86_64 microarchitecture levels:

Uarchlevels Performance [2103142-HA-UARCHLEVE55] - OpenBenchmarking.org

https://openbenchmarking.org/result/2103142-HA-UARCHLEVE55

OpenBenchmarking.org, Phoronix Test Suite, Linux benchmarking, automated benchmarking, benchmarking results, benchmarking repository, open source benchmarking, benchmarking test profiles

TL;DR:
there is no or negligible performance benefit of -march=nehalem, which corresponds to x86_64-v2,

there is a moderate benefit of -march=haswell (x86_64-v3) - around 10%-20% as compared to baseline for the tests performed

Code:

Geometric Mean Of All Test Results Result Composite Geometric Mean > Higher Is Better O1_generic ....... 367.99 |================================================= ============================= O3_generic ....... 459.84 |================================================= ================================================= O3_march_nehalem . 462.89 |================================================= ================================================= O3_march_haswell . 531.99 |================================================= ================================================== ==============

x86_64-v2: There were only two tests in which march=nehalem was meaningfully faster then march=x86_64 (the baseline architecture). These were "graphicsmagick/Swirl" and "FLAC audio encoding". FLAC results were quite noisy (click the "Result confidence" button above the pie chart to show data) so the benefits may not be statistically significant. Swirl appeared to be only around 4% faster. I was surprised because I thought that the benefits would be somewhere around 5-10%. It looks like GCC's autovectorisation does not make much use from the instructions added in SSE3/SSSE3/SSE4.

x86_64-v3: The geometric mean of test results was around 15% higher on march=haswell then on baseline x86_64. Apart from john-the-ripper/md5, the tests were up to 36% faster with median performance increase of around 10%. [1]

As described in a previous email to the Arch mailing list, I have excluded tests that use dedicated code paths for processors supporting AVX/AVX2/etc. - I saw little point of benchmarking them. I have also excluded some tests with little difference between the -O1 and -O3 optimization levels as it appears that the compiler has little work to do there. So real-world performance benefits of compiling whole distribution for x86_64-v3 would be probably smaller.

I think that many workloads of a "typical user" are I/O bound. The limiting factor is likely to be a HDD/SSD, network throughput / latency or a memory speed. Many programs that would benefit the most from compiling for x86_64-v3 already have dedicated code paths that use AVX/AVX2, perhaps written in assembly.

Limitations:
GCC 9.3.0 was used, which is not the most recent compiler available.

Further research:
benchmarking web browser performance, as this is what matters most for many users,

comparing battery usage (Phoronix Test Suite has support for this). I do not think it will be much different to performance data, though,

How to reproduce:

Code:

export CFLAGS="-O1 -mtune=generic -march=x86-64" export CXXFLAGS="-O1 -mtune=generic -march=x86-64" phoronix-test-suite benchmark 2103142-HA-UARCHLEVE55 export CFLAGS="-O3 -mtune=generic -march=x86-64" export CXXFLAGS="-O3 -mtune=generic -march=x86-64" phoronix-test-suite benchmark $name_of_test_identifier_specified_before #etc.

Conflict of interest:
I'd like that general-purpose distributions do not increase baseline x86_64 requirements.

[1] Visit https://openbenchmarking.org/result/..._march_nehalem and scroll slightly lower.
Tags: None

Likes 1

Announcement

x86_64 microarchitecture levels benchmarked

x86_64 microarchitecture levels benchmarked