No announcement yet.

x86_64 microarchitecture levels benchmarked

  • Filter
  • Time
  • Show
Clear All
new posts

  • x86_64 microarchitecture levels benchmarked


    I have benchmarked the performance impact of compiling code for various x86_64 microarchitecture levels:

    • there is no or negligible performance benefit of -march=nehalem, which corresponds to x86_64-v2,
    • there is a moderate benefit of -march=haswell (x86_64-v3) - around 10%-20% as compared to baseline for the tests performed
    Geometric Mean Of All Test Results
    Result Composite
    Geometric Mean > Higher Is Better
    O1_generic ....... 367.99 |================================================= =============================
    O3_generic ....... 459.84 |================================================= =================================================
    O3_march_nehalem . 462.89 |================================================= =================================================
    O3_march_haswell . 531.99 |================================================= ================================================== ==============
    x86_64-v2: There were only two tests in which march=nehalem was meaningfully faster then march=x86_64 (the baseline architecture). These were "graphicsmagick/Swirl" and "FLAC audio encoding". FLAC results were quite noisy (click the "Result confidence" button above the pie chart to show data) so the benefits may not be statistically significant. Swirl appeared to be only around 4% faster. I was surprised because I thought that the benefits would be somewhere around 5-10%. It looks like GCC's autovectorisation does not make much use from the instructions added in SSE3/SSSE3/SSE4.

    x86_64-v3: The geometric mean of test results was around 15% higher on march=haswell then on baseline x86_64. Apart from john-the-ripper/md5, the tests were up to 36% faster with median performance increase of around 10%. [1]

    As described in a previous email to the Arch mailing list, I have excluded tests that use dedicated code paths for processors supporting AVX/AVX2/etc. - I saw little point of benchmarking them. I have also excluded some tests with little difference between the -O1 and -O3 optimization levels as it appears that the compiler has little work to do there. So real-world performance benefits of compiling whole distribution for x86_64-v3 would be probably smaller.

    I think that many workloads of a "typical user" are I/O bound. The limiting factor is likely to be a HDD/SSD, network throughput / latency or a memory speed. Many programs that would benefit the most from compiling for x86_64-v3 already have dedicated code paths that use AVX/AVX2, perhaps written in assembly.

    • GCC 9.3.0 was used, which is not the most recent compiler available.
    Further research:
    • benchmarking web browser performance, as this is what matters most for many users,
    • comparing battery usage (Phoronix Test Suite has support for this). I do not think it will be much different to performance data, though,
    How to reproduce:
    export CFLAGS="-O1 -mtune=generic -march=x86-64"
    export CXXFLAGS="-O1 -mtune=generic -march=x86-64"
    phoronix-test-suite benchmark 2103142-HA-UARCHLEVE55
    export CFLAGS="-O3 -mtune=generic -march=x86-64"
    export CXXFLAGS="-O3 -mtune=generic -march=x86-64"
    phoronix-test-suite benchmark $name_of_test_identifier_specified_before
    Conflict of interest:
    • I'd like that general-purpose distributions do not increase baseline x86_64 requirements.

    [1] Visit and scroll slightly lower.