GCC 13 "-O2" Performance Being Sped Up With Enabling Small Loop Unrolling
Intel engineer Hongyu Wang led the change that merged today to GCC 13. Small loop unrolling is being enabled for the -O2 optimization level due to its benefit with modern AMD and Intel CPUs. In one particular SPEC test case having small loop unrolling at -O2 improved the Ice Lake server performance by 9% while helping out a Zen 3 system too by 7.4%. On the consequence, this led to a 0.9% code size increase. But for the other benchmark cases ran there was less to no measurable impact. Of course, it will be interesting to test GCC 13 with a more diverse range of benchmarks to see how it all goes.
Hongyu Wang explained in the commit:
"Modern processors has multiple way instruction decoders For x86, icelake/zen3 has 5 uops, so for small loop with <= 4 instructions (usually has 3 uops with a cmp/jmp pair that can be macro-fused), the decoder would have 2 uops bubble for each iteration and the pipeline could not be fully utilized.
Therefore, this patch enables loop unrolling for small size loop at O2 to fullfill the decoder as much as possible. It turns on rtl loop unrolling when targetm.loop_unroll_adjust exists and O2 plus speed only. In x86 backend the default behavior is to unroll small loops with less than 4 insns by 1 time.
This improves 548.exchange2 by 9% on icelake and 7.4% on zen3 with 0.9% codesize increment. For other benchmarks the variants are minor and overall codesize increased by 0.2%.
The kernel image size increased by 0.06%, and no impact on eembc."
This change is merged for GCC 13 with -O2. The GCC 13.1 stable release should be out in its usual March~April timeframe. As that release nears I'll be running my usual GCC compiler comparison benchmarks.