Announcement

Collapse
No announcement yet.

AMD Zen 3 Tuning Backported To The GCC 10 Compiler

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • AMD Zen 3 Tuning Backported To The GCC 10 Compiler

    Phoronix: AMD Zen 3 Tuning Backported To The GCC 10 Compiler

    In the past few weeks since the introduction of the EPYC 7003 "Milan" processors there has finally been AMD Zen 3 "Znver3" tuning work that's been hurried into the GCC 11 compiler code-base ahead of its stable release in the coming weeks. That initial Zen 3 tuning work has also now been back-ported to the GCC 10 branch ahead of its next point release...

    https://www.phoronix.com/scan.php?pa...CC-10-Backport

  • #2
    While I am happy that GCC development gets some visibility, I think I should point out that from code generation point of view, zen2 and zen3 are very close to each other (which is a good thing since it is well designed to run existing software). Important difference between zen1 and zen2 was the implementation of avx256, but for zen3 there is little to improve at this point.

    - Zen3 has better parallelism (perhaps it is better to say a lot better juding from increases ipc), but this hardly shows to compiler since the scheduling regions are generally too small; there are too many unknowns (like state of CPU at the start of scheduling region) and the scheduler model is not precise enough. Scheduling is still an importnat win. Just like for zen2 and other current generation out of order CPUs it makes sense to identify the critical path, start it early, and interlace with other instructions to help the out of order core to fill bubles in the pipeline.

    - Some instruction latencies (generally of expensive instructions like division or square root) changed - this may affect the critical path choice, but only in very rare cases that there are two candidates for hot path involving those expensive operations. Other similarly rare case where this can make difference is vectorization, since the latencies may affect the logic judging benefits of vectorized loop relative to the prologue/epilogues costs. TSVC benchmark tests a lot of common vectorizable kernels and there is 1.5-2.5% improvement in geom average comparing zen2 and zen3 tuning.

    - gather instructions are faster, but still microcoded. However testing shows that, unlike in zen1, there are now kernels where they makes sense. However because it is not 100% clear how to set up the heuristics (i.e. aocc does not use gathers) this is not backported to gcc10. On general benchmarks. like SPEC2k17 the difference is within noise however on few maningful kernels in tsvc one get quite nice improvement affecting about 0.5% of geom average.

    - multiply-and-add instruction reduced latency, but still when placed on critical path of matrix multiply it is slower than separate multiplication and add. So heuristics here is still the same: use multiply-and-add if there is no critical chain around the loop involving the accumulator register.

    In short one should not expect huge differences between zen2 and zen3 tuning. However -march=native will now correctly detects the core and one can use ifdef __znver3__ and ifdef __tune_znver3__ with expected outcome.
    Last edited by hubicka; 01 April 2021, 09:10 AM.

    Comment


    • #3
      Is AOCC the same as aocl-gcc?

      Comment


      • #4
        Originally posted by AndyChow View Post
        Is AOCC the same as aocl-gcc?
        AOCC is self-compiled and aocl-gcc is compiled with GCC.

        Comment


        • #5
          Michael

          Typo:

          Originally posted by phoronix View Post
          I'm glad this ended up turning like I had asked:
          Originally posted by geearf View Post
          Can't znver3 be added to a minor release of GCC if it's not too much of a change? That way it shouldn't take that long to go.
          Last edited by geearf; 01 April 2021, 09:08 AM.

          Comment


          • #6
            hubicka thx for your additional notes. Good read!

            Comment


            • #7
              Originally posted by hubicka View Post
              [...] there are too many unknowns (like state of CPU at the start of scheduling region) [...]
              Does AMD not talk to you or other compiler engineers about that?

              Comment


              • #8
                Originally posted by mlau View Post
                Does AMD not talk to you or other compiler engineers about that?
                Not everything about the out of order code is documented (but communication with AMD is good and a lot of things are not known at compile time (memory latencies, cache status, branch prediction).

                Main problem is that compiler can analyse just relatively small region of code (basic block, function etc.) and thus it can not have very clear idea about the pipeline at the beggining of the analysed region. Scheduler in GCC is mostly basic block based and average basic block is just few instructions long which does not compare well to parallelism of Zen.

                Comment

                Working...
                X