Announcement

Collapse
No announcement yet.

AMD AOCC 4.0 Arrives For Squeezing More Performance Out Of Zen 4

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by PerformanceExpert View Post
    AOCC and ICC only exist for SPEC scores. So new compilers get released at the same time as new server cores. Releasing the compiler earlier means SPEC scores of older cores will increase as well, something you definitely don't want when promoting your next generation server! Many optimization tricks apply just to SPEC and are kept secret, hence the relatively low-key effort to tune open source GCC and LLVM.
    This is somewhat self-defeating logic, because I think fewer people are basing their purchasing decisions on SPEC or using AOCC/ICC, as time goes on. Furthermore, regardless of what you say, it's really in their interest to get basic stuff, like instruction cost models, updated for new/upcoming CPUs. Once there are engineering samples in the wild, that stuff is no longer secret.

    I wonder if AMD might not have prioritized it, due to the lack of serious competition for Genoa. That could change, when their 3D cache model faces off against Sapphire Rapids Xeon Max (HBM).

    Comment


    • #12
      Originally posted by baryluk View Post
      not sure why it needs libncurses5-dev either.
      For pretty formatting of compiler errors and warnings, I'm sure.

      Originally posted by baryluk View Post
      More interesting is also a release of AMD μProf 4.0.0 for Linux, FreeBSD and Windows. Works with code compiled by all compilers. It has command line interface, and also GUI too. It extends normal profiling methods, with extra library support, especially MPI and OpenMP. And it integrates with GPU profiling (ROCm and gcc offload too I think). For CPU it uses perf + eBPF, which is a really cool, flexible and lowish overhead technique. It also integrates with power measurement tracing. So it is a nice package. Again mostly for HPC people. Documentation and features are also pretty decent, so it might be a good alternative to VTune, or perf, and similar.

      ​Unfortunately deb package for Debian does not work well on Debian. There are install scripts bugs, and might not work with more recent kernels. :/
      Thanks!

      Comment


      • #13
        Originally posted by coder View Post
        This is somewhat self-defeating logic, because I think fewer people are basing their purchasing decisions on SPEC or using AOCC/ICC, as time goes on. Furthermore, regardless of what you say, it's really in their interest to get basic stuff, like instruction cost models, updated for new/upcoming CPUs. Once there are engineering samples in the wild, that stuff is no longer secret.

        I wonder if AMD might not have prioritized it, due to the lack of serious competition for Genoa. That could change, when their 3D cache model faces off against Sapphire Rapids Xeon Max (HBM).
        You're right that SPEC as a benchmark is becoming less important, but closed-source compilers are still popular for selling servers. It allows for workload specific optimizations that cannot be used by your competitors (Intel's ICC is very infamous for this). I much prefer comparisons using GCC/LLVM (including for SPEC) since then you compare real CPU performance rather than who wrote the best compiler tricks!

        Generally the most important aspect of supporting new CPUs is ensuring you can generate code for the correct ISA. For Zen 4 that would be use of AVX-512 as that gives the largest gains by far. This is not as easy as it seems given the rather complex ISA and many extensions - there was a lot of work in GLIBC recently to ensure that AVX-512 string functions correctly check for the ISA extensions they require (this may be an issue in other AVX-512 code out there since it has been written for and tested only on Intel CPUs).

        Fine tuning cost models and schedulers gives far less gain. Micro-architectures have converged and are fairly similar nowadays, so the tuning from previous generation(s) works just fine. Instruction scheduling on OoO cores has been pretty much abandoned - there isn't much a compiler can do there besides perhaps reducing register pressure and spills. So that's why this gets low priority.

        Comment


        • #14
          Originally posted by PerformanceExpert View Post
          Instruction scheduling on OoO cores has been pretty much abandoned - there isn't much a compiler can do there besides perhaps reducing register pressure and spills. So that's why this gets low priority.
          I've found the -mtune option in GCC 10 to be worth a few %, when comparing against baseline x86-64 -- a significant amount, in today's competitive market. That said, I didn't try comparing between a recent vs. current micro-architecture.

          I think cost models are useful for more than just scheduling. I believe they can influence which instructions the compiler generates. Given differences in the number of ports and their restrictions & limitations, it stands to reason that could still be a relevant factor.

          Also, I believe name99 indicated that recent Apple cores can perform instruction fusion. However, because this happens early, the instructions to be fused must be consecutive. Therefore, LLVM contains a list of such instructions so the compiler can emit them in pairs, which Apple does take care to update.

          I know Intel does micro-op fusion, though I'm not sure about AMD. Perhaps that's subject to similar restrictions?
          Last edited by coder; 13 November 2022, 07:13 PM.

          Comment


          • #15
            I'm not really in the loop about this, why does AOCC exist in the first place? Are the AMD guys having trouble getting their patches accepted upstream or is this a manifestation of NIH syndrome?

            Comment


            • #16
              Originally posted by david-nk View Post
              I'm not really in the loop about this, why does AOCC exist in the first place? Are the AMD guys having trouble getting their patches accepted upstream or is this a manifestation of NIH syndrome?
              According to PerformanceExpert, it has some proprietary optimizations that AMD doesn't want to share.

              I think the article (or another commenter?) said it's based on LLVM, so I wouldn't say it's NIH-syndrome.

              Comment


              • #17
                Originally posted by coder View Post
                According to PerformanceExpert, it has some proprietary optimizations that AMD doesn't want to share.

                I think the article (or another commenter?) said it's based on LLVM, so I wouldn't say it's NIH-syndrome.
                So they want their own shiny product just for the sake of it, that 100% falls under NIH syndrome for me. It's not really feasible to develop a competitive compiler from scratch, especially not for AMD, so it makes sense that it's based on LLVM. And of course it's not in any official distro repos, forcing users to jump through hoops if they want the full performance from their CPU. Most of AMD's decisions are really questionable lately.

                Comment


                • #18
                  Originally posted by david-nk View Post

                  So they want their own shiny product just for the sake of it, that 100% falls under NIH syndrome for me. It's not really feasible to develop a competitive compiler from scratch, especially not for AMD, so it makes sense that it's based on LLVM. And of course it's not in any official distro repos, forcing users to jump through hoops if they want the full performance from their CPU. Most of AMD's decisions are really questionable lately.
                  No, you can't blame this on AMD at all - they were forced to do it by Intel. AMD sued Intel for anti-competitive practices and the way Intel used ICC to get ahead of AMD on SPEC. AMD won and got major payouts. The end result was AMD developing AOCC to counter ICC. Both compilers only exist to give good SPEC scores. Both do it by adding very questionable optimizations that would never be accepted by the open source community. Both compilers have nasty ULA clauses that only allow you to use them/report results on Intel or AMD hardware.

                  Comment


                  • #19
                    Originally posted by ptr1337 View Post
                    Interesting, was a bit excited about this release, but its really sad that it got based on llvm 14.
                    Anyways just tested a compilation of the linux kernel with AMD AOCC CLANG + FULL LTO and it was not successful.
                    With the default installed clang, also based on the latest clang 14 release the kernel compiled well.

                    Besides that:

                    The compiler is slow... Commonly I built a FULL LTO Kernel between 15-20 min with my 5900X.
                    When aocc has failed at the linking of vmlinuz, the compilation was already more then half a hour.
                    Hi! I work for AMD and in AOCC!
                    Can you please let us know what error you are facing and your settings for the compilation. We would like to track this issue down and resolve it.
                    Thank you!

                    Comment


                    • #20
                      Originally posted by gganesh View Post

                      Hi! I work for AMD and in AOCC!
                      Can you please let us know what error you are facing and your settings for the compilation. We would like to track this issue down and resolve it.
                      Thank you!
                      Oh, sorry I have not noticed the message. Sorry for late response.
                      I have not saved the log. I will run a compilation once again with AOCC THINLTO and 6.0.10.
                      Mainly it is failing at the linking step cause of missing modules.
                      I will let you know as soon I have the log.

                      If you are on developer on AOCC, did you tested the compiler performance itself? Why it is that much slower then clang 15 or aocc 3.0 (which also was a bit slow).
                      I think, AMD could easily compile the AOCC easily with THINLTO and PGO, it does improve the performance a lot of the compiler.
                      I made some benchmarks, about improving CLANG and compiling it with THIN LTO + PGO and on top also bolt the clang binary, here you can find them.
                      https://github.com/ptr1337/llvm-bolt...-15-pgothinlto

                      Really consider to do this in a final release of the compiler.
                      Last edited by ptr1337; 20 November 2022, 04:48 PM.

                      Comment

                      Working...
                      X