Announcement

Collapse
No announcement yet.

Initial AMD Zen 3 Support Successfully Lands In GCC 11

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Initial AMD Zen 3 Support Successfully Lands In GCC 11

    Phoronix: Initial AMD Zen 3 Support Successfully Lands In GCC 11

    A few days ago AMD finally sent out the initial AMD Zen 3 "znver3" support to the GCC compiler with the LLVM Clang support to follow. That initial "-march=znver3" targeting support has now been merged for GCC 11...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Compared to znver2 it adds support for PTA_VAES | PTA_VPCLMULQDQ | PTA_PKU, almost everything else is reusing znver2 code paths and scheduling.

    With exception of idiv, that is integer division, which does have new scheduling compared to znver2 and znver1. Looking at the new code, it appears the new divider is either pipelined better or simply faster, because it does reserve some ports for less cycles now. A bit over 2 times less cycles. Nice. It will affect scheduling of the code a bit, and improve register allocations a little bit too probably.

    Comment


    • #3
      Failing to have compiler optimisations in place before the product launch seems like a massive oversight, especially with how zen3 has some rather profound changes that I’m sure optimisation could take big advantage of.

      Comment


      • #4
        Originally posted by scottishduck View Post
        Failing to have compiler optimisations in place before the product launch seems like a massive oversight, especially with how zen3 has some rather profound changes that I’m sure optimisation could take big advantage of.
        This is nonsense. CPU microarchitecture specific optimisations are extremely rarely used in practice. Not only that, using znver2 already provides 98% of performance gains possible on znver3. The "profound" optimizations are in hardware, and generated code doesn't really need to change to use them.

        The gcc optimisations for znver3 will provide very little difference, and only in some very specific workloads.


        Also AMD has a own compiler suite, https://developer.amd.com/amd-aocc/ AMD AOCC, that is available for months with good optimisations.
        Last edited by baryluk; 05 December 2020, 11:15 AM.

        Comment


        • #5
          Originally posted by baryluk View Post
          This is nonsense. CPU microarchitecture specific optimisations are extremely rarely used in practice. Not only that, using znver2 already provides 98% of performance gains possible on znver3. The "profound" optimizations are in hardware, and generated code doesn't really need to change to use them.

          The gcc optimisations for znver3 will provide very little difference, and only in some very specific workloads.


          Also AMD has a own compiler suite, https://developer.amd.com/amd-aocc/ AMD AOCC, that is available for months with good optimisations.
          The major gains appear to be coming from the changes to the L3 cache, while L1/L2 have not been changed afaik. GCC then doesn't yet recognise L3 caches but only L1 and L2, for which it also provides command line switches in case the correct values haven't been recognised.

          How to check on these values:
          Code:
          $ gcc -march=znver2 --help=params -Q|fgrep cache
          --param=l1-cache-line-size= 32
          --param=l1-cache-size= 64
          --param=l2-cache-size= 512
          $ gcc -march=znver2 --help=target -Q|fgrep march
          -march= znver2
          That said, I've noticed an unexpected behaviour while I was checking on these values again today on my Ryzen 7 3800X. The L1/L2 cache parameters changed when I used -march=native versus -march=znver2.
          Code:
          $ gcc -march=native --help=params -Q|fgrep cache
          --param=l1-cache-line-size= 64
          --param=l1-cache-size= 32
          --param=l2-cache-size= 512
          $ gcc -march=native --help=target -Q|fgrep march
          -march= znver2
          The later values as shown here are in fact the correct ones for the Ryzen 7 3800X.

          So now I'm wondering as to what I'm missing here. It's apparently not enough to just specify -march=znver2 but it has to be -march=native before GCC uses the correct values for the L1/L2 cache on an actual Zen 2 CPU.

          Edit:
          I've now also tested it with -march=znver2 -mtune=znver2 and it keeps using the wrong values and only when -mtune=native is used and GCC performs a run-time detection does it use the correct values!
          Last edited by sdack; 05 December 2020, 12:54 PM.

          Comment


          • #6
            Originally posted by sdack View Post
            The major gains appear to be coming from the changes to the L3 cache, while L1/L2 have not been changed afaik. GCC then doesn't yet recognise L3 caches but only L1 and L2, for which it also provides command line switches in case the correct values haven't been recognised.

            How to check on these values:
            Code:
            $ gcc -march=znver2 --help=params -Q|fgrep cache
            --param=l1-cache-line-size= 32
            --param=l1-cache-size= 64
            --param=l2-cache-size= 512
            $ gcc -march=znver2 --help=target -Q|fgrep march
            -march= znver2
            That said, I've noticed an unexpected behaviour while I was checking on these values again today on my Ryzen 7 3800X. The L1/L2 cache parameters changed when I used -march=native versus -march=znver2.
            Code:
            $ gcc -march=native --help=params -Q|fgrep cache
            --param=l1-cache-line-size= 64
            --param=l1-cache-size= 32
            --param=l2-cache-size= 512
            $ gcc -march=native --help=target -Q|fgrep march
            -march= znver2
            The later values as shown here are in fact the correct ones for the Ryzen 7 3800X. So now I'm wondering as to what I'm missing here. It's apparently not enough to just specify -march=znver2 but it has to be -march=native before GCC uses the correct values for the L1/L2 cache on an actual Zen 2 CPU.
            Compilers don't take into account L3 into account, because it is both hard, and extremally workload dependent, including what other cores are doing. Compiler doesn't know. Such option will not be in GCC or clang in the next 10 years. Easy as that.

            Comment


            • #7
              Originally posted by baryluk View Post
              Compilers don't take into account L3 into account, because it is both hard, and extremally workload dependent, including what other cores are doing. Compiler doesn't know. Such option will not be in GCC or clang in the next 10 years. Easy as that.
              It's not about that. It's that the knowledge of the cache characteristics can lead to an opportunity for an optimisation to produce a slightly better result instead of having no knowledge at all. The L1 and L2 cache are also affected by multi-threaded workloads, not only the L3, as you will know and it hasn't stopped developers from using this knowledge in their optimisations.

              I've mentioned the L3 cache, because it is the one cache, which has changed with Zen 3, but isn't taken into account by gcc. This has relevance to the article, because the article is about Zen 3 support for gcc.

              I wasn't trying to debate on whether it makes sense to include L3 characteristics or not. I'm sure some dev will find it interesting and it may have a few edge cases where it can already make a small difference. I'm not working on gcc and so won't get involved in those decisions. I also don't see the point in guessing if we see it happening within gcc in the next 10 years or not.
              Last edited by sdack; 05 December 2020, 01:34 PM.

              Comment


              • #8
                Originally posted by sdack View Post
                So now I'm wondering as to what I'm missing here. It's apparently not enough to just specify -march=znver2 but it has to be -march=native before GCC uses the correct values for the L1/L2 cache on an actual Zen 2 CPU.

                Edit:
                I've now also tested it with -march=znver2 -mtune=znver2 and it keeps using the wrong values and only when -mtune=native is used and GCC performs a run-time detection does it use the correct values!
                I would guess it's because you cannot assume that all Zen 2 cores will have the same cache configuration. While AMD seems to be using a fixed L1/L2 cache amount for Zen 2 regardless of application, the same has not always been true. For example, back in the Athlon 64 days, a K8 core could have anything from 128KB to 1MB of L2 cache per core.

                Comment


                • #9
                  Originally posted by AmericanLocomotive View Post
                  I would guess it's because you cannot assume that all Zen 2 cores will have the same cache configuration. While AMD seems to be using a fixed L1/L2 cache amount for Zen 2 regardless of application, the same has not always been true. For example, back in the Athlon 64 days, a K8 core could have anything from 128KB to 1MB of L2 cache per core.
                  From the looks of it does it get the false values from the generic or x86-64 default. I.e. -mtune=generic or -mtune=x86-64

                  However at least for the cacheline size is 64 the standard these days if not more. A cacheline size of 32 bytes goes back to AMD K5 and K6 iirc, and the K8 and later already had 64 bytes. It basically changed from 32 to 64 bytes when x86 went from 32-bit to 64-bit. It would therefore surprise me if today there was a Zen CPU out there that had a cacheline size of only 32 bytes. And even in the event that there was such a CPU would this be a rare exception and not the norm and you would want to stick with the 64 bytes that are far more common.

                  Just as a reminder, the cacheline size determines how many bytes are transferred between the caches at once, and to go back to 32 bytes would mean there was a Zen CPU where AMD would have cut the path width between the caches in half. I don't see what major technical gain this would give. It's not like it's an external bus requiring more pins and more complicated board layouts. Cutting the width in half would also affect how the instruction decoder works ...

                  Frankly, I don't believe there is a single Zen CPU out there with a 32 byte cacheline size. And there probably hasn't been an AMD CPU ever since the Athlon 64 with less than 64 bytes. This looks like a simple bug, a twist in numbers. A 32 got swapped for a 64 and nobody noticed it, because the effect is likely only a mild regression.

                  Anyhow, I've asked for an explanation on the gcc-help mailing list and I will open a PR if necessary to see if I can get an answer. If it's a bug then it can get fixed.
                  Last edited by sdack; 06 December 2020, 02:30 PM.

                  Comment


                  • #10
                    Hmmm, good point here, I'll try prefetching L3 manually in some of my loops and see if there's any difference. Didn't think of that one, thanks.

                    Comment

                    Working...
                    X