Announcement

Collapse
No announcement yet.

AMD Zen 5 Compiler Support Posted For GCC - Confirms New AVX Features & More

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by [deXter] View Post
    Zen 4 user here. Does anyone know if there's a difference (instruction set wise and real-world impact) in compiling using march=x86-64-v4 vs march=znver4? I've only recently switched my (Arch, btw) packages to x86-64-v4, but now I wonder whether I should be using znver4 instead - I haven't come across any mentions of this on the interwebs.
    I think the only userspace instruction additions are AVX-512.


    However, Zen 4 includes more than the Skylake-level AVX-512 functionality found in v4.


    That said, I'm not sure how much benefit you'll get from those other extensions, aside from in software packages specifically optimized for them.

    Also, using -mtune=znver4 could enable instruction cost tables specific to zen 4, depending on whether AMD ever got around to contributing them. Note that -mtune is implied by -march (for x86, at least; I think it's not true for ARM).
    Last edited by coder; 11 February 2024, 02:50 AM.

    Comment


    • #12
      Originally posted by [deXter] View Post
      Zen 4 user here. Does anyone know if there's a difference (instruction set wise and real-world impact) in compiling using march=x86-64-v4 vs march=znver4? I've only recently switched my (Arch, btw) packages to x86-64-v4, but now I wonder whether I should be using znver4 instead - I haven't come across any mentions of this on the interwebs.
      I'm compiling my C++ app using GCC 4.3. Instead of manually selecting the optimization flags I'm using -march=native, which in theory should add all optimization flags applicable to the hardware I'm


      A litany of different methods to see what GCC does when using different microarchitecture options.

      Realistically there won't be a big difference when it comes to performance between using znver4 versus x86-64-v4, not noticeable in use but would be evident in benchmarks.
      With a specified microarchitecture the compiler will be able to choose more efficient values and limits instead of safe defaults. You also should be using native if you do not care to move or package software for a different machine. For me znver3 doesn't have Shadow Stack enabled, but native does, so you're going to have to check if your compiler enables any extra flags on top of znver4 when using native.

      Comment


      • #13
        Originally posted by [deXter] View Post
        Zen 4 user here. Does anyone know if there's a difference (instruction set wise and real-world impact) in compiling using march=x86-64-v4 vs march=znver4? I've only recently switched my (Arch, btw) packages to x86-64-v4, but now I wonder whether I should be using znver4 instead - I haven't come across any mentions of this on the interwebs.
        If you read the gcc docs, you'll see that setting it to x86-64-v4 enables the use of instructions, but doesn't tune to a specific CPU. Tuning will take into account pipeline widths, instruction timings, and so on, to produce more optimal assembly for a specific CPU.

        Comment


        • #14
          Originally posted by [deXter] View Post
          Zen 4 user here. Does anyone know if there's a difference (instruction set wise and real-world impact) in compiling using march=x86-64-v4 vs march=znver4? I've only recently switched my (Arch, btw) packages to x86-64-v4, but now I wonder whether I should be using znver4 instead - I haven't come across any mentions of this on the interwebs.


          Here is a diff of the flags which are enabled with -march=x86-64-v4 and -march=znver4

          Edit:
          This is done via:
          Code:
          gcc -march=x86-64-v4 -Q --help=target > v4.flags
          ​gcc -march=znver4 -Q --help=target > znver4.flags
          ​diff -u zen4.flags v4.flags
          Last edited by ptr1337; 10 February 2024, 04:53 PM.

          Comment


          • #15
            Originally posted by [deXter] View Post
            Zen 4 user here. Does anyone know if there's a difference (instruction set wise and real-world impact) in compiling using march=x86-64-v4 vs march=znver4? I've only recently switched my (Arch, btw) packages to x86-64-v4, but now I wonder whether I should be using znver4 instead - I haven't come across any mentions of this on the interwebs.
            Why not simply use -march=native if you are compiling yourself?

            Comment


            • #16
              Originally posted by [deXter] View Post
              Zen 4 user here. Does anyone know if there's a difference (instruction set wise and real-world impact) in compiling using march=x86-64-v4 vs march=znver4? I've only recently switched my (Arch, btw) packages to x86-64-v4, but now I wonder whether I should be using znver4 instead - I haven't come across any mentions of this on the interwebs.


              As others have said, one advantage of -march=znver4 over -march=x86-64-v4 is that the compiler does optimization tuned for the Zen 4 microarchitecture.

              The option -march=x86-64-v4 only enables the instructions supported by Skylake Server, 7 years ago.

              Zen 4 also supports many other instructions introduced by Cascade Lake, Cooper Lake, Cannon Lake and Ice Lake.

              These instructions increase the speed many times, i.e. not just by some percent, for various algorithms used in arithmetic with big numbers, cryptography and machine learning.

              Nevertheless, the greatest impact of these instructions is for various libraries, like OpenSSL, which usually include functions written in assembly language or with compiler intrinsics, which may be selected at run time.

              The impact on the code that does not use compiler intrinsics is likely to be less.

              In any case it does not make sense to ever compile with -march=x86-64-v4, unless you cannot predict on what kind of computers the code will be executed.

              Whenever you know that a program will be executed, e.g. on Zen 4 or on Alder Lake, the appropriate compiler option should be used.

              I am not a fan of "-march=native", because when you use an older compiler on a newer CPU it may fall back silently to the worst (but safe) options and also because I frequently compile a program on one computer for using it on another computer and it would be a waste of time for me to write an included makefile with appropriate make definitions for tool options that could be used only in special circumstances.


              Comment


              • #17
                Originally posted by Namelesswonder View Post
                VP2INTERSECT was flawed on Tiger Lake, which meant it was faster to emulate it than to actually use it.

                Although that could just be typical Intel underbaking their implementation and only until their second attempt or AMD's implementation does it work better.

                Also does AVX-VNNI even have any use over AVX512-VNNI? From Intel's own documentation it seems that they have the same CPI for their implementations, so AVX512-VNNI would have double the throughput.
                Is AMD just making a microcode implementation that uses the same AVX512-VNNI instructions to offer AVX-VNNI?

                Edit: It appears AVX-VNNI-INT* does add some more intrinsics over AVX512-VNNI, so it has use there, but AVX-VNNI standalone doesn't have any use over AVX512-VNNI.

                Intel has announced that in Granite Rapids VP2INTERSECT will be reintroduced.

                I assume that AMD has aimed in Zen 5 to provide compatibility with most of the Granite Rapids instruction set, with the exception of the features which would have been expensive to implement, i.e. AMX and AVX512-FP16, and for which it might be better to use a GPU anyway.


                AVX-VNNI has been introduced by Intel for their CPUs that do not support AVX-512.

                There are many software developers who did not bother to provide AVX-512 support in their programs, because so many Intel CPUs lack support, but who might have included AVX-VNNI support.

                It is normal for AMD to add AVX-VNNI support, so that their CPUs will not be handicapped when executing such Intel-oriented programs, especially because this addition is cheap, because it requires mainly changes in the instruction decoder, since they already had the corresponding execution units.



                Comment


                • #18
                  Does anyone know what PREFETCHI does? Is that for prefetching instructions, as opposed to data?

                  Also, I looked at a description of the MOVDIRI and I'm a little unclear on the use case for it, to the extent that it differs from that of non-temporal stores. If anyone has insight into what use cases it addresses, please share.

                  Comment


                  • #19
                    Originally posted by coder View Post
                    Does anyone know what PREFETCHI does? Is that for prefetching instructions, as opposed to data?

                    Also, I looked at a description of the MOVDIRI and I'm a little unclear on the use case for it, to the extent that it differs from that of non-temporal stores. If anyone has insight into what use cases it addresses, please share.

                    As you have guessed, unlike older prefetch instructions that were for data, PREFETCHI is for prefetching instructions.

                    Unlike the non-temporal stores for vector registers, MOVDIRI is for storing the general-purpose registers. Moreover, there are differences in behavior as written in the Intel manual:

                    "Unlike stores with non-temporal hint that allow uncached (UC) and write-protected (WP) memory-type for the destination to override the non-temporal hint, direct-stores always follow WC memory type protocol irrespective of the destination address memory type (including UC and WP types).

                    Unlike WC stores and stores with non-temporal hint, direct-stores are eligible for immediate eviction from the write-combining buffer, and thus not combined with younger stores (including direct-stores) to the same address. Older WC and non-temporal stores held in the write-combing buffer may be combined with younger direct stores to the same address."


                    The MOVDIRI is not intended to be used in normal software. It is intended for the communication with the hardware accelerators that are included in various Intel SoCs, currently mostly in the Sapphire Rapids and Emerald Rapids Xeon CPUs.






                    Comment


                    • #20
                      Originally posted by AdrianBc View Post
                      I am not a fan of "-march=native", because when you use an older compiler on a newer CPU it may fall back silently to the worst (but safe) options and also because I frequently compile a program on one computer for using it on another computer and it would be a waste of time for me to write an included makefile with appropriate make definitions for tool options that could be used only in special circumstances.

                      Processor-specific tunings are a pox and cancer and should never exist on anything except purpose-built appliances that exist to do only a very specific function.

                      Everything else should be on -march=x86-64 -mtune=generic

                      Comment

                      Working...
                      X