Announcement

Collapse
No announcement yet.

GCC 12 Compiler Squaring Away Its AVX-512 FP16 Support

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GCC 12 Compiler Squaring Away Its AVX-512 FP16 Support

    Phoronix: GCC 12 Compiler Squaring Away Its AVX-512 FP16 Support

    In recent weeks the AVX-512 FP16 support has been landing within the GNU Compiler Collection codebase for next year's GCC 12 release...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Sapphire Rapids might be adding AVX-512 FP16, but it's worth noting that AVX-512 is gone entirely from Alder Lake. It will be interesting to see what Zen 4 does.

    Comment


    • #3
      Would be nice to have all the new AVX-512 instructions, even if only running at 256bit at a time. Though it might not be that easy to implement :/

      Comment


      • #4
        Originally posted by carewolf View Post
        Would be nice to have all the new AVX-512 instructions, even if only running at 256bit at a time. Though it might not be that easy to implement :/
        Indeed. What if Zen 4, Gracemont, Golden Clove and so on accepted the "full" AVX512 instruction set and just broke down wider instructions into micro ops? Would it really add that much overhead?

        Zen 1 did that, and it seems fine. ARMv9 sort of does that, and it scales down to tiny, low power cores. What makes it so hard for AMD/Intel to do in newer cores?

        Comment


        • #5
          Originally posted by brucethemoose View Post

          Indeed. What if Zen 4, Gracemont, Golden Clove and so on accepted the "full" AVX512 instruction set and just broke down wider instructions into micro ops? Would it really add that much overhead?

          Zen 1 did that, and it seems fine. ARMv9 sort of does that, and it scales down to tiny, low power cores. What makes it so hard for AMD/Intel to do in newer cores?
          Well AVX256 was specifically designed to make that easy. AVX512 is not similarly designed, so breaking the instructions in half is not trivial. It could be done, but it might raise a number of situations where it is slower than using AVX256 directly.

          Comment


          • #6
            Originally posted by carewolf View Post

            Well AVX256 was specifically designed to make that easy. AVX512 is not similarly designed, so breaking the instructions in half is not trivial. It could be done, but it might raise a number of situations where it is slower than using AVX256 directly.
            Ah.

            Assuming its still possible, I think the standardization would be worth extra decoding complexity, but I guess it depends how big that penality is.

            Full AVX512 would probably take up too much die space on the atom-like cores... or maybe not? The Xeon Phi cores weren't that big IIRC, and the Centaur cores that support AVX512 aren't particularly huge either.

            Comment


            • #7
              Originally posted by brucethemoose View Post

              Ah.

              Assuming its still possible, I think the standardization would be worth extra decoding complexity, but I guess it depends how big that penality is.

              Full AVX512 would probably take up too much die space on the atom-like cores... or maybe not? The Xeon Phi cores weren't that big IIRC, and the Centaur cores that support AVX512 aren't particularly huge either.
              Sure. I think for Intel it is just a matter of market differentiation. For AMD, it is a matter of whether it is worth it.

              Comment


              • #8
                Originally posted by carewolf View Post
                Would be nice to have all the new AVX-512 instructions, even if only running at 256bit at a time. Though it might not be that easy to implement :/
                It's not only about vector pipeline width. AVX-512 >= quadruples the size of the vector register file, by doubling both their number and size. On Intel's "little" cores, even that increase in area might've been deemed too much.

                Comment


                • #9
                  Originally posted by brucethemoose View Post
                  ARMv9 sort of does that, and it scales down to tiny, low power cores. What makes it so hard for AMD/Intel to do in newer cores?
                  ARM SVE is a different animal. Even though an implementation can scale vectors up to 2048 bits, it doesn't require 2048-bit registers on all implementations. The minimum width is 128-bits, at which your entire footprint is just 128-bits. That includes even the registers.

                  What SVE does differently is to expose the implementation width in a way that makes it easy for software to adapt to its vector size. This stands in contrast to the x86 approach of requiring distinct opcodes for different vector widths.

                  BTW, ARM's new mid-power cores (the A510) share a single, 128-bit vector pipeline between two of them, in the default configuration. So, that would suggest SVE has more area cost than conventional ARMv8 128-bit SIMD.

                  I haven't read about a new A3x-series core for ARMv9, which would be the "tiny, low power" cores. Surely, it's only a matter of time. ...and maybe a new process node, for them to be viable.
                  Last edited by coder; 05 October 2021, 05:15 AM.

                  Comment


                  • #10
                    Originally posted by brucethemoose View Post
                    Full AVX512 would probably take up too much die space on the atom-like cores... or maybe not? The Xeon Phi cores weren't that big IIRC, and the Centaur cores that support AVX512 aren't particularly huge either.
                    The KNL dies were huge and the cores were otherwise very simple. This enabled them to have more GPU-like allocation of vector compute vs. scalar & control logic, I think.

                    With modern "little" cores, there's a lot more OoO overhead, much wider support for scalar ops, and they have to be cost-competitive even in low-end 4-core implementations. I'm guessing that's the issue.
                    Last edited by coder; 05 October 2021, 05:17 AM.

                    Comment

                    Working...
                    X