Announcement

Collapse
No announcement yet.

AMD Sends Out Patches Adding "Znver3" Support To GNU Binutils With New Instructions

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Since Ryzen had such great benchmarks on AES, I wonder what the new VAES set gives here? Is there a way to test that for Michael?

    Comment


    • #12
      Originally posted by phoronix View Post
      Phoronix: AMD Sends Out Patches Adding "Znver3" Support To GNU Binutils With New Instructions

      One of AMD's compiler experts this week sent out a patch wiring up Zen 3 support in the important GNU Binutils collection for Linux systems...

      http://www.phoronix.com/scan.php?pag...nutils-Support
      I believe the AVX-512 part of the following sentence should be deleted from the article because it is misleading: "VPCLMULQDQ - Another instruction part of AVX-512".

      Comment


      • #13
        Originally posted by ms178 View Post
        I guess you meant Bulldozer? As far as I know Zen 1's AVX2 implementation was up to par with Intel's. If they implemented AVX-512 like that, it wouldn't be that beneficial after all, would it? I am not an ISA expert, but having larger vector units and fewer cycles for its instructions are what the performance comes from?! And from looking in the past of that approach showed that they were lacking behind in AVX performance quite a bit due to their implementation. Not that it mattered too much at that time as AVX2 wasn't that important at that time, but it might matter now if they want to go after Intel in AI, HPC workloads where AVX-512 is fully utilized. And with the x86-64-v4 target, it probably will get used soon more widely at least on Linux (Does anyone know if these new baselines will translate over into the Windows world? I'd love to see such a Windows version).
        Just a note: It is possible that some future x86 CPUs with heterogeneous cores might feature 128-bit-wide implementations of AVX-256 and AVX-512 due to the possibility of having cores with a higher 64-bit integer/floating-point performance at lower power than is possible with 512-bit-wide cores, on the same chip.

        PS: Thanks for mentioning the new x86-64-v4 target.

        Comment


        • #14
          Originally posted by rubdos View Post
          Since Ryzen had such great benchmarks on AES, I wonder what the new VAES set gives here? Is there a way to test that for Michael?
          Ideally twice the number of bits processed per clock cycle.

          Comment


          • #15
            Originally posted by carewolf View Post

            Ideally twice the number of bits processed per clock cycle.
            I mean, that could be a HUGE deal in data centres, no? "Ryzen 3 twice the speed of already record holder Ryzen 2 on OpenSSL AES, buy EPYC now!"

            I suppose it won't be twice the bandwidth though, but it should be interesting to see some numbers on it. Also, since vector instructions could temporally disable SMP locally (is that still true?), it might be only truly useful in non-server applications. Multicore benchmarks warranted!

            Comment


            • #16
              Originally posted by ms178 View Post
              AVX-512 is supposed to come with Zen 4, hopefully with a better implementation than Intel's.
              Sadly, AVX-512 is broken by design.

              If just one library call executes just one AVX-512 instruction, suddenly every SSE and AVX operation now burns more power by virtue of having to always copy the upper 256-bits of each vector register. Of course, you could always terminate AVX-512 code blocks with VZEROUPPER, but that potentially limits its use in smaller functions.

              ARM's SVE is a much better approach, if you really must have larger vectors. Better still would be to use a GPU or purpose-built AI accelerator.

              Comment


              • #17
                Originally posted by zxy_thf View Post
                Zen 4's AVX-512 support might be light Zen 1's AVX2 support, i.e., emulating 512-bit operations with 256-bit ALUs.
                However due to the tremendous cost of AVX-512 on die area, this approach might also be another "worst is better" solution.
                Aside from the penalty you incur in mixed 128-bit or 256-bit + 512-bit workloads, there's the unavoidable downside of bigger registers and larger context. So, even a half-width implementation isn't going to be an entirely positive development.

                Comment


                • #18
                  Originally posted by carewolf View Post
                  Ideally twice the number of bits processed per clock cycle.
                  At lower clock speeds, though! One developer found the impact on clock speed was so dramatic that using AVX-512 for crypto resulted in a net decrease of server throughput!

                  https://blog.cloudflare.com/on-the-d...uency-scaling/

                  If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.

                  Comment


                  • #19
                    Originally posted by coder View Post
                    At lower clock speeds, though! One developer found the impact on clock speed was so dramatic that using AVX-512 for crypto resulted in a net decrease of server throughput!

                    https://blog.cloudflare.com/on-the-d...uency-scaling/
                    I had that in mind too, indeed, but that's on Intel. So now I'm really wondering: there's been three years between that Cloudflare post and now, maybe AMD has a way better implementation here. If AMD manages to keep the clock high, that means it's for sure useful for non-server workloads (i.e., browsers). If they manage not to disable SMP while processing there instructions (I doubt it, but it might be possible), it's possibly a huge impact for servers too.

                    AMD managed to make AES extremely fast in Zen1 already, who knows what they pull off here?

                    Comment


                    • #20
                      Originally posted by coder View Post
                      At lower clock speeds, though! One developer found the impact on clock speed was so dramatic that using AVX-512 for crypto resulted in a net decrease of server throughput!

                      https://blog.cloudflare.com/on-the-d...uency-scaling/
                      I was talking about the AVX version. It only has a moderate decrease of frequency. Also the AVX-512 should have 4 times the bandwidth.

                      Comment

                      Working...
                      X