Announcement

Collapse
No announcement yet.

AMD Sends Out Patches Adding "Znver3" Support To GNU Binutils With New Instructions

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Since Ryzen had such great benchmarks on AES, I wonder what the new VAES set gives here? Is there a way to test that for Michael?

    Comment


    • #12
      Originally posted by rubdos View Post
      Since Ryzen had such great benchmarks on AES, I wonder what the new VAES set gives here? Is there a way to test that for Michael?
      Ideally twice the number of bits processed per clock cycle.

      Comment


      • #13
        Originally posted by carewolf View Post

        Ideally twice the number of bits processed per clock cycle.
        I mean, that could be a HUGE deal in data centres, no? "Ryzen 3 twice the speed of already record holder Ryzen 2 on OpenSSL AES, buy EPYC now!"

        I suppose it won't be twice the bandwidth though, but it should be interesting to see some numbers on it. Also, since vector instructions could temporally disable SMP locally (is that still true?), it might be only truly useful in non-server applications. Multicore benchmarks warranted!

        Comment


        • #14
          Originally posted by ms178 View Post
          AVX-512 is supposed to come with Zen 4, hopefully with a better implementation than Intel's.
          Sadly, AVX-512 is broken by design.

          If just one library call executes just one AVX-512 instruction, suddenly every SSE and AVX operation now burns more power by virtue of having to always copy the upper 256-bits of each vector register. Of course, you could always terminate AVX-512 code blocks with VZEROUPPER, but that potentially limits its use in smaller functions.

          ARM's SVE is a much better approach, if you really must have larger vectors. Better still would be to use a GPU or purpose-built AI accelerator.

          Comment


          • #15
            Originally posted by zxy_thf View Post
            Zen 4's AVX-512 support might be light Zen 1's AVX2 support, i.e., emulating 512-bit operations with 256-bit ALUs.
            However due to the tremendous cost of AVX-512 on die area, this approach might also be another "worst is better" solution.
            Aside from the penalty you incur in mixed 128-bit or 256-bit + 512-bit workloads, there's the unavoidable downside of bigger registers and larger context. So, even a half-width implementation isn't going to be an entirely positive development.

            Comment


            • #16
              Originally posted by carewolf View Post
              Ideally twice the number of bits processed per clock cycle.
              At lower clock speeds, though! One developer found the impact on clock speed was so dramatic that using AVX-512 for crypto resulted in a net decrease of server throughput!

              While I was writing the post comparing the new Qualcomm server chip, Centriq, to our current stock of Intel Skylake-based Xeons, I noticed a disturbing phenomena.


              If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.

              Comment


              • #17
                Originally posted by coder View Post
                At lower clock speeds, though! One developer found the impact on clock speed was so dramatic that using AVX-512 for crypto resulted in a net decrease of server throughput!

                While I was writing the post comparing the new Qualcomm server chip, Centriq, to our current stock of Intel Skylake-based Xeons, I noticed a disturbing phenomena.

                I had that in mind too, indeed, but that's on Intel. So now I'm really wondering: there's been three years between that Cloudflare post and now, maybe AMD has a way better implementation here. If AMD manages to keep the clock high, that means it's for sure useful for non-server workloads (i.e., browsers). If they manage not to disable SMP while processing there instructions (I doubt it, but it might be possible), it's possibly a huge impact for servers too.

                AMD managed to make AES extremely fast in Zen1 already, who knows what they pull off here?

                Comment


                • #18
                  Originally posted by coder View Post
                  At lower clock speeds, though! One developer found the impact on clock speed was so dramatic that using AVX-512 for crypto resulted in a net decrease of server throughput!

                  While I was writing the post comparing the new Qualcomm server chip, Centriq, to our current stock of Intel Skylake-based Xeons, I noticed a disturbing phenomena.

                  I was talking about the AVX version. It only has a moderate decrease of frequency. Also the AVX-512 should have 4 times the bandwidth.

                  Comment


                  • #19
                    Several avx512 frequency improvements were described in the hotchips 2020 Ice Lake Server presentation.


                    Comment

                    Working...
                    X