Announcement

**rubdos** · 19 October 2020, 06:39 AM

Since Ryzen had such great benchmarks on AES, I wonder what the new VAES set gives here? Is there a way to test that for Michael?

**carewolf** · 19 October 2020, 07:06 AM

Originally posted by rubdos View Post

Since Ryzen had such great benchmarks on AES, I wonder what the new VAES set gives here? Is there a way to test that for Michael?

Ideally twice the number of bits processed per clock cycle.

**rubdos** · 19 October 2020, 07:51 AM

Originally posted by carewolf View Post

Ideally twice the number of bits processed per clock cycle.

I mean, that could be a HUGE deal in data centres, no? "Ryzen 3 twice the speed of already record holder Ryzen 2 on OpenSSL AES, buy EPYC now!"

I suppose it won't be twice the bandwidth though, but it should be interesting to see some numbers on it. Also, since vector instructions could temporally disable SMP locally (is that still true?), it might be only truly useful in non-server applications. Multicore benchmarks warranted!

**coder** · 19 October 2020, 10:26 AM

Originally posted by ms178 View Post

AVX-512 is supposed to come with Zen 4, hopefully with a better implementation than Intel's.

Sadly, AVX-512 is broken by design.

If just one library call executes just one AVX-512 instruction, suddenly every SSE and AVX operation now burns more power by virtue of having to always copy the upper 256-bits of each vector register. Of course, you could always terminate AVX-512 code blocks with VZEROUPPER, but that potentially limits its use in smaller functions.

ARM's SVE is a much better approach, if you really must have larger vectors. Better still would be to use a GPU or purpose-built AI accelerator.

**coder** · 19 October 2020, 10:30 AM

Originally posted by zxy_thf View Post

Zen 4's AVX-512 support might be light Zen 1's AVX2 support, i.e., emulating 512-bit operations with 256-bit ALUs.
However due to the tremendous cost of AVX-512 on die area, this approach might also be another "worst is better" solution.

Aside from the penalty you incur in mixed 128-bit or 256-bit + 512-bit workloads, there's the unavoidable downside of bigger registers and larger context. So, even a half-width implementation isn't going to be an entirely positive development.

**coder** · 19 October 2020, 10:37 AM

Originally posted by carewolf View Post

Ideally twice the number of bits processed per clock cycle.

At lower clock speeds, though! One developer found the impact on clock speed was so dramatic that using AVX-512 for crypto resulted in a net decrease of server throughput!

On the dangers of Intel's frequency scaling

https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

While I was writing the post comparing the new Qualcomm server chip, Centriq, to our current stock of Intel Skylake-based Xeons, I noticed a disturbing phenomena.

If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.

**rubdos** · 19 October 2020, 10:45 AM

Originally posted by coder View Post

At lower clock speeds, though! One developer found the impact on clock speed was so dramatic that using AVX-512 for crypto resulted in a net decrease of server throughput!

On the dangers of Intel's frequency scaling

https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

While I was writing the post comparing the new Qualcomm server chip, Centriq, to our current stock of Intel Skylake-based Xeons, I noticed a disturbing phenomena.

I had that in mind too, indeed, but that's on Intel. So now I'm really wondering: there's been three years between that Cloudflare post and now, maybe AMD has a way better implementation here. If AMD manages to keep the clock high, that means it's for sure useful for non-server workloads (i.e., browsers). If they manage not to disable SMP while processing there instructions (I doubt it, but it might be possible), it's possibly a huge impact for servers too.

AMD managed to make AES extremely fast in Zen1 already, who knows what they pull off here?

**carewolf** · 19 October 2020, 10:59 AM

Originally posted by coder View Post

At lower clock speeds, though! One developer found the impact on clock speed was so dramatic that using AVX-512 for crypto resulted in a net decrease of server throughput!

On the dangers of Intel's frequency scaling

https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

While I was writing the post comparing the new Qualcomm server chip, Centriq, to our current stock of Intel Skylake-based Xeons, I noticed a disturbing phenomena.

I was talking about the AVX version. It only has a moderate decrease of frequency. Also the AVX-512 should have 4 times the bandwidth.

**jayN** · 19 October 2020, 11:39 AM

Several avx512 frequency improvements were described in the hotchips 2020 Ice Lake Server presentation.

Hot Chips 2020 Live Blog: Next Gen Intel Xeon, Ice Lake-SP (9:30am PT)

https://www.anandtech.com/show/15984/hot-chips-2020-live-blog-next-gen-intel-xeon-ice-lakesp-930am-pt

Announcement

AMD Sends Out Patches Adding "Znver3" Support To GNU Binutils With New Instructions

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment