Announcement

**Quppa** · 04 October 2021, 07:10 AM

Sapphire Rapids might be adding AVX-512 FP16, but it's worth noting that AVX-512 is gone entirely from Alder Lake. It will be interesting to see what Zen 4 does.

**carewolf** · 04 October 2021, 07:26 AM

Would be nice to have all the new AVX-512 instructions, even if only running at 256bit at a time. Though it might not be that easy to implement :/

**brucethemoose** · 04 October 2021, 10:17 AM

Originally posted by carewolf View Post

Would be nice to have all the new AVX-512 instructions, even if only running at 256bit at a time. Though it might not be that easy to implement :/

Indeed. What if Zen 4, Gracemont, Golden Clove and so on accepted the "full" AVX512 instruction set and just broke down wider instructions into micro ops? Would it really add that much overhead?

Zen 1 did that, and it seems fine. ARMv9 sort of does that, and it scales down to tiny, low power cores. What makes it so hard for AMD/Intel to do in newer cores?

**carewolf** · 04 October 2021, 10:49 AM

Originally posted by brucethemoose View Post

Indeed. What if Zen 4, Gracemont, Golden Clove and so on accepted the "full" AVX512 instruction set and just broke down wider instructions into micro ops? Would it really add that much overhead?

Zen 1 did that, and it seems fine. ARMv9 sort of does that, and it scales down to tiny, low power cores. What makes it so hard for AMD/Intel to do in newer cores?

Well AVX256 was specifically designed to make that easy. AVX512 is not similarly designed, so breaking the instructions in half is not trivial. It could be done, but it might raise a number of situations where it is slower than using AVX256 directly.

**brucethemoose** · 04 October 2021, 01:14 PM

Originally posted by carewolf View Post

Well AVX256 was specifically designed to make that easy. AVX512 is not similarly designed, so breaking the instructions in half is not trivial. It could be done, but it might raise a number of situations where it is slower than using AVX256 directly.

Ah.

Assuming its still possible, I think the standardization would be worth extra decoding complexity, but I guess it depends how big that penality is.

Full AVX512 would probably take up too much die space on the atom-like cores... or maybe not? The Xeon Phi cores weren't that big IIRC, and the Centaur cores that support AVX512 aren't particularly huge either.

**carewolf** · 05 October 2021, 04:35 AM

Originally posted by brucethemoose View Post

Ah.

Assuming its still possible, I think the standardization would be worth extra decoding complexity, but I guess it depends how big that penality is.

Full AVX512 would probably take up too much die space on the atom-like cores... or maybe not? The Xeon Phi cores weren't that big IIRC, and the Centaur cores that support AVX512 aren't particularly huge either.

Sure. I think for Intel it is just a matter of market differentiation. For AMD, it is a matter of whether it is worth it.

**coder** · 05 October 2021, 04:39 AM

Originally posted by carewolf View Post

Would be nice to have all the new AVX-512 instructions, even if only running at 256bit at a time. Though it might not be that easy to implement :/

It's not only about vector pipeline width. AVX-512 >= quadruples the size of the vector register file, by doubling both their number and size. On Intel's "little" cores, even that increase in area might've been deemed too much.

**coder** · 05 October 2021, 04:54 AM

Originally posted by brucethemoose View Post

ARMv9 sort of does that, and it scales down to tiny, low power cores. What makes it so hard for AMD/Intel to do in newer cores?

ARM SVE is a different animal. Even though an implementation can scale vectors up to 2048 bits, it doesn't require 2048-bit registers on all implementations. The minimum width is 128-bits, at which your entire footprint is just 128-bits. That includes even the registers.

What SVE does differently is to expose the implementation width in a way that makes it easy for software to adapt to its vector size. This stands in contrast to the x86 approach of requiring distinct opcodes for different vector widths.

BTW, ARM's new mid-power cores (the A510) share a single, 128-bit vector pipeline between two of them, in the default configuration. So, that would suggest SVE has more area cost than conventional ARMv8 128-bit SIMD.

I haven't read about a new A3x-series core for ARMv9, which would be the "tiny, low power" cores. Surely, it's only a matter of time. ...and maybe a new process node, for them to be viable.

**coder** · 05 October 2021, 05:03 AM

Originally posted by brucethemoose View Post

Full AVX512 would probably take up too much die space on the atom-like cores... or maybe not? The Xeon Phi cores weren't that big IIRC, and the Centaur cores that support AVX512 aren't particularly huge either.

The KNL dies were huge and the cores were otherwise very simple. This enabled them to have more GPU-like allocation of vector compute vs. scalar & control logic, I think.

With modern "little" cores, there's a lot more OoO overhead, much wider support for scalar ops, and they have to be cost-competitive even in low-end 4-core implementations. I'm guessing that's the issue.

Announcement

GCC 12 Compiler Squaring Away Its AVX-512 FP16 Support

GCC 12 Compiler Squaring Away Its AVX-512 FP16 Support

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment