Originally posted by Developer12
View Post
One clue is this point he made, in that post:
"I want my power limits to be reached with regular integer code, not with some AVX512 power virus that takes away top frequency (because people ended up using it for memcpy!)"
A significant downside of AVX-512, in Intel 14nm CPUs (Skylake-X & Cascade Lake), was sometimes severe clock-throttling that could occur, even from modest use. It had been shown to negatively affect clockspeeds enough that you were often better off completely avoiding AVX-512, if you weren't using it heavily.
When you also consider the Balkanization of AVX-512 subsets, it seems clear that Intel jumped the gun on implementing it. Deploying it on 14nm technology just came with too many caveats: power/clockspeed and die space.
However, Michael's tests of Zen4 and Sapphire Rapids have shown virtually no clock-throttling or excessive power utilization and Zen 4 even managed to implement it in a way that seems quite area-efficient. Not only that, but Zen 4 came right out of the gate with virtually all subsets implemented (except for a couple of the most recent ones).
That said, Zen 5 is rumored to have gone for native 512-bit execution paths. So, we'll have to see how that pans out, but I think the current generation of AVX-512 is relatively harmless. We can perhaps turn our criticism to other die space hogs, like Intel's AMX, which also bloats thread context by a whopping 8 kB, and so far isn't useful for anything but AI inferencing.
Comment