Originally posted by pWe00Iri3e7Z9lHOX2Qx
View Post
Originally posted by pWe00Iri3e7Z9lHOX2Qx
View Post
Not to mention issues like the need to vzerroupper:
AVX2 got some bad press, when it launched in Haswell CPUs, due to clock-throttling issues. AVX-512 was like that, only much worse. There's nothing inconsistent, here.
Another legit complaint about AVX-512 was the degree of fragmentation between Intel's different product lines. By the time Zen 4 implemented it, they were able to implement virtually all facets.
BTW, I think Linus way overestimated how much die space AVX-512 uses.
It did bloat the context, noticeably, with the vector registers occupying 4x the footprint, relative to AVX2 (as Intel doubled both the number and size of the vector ISA registers resulting in 32 * 64B = 2 kiB). As cache sizes and memory bandwidth both increase, that's less of an issue.
I think he had a legit point about people using it for simple things like string ops triggering clock-throttling, that makes it a net-loss for application-level performance. That's not true of Sapphire Rapids or Zen 4, but it was certainly a legit concern on 14 nm Xeons, as demonstrated here:
"the AVX-512 workload consumes only 2.5% of the CPU time when all the requests are ChaCha20-Poly1305, and less then 0.3% when doing ChaCha20-Poly1305 for 10% of the requests. Irregardless the CPU throttles down, because that what it does when it sees AVX-512 running on all cores."
https://blog.cloudflare.com/on-the-d...uency-scaling/
Using just a tiny bit of AVX-512 was enough to trigger heavy clock-throttling that slowed down the rest of the application code by much more than the AVX-512 was able to accelerate the portion where it was used.
So, yeah. If you introduce a feature before its time, and the implementation comes with so many caveats and pitfalls, it's only natural that you catch some blowback!
Leave a comment: