I think the real problems of avx-512 is
a) a lot of avx-512 instructions subset fragmentation which makes this thing rediculous. I mean instruction sets like sse, avx, avx2, are already too fragmented and then you get 10 flavors of avx-512 on top of that. WTF.
b) the powering-down issue that makes the rest of the cpu crawl so alternating avx-512 instructions, or having a server that has one instance use avx-512 creates issues to other server instances. I think it was done due to the 14nm issue.
c) various workarounds that need to be implemented to have avx perform correctly... like vzeroupper in the code... or having the OS check whether certain avx (z/y) registers were used in order to do, or not do, some things in order to do things differently and avoid slowdowns. These arise out of the intel "hacks" and then end up in the OS code as workarounds.
In theory avx512 should be much more power efficient than avx2. It doesn't matter if, say, a cpu that used to burn 100w now does 140w, or if it is underclocked, but how many results you can get out of certain watts. If, for example, avx512 can double the throughput of avx2, that's a 100% gain. If you get consumption from 100w to 140w, then that's a fantastic perf/watt gain. And if you underclock to, say, 70% of the speed and only get +70% throughput for, say, similar avx2 watts, then that's good too. The rationale that "avx2 should be enough" that Linus said, cannot and does not increase perf/watt. Redditors say "ohhh but it burns so much". Yeah but it puts out 60-80% more throughput in terms of results. They are not factoring perf/watt. Only watts. Watts can be adjusted downwards and then you can keep the throughput gains vs inferior vectors (avx2) at similar watts.
Personally I would like wide avx2 and avx512 adoption in chips, because without them the code lags in adopting the speed enhancements that the cpu can actually give. But I would also like the chips to be less "hacked" to solve problems like splitting 128 bit lanes and using vzeroupper, or having irrelevant slowdown issues with zmm register usage, etc. The moment intel gives the set to the people, this should be good to go, not need hacks in code or OS to play well. I think it creates needless irritation for everyone. And please no 500 flavors of avx512. This is a fricking nightmare to support by coders.
In terms of chips I would also like to see larger L1 instruction caches because opcodes have been getting larger and larger for these commands and yet L1 instruction cache size is almost the same for 15 years. Sure, we have bigger μop caches, but it's not the same.
a) a lot of avx-512 instructions subset fragmentation which makes this thing rediculous. I mean instruction sets like sse, avx, avx2, are already too fragmented and then you get 10 flavors of avx-512 on top of that. WTF.
b) the powering-down issue that makes the rest of the cpu crawl so alternating avx-512 instructions, or having a server that has one instance use avx-512 creates issues to other server instances. I think it was done due to the 14nm issue.
c) various workarounds that need to be implemented to have avx perform correctly... like vzeroupper in the code... or having the OS check whether certain avx (z/y) registers were used in order to do, or not do, some things in order to do things differently and avoid slowdowns. These arise out of the intel "hacks" and then end up in the OS code as workarounds.
In theory avx512 should be much more power efficient than avx2. It doesn't matter if, say, a cpu that used to burn 100w now does 140w, or if it is underclocked, but how many results you can get out of certain watts. If, for example, avx512 can double the throughput of avx2, that's a 100% gain. If you get consumption from 100w to 140w, then that's a fantastic perf/watt gain. And if you underclock to, say, 70% of the speed and only get +70% throughput for, say, similar avx2 watts, then that's good too. The rationale that "avx2 should be enough" that Linus said, cannot and does not increase perf/watt. Redditors say "ohhh but it burns so much". Yeah but it puts out 60-80% more throughput in terms of results. They are not factoring perf/watt. Only watts. Watts can be adjusted downwards and then you can keep the throughput gains vs inferior vectors (avx2) at similar watts.
Personally I would like wide avx2 and avx512 adoption in chips, because without them the code lags in adopting the speed enhancements that the cpu can actually give. But I would also like the chips to be less "hacked" to solve problems like splitting 128 bit lanes and using vzeroupper, or having irrelevant slowdown issues with zmm register usage, etc. The moment intel gives the set to the people, this should be good to go, not need hacks in code or OS to play well. I think it creates needless irritation for everyone. And please no 500 flavors of avx512. This is a fricking nightmare to support by coders.
In terms of chips I would also like to see larger L1 instruction caches because opcodes have been getting larger and larger for these commands and yet L1 instruction cache size is almost the same for 15 years. Sure, we have bigger μop caches, but it's not the same.
Comment