Well, just to fix some wrong assumptions here:
AVX512 is equal to AVX2 + AVX with wider registers (512 instead of 256) while AVX2 + AVX is the same as SSE-4.2 with wider registers(256 instead of 128) and few extra extensions here and there and all of them handle FP and INT operations.
the main problem here is the actual Intel implementation, Intel went full Per Core implementation with their 14++++++++ process but with so many wide registers power envelope goes to hell simply because this process is not efficient enough anymore and since AMD put them through the ringer they also packed an idiotic amount of core, so now those CPU need a mini nuclear plant to run.
Also Intel for some freaking reason decided to implement AVX512 in like 12 different segments software side and this could lead to an implementation hell when other X86 licensees implement those and even on Intel side is kinda sick already the combinations possible when checking for extensions support in AVX512. check the list here https://en.wikipedia.org/wiki/AVX-512.
Also note THIS IS IMPORTANT the kernel itself barely use any of those because the kernel itself outside few modules(crypto mostly) don't do operations in a high level enough to need them in any form of shape outside enable user space to switch contexts and use those registers(I'm guessing here is where Linus got pissed with the code) BUT in user space they are immensely useful because those register don't handle only Arithmetic Vector Operation(this you can do in GPU or CPU) but cache and memory as well(Huge performance win).
Also, YOU CANNOT DO ALL VECTOR OPERATIONS ON A GPU !!! STOP SPREADING FALSE ASSERTIONS!!!! why because the GPU is horribly slow and a Latency nightmare, so you ONLY DO GPU OPERATIONS WHEN THE DATASET IS MASSIVE, GIGABYTES MASSIVE NOT A COUPLE THOUSAND MULTIPLICATIONS(iGPUs are even worse because the bandwidth is even more limited) AND YOU DON'T CARE ABOUT LATENCY, for those smaller jobs with few thousands operation OR WHEN LATENCY IS IMPORTANT is when you USE SIMD(SSE/AVX).
@sophisticles Intel iGPU have HARDWARE DEDICATED SILICON for video encoding AKA Intel QUICKSYNC and is very well supported in video editors BUT is not usable for PRO video, That is where the CPU/CUDA encoders shine HENCE IS NOT RELEVANT IN THIS CASE CUZ NEITHER USE SIMD OR GPGPU BUT YES FOR NON-PRO RES VIDEO IS QUITE GOOD, SO ENJOY.
AVX512 is equal to AVX2 + AVX with wider registers (512 instead of 256) while AVX2 + AVX is the same as SSE-4.2 with wider registers(256 instead of 128) and few extra extensions here and there and all of them handle FP and INT operations.
the main problem here is the actual Intel implementation, Intel went full Per Core implementation with their 14++++++++ process but with so many wide registers power envelope goes to hell simply because this process is not efficient enough anymore and since AMD put them through the ringer they also packed an idiotic amount of core, so now those CPU need a mini nuclear plant to run.
Also Intel for some freaking reason decided to implement AVX512 in like 12 different segments software side and this could lead to an implementation hell when other X86 licensees implement those and even on Intel side is kinda sick already the combinations possible when checking for extensions support in AVX512. check the list here https://en.wikipedia.org/wiki/AVX-512.
Also note THIS IS IMPORTANT the kernel itself barely use any of those because the kernel itself outside few modules(crypto mostly) don't do operations in a high level enough to need them in any form of shape outside enable user space to switch contexts and use those registers(I'm guessing here is where Linus got pissed with the code) BUT in user space they are immensely useful because those register don't handle only Arithmetic Vector Operation(this you can do in GPU or CPU) but cache and memory as well(Huge performance win).
Also, YOU CANNOT DO ALL VECTOR OPERATIONS ON A GPU !!! STOP SPREADING FALSE ASSERTIONS!!!! why because the GPU is horribly slow and a Latency nightmare, so you ONLY DO GPU OPERATIONS WHEN THE DATASET IS MASSIVE, GIGABYTES MASSIVE NOT A COUPLE THOUSAND MULTIPLICATIONS(iGPUs are even worse because the bandwidth is even more limited) AND YOU DON'T CARE ABOUT LATENCY, for those smaller jobs with few thousands operation OR WHEN LATENCY IS IMPORTANT is when you USE SIMD(SSE/AVX).
@sophisticles Intel iGPU have HARDWARE DEDICATED SILICON for video encoding AKA Intel QUICKSYNC and is very well supported in video editors BUT is not usable for PRO video, That is where the CPU/CUDA encoders shine HENCE IS NOT RELEVANT IN THIS CASE CUZ NEITHER USE SIMD OR GPGPU BUT YES FOR NON-PRO RES VIDEO IS QUITE GOOD, SO ENJOY.
Comment