Originally posted by Marnfeldt
View Post
Announcement
Collapse
No announcement yet.
Linus Torvalds: "I Hope AVX512 Dies A Painful Death"
Collapse
X
-
Originally posted by dxin View PostSame goes with Intel GPUs. Why waste transistor on something nobody cares?
- Likes 11
Comment
-
Originally posted by xpue View PostNo, that depends on number representaton. It can be 3141592653589793238 in fixed point.
Floating point numbers are critical for doing accurate computations, not just printing random numbers on the screen.
- Likes 4
Comment
-
Originally posted by sophisticles View Post
Not to mention sometimes iGPU is faster than dGPU, for instance I have a number of pcs, the fastest is a R5 1600 with 16 Gb ddr4 and a GTX1050 and the slowest is an i3 7100 with 16 Gb ddr4 and no dGPU (it uses the iGPU).
I do a lot of video editing and routinely need to render out a file that has a bunch of filters applied (I use Shotcut). If I use the first pc with a 50 minute source, it takes over 9 hours to finish the encode, if I'm using software filters. If I enable gpu filters, R5 + 1050 combo cuts that time down to 5.5 to 6 hours. If I do the same encode on the i3 and use gpu filters with the iGPU, the time is down to just over 3 hours.
This is repeatable with other test files. Near as I can tell the iGPU cuts the time so much because it doesn't suffer from memory copy performance penalties (from system ram to gpu ram) that the other system has to perform.
I'm looking forward to Rocket Lake, that Gen 12 Xe iGPU should be awesome for the work I do.
Of course iGPU's are crucial when it comes to office work and laptop computing, and I for one wouldn't buy a laptop with a discrete GPU, let alone an Optimus piece of crap. But even in that use case, I think I prefer AMD's Vega graphics.Last edited by omer666; 12 July 2020, 06:14 AM.
- Likes 7
Comment
-
TLDR: the cost of using AVX-512 is too high.
Some remarks after using and testing AVX-512 on various codes and platforms for years:
- compilers are not able to efficiently vectorize codes, I mean, yes sometimes you can see part of your code vectorized but it is always faster when you use intrinsics, and lots of codes, even in HPC, are not efficiently vectorized, it is (almost) all about scaling, optimizing inter nodes communication/synchroniation. So, the cost of vectorizing codes is really high and the number of AVX-512 flavours does not help (and new ones keep coming !). And there is still no AOS to SOA loads like with NEON on ARM which are really useful since most of libraries (image processing, ...) use AOS layouts. Lots of people in the HPC community tend to believe that the Intel compiler is able to perform kind of “black magic” autovectorization and that AVX-512 is a must have (for its theoretical peak performance…).
- for memory-bound codes (stencils, SEM, ...), the performance increase when using AVX-512 over AVX2 is around 20% which is not bad but far from the 2x speedup you may expect. And it was obtained on a Skylake Gold, which has 2 AVX-512 ports (fusion of 2xAVX2 ports + port 5) which operates at 1.9Ghz only when using AVX-512 instructions, on cheaper Silver ones AVX-512 instructions are splitted across the 2 AVX2 ports so you may not see any speedup.
- AVX-512 units are basically 4 SSE units glued together (AVX are 2 SSE), if you want to make permutations or shifting values between 128-bit lanes, it comes at a cost (permute2f128 instructions cost 3 cycles) thus you cannot expect to scale from SSE to AVX-512 when your code requires moving values across lanes (like stencils).
- I think one way to take advantage of such large units is to combine it with HBM memory like the Fujitsu A64FX, to be able to feed AVX-512 units (but the A64FX SVE unit seems to be 4x128-bit if you look closely at the latencies of fma and sqrt in the documentation).
- I teach HPC programming and the way students react when they look at a vectorized code using intrinsics for the first time (I use a RGB to grayscale example) just indicates that something is wrong with the API. In fact, they prefer the CUDA programming model which is more readable/looks like a scalar code, but it is possible to use ispc to have something similar for SIMD instruction sets.
- regarding AVX-512 and games: just parse the code of the UnrealEngine… Spoiler: you will only see few SSE intrinsics operating on AOS 3D vectors.
- Likes 14
Comment
-
Originally posted by Archprogrammer View PostI really hope the RISC-V vector extension shows the way to a better programming model than SIMD. Then we could skip AVX-512, feature levels for CPUs and in the long run.
More generally, you can implement and expose SIMD in different ways. GPUs, for example, are essentially built on SIMD-style execution units, but a combination of clever hardware features and well thought-out programming model ensures that they are much easier to program than CPUs for vector processing tasks. If largely feels like scalar code to the programmer, but if you help the hardware a bit by coalescing your loads/stores and ensuring that your branches are convergent, it will run as fast as SIMD code on the CPU. Best of both worlds IMO, I wish CPU vectorization worked like that too.
- Likes 1
Comment
-
SVE and RISC-V use the VLA Vector Length Agnostic paradigm and I have played a little with it using intrinsics (GCC and Armclang) and I agree with HadrienG: it is really a huge constraint for code optimization when your optimization (unrolling for example) depends on the width of registers (for example the FFT). Of course I have tested autovectorization provided by both compiler and it is not really good and IMO it will not improve dramatically.
- Likes 2
Comment
-
Originally posted by curfew View Post...
Originally posted by curfew View PostIn this form you cannot even multiply it by ten
- Likes 6
Comment
Comment