No announcement yet.

Linus Torvalds: "I Hope AVX512 Dies A Painful Death"

  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    Originally posted by Marnfeldt View Post
    No, that depends on number representation. It can be 3141592653589793238 in fixed point.
    Last edited by xpue; 07-12-2020, 08:08 AM.


    • #42
      Originally posted by dxin View Post
      Same goes with Intel GPUs. Why waste transistor on something nobody cares?
      Well, I care so deeply that I will go out of my way to buy a laptop that doesn't have a discreet GPU. I see it as a power sucking monster and a waste of money. All I do is web browsing and development, and for anybody that doesn't play games or need acceleration for their workload, an Intel GPU is perfect: it has all the power required to accelerate desktop composition and show the occasional 3D thing.


      • #43
        I really hope the RISC-V vector extension shows the way to a better programming model than SIMD. Then we could skip AVX-512, feature levels for CPUs and in the long run.


        • #44
          Originally posted by xpue View Post
          No, that depends on number representaton. It can be 3141592653589793238 in fixed point.
          In this form you cannot even multiply it by ten without overflowing and losing the slightest bit of accuracy, idiot. 🤦‍♀️

          Floating point numbers are critical for doing accurate computations, not just printing random numbers on the screen.


          • #45
            Originally posted by sophisticles View Post

            Not to mention sometimes iGPU is faster than dGPU, for instance I have a number of pcs, the fastest is a R5 1600 with 16 Gb ddr4 and a GTX1050 and the slowest is an i3 7100 with 16 Gb ddr4 and no dGPU (it uses the iGPU).

            I do a lot of video editing and routinely need to render out a file that has a bunch of filters applied (I use Shotcut). If I use the first pc with a 50 minute source, it takes over 9 hours to finish the encode, if I'm using software filters. If I enable gpu filters, R5 + 1050 combo cuts that time down to 5.5 to 6 hours. If I do the same encode on the i3 and use gpu filters with the iGPU, the time is down to just over 3 hours.

            This is repeatable with other test files. Near as I can tell the iGPU cuts the time so much because it doesn't suffer from memory copy performance penalties (from system ram to gpu ram) that the other system has to perform.

            I'm looking forward to Rocket Lake, that Gen 12 Xe iGPU should be awesome for the work I do.
            There is also the fact that Nvidia sucks at 2D and computation on their consumer-level cards, that's why crypto-mining led to a shortage in AMD Radeon RX cards, but didn't affect Nvidia as much.

            Of course iGPU's are crucial when it comes to office work and laptop computing, and I for one wouldn't buy a laptop with a discrete GPU, let alone an Optimus piece of crap. But even in that use case, I think I prefer AMD's Vega graphics.
            Last edited by omer666; 07-12-2020, 06:14 AM.


            • #46
              Yes, this man power should be used to fix the long list of errata


              • #47
                TLDR: the cost of using AVX-512 is too high.

                Some remarks after using and testing AVX-512 on various codes and platforms for years:
                - compilers are not able to efficiently vectorize codes, I mean, yes sometimes you can see part of your code vectorized but it is always faster when you use intrinsics, and lots of codes, even in HPC, are not efficiently vectorized, it is (almost) all about scaling, optimizing inter nodes communication/synchroniation. So, the cost of vectorizing codes is really high and the number of AVX-512 flavours does not help (and new ones keep coming !). And there is still no AOS to SOA loads like with NEON on ARM which are really useful since most of libraries (image processing, ...) use AOS layouts. Lots of people in the HPC community tend to believe that the Intel compiler is able to perform kind of “black magic” autovectorization and that AVX-512 is a must have (for its theoretical peak performance…).
                - for memory-bound codes (stencils, SEM, ...), the performance increase when using AVX-512 over AVX2 is around 20% which is not bad but far from the 2x speedup you may expect. And it was obtained on a Skylake Gold, which has 2 AVX-512 ports (fusion of 2xAVX2 ports + port 5) which operates at 1.9Ghz only when using AVX-512 instructions, on cheaper Silver ones AVX-512 instructions are splitted across the 2 AVX2 ports so you may not see any speedup.
                - AVX-512 units are basically 4 SSE units glued together (AVX are 2 SSE), if you want to make permutations or shifting values between 128-bit lanes, it comes at a cost (permute2f128 instructions cost 3 cycles) thus you cannot expect to scale from SSE to AVX-512 when your code requires moving values across lanes (like stencils).
                - I think one way to take advantage of such large units is to combine it with HBM memory like the Fujitsu A64FX, to be able to feed AVX-512 units (but the A64FX SVE unit seems to be 4x128-bit if you look closely at the latencies of fma and sqrt in the documentation).
                - I teach HPC programming and the way students react when they look at a vectorized code using intrinsics for the first time (I use a RGB to grayscale example) just indicates that something is wrong with the API. In fact, they prefer the CUDA programming model which is more readable/looks like a scalar code, but it is possible to use ispc to have something similar for SIMD instruction sets.
                - regarding AVX-512 and games: just parse the code of the UnrealEngine… Spoiler: you will only see few SSE intrinsics operating on AOS 3D vectors.


                • #48
                  Originally posted by Archprogrammer View Post
                  I really hope the RISC-V vector extension shows the way to a better programming model than SIMD. Then we could skip AVX-512, feature levels for CPUs and in the long run.
                  It's been a while since I looked at it, but IIRC the currently proposed RISC-V vectorization model uses Cray vectors. Still SIMD-style execution under the hood, but with a vector width that's only known at runtime instead of being architecturally defined. Which is bad for the compiler's ability to optimize vector code (less compile-time information about what's going on), but good for binary compatibility (a binary built for generation N can automatically use the extra vector width of generation N+1 without recompilation).

                  More generally, you can implement and expose SIMD in different ways. GPUs, for example, are essentially built on SIMD-style execution units, but a combination of clever hardware features and well thought-out programming model ensures that they are much easier to program than CPUs for vector processing tasks. If largely feels like scalar code to the programmer, but if you help the hardware a bit by coalescing your loads/stores and ensuring that your branches are convergent, it will run as fast as SIMD code on the CPU. Best of both worlds IMO, I wish CPU vectorization worked like that too.


                  • #49
                    SVE and RISC-V use the VLA Vector Length Agnostic paradigm and I have played a little with it using intrinsics (GCC and Armclang) and I agree with HadrienG: it is really a huge constraint for code optimization when your optimization (unrolling for example) depends on the width of registers (for example the FFT). Of course I have tested autovectorization provided by both compiler and it is not really good and IMO it will not improve dramatically.


                    • #50
                      Originally posted by curfew View Post
                      Floating point hardware has limited precision and is unsuited "for doing accurate computations". This has nothing to do with printing anything on the screen.

                      Originally posted by curfew View Post
                      In this form you cannot even multiply it by ten
                      There are libraries for arbitrary precision math.