Announcement

Collapse
No announcement yet.

Linus Torvalds: "I Hope AVX512 Dies A Painful Death"

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #51
    Does the RISC-V vector instruction set allow you to do cheap chaining, the way Cray vectors did? That seems like a really neat trick, though I'm sure automatically handling it in a compiler is horrid.

    Just to clarify, this is how I understand the difference between Cray variable-length plus chaining vs intel-style. I may be wrong.
    On a Cray, you point a vector unit at a list of values. It iterates through it, and produces one output per unit of time. That output can be fed directly into another vector unit, and the output from that further into another unit, and so on until you run out of vector hardware or things you want to do. This can then be left to churn through all your data - or at least up to the max value of the vector length register. I don't know how many vector units you had to play with, though - I would guess "a few"?

    Intel-style, you set up a fixed-length chunk of data, and do one instruction on it. A while later, all the output appears in a (very long) register. You can then do another instruction on it - and you can also load up and start another copy of the first instruction while waiting for the results; I believe most AVX-512 CPUs have two execution units?

    edit: Right. I'd forgotten that Cray also used vector registers, though long ones - a Cray-1 had 64-element registers, a T-90 had 128. They all had 6-8 processing units (which is what limits how deep you can chain). The later ones also had dedicated "move data between memory and registers" hardware to smooth out the "load/store data" part.
    Last edited by dnebdal; 07-12-2020, 09:34 AM.

    Comment


    • #52
      Originally posted by dragorth View Post
      Games do care about AVX512, though, and 3D tools like Blender will care. Big budget movies that buy Intel by the truckload will care, which is why Intel is going to do it. Also, Cloud companies care, as a feature for their competitive advantage.
      I thought that studios dig the Threadripper series for their sheer computing power?

      Comment


      • #53
        Originally posted by Virtus View Post
        TLDR: the cost of using AVX-512 is too high.

        Some remarks after using and testing AVX-512 on various codes and platforms for years:
        - compilers are not able to efficiently vectorize codes, I mean, yes sometimes you can see part of your code vectorized but it is always faster when you use intrinsics, and lots of codes, even in HPC, are not efficiently vectorized, it is (almost) all about scaling, optimizing inter nodes communication/synchroniation. So, the cost of vectorizing codes is really high and the number of AVX-512 flavours does not help (and new ones keep coming !). And there is still no AOS to SOA loads like with NEON on ARM which are really useful since most of libraries (image processing, ...) use AOS layouts. Lots of people in the HPC community tend to believe that the Intel compiler is able to perform kind of “black magic” autovectorization and that AVX-512 is a must have (for its theoretical peak performance…).
        - for memory-bound codes (stencils, SEM, ...), the performance increase when using AVX-512 over AVX2 is around 20% which is not bad but far from the 2x speedup you may expect. And it was obtained on a Skylake Gold, which has 2 AVX-512 ports (fusion of 2xAVX2 ports + port 5) which operates at 1.9Ghz only when using AVX-512 instructions, on cheaper Silver ones AVX-512 instructions are splitted across the 2 AVX2 ports so you may not see any speedup.
        - AVX-512 units are basically 4 SSE units glued together (AVX are 2 SSE), if you want to make permutations or shifting values between 128-bit lanes, it comes at a cost (permute2f128 instructions cost 3 cycles) thus you cannot expect to scale from SSE to AVX-512 when your code requires moving values across lanes (like stencils).
        - I think one way to take advantage of such large units is to combine it with HBM memory like the Fujitsu A64FX, to be able to feed AVX-512 units (but the A64FX SVE unit seems to be 4x128-bit if you look closely at the latencies of fma and sqrt in the documentation).
        - I teach HPC programming and the way students react when they look at a vectorized code using intrinsics for the first time (I use a RGB to grayscale example) just indicates that something is wrong with the API. In fact, they prefer the CUDA programming model which is more readable/looks like a scalar code, but it is possible to use ispc to have something similar for SIMD instruction sets.
        - regarding AVX-512 and games: just parse the code of the UnrealEngine… Spoiler: you will only see few SSE intrinsics operating on AOS 3D vectors.
        wouldn't game engine take advantage of at least basic avx?

        Comment


        • #54
          Originally posted by sireangelus View Post

          wouldn't game engine take advantage of at least basic avx?
          Lots of codes may take advantage of SSE/AVX/NEON/..., but are developers ready to learn all these instruction sets and spend some time to optimize their codes and update them for new ones ? If you disassemble some codes you will see that the answer is almost always no. Note that, to really take advantage of SIMD units you need to also use SOA layouts and this is mostly not the case for games (A vec3 struct is one thing you have to avoid) and also image/signal processing (rgb, yuv, ...) so you need to rewrite a lot of codes.

          Comment


          • #55
            It would be better if he rallied for Posit support (https://www.nextplatform.com/2019/07...t-computation/ https://www.intel.com/content/www/us...-you-need.html https://www.nextplatform.com/2019/07...to-processors/) (which Intel is considering) than ranting about AVX-512 support on processors where people want to pay extra for it (because it saves them money).

            I do feel that math extensions have got out of hand: https://en.wikipedia.org/wiki/SIMD#See_also and object to needing just one more feature and having to pay an enormous additional expense to go 'up a color' - time for a Grand Unification.

            Comment


            • #56
              So, most games aren't using AVX512 because it is not currently on consumer hardware. If you look on Steam hardware, you will see AVX2 is still not on a quarter of Steam's user base. Simply from monetary reasons, you don't want to push 25% of your users away from the ability to play your game. So most games are still written for SSE.

              Now, SSE isn't all that great an api, it was first to market and lacks many of the modern conveniences of modern implementations such as Neon on ARM. It is missing many conveniences that make programming easier and more powerful, such as the more efficient masking ability that is available in the newer AVX instructions. Its these conveniences that are important to games programing, that will make more advanced methods such as branchless paths much easier to code as well as more efficient for the CPU, as just one example. The ability to possibly do one set of masking across 8 or 16 data points means avoiding a possible 8-16 wrong branch predictions that in the worst case could mean going all the way out to memory for all 16 calls, stalling for much longer than causing the CPU to downclock to half speed can affect, trading a possible speed increase with reliability, which matters much more when trying to ensure a set FPS.

              Gamers like Ryzen and so do developers, due to its performance. Mainly Ryzen provides many of the benefits of the AVX set, thanks to its AVX2 support with a few of the other extensions. Ryzen has accelerated the adoption of AVX2, compared to how slow the adoption of earlier instructions sets.

              Comment


              • #57
                Originally posted by fguerraz View Post

                Well, I care so deeply that I will go out of my way to buy a laptop that doesn't have a discreet GPU. I see it as a power sucking monster and a waste of money. All I do is web browsing and development, and for anybody that doesn't play games or need acceleration for their workload, an Intel GPU is perfect: it has all the power required to accelerate desktop composition and show the occasional 3D thing.
                He was probably referring to Intel's recent work on creating a discreet GPU, so erh... you might actually agree with the post you are arguing against.

                Comment


                • #58
                  Originally posted by sireangelus View Post

                  wouldn't game engine take advantage of at least basic avx?
                  Yes, AVX/AVX2 is highly used in game engines.

                  The big problem with AVX512 is the power envelope. It consumes an enormous amount of power, so when the CPU has to light up that silicon, it has to reduce power everywhere else, which means reduced clock speeds. AVX was similar initially, but to a significantly lesser extent, and you could remove that drop in UEFI if you wanted to.

                  Then AVX512 is segmented into a few categories that your CPU may support. This adds to the number of code paths you need if you try to use it. It would have been much better as a complete set, so you could say "this processor is AVX512-level."

                  There's nothing wrong with 512-bit wide instructions as a concept or even AVX512, if it were bundled as a complete API and weren't so power-hungry. It's the implementation that's lackluster.

                  Comment


                  • #59
                    Originally posted by brent View Post
                    I think Linus does have a point. Just look at the AVX-512 specifications. It's a broad selection of different special-purpose instruction sets. DIfferent CPUs support vastly different feature sets, it's a complete mess. It's pretty horrible from an application developer's POV. Even if you ignore all that, AVX-512 is just too wide and has too little benefit in most cases outside of a few special niche use-cases and benchmarks.
                    Problem is not being "too wide". Problem is way it is implemented comparing to AVX1 or AVX2. Generally if you can split data into 256 bit chunks for AVX2, probably you can do so for AVX512 too. Problem is a mess that AVX512 introduced by:
                    - making very few CPUs supporting it (so widespread usage is simply pointless),
                    - making AVX512 have significantly lower clock then rest of CPU, meaning if code runs normal instructions then AVX512 for few cycles then again normal instruction you have ramp up and ramp down,
                    - making some AVX-512 CPUs that have only 1 AVX-512 units like this one https://community.intel.com/t5/Softw...8/td-p/1135951 . So you have a CPU that supports AVX512, but will be faster executing same stuff over AVX2.

                    Comment


                    • #60
                      Intel is a real virus. a cancer.

                      Comment

                      Working...
                      X