Announcement

Collapse
No announcement yet.

Linus Torvalds: "I Hope AVX512 Dies A Painful Death"

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Well, just to fix some wrong assumptions here:

    AVX512 is equal to AVX2 + AVX with wider registers (512 instead of 256) while AVX2 + AVX is the same as SSE-4.2 with wider registers(256 instead of 128) and few extra extensions here and there and all of them handle FP and INT operations.

    the main problem here is the actual Intel implementation, Intel went full Per Core implementation with their 14++++++++ process but with so many wide registers power envelope goes to hell simply because this process is not efficient enough anymore and since AMD put them through the ringer they also packed an idiotic amount of core, so now those CPU need a mini nuclear plant to run.

    Also Intel for some freaking reason decided to implement AVX512 in like 12 different segments software side and this could lead to an implementation hell when other X86 licensees implement those and even on Intel side is kinda sick already the combinations possible when checking for extensions support in AVX512. check the list here https://en.wikipedia.org/wiki/AVX-512.

    Also note THIS IS IMPORTANT the kernel itself barely use any of those because the kernel itself outside few modules(crypto mostly) don't do operations in a high level enough to need them in any form of shape outside enable user space to switch contexts and use those registers(I'm guessing here is where Linus got pissed with the code) BUT in user space they are immensely useful because those register don't handle only Arithmetic Vector Operation(this you can do in GPU or CPU) but cache and memory as well(Huge performance win).

    Also, YOU CANNOT DO ALL VECTOR OPERATIONS ON A GPU !!! STOP SPREADING FALSE ASSERTIONS!!!! why because the GPU is horribly slow and a Latency nightmare, so you ONLY DO GPU OPERATIONS WHEN THE DATASET IS MASSIVE, GIGABYTES MASSIVE NOT A COUPLE THOUSAND MULTIPLICATIONS(iGPUs are even worse because the bandwidth is even more limited) AND YOU DON'T CARE ABOUT LATENCY, for those smaller jobs with few thousands operation OR WHEN LATENCY IS IMPORTANT is when you USE SIMD(SSE/AVX).

    @sophisticles Intel iGPU have HARDWARE DEDICATED SILICON for video encoding AKA Intel QUICKSYNC and is very well supported in video editors BUT is not usable for PRO video, That is where the CPU/CUDA encoders shine HENCE IS NOT RELEVANT IN THIS CASE CUZ NEITHER USE SIMD OR GPGPU BUT YES FOR NON-PRO RES VIDEO IS QUITE GOOD, SO ENJOY.

    Comment


    • #22
      Originally posted by sophisticles View Post
      This rant right here tells you everything you need to know why Linux is an also ran on the desktop. Intel concentrates on the HPC market because that's where the big bucks are, just ask NVIDIA.

      And AVX-512 is a God send on HPC workloads, in some cases using AVX-512 is vastly faster than using GPU acceleration.

      The reality is that Intel doesn't care about the desktop market, that's not their bread and butter, people buying a $200-$300 cpu are not going to make or break Intel, customers that buy $10000 cpu's but the hundreds to thousands to build huge super computers for AI, simulations, Inference, etc are what make or break Intel.

      Lastly, floating point is very important in many applications, like scientific and video, if you want the highest quality you don't use integer you use floating point, the only reason to use int is because it's much faster due to the lower precision of the calculations.

      It's really a shame to see a guy like this jackass making statements like this one.
      I find it confusing that Linus states that HPC is a pointless special case, while at the same time nVidia has increased its participation of HPC and datacenter contracts in its total annual revenue from 1-2% 5 years ago to almost 30% last year (and growing at an 80% rate!). Don't know the figures for Intel, but it's undeniable that greater and greater amounts of semiconductor are going to number crunching in some way or another. Wider FP registers don't benefit all the HPC/Datacenter usecases, in a considerable amount of cases, they do.

      Reference:
      https://www.hpcwire.com/2020/05/22/n...enue-jumps-80/

      Comment


      • #23
        Originally posted by sabian2008 View Post

        I find it confusing that Linus states that HPC is a pointless special case, while at the same time nVidia has increased its participation of HPC and datacenter contracts in its total annual revenue from 1-2% 5 years ago to almost 30% last year (and growing at an 80% rate!). Don't know the figures for Intel, but it's undeniable that greater and greater amounts of semiconductor are going to number crunching in some way or another. Wider FP registers don't benefit all the HPC/Datacenter usecases, in a considerable amount of cases, they do.

        Reference:
        https://www.hpcwire.com/2020/05/22/n...enue-jumps-80/
        Because the kernels don't do HPC, user space does so, from his POV it is a special case.

        Remember the Kernels work are to enable access to those hardware features and user space work is to use them, so inside the kernel they just handle context switching and exposure of those ZMM registers but never actually use them(outside a few that are CACHE/MEMORY related) and this work can be horribly complicated DEPENDING ON THE HARDWARE IMPLEMENTATION IN SILICON <-- here is where i guess Linus went nuclear

        Comment


        • #24
          Originally posted by zxy_thf View Post
          2. Fragmented product line: Not all chips on sale have AVX-2 by far, nor to say AVX-512.
          they don't even all have AVX.

          Not all CPUs from the listed families support AVX. Generally, CPUs with the commercial denomination "Core i3/i5/i7" support them, whereas "Pentium" and "Celeron" CPUs don't.

          Comment


          • #25
            AVX-512 is horrible. The primary reason is how fragmented Intel make their product stack. To get AVX-512, you need either a modern Xeon, or their HEDT chips. The new 10900K? Nope, doesn't have it. It is essentially impossible to put together a system for testing AVX-512 utility without putting down quite a chunk of change. The clockspeed effect is another headache to think about. Worse is the fact that there are multiple variations of AVX-512 - not all AVX-512-capable CPUs are actually capable of all the functions. Which is a ridiculous state of affairs. Do you have a Skylake EP? Skylake X? Cannon Lake? Knights Mill? Ice Lake? Tiger Lake?

            In another thread I joked about the alphabet soup that the Intel suffixes for their CPU products was... but AVX-512 is also a random grab-bag of confusion - and it is "one" set of instruction set extensions!

            Originally posted by sophisticles View Post

            Not to mention sometimes iGPU is faster than dGPU, for instance I have a number of pcs, the fastest is a R5 1600 with 16 Gb ddr4 and a GTX1050 and the slowest is an i3 7100 with 16 Gb ddr4 and no dGPU (it uses the iGPU).

            I do a lot of video editing and routinely need to render out a file that has a bunch of filters applied (I use Shotcut). If I use the first pc with a 50 minute source, it takes over 9 hours to finish the encode, if I'm using software filters. If I enable gpu filters, R5 + 1050 combo cuts that time down to 5.5 to 6 hours. If I do the same encode on the i3 and use gpu filters with the iGPU, the time is down to just over 3 hours.

            This is repeatable with other test files. Near as I can tell the iGPU cuts the time so much because it doesn't suffer from memory copy performance penalties (from system ram to gpu ram) that the other system has to perform.

            I'm looking forward to Rocket Lake, that Gen 12 Xe iGPU should be awesome for the work I do.
            While jrch2k8 has already responded, I'd like to suggest that QuickSync - the Intel-proprietary encoding technique that permits Intel CPUs with an iGPU to encode video so fast (when compared to AMD CPUs or AMD/nVidia GPUs) - is bitrate-for-bitrate encoding at a lower quality than, say, CPU x264 or x265.

            It's been a long time since I tested this myself, but I ended up eating the increased encode time for using x264/5 on CPU as the resulting output was lightyears ahead of the QuickSync or nvenc output for quality. It depended heavily on input stream. Modern Western animation, which tends toward simpler styles and solid blocks of colour did comparably well with all encoders. Anime and older Western animation (classic Tom & Jerry, with the ultra-detailed hand-painted backgrounds) looked terrible with QuickSync.

            From what I've seen from on-the-fly encoding with nVidia GPUs for streaming (for example) nVidia have really dramatically improved their encoder. Intel may have done the same... as I said, it's been a long time since I checked.

            But I'd definitely recommend doing a bit-for-bit encode and closely examining the output for artefacts.

            Comment


            • #26
              Originally posted by bearoso View Post
              Games are definitely a bad use-case of AVX512, and are the worst thing to use it with. The reduced clock speed would be a huge impact. The time it takes to ramp back up is more than a few frames, and with a latency of anything more than a frame the processor would never run at a normal clock speed.

              Blender is arguable. The only benefit would be during rendering on the CPU, but it’s already better to use a GPU there.
              Blender falls into the category "more cores better than wider cores" perfectly, because path tracing's performance heavily depends on the traversal of BVH (or other data structure).
              This is not something AVX-512 can help, unless you employ complicated techniques like https://dl.acm.org/doi/abs/10.1145/2504459.2504515
              Even so, spending limited silicon areas on additional cores is more beneficial than making a core wider and hoping software can squeeze more from the limited number of cores.

              Comment


              • #27
                Originally posted by zxy_thf View Post
                2. Fragmented product line: Not all chips on sale have AVX-2 by far, nor to say AVX-512.
                And to make matters worse, here you are implicitly treating AVX-512 as a unified ISA extension like AVX-2. It is not. There is a myriad of "AVX-512" extensions out there...

                Comment


                • #28
                  Looks like a special series of HPC CPU should have this kind of instructions and be used for computing not minding latency. A special kernel variation should be made for it and released in different pace from desktop Linux kernel. I don't need it.

                  Comment


                  • #29
                    Originally posted by jrch2k8 View Post

                    Also, YOU CANNOT DO ALL VECTOR OPERATIONS ON A GPU !!! STOP SPREADING FALSE ASSERTIONS!!!! why because the GPU is horribly slow and a Latency nightmare, so you ONLY DO GPU OPERATIONS WHEN THE DATASET IS MASSIVE, GIGABYTES MASSIVE NOT A COUPLE THOUSAND MULTIPLICATIONS(iGPUs are even worse because the bandwidth is even more limited) AND YOU DON'T CARE ABOUT LATENCY, for those smaller jobs with few thousands operation OR WHEN LATENCY IS IMPORTANT is when you USE SIMD(SSE/AVX).
                    You can do all vector operations on a gpu. Actually all of them. All of integer operations too. The gpu is not somehow magically constrained to not be able to calculate some stuff. Yes, it might be slower than the cpu, but that is a different thing to say than to say "you cannot do all".

                    What you wanted to say was that the gpu is more suited to work that can be arranged in autonomous large enough batches, unlike general jumpy/branchy code
                    which will naturally suffer from the PCIE latency. That is correct and known since forever, you are not telling us something new. AMD's Fusion/HSA was supposed to fix that but your favourite company Intel didn't follow and tried their best to destroy the endeavor (as did Nvidia), AMD didn't have market clout to push this and so it failed.

                    I loved HSA when i first read about it. It seemed AMD had a real revolutionary thing on their hands. Imagine if AMD was able to add some cheap RAM on the APU to eliminate bandwidth bottlenecks, enlarged APU's typical TDPs to 200W or even more, and release mainstream mobos that instead of moar PCIE slots they just used PCIE slots for a second or more APUs. If the APU idea was commercially succesful and saw developer support, we would be able to see cheap onpackage RAMs and cheap dual socket mobos and stuff simply due to economies of scale. They might look expensive to do right now, but that is because they are a different paradigm.

                    This would also allow the cpu side of the equation to get rid of all the crappy SIMD instructions no one really needs, that take a huge amount of die area and thermals for no reason and are ineffective relatively to the gpu. AMD's Fusion idea would have been awesome for computing, such an efficient usage of transistor budget. Sadly, it couldn't happen, because Intel was only competitive on cpus an Nvidia was only competitive in gpus, and they both were pretty much established in each area. Poor AMD didn't stand a chance. Even Bulldozer architecture was called a failure, despite being a clear advancement in the direction of Fusion. It did all that Linus wrote he wanted in his post. It removed FP budget for more integer budget. Sadly, Bulldozer was not a failure, it was just way ahead of its time. I have a feeling we are going to see Bulldozer-like architectures in the future at some point, even by Intel.

                    Comment


                    • #30
                      hear, hear!

                      Comment

                      Working...
                      X