Announcement

Collapse
No announcement yet.

Linus Torvalds: "I Hope AVX512 Dies A Painful Death"

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #91
    I think the real problems of avx-512 is

    a) a lot of avx-512 instructions subset fragmentation which makes this thing rediculous. I mean instruction sets like sse, avx, avx2, are already too fragmented and then you get 10 flavors of avx-512 on top of that. WTF.
    b) the powering-down issue that makes the rest of the cpu crawl so alternating avx-512 instructions, or having a server that has one instance use avx-512 creates issues to other server instances. I think it was done due to the 14nm issue.
    c) various workarounds that need to be implemented to have avx perform correctly... like vzeroupper in the code... or having the OS check whether certain avx (z/y) registers were used in order to do, or not do, some things in order to do things differently and avoid slowdowns. These arise out of the intel "hacks" and then end up in the OS code as workarounds.


    In theory avx512 should be much more power efficient than avx2. It doesn't matter if, say, a cpu that used to burn 100w now does 140w, or if it is underclocked, but how many results you can get out of certain watts. If, for example, avx512 can double the throughput of avx2, that's a 100% gain. If you get consumption from 100w to 140w, then that's a fantastic perf/watt gain. And if you underclock to, say, 70% of the speed and only get +70% throughput for, say, similar avx2 watts, then that's good too. The rationale that "avx2 should be enough" that Linus said, cannot and does not increase perf/watt. Redditors say "ohhh but it burns so much". Yeah but it puts out 60-80% more throughput in terms of results. They are not factoring perf/watt. Only watts. Watts can be adjusted downwards and then you can keep the throughput gains vs inferior vectors (avx2) at similar watts.

    Personally I would like wide avx2 and avx512 adoption in chips, because without them the code lags in adopting the speed enhancements that the cpu can actually give. But I would also like the chips to be less "hacked" to solve problems like splitting 128 bit lanes and using vzeroupper, or having irrelevant slowdown issues with zmm register usage, etc. The moment intel gives the set to the people, this should be good to go, not need hacks in code or OS to play well. I think it creates needless irritation for everyone. And please no 500 flavors of avx512. This is a fricking nightmare to support by coders.

    In terms of chips I would also like to see larger L1 instruction caches because opcodes have been getting larger and larger for these commands and yet L1 instruction cache size is almost the same for 15 years. Sure, we have bigger μop caches, but it's not the same.

    Comment


    • #92
      Originally posted by _Alex_ View Post
      I think the real problems of avx-512 is
      In theory avx512 should be much more power efficient than avx2. It doesn't matter if, say, a cpu that used to burn 100w now does 140w
      Well, in fact it does matter, because your 100W TDP cpu will burn..
      For it not to burn,
      You have to slow it down the most you can, to try to sustain only simd operations( arm had a lot of problems and simd power consumption and TDP limits.. to the point that simd is disable by default by the kernel..activated only when a simd intruction is required, then turned off again...crazy.. )...

      Now, you will be putting in question your hardware, because you brought a CPU to be performant, and it cannot operate at high frequency( because it will burn ), so you end with a nice simd "coprocessor" or "compute module", but with a terrible performance CPU, all of this consuming the maximum power the system can sustain..

      The user here will make a choice, it will drop the "coprocessor" part, and buy a CPU that is really fast, at all general purpose operations, and using less power, and been more responsive..
      If it wants an "accelerator", "compute module", or "simd coprocessor", the user will buy one but separatly from that..

      Comment


      • #93
        Let me rephrase it another way.

        If you are to use 100w in doing AVX2 or underclocked AVX-512, you'll get far more results back with AVX512.

        The reason is that power consumption can be cut down non-linearly as you reduce frequency and volts. Thus at 50% hz * avx512, you can theoretically have the same throughput at 100% hz * avx2, however at 50%hz you can go much less than 50% power consumption due to being able to reduce voltage, thus drastically increasing perf/watt.

        Perf/watt is what counts, not TDP.

        MMX->SSE->SSE2/4/->AVX2->AVX512 go wider and wider thus increasing perf/watt. Otherwise we would never have vector instructions or ever-widening vectors.

        Comment


        • #94
          Originally posted by sdack View Post
          Torvald's speaking his mind and somebody actually took the time to write it down and make news of it.

          On a scale from 1.0 to 10.0 with 10.0 being hilarious do I rate his rant with a 2.1.

          Speaking of 512-bit vector units ... Fujitsu's super computer Fugaku with it's 160,000 Arm A64FX CPUs sure benefits from it's 512-bit vector units. I wouldn't mind having PCs with those CPUs inside instead of Intel or AMD CPUs.
          They're 48 Core SoC with 32GB of HBM2 memory on each SoC. It's not impressive. Throw that many nodes on El Capitan with Zen 4 custom with 48 cores and a share memory space of 32GB HBM2 to access along with their RDNA 2.x CU Fugaku won't be near it.
          Last edited by Marc Driftmeyer; 12 July 2020, 10:40 PM.

          Comment


          • #95
            Originally posted by tuxd3v View Post

            Now, you will be putting in question your hardware, because you brought a CPU to be performant, and it cannot operate at high frequency( because it will burn ), so you end with a nice simd "coprocessor" or "compute module", but with a terrible performance CPU, all of this consuming the maximum power the system can sustain..

            The user here will make a choice, it will drop the "coprocessor" part, and buy a CPU that is really fast, at all general purpose operations, and using less power, and been more responsive..
            If it wants an "accelerator", "compute module", or "simd coprocessor", the user will buy one but separatly from that..
            This is the ideal case, but real problems are more complicated...
            For example, most matrix factorization algorithms are still single threaded.Not because researchers/developers don't want, but the factorization itself is hard to be parallel.
            In this case wider SIMD does make code faster -- Well, consider current AVX-512's status it's safer to add a "properly implemented" prefix.

            HSA with cache coherent accelerators is a reasonable solution, but significant architecture changes are likely needed (e.g. what the CPU should do while only wait for 1000 cycles to get the output from accelerator?)

            Comment


            • #96
              I like PCs, I've tampered with them since I was born. This being said, I wonder where are the advantages Torvalds and us would gain from Intel focusing on non-HPC use of x86. I may be drastic, but I think x86 is going away from mainstream:
              1· android mostly uses ARM and anyway Google is going to switch to Fuchsia
              2· consumer PCs market is dying apart from gaming and office use...
              3· ... in any case the trend may be switching to ARM (Apple and maybe Chromebook and its clones)...
              4· ... also softwares are switching to web services, thus gaming and office PCs may disappear
              5· in the end (in 10 years?) the x86 market where money goes may be only HPC and servers...
              6· ... also servers may find helpful AVX512 for cryptography

              Comment


              • #97
                Originally posted by curfew View Post
                Floating point numbers are critical for doing accurate computations.


                Take ten minutes. Read that. Then we'll talk.

                Comment


                • #98
                  Originally posted by AJenbo View Post

                  He was probably referring to Intel's recent work on creating a discreet GPU, so erh... you might actually agree with the post you are arguing against.
                  Ha! Well, in that case I'm neutral :-)

                  Comment


                  • #99
                    Originally posted by dxin View Post
                    Same goes with Intel GPUs. Why waste transistor on something nobody cares?
                    Nooooo! Don't take away my Intel GPUs!

                    Comment


                    • Originally posted by Marc Driftmeyer View Post
                      They're 48 Core SoC with 32GB of HBM2 memory on each SoC. It's not impressive. Throw that many nodes on El Capitan with Zen 4 custom with 48 cores and a share memory space of 32GB HBM2 to access along with their RDNA 2.x CU Fugaku won't be near it.
                      Nonsense. The HBM2 memory is certainly not on the same die. Only the memory controller is. The A64FX uses 4 channels of HBM2 to move 1TB/sec of data.

                      Not even AMD's EPYC with 8 channels of DDR4 gets close to this. It only does about 145GB/sec.

                      Then look at the transistor count. AMD EPYC has got 40B transistors and Fujitsu A64FX has got only 10B transistors. That's a quarter of it!

                      And the moment you start tearing Zen apart, to make it look like the A64FX, might you just plug the Arm v8 core from Fujitsu's CPU design and replace it with any x86 core.

                      The point should then become clear to you: if it's x86 or Arm at the core doesn't matter, because we now have Linux. x86 is just a long dead CISC design with now RISC under the hood, because of Microsoft's quasi-monopoly binding the entire IT industry for decades to x86 while every other CISC design has died a long time ago and for very good reasons.

                      What matters in a CPU is how it interconnects. x86 CPUs are tumours with overgrowing, monstrous caches, because DDR4 cannot keep up. Everything about an x86 CPU these days tries to cater to this tumour. We're going to see level 4 caches at some point when this doesn't see a radical change.

                      So you keep dreaming of your Zen 4 CISC with RISC inside, shit loads of caches to work around creepy slow DDR4, but to make me happy do you need to give me a pure RISC design without the cancer you're still dreaming of.
                      Last edited by sdack; 13 July 2020, 09:08 AM.

                      Comment

                      Working...
                      X