Announcement

Collapse
No announcement yet.

Linus Torvalds: "I Hope AVX512 Dies A Painful Death"

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by jrch2k8 View Post

    Because the kernels don't do HPC, user space does so, from his POV it is a special case.

    Remember the Kernels work are to enable access to those hardware features and user space work is to use them, so inside the kernel they just handle context switching and exposure of those ZMM registers but never actually use them(outside a few that are CACHE/MEMORY related) and this work can be horribly complicated DEPENDING ON THE HARDWARE IMPLEMENTATION IN SILICON <-- here is where i guess Linus went nuclear
    Of course I get that part. I was referring to this idea
    Originally posted by Linus
    The same is largely true of AVX512 now - and in the future. Yes, you can find things that care. No, those things don't sell machines in the big picture.
    Even though Linus probably knows more about the CPU market than me, I think Intel and AMD might know a little bit more about Linus in this regard. Every CPU maker is investing in wide FP registers, some more aggressive some less, but to say that those things are irrelevant is counterfactual.

    He might have a point on AVX-512 being a rushed technology that is a PITA to maintain and all the mess behind the different implementations. However, the arguments he exposed are a little bit lacking. Not surprised though, he must have been really biting his fingers not to get more verbose on the Rust topic.

    Comment


    • #32
      Originally posted by TemplarGR View Post

      You can do all vector operations on a gpu. Actually all of them. All of integer operations too. The gpu is not somehow magically constrained to not be able to calculate some stuff. Yes, it might be slower than the cpu, but that is a different thing to say than to say "you cannot do all".

      What you wanted to say was that the gpu is more suited to work that can be arranged in autonomous large enough batches, unlike general jumpy/branchy code
      which will naturally suffer from the PCIE latency. That is correct and known since forever, you are not telling us something new. AMD's Fusion/HSA was supposed to fix that but your favourite company Intel didn't follow and tried their best to destroy the endeavor (as did Nvidia), AMD didn't have market clout to push this and so it failed.

      I loved HSA when i first read about it. It seemed AMD had a real revolutionary thing on their hands. Imagine if AMD was able to add some cheap RAM on the APU to eliminate bandwidth bottlenecks, enlarged APU's typical TDPs to 200W or even more, and release mainstream mobos that instead of moar PCIE slots they just used PCIE slots for a second or more APUs. If the APU idea was commercially succesful and saw developer support, we would be able to see cheap onpackage RAMs and cheap dual socket mobos and stuff simply due to economies of scale. They might look expensive to do right now, but that is because they are a different paradigm.

      This would also allow the cpu side of the equation to get rid of all the crappy SIMD instructions no one really needs, that take a huge amount of die area and thermals for no reason and are ineffective relatively to the gpu. AMD's Fusion idea would have been awesome for computing, such an efficient usage of transistor budget. Sadly, it couldn't happen, because Intel was only competitive on cpus an Nvidia was only competitive in gpus, and they both were pretty much established in each area. Poor AMD didn't stand a chance. Even Bulldozer architecture was called a failure, despite being a clear advancement in the direction of Fusion. It did all that Linus wrote he wanted in his post. It removed FP budget for more integer budget. Sadly, Bulldozer was not a failure, it was just way ahead of its time. I have a feeling we are going to see Bulldozer-like architectures in the future at some point, even by Intel.
      1.) Ok fair enough, change cannot for should not

      2.) HSA was meant to solve 50% of that problem but yeah i agree was really nice.

      3.) HSA was never meant to eliminate SIMD(at best maybe shove off some the of the math operations) because unless the cache lines and FP/ALU units can be both CPU/GPU usable latency will murder performance for what you need SIMD for. If HSA was successful maybe we could had see something in that sense later on tho.

      Remember SIMD have two sides, one is comparable to the kind of operations you can do in a GPU, like:
      • Matrix calculations, IDCT, etc. etc. aka huge math problems that can be massively parallelized, SIMD and GPGPU are great here and HSA would have improved worlds here.
      • IA(is some form of previous tho)
      • etc.
      the other that is used everywhere all the time you realize it or not because the compiler hides it(check an assembly dump on any -02+ compiled binary and search for x(y)mm registers) is for cache/logic/arithmetic operations on several variables on 1 cycle(in theory tho, some may take more depending on the CPU) and provide huge speed up and is one of the backbones of modern IPC gains and in here latency is crucial and is where you spend a lot of time fine tuning the code to try to hit L1 as much as possible because even RAM is dead beat dog slow at this level.

      4.) get rid of all the crappy SIMD instructions no one really needs <-- don't think you can find more than 3 or 4 unless you hate performance and wanna go back to the fp87 days.

      AVX512 problems is not a problem of SIMD vs GPU or SIMD are good or bad(they are definitely not), is about the horrible way it got implemented this time around(software and hardware wise)

      Comment


      • #33
        I think Linus does have a point. Just look at the AVX-512 specifications. It's a broad selection of different special-purpose instruction sets. DIfferent CPUs support vastly different feature sets, it's a complete mess. It's pretty horrible from an application developer's POV. Even if you ignore all that, AVX-512 is just too wide and has too little benefit in most cases outside of a few special niche use-cases and benchmarks.

        Comment


        • #34
          Originally posted by sabian2008 View Post

          Of course I get that part. I was referring to this idea


          Even though Linus probably knows more about the CPU market than me, I think Intel and AMD might know a little bit more about Linus in this regard. Every CPU maker is investing in wide FP registers, some more aggressive some less, but to say that those things are irrelevant is counterfactual.

          He might have a point on AVX-512 being a rushed technology that is a PITA to maintain and all the mess behind the different implementations. However, the arguments he exposed are a little bit lacking. Not surprised though, he must have been really biting his fingers not to get more verbose on the Rust topic.
          The reason for this movement in the HPC sector is because GPGPU is great but upload/download latencies are huge so you basically upload couple of gigs of data, wait 5m and download the whole result because interrupt the GPU to ingest partial result will kill performance even further, this methodology works great for a big set of problems but with IA and other business logic processes you loose real time (well quasi) ability or at least the ability of partial downloads with tolerable latencies(Imagine ask siri something and get an answer 20m later while you are showering, not a great product showcase).

          So the CPU with a wide wide SIMD can do the same job lot slower but latency for partial result is negligible since you operate with huge bandwidths all around(octa channel RAM, huge L2/L3 caches, etc.) and since you are already in the CPU you don't have to deal with upload/downloads either(and secondary compilations) hence some techniques have come up to allow the GPU to do sort of pre computation of this massive data points and use very wide powerful CPU do the rest to reach quasi realtime.

          Also fixed hardware accelerators are competing here for the same thing, not just wide CPUs and GPUs

          Comment


          • #35
            Originally posted by sabian2008 View Post
            Even though Linus probably knows more about the CPU market than me, I think Intel and AMD might know a little bit more about Linus in this regard. Every CPU maker is investing in wide FP registers, some more aggressive some less, but to say that those things are irrelevant is counterfactual.
            Intel knows that bullshit sells when they have nothing else to show. Linus was calling this behavior out as it is... well... bullshit. I don't understand what's so hard to grasp in this situation.

            Comment


            • #36
              Originally posted by jrch2k8 View Post
              AVX512 problems is not a problem of SIMD vs GPU or SIMD are good or bad(they are definitely not), is about the horrible way it got implemented this time around(software and hardware wise)
              I agree, Linus's rant shouldn't blame the AVX512-ISA per se but Intel's product segmentation and implementation details.

              By the way, speaking of GPGPU computing and HSA, I wonder if HBM on package and CXL will be a game changer in this area as both technologies are mitigating the penalties you described.

              Comment


              • #37
                Originally posted by sophisticles View Post
                if you want the highest quality you don't use integer you use floating point, the only reason to use int is because it's much faster due to the lower precision of the calculations
                Wrong. Floating point has limited precision, integers are precise.

                Comment


                • #38
                  Originally posted by xpue View Post
                  Wrong. Floating point has limited precision, integers are precise.
                  With integers, pi is precisely 3

                  Comment


                  • #39
                    Originally posted by Anarchy View Post
                    a noob question: how is avx512 implemented in the cores, is it per core or one for all?
                    another noob question: is avx512 something that can be "emulated" by some kind of a tensor-coprocessor?
                    AVX-512 is implemented per-core, but like everything else with the extension, it gets complicated. Each processor that implements AVX-512 will have in each core one or two units that can handle some subset of AVX-512 instructions. So different processors that "implement" AVX-512 will implement it at different speeds and different level of parallelism.

                    AVX-512 could also technically be emulated, but it would be extremely slow to do so, as you'd have to trap each AVX-512 opcode, reroute that data to some off-CPU source, and then send that data back to the application making the call. As such, programs don't really do that, and instead will either be compiled with options for the CPU it's written on, or fancier programs will have logic that figures out at runtime which operations are supported/are the fastest, and direct to separate functions accordingly.

                    Comment


                    • #40
                      That's surely not an inclusive statement.

                      Comment

                      Working...
                      X