Announcement

Collapse
No announcement yet.

Intel Prepares Linux Kernel Support For Advanced Matrix Extensions (AMX)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Intel Prepares Linux Kernel Support For Advanced Matrix Extensions (AMX)

    Phoronix: Intel Prepares Linux Kernel Support For Advanced Matrix Extensions (AMX)

    Following the announcement this summer of Intel Advanced Matrix Extensions (AMX) as an exciting feature coming to Sapphire Rapids Xeon CPUs next year, Intel's open-source engineers quickly began with patches to LLVM and GNU toolchain support for AMX. Now Intel engineers have sent out their patches in preparing the Linux kernel for AMX...

    http://www.phoronix.com/scan.php?pag...Kernel-Patches

  • #2
    Meanwhile @ Intel HQ.
    -"This Leenux guy. The one complaining about AVX being shit...?"
    "Yes sir?"
    -"Lets piss him off!"
    -"Do something equally stupid and call it ... AMX! Yes! AMX! That'll teach him!"

    Comment


    • #3
      Originally posted by milkylainen View Post
      Meanwhile @ Intel HQ.
      -"This Leenux guy. The one complaining about AVX being shit...?"
      "Yes sir?"
      -"Lets piss him off!"
      -"Do something equally stupid and call it ... AMX! Yes! AMX! That'll teach him!"
      It'll be hard to beat AVX-512 though

      Comment


      • #4
        Wow, an entire new instruction set and registers, just to do fast dot product multiplications... How.. CISC.

        Comment


        • #5
          Intel will never do Cray vectors because that would thwart their ability to segment through ISA extensions.

          By the way expect this (and AVX-512) to become an "industry standard" when Zen 3 puts the final nail in their coffin.

          Comment


          • #6
            Seems like another yet-to-almost-never-be-used command set, like AVX512.
            Kernel would probably have to save another set of registers each context switch?

            Comment


            • #7
              Originally posted by carewolf View Post
              Wow, an entire new instruction set and registers, just to do fast dot product multiplications... How.. CISC.
              If you think it's about computing dot-products, you're missing the point. This is really about optimizing data-movement, and quite obviously for deep learning, as the initial data types (int8 and BFloat16) aren't good for much else (e.g. HPC) that will run on Sapphire Rapids CPUs.

              You can read more about it, here:
              My hunch is that they do some further optimizations to reuse register contents when loading a tile from an overlapping position, as one does in convolutions.
              Last edited by coder; 05 October 2020, 03:07 AM.

              Comment


              • #8
                Originally posted by Alex/AT View Post
                Seems like another yet-to-almost-never-be-used command set, like AVX512.
                Not to cast myself as an AMX proponent, but it's more constructive to think of it like crypto-acceleration extensions. In both cases, they're very specialized instructions that need only be supported by a few key libraries, in order to reap the benefits. AVX is far more general than at least AMX's initial incarnation.

                Originally posted by Alex/AT View Post
                Kernel would probably have to save another set of registers each context switch?
                Yes, on CPUs with the feature enabled, the context would bloat by over 8 k (8 registers * 1024 bytes + configuration). However, I think the TILERELEASE might allow saving/restoring of the AMX state to be skipped? It'd be nice if you only had to pay that penalty for threads currently using AMX, which might be the case.

                From what I can see, this really could've been a separate functional unit. Its registers are inaccessible by the CPU's other instructions, so it doesn't much benefit from sharing the CPU's execution pipeline. I think they'd have probably done better to extend their GPU with this functionality and added an iGPU block to some of their server CPUs.
                Last edited by coder; 05 October 2020, 03:16 AM.

                Comment


                • #9
                  Originally posted by coder View Post
                  If you think it's about computing dot-products, you're missing the point. This is really about optimizing data-movement, and quite obviously for deep learning, as the initial data types (int8 and BFloat16) aren't good for much else (e.g. HPC) that will run on Sapphire Rapids CPUs.

                  You can read more about it, here:
                  My hunch is that they do some further optimizations to reuse register contents when loading a tile from an overlapping position, as one does in convolutions.
                  Note however, that it only has two operations. TDPBF16PS and TDPB[XX]D, both dot products. You don't do data movement faster by moving it through tiles, it is already limited by memory speed. The only data "movement" that benefits from being done in tiles are rotations, and it doesn't do those.

                  Comment


                  • #10
                    Originally posted by carewolf View Post
                    Note however, that it only has two operations. TDPBF16PS and TDPB[XX]D, both dot products.
                    I didn't say it didn't do dot products, just that the raw computation isn't the key point of it.

                    Originally posted by carewolf View Post
                    You don't do data movement faster by moving it through tiles,
                    Unless what you actually need is a tile arrangement!

                    Originally posted by carewolf View Post
                    it is already limited by memory speed.
                    And why do you think that's not one of the problems they're trying to solve? One thing you're missing is the optimizations they can do behind the scenes. The hardware can track which memory region was loaded into a tile, and they can potentially shift the tile in-place, if you just offset it by one, so they only need to load the leading row or column.

                    It seems like you don't understand the specific problem they're trying to solve. Without that, I don't see how we can hope to have a meaningful discussion about their solution.

                    Look at the types of computations it accelerates and boil those down to AVX-512 operations and you'll see my point. AVX2/AVX-512 sucks for small, non-separable 2D convolutions. And it's all the data-movement overhead that really hurts. However, data-movement is cheap to do, in hardware.

                    Although I take issue with these being implemented as CPU instructions, I get what Intel is trying to do, here. It's somewhat analogous to what Nvidia did with the tensor "cores", in their GPUs, with similar potential benefits.

                    Comment

                    Working...
                    X