Announcement

Collapse
No announcement yet.

Intel Prepares Linux Kernel Support For Advanced Matrix Extensions (AMX)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11

    Originally posted by coder View Post
    And why do you think that's not one of the problems they're trying to solve? One thing you're missing is the optimizations they can do behind the scenes. The hardware can track which memory region was loaded into a tile, and they can potentially shift the tile in-place, if you just offset it by one, so they only need to load the leading row or column
    Moving blocks of memory around is old-school 2D acceleration and called a blitter. You wouldn't need special registers for that. You only need new registers if you intend to operate on the values, not just move them. Note there is nothing in the spec that allows interaction with other CPU extensions or registers, it is load tile, save tile, and dot-product-multiply tile. All that behind the scenes magic on the tiles, doesn't help if you can't actually do anything on the tiles.

    Originally posted by coder View Post
    It seems like you don't understand the specific problem they're trying to solve. Without that, I don't see how we can hope to have a meaningful discussion about their solution.
    Dot product multiplications especially on BF16 values is a key component on neural network implementations. Intel has already in the past introduced extensions for this particular field, see AVX512-VNNI. Ask yourself why this operation on this particular value type is the only active component so far?

    Originally posted by coder View Post
    Although I take issue with these being implemented as CPU instructions, I get what Intel is trying to do, here. It's somewhat analogous to what Nvidia did with the tensor "cores", in their GPUs, with similar potential benefits.
    I can agree on that.

    Comment


    • #12
      8k of context just for two tensor matrix dot product instructions looks like overkill & cache thrasher. They don't recommend using flags to lazy load / lazy save the tile registers, and so we are in for a fun ride where disabling this by default for every thread besides ones explicitly requesting it should be a must. Did not look if the patches are like this though.

      Comment


      • #13
        Originally posted by carewolf View Post
        Moving blocks of memory around is old-school 2D acceleration and called a blitter. You wouldn't need special registers for that.
        It's obviously not for that use case.

        Originally posted by carewolf View Post
        You only need new registers if you intend to operate on the values, not just move them.
        A generalized matrix multiply requires repeated access to each row and column, which would be one justification for stashing them in registers. Presumably, that's where they are headed, in a subsequent iteration of the feature.

        Originally posted by carewolf View Post
        Note there is nothing in the spec that allows interaction with other CPU extensions or registers,
        I already made that exact observation, in post #8! Do you not read any posts that aren't replies to yours?

        Originally posted by carewolf View Post
        All that behind the scenes magic on the tiles, doesn't help if you can't actually do anything on the tiles.
        I already gave you a perfect example which exactly fits their primary use case - small 2D non-separable convolutions.

        Originally posted by carewolf View Post
        Dot product multiplications especially on BF16 values is a key component on neural network implementations. Intel has already in the past introduced extensions for this particular field, see AVX512-VNNI. Ask yourself why this operation on this particular value type is the only active component so far?
        You're missing 8-bit (also in VNNI, I think), but what they're trying to do is build a better engine for convolutional neural networks than they had with AVX-512 VNNI. So, they put on their engineer's hat and looked at how to optimize it further. They not only had to add more compute - they also had to do something about all the shuffling and data movement that's involved in using an AVX-style approach. That said, someone must have grander ambitions for AMX.

        Comment


        • #14
          Originally posted by Alex/AT View Post
          8k of context just for two tensor matrix dot product instructions looks like overkill & cache thrasher. They don't recommend using flags to lazy load / lazy save the tile registers, and so we are in for a fun ride where disabling this by default for every thread besides ones explicitly requesting it should be a must. Did not look if the patches are like this though.
          Again, if you look at the TILERELEASE instruction, that talks about putting AMX in some sort of "INIT" state, which presumably could disable saving & loading of the TILE register contents. If true, then you'd have a situation where the only time you actually incur the penalty of loading/saving them is when a thread is actually in the midst of using them.

          Comment


          • #15
            Originally posted by coder View Post
            Again, if you look at the TILERELEASE instruction, that talks about putting AMX in some sort of "INIT" state, which presumably could disable saving & loading of the TILE register contents. If true, then you'd have a situation where the only time you actually incur the penalty of loading/saving them is when a thread is actually in the midst of using them.
            With XSAVE maybe. But even with feature disabled they list a catch in specs...

            System software should not use XFD to implement a “lazy restore” approach to management of the XTILEDATA
            state component. This approach will not operate correctly for a variety of reasons. One is that the LDTILECFG and
            TILERELEASE instructions initialize XTILEDATA and do not cause an #NM exception. Another is that an execution of
            XSAVE by a user thread will save XTILEDATA as initialized instead of the data expected by the user thread.
            So I wonder if INIT state does not cause save/load given the above.

            Comment

            Working...
            X