Announcement

Collapse
No announcement yet.

Intel Prepares Linux Kernel Support For Advanced Matrix Extensions (AMX)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Alex/AT
    replied
    Originally posted by coder View Post
    Again, if you look at the TILERELEASE instruction, that talks about putting AMX in some sort of "INIT" state, which presumably could disable saving & loading of the TILE register contents. If true, then you'd have a situation where the only time you actually incur the penalty of loading/saving them is when a thread is actually in the midst of using them.
    With XSAVE maybe. But even with feature disabled they list a catch in specs...

    System software should not use XFD to implement a “lazy restore” approach to management of the XTILEDATA
    state component. This approach will not operate correctly for a variety of reasons. One is that the LDTILECFG and
    TILERELEASE instructions initialize XTILEDATA and do not cause an #NM exception. Another is that an execution of
    XSAVE by a user thread will save XTILEDATA as initialized instead of the data expected by the user thread.
    So I wonder if INIT state does not cause save/load given the above.

    Leave a comment:


  • coder
    replied
    Originally posted by Alex/AT View Post
    8k of context just for two tensor matrix dot product instructions looks like overkill & cache thrasher. They don't recommend using flags to lazy load / lazy save the tile registers, and so we are in for a fun ride where disabling this by default for every thread besides ones explicitly requesting it should be a must. Did not look if the patches are like this though.
    Again, if you look at the TILERELEASE instruction, that talks about putting AMX in some sort of "INIT" state, which presumably could disable saving & loading of the TILE register contents. If true, then you'd have a situation where the only time you actually incur the penalty of loading/saving them is when a thread is actually in the midst of using them.

    Leave a comment:


  • coder
    replied
    Originally posted by carewolf View Post
    Moving blocks of memory around is old-school 2D acceleration and called a blitter. You wouldn't need special registers for that.
    It's obviously not for that use case.

    Originally posted by carewolf View Post
    You only need new registers if you intend to operate on the values, not just move them.
    A generalized matrix multiply requires repeated access to each row and column, which would be one justification for stashing them in registers. Presumably, that's where they are headed, in a subsequent iteration of the feature.

    Originally posted by carewolf View Post
    Note there is nothing in the spec that allows interaction with other CPU extensions or registers,
    I already made that exact observation, in post #8! Do you not read any posts that aren't replies to yours?

    Originally posted by carewolf View Post
    All that behind the scenes magic on the tiles, doesn't help if you can't actually do anything on the tiles.
    I already gave you a perfect example which exactly fits their primary use case - small 2D non-separable convolutions.

    Originally posted by carewolf View Post
    Dot product multiplications especially on BF16 values is a key component on neural network implementations. Intel has already in the past introduced extensions for this particular field, see AVX512-VNNI. Ask yourself why this operation on this particular value type is the only active component so far?
    You're missing 8-bit (also in VNNI, I think), but what they're trying to do is build a better engine for convolutional neural networks than they had with AVX-512 VNNI. So, they put on their engineer's hat and looked at how to optimize it further. They not only had to add more compute - they also had to do something about all the shuffling and data movement that's involved in using an AVX-style approach. That said, someone must have grander ambitions for AMX.

    Leave a comment:


  • Alex/AT
    replied
    8k of context just for two tensor matrix dot product instructions looks like overkill & cache thrasher. They don't recommend using flags to lazy load / lazy save the tile registers, and so we are in for a fun ride where disabling this by default for every thread besides ones explicitly requesting it should be a must. Did not look if the patches are like this though.

    Leave a comment:


  • carewolf
    replied

    Originally posted by coder View Post
    And why do you think that's not one of the problems they're trying to solve? One thing you're missing is the optimizations they can do behind the scenes. The hardware can track which memory region was loaded into a tile, and they can potentially shift the tile in-place, if you just offset it by one, so they only need to load the leading row or column
    Moving blocks of memory around is old-school 2D acceleration and called a blitter. You wouldn't need special registers for that. You only need new registers if you intend to operate on the values, not just move them. Note there is nothing in the spec that allows interaction with other CPU extensions or registers, it is load tile, save tile, and dot-product-multiply tile. All that behind the scenes magic on the tiles, doesn't help if you can't actually do anything on the tiles.

    Originally posted by coder View Post
    It seems like you don't understand the specific problem they're trying to solve. Without that, I don't see how we can hope to have a meaningful discussion about their solution.
    Dot product multiplications especially on BF16 values is a key component on neural network implementations. Intel has already in the past introduced extensions for this particular field, see AVX512-VNNI. Ask yourself why this operation on this particular value type is the only active component so far?

    Originally posted by coder View Post
    Although I take issue with these being implemented as CPU instructions, I get what Intel is trying to do, here. It's somewhat analogous to what Nvidia did with the tensor "cores", in their GPUs, with similar potential benefits.
    I can agree on that.

    Leave a comment:


  • coder
    replied
    Originally posted by carewolf View Post
    Note however, that it only has two operations. TDPBF16PS and TDPB[XX]D, both dot products.
    I didn't say it didn't do dot products, just that the raw computation isn't the key point of it.

    Originally posted by carewolf View Post
    You don't do data movement faster by moving it through tiles,
    Unless what you actually need is a tile arrangement!

    Originally posted by carewolf View Post
    it is already limited by memory speed.
    And why do you think that's not one of the problems they're trying to solve? One thing you're missing is the optimizations they can do behind the scenes. The hardware can track which memory region was loaded into a tile, and they can potentially shift the tile in-place, if you just offset it by one, so they only need to load the leading row or column.

    It seems like you don't understand the specific problem they're trying to solve. Without that, I don't see how we can hope to have a meaningful discussion about their solution.

    Look at the types of computations it accelerates and boil those down to AVX-512 operations and you'll see my point. AVX2/AVX-512 sucks for small, non-separable 2D convolutions. And it's all the data-movement overhead that really hurts. However, data-movement is cheap to do, in hardware.

    Although I take issue with these being implemented as CPU instructions, I get what Intel is trying to do, here. It's somewhat analogous to what Nvidia did with the tensor "cores", in their GPUs, with similar potential benefits.

    Leave a comment:


  • carewolf
    replied
    Originally posted by coder View Post
    If you think it's about computing dot-products, you're missing the point. This is really about optimizing data-movement, and quite obviously for deep learning, as the initial data types (int8 and BFloat16) aren't good for much else (e.g. HPC) that will run on Sapphire Rapids CPUs.

    You can read more about it, here:
    My hunch is that they do some further optimizations to reuse register contents when loading a tile from an overlapping position, as one does in convolutions.
    Note however, that it only has two operations. TDPBF16PS and TDPB[XX]D, both dot products. You don't do data movement faster by moving it through tiles, it is already limited by memory speed. The only data "movement" that benefits from being done in tiles are rotations, and it doesn't do those.

    Leave a comment:


  • coder
    replied
    Originally posted by Alex/AT View Post
    Seems like another yet-to-almost-never-be-used command set, like AVX512.
    Not to cast myself as an AMX proponent, but it's more constructive to think of it like crypto-acceleration extensions. In both cases, they're very specialized instructions that need only be supported by a few key libraries, in order to reap the benefits. AVX is far more general than at least AMX's initial incarnation.

    Originally posted by Alex/AT View Post
    Kernel would probably have to save another set of registers each context switch?
    Yes, on CPUs with the feature enabled, the context would bloat by over 8 k (8 registers * 1024 bytes + configuration). However, I think the TILERELEASE might allow saving/restoring of the AMX state to be skipped? It'd be nice if you only had to pay that penalty for threads currently using AMX, which might be the case.

    From what I can see, this really could've been a separate functional unit. Its registers are inaccessible by the CPU's other instructions, so it doesn't much benefit from sharing the CPU's execution pipeline. I think they'd have probably done better to extend their GPU with this functionality and added an iGPU block to some of their server CPUs.
    Last edited by coder; 05 October 2020, 03:16 AM.

    Leave a comment:


  • coder
    replied
    Originally posted by carewolf View Post
    Wow, an entire new instruction set and registers, just to do fast dot product multiplications... How.. CISC.
    If you think it's about computing dot-products, you're missing the point. This is really about optimizing data-movement, and quite obviously for deep learning, as the initial data types (int8 and BFloat16) aren't good for much else (e.g. HPC) that will run on Sapphire Rapids CPUs.

    You can read more about it, here:
    My hunch is that they do some further optimizations to reuse register contents when loading a tile from an overlapping position, as one does in convolutions.
    Last edited by coder; 05 October 2020, 03:07 AM.

    Leave a comment:


  • Alex/AT
    replied
    Seems like another yet-to-almost-never-be-used command set, like AVX512.
    Kernel would probably have to save another set of registers each context switch?

    Leave a comment:

Working...
X