Announcement

Collapse
No announcement yet.

Intel Posts Newest Advanced Matrix Extensions Patches For Linux (AMX Patches v7)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Intel Posts Newest Advanced Matrix Extensions Patches For Linux (AMX Patches v7)

    Phoronix: Intel Posts Newest Advanced Matrix Extensions Patches For Linux (AMX Patches v7)

    For over one year now since Advanced Matrix Extensions (AMX) was first disclosed as a future feature with Xeon "Sapphire Rapids", Intel engineers have been posting AMX patches for enabling the new support for changes needed from the kernel to code compiler stacks. The Linux kernel support for AMX hasn't yet landed but has now been revised its seventh time for public review...

    https://www.phoronix.com/scan.php?pa...X-For-Linux-v7

  • #2
    ensuring AMX isn't running simultaneously on SMT siblings, and a new system call is introduced so applications can request access to AMX usage. The system call (an arch_prctl flag) for requesting AMX access is done to signal the application is responsible for using an alternative signal stack and that the stack is large enough, which can be easily accomplished using the modern Glibc ABI. Trying to make use of AMX on Linux without proper permissions from the system call will result in the process exiting.
    Holy cow! At this point, why not just write a device diver for it, too? It sounds like usage is going to be somewhat limited to libraries, anyhow.

    I'd like to understand more about why you can't use it from > 1 hyperthread per core. I know it's usually better to limit AVX to one thread per core, but it imposes an extra burden on software if we have to actually design around that limitation! Could it be the result of a hardware bug?
    Last edited by coder; 10 July 2021, 05:10 PM.

    Comment


    • #3
      Originally posted by coder View Post
      Holy cow! At this point, why not just write a device diver for it, too? It sounds like usage is going to be somewhat limited to libraries, anyhow.

      I'd like to understand more about why you can't use it from > 1 hyperthread per core. I know it's usually better to limit AVX to one thread per core, but it imposes an extra burden on software if we have to actually design around that limitation! Could it be the result of a hardware bug?
      I suspect it is just that restoring state for multiple threads would kill the performance. They also suggest using thread affinity settings when running benchmarks.

      See

      Comment


      • #4
        Originally posted by jayN View Post
        I suspect it is just that restoring state for multiple threads would kill the performance.
        Based on what I quoted, it sounds like a hard requirement that AMX is only running on one hyperthread per core, at a time!

        Comment


        • #5
          Originally posted by coder View Post
          Based on what I quoted, it sounds like a hard requirement that AMX is only running on one hyperthread per core, at a time!
          Linux implementation may make it a hard requirement to not schedule on a shared resource, but the comment by the developer (at Intel) stated that it was a very significant performance issue that was the underlying cause. Basically, schedule processes that will use AMX on different cores at this point of the hardware and Linux implementation.

          Comment


          • #6
            I feel it very unpleasant, it looks like even more x86 hacks to keep the thing proper. Are these AMX instructions those Linus Torvals was complaining hard about, aren't them?

            Comment


            • #7
              Sent out on Saturday by Intel was their latest set of 26 patches for supporting Advanced Matrix Extensions in the kernel. Kernel changes for AMX are needed around the software stack management with on-demand expansion of per-task context switch buffers using XSAVE, ensuring AMX isn't running simultaneously on SMT siblings, and a new system call is introduced so applications can request access to AMX usage.
              The bold statement above is false. The system call is used for ABI compatibility, and has nothing to do with HT.

              All CPUs support the exact same instructions, and each HT sibling has its own AMX TMM registers.

              As we tried to describe in the LKML message referenced, we expect AMX scalability to be analogous to AVX scalability across HT siblings.

              Comment

              Working...
              X