Announcement

Collapse
No announcement yet.

Tensor LLVM Extensions Proposed For Targeting AI Accelerators, Emerging Hardware

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tensor LLVM Extensions Proposed For Targeting AI Accelerators, Emerging Hardware

    Phoronix: Tensor LLVM Extensions Proposed For Targeting AI Accelerators, Emerging Hardware

    Intel, Amazon AWS, IBM, Qualcomm, and UIUC researchers have been collaborating over a proposed "Tensor LLVM Extensions" (TLX) to make this open-source compiler infrastructure more suitable for targeting AI accelerators and other emerging classes of hardware...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    This is a big deal, right? The dream is to just download a project built for Nvidia or whatever, and have it run on other hardware with minimal tweaks.

    Comment


    • #3
      Yup, indeed. As long as the high-level libraries (PyTorch, etc.) are written to use that as a backend it should indeed help make such code more portable accross platform.
      Now we still need to persuade those devs who insist on only writing proprietary CUDA code.
      (Though even on that front, AMD has been developing some portable compiler that are able to ingest CUDA).

      Comment


      • #4
        Isn't that what openCL was supposed to do?

        Comment


        • #5
          Originally posted by brucethemoose View Post
          This is a big deal, right? The dream is to just download a project built for Nvidia or whatever, and have it run on other hardware with minimal tweaks.
          Except that probably (almost) nobody uses LLVM to target compute workloads for Nvidia hardware. The normal way to develop for their stack is to use their CUDA API and compiler, which I think isn't even based on LLVM (but that's almost beside the point).

          For a portable solution, you'd first need to be using a hardware agnostic host API, like OpenCL, Vulkan, or oneAPI. Then, your device code should be compiled to something like SPIR-V that supports the new operations, the backend will have to be LLVM-based, and you'd be limited to whatever operators LLVM supported, which isn't going to be a 1:1 match to what the hardware actually supports. That mismatch is going to result in some level of inefficiency, which is going to vary as a function of the hardware target and what operators you actually need/want.

          I don't mean to pour cold water on this news, but it's merely a building block in a grander scheme. It takes us mostly back to the picture we had of device-portability before tensor instructions came onto the scene, some ~4 years ago.

          What impresses me is that there's enough similarity between the devices & their native operations that we can even seriously talk about something like this. Unlike vector operations, I think there's a greater variety in how people have approached tensors.
          Last edited by coder; 14 November 2021, 06:12 PM.

          Comment


          • #6
            Originally posted by DrYak View Post
            Yup, indeed. As long as the high-level libraries (PyTorch, etc.) are written to use that as a backend it should indeed help make such code more portable accross platform.
            Now we still need to persuade those devs who insist on only writing proprietary CUDA code.
            This is only the backend support. We'll still need SPIR-V support, and then (for those not directly targeting SPIR-V) extensions added to something like OpenCL.

            This is an important first step, but only the first.

            Comment


            • #7
              Originally posted by MadeUpName View Post
              Isn't that what openCL was supposed to do?
              OpenCL* created a portable host API, common device language, and common intermediate representation, without most of which it wouldn't even matter if the device compiler supported a common set of tensor operations or not.

              To round out the picture, oneAPI builds on OpenCL's foundations, though I honestly can't (yet) say much else about it. WebGPU sits a level further up the stack, but (I think) is more agnostic about whatever sits between it and the hardware.

              * OpenCL was itself influenced by OpenGL and prior GPU compute languages & toolchains (including CUDA). Vulkan borrowed and extended OpenCL's SPIR intermediate representation.

              Comment


              • #8
                Originally posted by MadeUpName View Post
                Isn't that what openCL was supposed to do?
                This should open up tensor extensions, natively, to any language that uses LLVM. Think Swift, Rust, Haskell, Fortran, Julia, and many others, without going through C.

                Comment


                • #9
                  Originally posted by vegabook View Post
                  This should open up tensor extensions, natively, to any language that uses LLVM. Think Swift, Rust, Haskell, Fortran, Julia, and many others, without going through C.
                  Whatever language you use to write the device code, you'll probably be limited to using some form of intrinsics, library functions, or a narrow range of idioms to get decent utilization out of the hardware. The most sensible approach is probably to build up a set of library primitives upon whatever common instructions they define, in order to do more useful, high-level operations.

                  Comment


                  • #10
                    Originally posted by coder View Post
                    Except that probably (almost) nobody uses LLVM to target compute workloads for Nvidia hardware. The normal way to develop for their stack is to use their CUDA API and compiler, which I think isn't even based on LLVM (but that's almost beside the point).

                    For a portable solution, you'd first need to be using a hardware agnostic host API, like OpenCL, Vulkan, or oneAPI. Then, your device code should be compiled to something like SPIR-V that supports the new operations, the backend will have to be LLVM-based, and you'd be limited to whatever operators LLVM supported, which isn't going to be a 1:1 match to what the hardware actually supports. That mismatch is going to result in some level of inefficiency, which is going to vary as a function of the hardware target and what operators you actually need/want.

                    I don't mean to pour cold water on this news, but it's merely a building block in a grander scheme. It takes us mostly back to the picture we had of device-portability before tensor instructions came onto the scene, some ~4 years ago.

                    What impresses me is that there's enough similarity between the devices & their native operations that we can even seriously talk about something like this. Unlike vector operations, I think there's a greater variety in how people have approached tensors.
                    I was thinking this is something PyTorch, Tensorflow and so on could use on their end to help make projects more portable. If LLVM is targeting tensor operations (and not generic compute like the other APIs), wouldn't that have less overhead than, say, the Vulkan ncnn backend lots of projects use now?

                    Comment

                    Working...
                    X