Announcement

Collapse
No announcement yet.

AMD Unified Inference Frontend 1.1 Released

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • AMD Unified Inference Frontend 1.1 Released

    Phoronix: AMD Unified Inference Frontend 1.1 Released

    AMD in February quietly released version 1.1 of their in-development Unified Inference Front-end "UIF" that aims to be their catch-all solution for AI inference from CPUs to GPUs to FPGAs and other IP from their recent Xilinx acquisition...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Hope they will add Versal AI cores to Desktop CPUs or GPUs some time soon.
    Really looking forward to the 7040 Series Phoenix APUs, as they got Versal Cores.

    Comment


    • #3
      Originally posted by Spacefish View Post
      Hope they will add Versal AI cores to Desktop CPUs or GPUs some time soon.
      Really looking forward to the 7040 Series Phoenix APUs, as they got Versal Cores.

      Comment


      • #4
        There's just one sad thing about this. We had a technology for over a decade now that did this unification: it's called OpenCL. It's a shame how such a data-parallel workload got fused off into 10 different APIs and technologies only to be unified again under a Python umbrella.

        Comment


        • #5
          Originally posted by Meteorhead View Post
          There's just one sad thing about this. We had a technology for over a decade now that did this unification: it's called OpenCL. It's a shame how such a data-parallel workload got fused off into 10 different APIs and technologies only to be unified again under a Python umbrella.
          CUDA might be the culprit here.

          Comment


          • #6
            How far away can we be from a less monolithic / vertically integrated stack / ecosystem and something more granular?
            At the bottom end the CPU/GPU implements some registers and instructions and various cache / memory access capabilities and then various levels of ALU / CORE / SIMD / thread / thread group / work group / ... parallelism for the most part.
            At the next higher end there's the overall resource scheduling, data and code loading from the host CPU / memory.
            Then there's a usual bit of shared memory between host / GPU or within the execution groups within the GPU.

            A lot of that GENERAL sort of stuff already has to be considered for even CPUs these days with NUMA and threading and multi-core and multi-chiplet
            and localized caches and backplane connected multiprocessor systems etc. to say nothing of MPI / distributed computing etc.

            So obviously stuff like LLVM and general compilers already deal with ISA / architecture level optimization for caches, instruction ordering,
            instruction translation / elucidation for higher level generic IR / pseudocode inputs, resource contention for registers / memory / cache / ALUs etc.
            Vectorization optimizers deal with SIMD generation where possible and loop unrolling and variously invariant code optimization happens and all that.
            Then layers of scheduling stuff deal with thread / process scheduling and even job scheduling across distributed systems, message passing, MPI, etc. etc.
            Even performance analyzers and during run and post-run profiling and optimization to make the efficiency improve / adapt to complex work loads and
            architectures.

            And more and more languages and ecosystems, development models are just adapting concurrency models independent of the actual machine level implementation of
            cores / threads e.g. goroutines, CSP, XMOS, erlang, event driven / message passing systems, shared-nothing, etc.

            So CUDA and Pytorch and Tensorflow and whatever are great and so is C++ or DPC++ or OpenMP or whatever but surely with the right tools and
            separation of concerns and intermediate layers of optimizers / translators etc. we can get beyond what CPU / GPU vendor we're targeting at
            run time wrt. how we're coding things at a high level and open compiler / optimizer / execution environment management runtime tools
            just make your code work fairly efficiently on whatever ISA / architecture / CPU / GPU you may have with a minimal amount of
            vendor specific "lock in" of the entire tool chain / compiler / optimizer / scheduler / runtime / driver / ... etc.

            So whether you start with OCL or OMP or DPC++ or rust or GO or C++ or whatever you'd want it to build / run efficiently if you've
            indicated manually or by automatic analysis where the parallelism / data flow should be focused on various RAM / IO / distributed resources in
            your mesh of computing and storage resources.

            Originally posted by Meteorhead View Post
            There's just one sad thing about this. We had a technology for over a decade now that did this unification: it's called OpenCL. It's a shame how such a data-parallel workload got fused off into 10 different APIs and technologies only to be unified again under a Python umbrella.

            Comment


            • #7
              Originally posted by Meteorhead View Post
              There's just one sad thing about this. We had a technology for over a decade now that did this unification: it's called OpenCL. It's a shame how such a data-parallel workload got fused off into 10 different APIs and technologies only to be unified again under a Python umbrella.
              I blame the big software vendors, but Google most of all. After Apple withdrew support for OpenCL, it was only Google that had the clout to make it stick. Instead, they chose to push RenderScript as their preferred GPU compute solution for Android and literally banned Android devices from shipping with OpenCL support installed. Microsoft was always going to go its own way, with things like DirectCompute, C++AMP, etc. So, it really came down to Google... and they failed us.

              Among vendors, Intel is the lone OpenCL holdout. For that, my next GPU will be Intel. But, the damage was already done. Without a vibrant community, OpenCL got off track and fell too far behind CUDA.

              AMD got distracted by HSA, which went nowhere, and then threw in the towel and made its CUDA clone called HIP. That left a window of at least 5 years that were crucial for deep learning, where CUDA was basically the only viable API. Between that and Nvidia's support for educational institutions & deep learning researchers, they easily sewed up that market.

              Originally posted by aviallon View Post
              CUDA might be the culprit here.
              Eh, I don't blame Nvidia as much as Google. Vendors -- especially dominant ones -- are always inclined to push their own proprietary APIs and frameworks. In many ways, CUDA did pioneer GPU compute programmability, not entirely unlike how Mantle prototyped a new GPU rendering API.‚Äč I think OpenCL generally benefited from that.

              It's only the biggest software vendors and platform companies who have the clout to push standards on hardware vendors (e.g. Vulkan).
              Last edited by coder; 07 March 2023, 03:07 AM.

              Comment


              • #8
                Originally posted by pong View Post
                A lot of that GENERAL sort of stuff already has to be considered for even CPUs these days with NUMA and threading and multi-core and multi-chiplet
                and localized caches and backplane connected multiprocessor systems etc. to say nothing of MPI / distributed computing etc.
                I like how OpenCL exposes memory hierarchies and workgroup parallelism. We need more of those ideas to permeate userspace APIs and OS schedulers.

                Originally posted by pong View Post
                Then layers of scheduling stuff deal with thread / process scheduling and even job scheduling across distributed systems, message passing, MPI, etc. etc.
                Not well. The threading APIs we have are ancient, and OS kernels are organized around those. This area has been slow to adapt, because you need simultaneous changes in both threading APIs and at the OS level. I think Apple has done some interesting things, here.

                Comment

                Working...
                X