Announcement

Collapse
No announcement yet.

Intel Developing "oneAPI" For Optimized Code Across CPUs, GPUs, FPGAs & More

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Intel Developing "oneAPI" For Optimized Code Across CPUs, GPUs, FPGAs & More

    Phoronix: Intel Developing "oneAPI" For Optimized Code Across CPUs, GPUs, FPGAs & More

    Intel's 2018 Architecture Day was primarily focused on the company's hardware architecture road-map, but one of the software (pre)announcements was their oneAPI software stack...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Well, AMD showed some benchmarks some time ago that compared a workload on HSA vs. OpenCL (with a performance advantage for their HSA stack). Hence I am not sold on the idea that OpenCL or SYCL are the best way forward in terms of achieving maximum performance. Sure, three competing ecosystems with limited code portability would not be great either. If it is going to be open sourced and also considering that Intel hired the former ROCm lead engineer maybe there is a way that ROCm and oneAPI could converge to a single open standard used by AMD and Intel?

    Comment


    • #3
      Having written high performance code for heavily multi-threaded CPUs, GPUs and done some compute work on FPGAs I'm really not convinced that trying to have them all under one API is such a good idea...

      GPUs, being the massively parallel SIMD devices that they are, require a very different programming style than even the most heavily parallel CPUs. Sure, there are APIs like OpenACC that aim to make it very easy to port code written for highly parallel CPUs and to write code for both, but in my experience the results produced by those tend to be less than impressive.

      One job that I had when I was still at university was to size up OpenACC by doing one of those low effort ports of some well written OpenMP code for multi-threaded CPUs (OpenACC being a GPU extension for OpenMP) and then comparing it to an almost from-the-ground-up CUDA implementation that I put together. The end result of this was that thanks to CUDA being made for GPUs first and foremost and allowing for a whole bunch of optimizations that work really well on GPUs the performance of the CUDA implementation wasn't far off from about 4X the OpenACC implementation (the CUDA implementation finished a particular job on average in about 2.7 seconds while the OpenACC implementation took on average a bit over 10 seconds to do the same job).

      As for FPGAs, those things are practically alien compared to CPUs and GPUs with what you have to do in order to get good performance. There's a reason why despite a lot of efforts to get FPGAs into high performance computing, the only real successes have been in applications that demand good energy efficiency, but where ASICs aren't an option, and where protecting the IP is considered of the utmost importance. In the latter case performance and efficiency can be downright miserable as they're often just running C/C++ code on a standard soft core like a Nios (Altera/Intel) or MicroBlaze (Xilinx).
      Last edited by L_A_G; 12 December 2018, 11:24 AM.

      Comment


      • #4
        Presentations from Nvidia and AMD at least suggest that their goal is to have better GPU integration in the C++ standard and in my limited understanding it seems their current vendor specific efforts are some sort of precursor to that. Nvidias C++ guru Olivier Giroux at least said in an interview that Turing would be also great at generic C++ code (with C++17), I guess future GPUs and C++ standards will only get better at that.

        The experimental engine from EA (SEED) uses Intel's ISPC on the CPU side which also employs a SPMD model to extract parallelism and vectorization better from modern CPUs.

        At least HSA also targets FPGAs and there was some recent work to get that more efficient: https://github.com/HSA-on-FPGA/HSA-on-FPGA (based on: https://link.springer.com/article/10...265-018-1382-7).

        Comment


        • #5
          threeAPIs

          Comment


          • #6
            Yeah, hopefully they use OpenCL or Vulkan Compute instead of creating yet another API. I read the article and I think "OpenCL already does this......and is cross-platform, cross-vendor etc.".

            ms178 - It doesn't matter what API is used, ultimately they all run on the same GPU hardware, and therefore can all achieve the same level of performance. The API design does influences performance to some extent, but that can always be improved. The only difference between OpenCL, CUDA is the API they expose for developers, how easy it is to use the API, and the tools provided to work with it.

            Comment


            • #7
              Originally posted by sandy8925 View Post
              Yeah, hopefully they use OpenCL or Vulkan Compute instead of creating yet another API. I read the article and I think "OpenCL already does this......and is cross-platform, cross-vendor etc.".

              ms178 - It doesn't matter what API is used, ultimately they all run on the same GPU hardware, and therefore can all achieve the same level of performance. The API design does influences performance to some extent, but that can always be improved. The only difference between OpenCL, CUDA is the API they expose for developers, how easy it is to use the API, and the tools provided to work with it.
              The discussion over here points out some differences in scope of these compute APIs and their own set of limitations: https://forums.khronos.org/showthrea...Vulkan-Compute

              And as Michael pointed out to me recently in another discussion, the convergence from OpenCL and Vulkan was put on hold. Also these APIs are on a higher level thus limiting the chance to extract the maximum performance out of each device (consider that the API must accomodate FPGAs and GPUs which are very different in nature). Programmers also complained about the OpenCL memory model and other design choices which contributed to its slow adoption. OpenCL 2.0+ is still not widely established. SYCL builds on all of this. I still don't see a real end user benefit of this path yet.

              I'd rather see Intel joining the HSA effort, extending and improving the ROCm/HSA stack. I am also keen to know if Intel tries to engage with other industry standard bodies like CCIX, OpenCAPI or GEN-Z or if they cook up their own proprietary solutions. The latter is most likely but maybe the former AMD staff can convince them that collaboration could be also beneficial for them.

              Comment


              • #8
                i blame khronos backing down from their plan to make vulkan the one api needed ( it has been proven fit to run general purpose languages such as rust btw ) , for all this nonsense . It is getting pretty bad , but i think we can still end it , we have to demand them to make the necessary changes to vulkan ( eg : make rasterization optional ) so it can run everywhere and be used as their sole api

                Comment


                • #9
                  btw does anyone know how is web vulkan doing ? last time i checked google was contributing to apple's gl and gave invalid reasons for doing so . Eg : it not being able to practically run all platforms , has long being proven wrong ; it being too complex , language specific abstractions such as vulkano can make it easier to use while maintaining speed , with easier metaprograming in upcoming c++ this will also be practical in that language as it already is many others
                  Last edited by GunpowaderGuy; 13 December 2018, 11:22 AM.

                  Comment


                  • #10
                    Originally posted by L_A_G View Post
                    Sure, there are APIs like OpenACC that aim to make it very easy to port code written for highly parallel CPUs and to write code for both, but in my experience the results produced by those tend to be less than impressive.
                    I always felt OpenMP and OpenACC were just there to deliver easy wins for novice/intermediate programmers or those cases where it just isn't worth the time and effort to re-architect something to get better GPU performance.

                    IMO, the main win from CUDA and OpenCL is by exposing enough of the hardware's performance bottlenecks and strengths that you can architect your code around it. As such, if Intel's oneAPI is any good, I'd expect it to look structurally similar.

                    Comment

                    Working...
                    X