Announcement

Collapse
No announcement yet.

Intel Posts Big Set of Patches For AVX-512 FP16 Compiler Support For Sapphire Rapids

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Intel Posts Big Set of Patches For AVX-512 FP16 Compiler Support For Sapphire Rapids

    Phoronix: Intel Posts Big Set of Patches For AVX-512 FP16 Compiler Support For Sapphire Rapids

    Besides Sapphire Rapids introducing Advanced Matrix Extensions (AMX), new developer documentation has detailed AFX-512 FP16 capabilities coming with the next-generation Xeon processors. Intel has posted initial developer documentation around AVX512FP16 as well as a big set of GCC and LLVM Clang compiler patches for handling the new intrinsics...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Michael typo on "AFX" instead of "AVX"

    Comment


    • #3
      This AVX-512 half-precision floating point support should help for training and inference with deep learning models where FP32 isn't needed, among other use-cases.
      Not when they already have BF16! That's much better-suited to deep learning, which was the whole point of it.

      FP16 is more about graphics and other cases where you want a better balance between precision and range. TBH, I'm surprised they even went back and added FP16, after they already had BF16. Not many people seem to care much about FP16, any more.
      Last edited by coder; 01 July 2021, 02:34 PM.

      Comment


      • #4
        I don't mind them adding new stuff, but this is just another "does it or doesn't it support" question for AVX-512... I've now got access to several boxes which all support AVX-512... and all have different levels of support. It is really annoying trying to explain to some users why system X won't do something, while system Y will... when they're both "AVX-512". That said, the AVX-512 implementation in that (Comet Lake?) Core i3 was funny too - "I want to do XYZ on this little i3 because AVX-512!" "OK, come back to me when you've found a way of getting that little i3 to support a terabyte of RAM."

        Comment


        • #5
          Originally posted by Paradigm Shifter View Post
          the AVX-512 implementation in that (Comet Lake?) Core i3 was funny too
          Comet Lake was still built on the Skylake microarchitecture, and therefore didn't support AVX-512. I assume you're thinking of Rocket Lake (the Gen11 desktop CPUs).

          My team had to disable AVX-512 in a deep learning workload we needed to run on a CPU for... reasons. It was a Skylake server CPU, and we got up to a 50% improvement (IIRC) by disabling it, since it killed our clock speeds so badly. So, even if a CPU supports the instructions you want to use, still doesn't mean you necessarily want to use them. Ice Lake SP should be better, in that respect.

          Comment


          • #6
            Originally posted by coder View Post
            Comet Lake was still built on the Skylake microarchitecture, and therefore didn't support AVX-512. I assume you're thinking of Rocket Lake (the Gen11 desktop CPUs).
            Where is no i3 based on RL.

            Comment


            • #7
              Originally posted by drakonas777 View Post
              Where is no i3 based on RL.
              Well, the Comet Lake i3's don't support AVX-512. So, was this a laptop or maybe a NUC? Because then you're talking about probably Ice Lake or Tiger Lake.

              Comment


              • #8
                Originally posted by coder View Post
                Comet Lake was still built on the Skylake microarchitecture, and therefore didn't support AVX-512. I assume you're thinking of Rocket Lake (the Gen11 desktop CPUs).

                My team had to disable AVX-512 in a deep learning workload we needed to run on a CPU for... reasons. It was a Skylake server CPU, and we got up to a 50% improvement (IIRC) by disabling it, since it killed our clock speeds so badly. So, even if a CPU supports the instructions you want to use, still doesn't mean you necessarily want to use them. Ice Lake SP should be better, in that respect.
                Found it! (And it was in a NUC).

                Cannon Lake. 10nm, dual core, 32GB max RAM (although maybe 64GB would work? My (different) NUC says max 32GB but people report 64GB working OK... not tried it myself, yet, though) and... AVX-512.

                Why, Intel? Why? On a dual core? Why?

                For the software I use, enabling the AVX-512 instructions (and using the Intel compiler, MPI, TBB, etc) I see anything from a 20x to 120x speedup (yes, it varies that wildly, but the larger the dataset the bigger the improvement, at least until I run out of memory) over the normally compiled code. It trades blows with GPU performance, while using a lot less power. But the GPU code is older and much better optimised, so... it's often a case of try it and see which is better. Getting the correct balance of MPI processes to threads is trickier with the AVX-512 code, as well.

                Comment


                • #9
                  Originally posted by Paradigm Shifter View Post
                  Found it! (And it was in a NUC).

                  Cannon Lake. 10nm, dual core, 32GB max RAM (although maybe 64GB would work? My (different) NUC says max 32GB but people report 64GB working OK... not tried it myself, yet, though) and... AVX-512.
                  Wow, that's rare. Cannon Lake is almost nonexistent. It was made on Intel's original 10 nm process that sucked in almost every way possible (yield, power, frequency, etc.).

                  Originally posted by Paradigm Shifter View Post
                  Why, Intel? Why? On a dual core? Why?
                  It's not the number of cores, but rather their generation. After the Skylake variants, all their mainstream cores have it.

                  Originally posted by Paradigm Shifter View Post
                  For the software I use, enabling the AVX-512 instructions (and using the Intel compiler, MPI, TBB, etc) I see anything from a 20x to 120x speedup
                  Well, that sounds like it's just enabling an optimized/vectorized code path vs. using a scalar/unoptimized one. There's no way you should get more than a few x speedup over competently-written AVX2.

                  Originally posted by Paradigm Shifter View Post
                  It trades blows with GPU performance, while using a lot less power. But the GPU code is older and much better optimised, so... it's often a case of try it and see which is better. Getting the correct balance of MPI processes to threads is trickier with the AVX-512 code, as well.
                  Generally speaking, GPUs have a lot more memory bandwidth and compute. However, maybe some of the algorithm can't run on the GPU (or doesn't use it efficiently), so that's what's holding it back. It's hard to say without getting into specifics.

                  Comment

                  Working...
                  X