Announcement

**andrei_me** · 01 July 2021, 02:08 PM

Michael typo on "AFX" instead of "AVX"

**coder** · 01 July 2021, 02:31 PM

This AVX-512 half-precision floating point support should help for training and inference with deep learning models where FP32 isn't needed, among other use-cases.

Not when they already have BF16! That's much better-suited to deep learning, which was the whole point of it.

FP16 is more about graphics and other cases where you want a better balance between precision and range. TBH, I'm surprised they even went back and added FP16, after they already had BF16. Not many people seem to care much about FP16, any more.

**Paradigm Shifter** · 02 July 2021, 03:11 AM

I don't mind them adding new stuff, but this is just another "does it or doesn't it support" question for AVX-512... I've now got access to several boxes which all support AVX-512... and all have different levels of support. It is really annoying trying to explain to some users why system X won't do something, while system Y will... when they're both "AVX-512". That said, the AVX-512 implementation in that (Comet Lake?) Core i3 was funny too - "I want to do XYZ on this little i3 because AVX-512!" "OK, come back to me when you've found a way of getting that little i3 to support a terabyte of RAM."

**coder** · 02 July 2021, 04:33 AM

Originally posted by Paradigm Shifter View Post

the AVX-512 implementation in that (Comet Lake?) Core i3 was funny too

Comet Lake was still built on the Skylake microarchitecture, and therefore didn't support AVX-512. I assume you're thinking of Rocket Lake (the Gen11 desktop CPUs).

My team had to disable AVX-512 in a deep learning workload we needed to run on a CPU for... reasons. It was a Skylake server CPU, and we got up to a 50% improvement (IIRC) by disabling it, since it killed our clock speeds so badly. So, even if a CPU supports the instructions you want to use, still doesn't mean you necessarily want to use them. Ice Lake SP should be better, in that respect.

**drakonas777** · 02 July 2021, 07:14 AM

Originally posted by coder View Post

Comet Lake was still built on the Skylake microarchitecture, and therefore didn't support AVX-512. I assume you're thinking of Rocket Lake (the Gen11 desktop CPUs).

Where is no i3 based on RL.

**coder** · 02 July 2021, 01:24 PM

Originally posted by drakonas777 View Post

Where is no i3 based on RL.

Well, the Comet Lake i3's don't support AVX-512. So, was this a laptop or maybe a NUC? Because then you're talking about probably Ice Lake or Tiger Lake.

**Paradigm Shifter** · 06 July 2021, 02:37 AM

Originally posted by coder View Post

Comet Lake was still built on the Skylake microarchitecture, and therefore didn't support AVX-512. I assume you're thinking of Rocket Lake (the Gen11 desktop CPUs).

My team had to disable AVX-512 in a deep learning workload we needed to run on a CPU for... reasons. It was a Skylake server CPU, and we got up to a 50% improvement (IIRC) by disabling it, since it killed our clock speeds so badly. So, even if a CPU supports the instructions you want to use, still doesn't mean you necessarily want to use them. Ice Lake SP should be better, in that respect.

Found it! (And it was in a NUC).

Cannon Lake. 10nm, dual core, 32GB max RAM (although maybe 64GB would work? My (different) NUC says max 32GB but people report 64GB working OK... not tried it myself, yet, though) and... AVX-512.

Why, Intel? Why? On a dual core? Why?

For the software I use, enabling the AVX-512 instructions (and using the Intel compiler, MPI, TBB, etc) I see anything from a 20x to 120x speedup (yes, it varies that wildly, but the larger the dataset the bigger the improvement, at least until I run out of memory) over the normally compiled code. It trades blows with GPU performance, while using a lot less power. But the GPU code is older and much better optimised, so... it's often a case of try it and see which is better. Getting the correct balance of MPI processes to threads is trickier with the AVX-512 code, as well.

**coder** · 06 July 2021, 04:41 PM

Originally posted by Paradigm Shifter View Post

Found it! (And it was in a NUC).

Cannon Lake. 10nm, dual core, 32GB max RAM (although maybe 64GB would work? My (different) NUC says max 32GB but people report 64GB working OK... not tried it myself, yet, though) and... AVX-512.

Wow, that's rare. Cannon Lake is almost nonexistent. It was made on Intel's original 10 nm process that sucked in almost every way possible (yield, power, frequency, etc.).

Originally posted by Paradigm Shifter View Post

Why, Intel? Why? On a dual core? Why?

It's not the number of cores, but rather their generation. After the Skylake variants, all their mainstream cores have it.

Originally posted by Paradigm Shifter View Post

For the software I use, enabling the AVX-512 instructions (and using the Intel compiler, MPI, TBB, etc) I see anything from a 20x to 120x speedup

Well, that sounds like it's just enabling an optimized/vectorized code path vs. using a scalar/unoptimized one. There's no way you should get more than a few x speedup over competently-written AVX2.

Originally posted by Paradigm Shifter View Post

It trades blows with GPU performance, while using a lot less power. But the GPU code is older and much better optimised, so... it's often a case of try it and see which is better. Getting the correct balance of MPI processes to threads is trickier with the AVX-512 code, as well.

Generally speaking, GPUs have a lot more memory bandwidth and compute. However, maybe some of the algorithm can't run on the GPU (or doesn't use it efficiently), so that's what's holding it back. It's hard to say without getting into specifics.

Announcement

Intel Posts Big Set of Patches For AVX-512 FP16 Compiler Support For Sapphire Rapids

Intel Posts Big Set of Patches For AVX-512 FP16 Compiler Support For Sapphire Rapids

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment