Announcement

**piotrj3** · 07 April 2021, 05:43 PM

Originally posted by torsionbar28 View Post

This is the oft cited answer for what AVX-512 is useful for, yet nobody has any examples of this usage benefiting from AVX-512. Do you have any real world examples? Actual software products that use it, and specific workloads where there is a demonstrated benefit? These seem to always be missing from these AVX 512 discussions. IMO the benefit of AVX 512 seems far more theoretical than practical at this point.

What is the intended way? Can you quantify the benefits? Where can I go to see this benefit demonstrated?

Edit: AVX-512 feels like the CPU instruction equivalent of an herbal supplement, with promises of increased vitality, improved clarity, and stronger constitution. Not FDA approved. Not intended to treat or cure any disease. Consult your doctor before taking. Results not guaranteed. Advertisement contains paid actors. Batteries not included. Void where prohibited. Not for sale in ME, TX, CA, NY, or NJ.

Doesn't CPUminer exactly shown you that? And keep in mind NCNN had also some improvements. and keep in mind i think desktop chips have only 1 AVX512 unit per core (or thread) server chips have 2 mostly.

AVX256 you have mostly 2 units in desktop chips, so often you end up in situation where AVX256 2 instructions are done in one cycle vs 1 instruction of AVX512, so gain mostly happens only if 2 AVX-256 instructions can't be used at the same time, and if AVX512 can reduce complexity of algorithm.

The biggest problem of AVX-512 is that you need handcrafted program for it, and you need processor with preferably at least 2 units of AVX-512.

**vegabook** · 07 April 2021, 06:39 PM

Originally posted by GPSnoopy View Post

Intel AVX is not faster. It's really not. Intel MKL by default does not use AVX on AMD CPUs, it falls back to something like SSE2 or SSE4. Quite slow.

You need to binary patch Intel MKL binaries to be able to bypass this rather arbitrary limitation. See:
- https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html
- https://www.extremetech.com/computin...eadripper-cpus
- http://www.swallowtail.org/naughty-intel.shtml

yeah 'cos on Haswell I had exactly the same code running in under a second. On a 4-core. Identical code. So now I have to figure how to do this patching for my numpy version.

**torsionbar28** · 07 April 2021, 10:44 PM

Originally posted by piotrj3 View Post

The biggest problem of AVX-512 is that you need handcrafted program for it, and you need processor with preferably at least 2 units of AVX-512.

So it's completely useless for 99.999% of today's workloads, client or server, and the few pieces of software that theoretically could use it, haven't been written yet. Got it.

This reminds me of another Intel innovation: Itanium. Crap performance, poor software ecosystem, if only we had better compilers and better software to take advantage of it, then it would be super duper awesome!!111 Mmmkay.

**coder** · 08 April 2021, 02:31 AM

While it does drive up power consumption and in some cases can be detrimental to the performance due to the clock speed differences when engaging AVX-512

Michael , has this been observed specifically on Rocket Lake?

I have seen it on older 14 nm CPUs, for sure, but I don't see any of your benchmarks where AVX-512 actually hurts raw performance (only perf/W).

**coder** · 08 April 2021, 02:34 AM

Originally posted by mle86pho View Post

At least for dav1d I'm suspicious if the benchmark measured something meaningful.

If you look into the sourcode, you'll notice, that there's a ton of handwritten assembler code (including AVX512):

src/x86 · master · VideoLAN / dav1d · GitLab

https://code.videolan.org/videolan/dav1d/-/tree/master/src/x86

dav1d is the fastest AV1 decoder on all platforms :) Targeted to be small, portable and very fast.

And there's is code to directly decode the cpuid to determine the available vector instructions. I guess that setting the usual -march/.. compiler settings are pointless, they are used anyway.

Yeah, they usually have a build-time and/or runtime option to override utilization of certain CPU features. So, PTS should really tie into that, for the results to be meaningful (although we don't necessarily know if the hand-coded stuff tries to use only 256-bit or goes for the full 512).

**coder** · 08 April 2021, 02:41 AM

Originally posted by ddriver View Post

I don't think it is that. The thing is so far SIMD units have been fairly general purpose. Intel is cramming a bunch of highly purpose specific stuff into avx 512. His is not a problem with the width of execution or power efficiency, but with the support hell it is to keep introducing new niche use instructions and having no instruction set and features uniformity between platforms.

No, he's definitely concerned with the amount of die space it's using up and the potential for clock-throttling due to "some AVX512 power virus that takes away top frequency (because people ended up using it for memcpy!)"

Here's his full statement: https://www.phoronix.com/scan.php?pa...lds-On-AVX-512

Originally posted by ddriver View Post

Intel appears to have given up on improving general purpose performance

No, they made plenty of general-purpose improvements in both Sunny Cove and Willow Cove. We probably don't see it so much on Rocket Lake, because it's a backport to 14 nm. Anandtech confirmed IPC increases in both Ice Lake (laptop) and Tiger Lake, though the latter still has slightly lower IPC than Zen3.

**coder** · 08 April 2021, 02:46 AM

Originally posted by carewolf View Post

Though the -march flags aren't ignored, they are overriden for the files with the special code. You can't use the intrinsics without right archs.

Depends on whether it's full-blown assembler or just intrinsics (or inline asm). For some reason, a lot of these multimedia codecs still like to use bare metal assembly language, in which case it don't matter what -march you tell your C compiler.

I got sick of doing register allocation like 2 decades ago, however. Intrinsics are fine for me. I sometimes even check the compiler output (with intrinsics) and usually find it's as good or better than the asm I'd write by hand.

**coder** · 08 April 2021, 02:49 AM

Originally posted by Etherman View Post

I wonder how much silicon avx 512 uses. Is it comparable to an extra core or two with avx 2?

RealWorldTech's David Kanter looked at this and estimated that AVX-512 adds a little over 11% to the base Skylake core (but only 5% of the entire tile area). He estimates that removing it would only free enough area on a 28-core Skylake SP die to add just 2 more tiles.

https://www.realworldtech.com/forum/...rpostid=193291

However, more instructions have been added since then, so I expect the overhead has gone up from that, somewhat. Still, I think die size is one of the lesser issues with it.

**carewolf** · 08 April 2021, 02:54 AM

Originally posted by coder View Post

Depends on whether it's full-blown assembler or just intrinsics (or inline asm). For some reason, a lot of these multimedia codecs still like to use bare metal assembly language, in which case it don't matter what -march you tell your C compiler.

I got sick of doing register allocation like 2 decades ago, however. Intrinsics are fine for me. I sometimes even check the compiler output (with intrinsics) and usually find it's as good or better than the asm I'd write by hand.

You can also do inline-assembler with clang/gcc. That way you avoid having to worry about registers. I find intrinsics better for SSE/AVX though. For NEON however sometimes using inline assembler is better because NEON has some weird instructions that operate on groups of 2-4 adjacent registers, and the intrinsics just have a tendency to not end up in a an optimal form in machine code (can end up with a move for each register before the instructions and a move for each after, instead of just fixing the allocated registers to some that are adjacent).

**coder** · 08 April 2021, 03:02 AM

Originally posted by lucasbekker View Post

AVX-512 is primarily aimed at software that has to perform a LOT of similar mathematical operations on large amounts of data. These kind of programs mostly fall into two categories:

You forgot crypto.

Originally posted by lucasbekker View Post

It is unfortunate that AVX-512 is getting a bad reputation because of these kind of benchmarks, because if you are making use of AVX-512 in the intended way, the performance benefints can be HUGE.

AVX-512 can be a disaster, for performance! Note what Michael said about compilers now defaulting the vector width to 256-bits, to try and limit clock-throttling.

Here's the worst-case scenario, for AVX-512: https://blog.cloudflare.com/on-the-d...uency-scaling/

Intel screwed themselves by getting ahead of what the process technology could support. Just like they did with AVX2, except worse. Dropping the CPU from a base clock of 2.1 GHz to 1.4 is just not forgivable! Especially when you're just executing 512-bit instructions for a small % of the time!

The only time AVX-512 is a net win (performance-wise, not even to speak of pref/W) is when your workload is using it very heavily. That's why Torvalds was complaining about the possibility of some idiot using it for memcpy().

Now, my hope and expectation is for Ice Lake SP (and maybe even Rocket Lake) to exercise more care and less latency around clock-speed adjustments, so that it wouldn't be a liability. However, I have yet to see good data on whether Intel managed to effectively mitigate the performance pitfalls of moderate AVX-512 usage, in Ice Lake (or Rocket Lake).

Announcement

AVX / AVX2 / AVX-512 Performance + Power On Intel Rocket Lake

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment