AVX / AVX2 / AVX-512 Performance + Power On Intel Rocket Lake

carewolf replied

08 April 2021, 02:54 AM
Originally posted by coder View Post

Depends on whether it's full-blown assembler or just intrinsics (or inline asm). For some reason, a lot of these multimedia codecs still like to use bare metal assembly language, in which case it don't matter what -march you tell your C compiler.

I got sick of doing register allocation like 2 decades ago, however. Intrinsics are fine for me. I sometimes even check the compiler output (with intrinsics) and usually find it's as good or better than the asm I'd write by hand.

You can also do inline-assembler with clang/gcc. That way you avoid having to worry about registers. I find intrinsics better for SSE/AVX though. For NEON however sometimes using inline assembler is better because NEON has some weird instructions that operate on groups of 2-4 adjacent registers, and the intrinsics just have a tendency to not end up in a an optimal form in machine code (can end up with a move for each register before the instructions and a move for each after, instead of just fixing the allocated registers to some that are adjacent).

Last edited by carewolf; 08 April 2021, 02:57 AM.
Likes 3
Leave a comment:
coder replied

08 April 2021, 02:49 AM
Originally posted by Etherman View Post

I wonder how much silicon avx 512 uses. Is it comparable to an extra core or two with avx 2?

RealWorldTech's David Kanter looked at this and estimated that AVX-512 adds a little over 11% to the base Skylake core (but only 5% of the entire tile area). He estimates that removing it would only free enough area on a 28-core Skylake SP die to add just 2 more tiles.

https://www.realworldtech.com/forum/...rpostid=193291

However, more instructions have been added since then, so I expect the overhead has gone up from that, somewhat. Still, I think die size is one of the lesser issues with it.

Last edited by coder; 08 April 2021, 08:49 AM.
Likes 1
Leave a comment:
coder replied

08 April 2021, 02:46 AM
Originally posted by carewolf View Post

Though the -march flags aren't ignored, they are overriden for the files with the special code. You can't use the intrinsics without right archs.

Depends on whether it's full-blown assembler or just intrinsics (or inline asm). For some reason, a lot of these multimedia codecs still like to use bare metal assembly language, in which case it don't matter what -march you tell your C compiler.

I got sick of doing register allocation like 2 decades ago, however. Intrinsics are fine for me. I sometimes even check the compiler output (with intrinsics) and usually find it's as good or better than the asm I'd write by hand.
Leave a comment:
coder replied

08 April 2021, 02:41 AM
Originally posted by ddriver View Post

I don't think it is that. The thing is so far SIMD units have been fairly general purpose. Intel is cramming a bunch of highly purpose specific stuff into avx 512. His is not a problem with the width of execution or power efficiency, but with the support hell it is to keep introducing new niche use instructions and having no instruction set and features uniformity between platforms.

No, he's definitely concerned with the amount of die space it's using up and the potential for clock-throttling due to "some AVX512 power virus that takes away top frequency (because people ended up using it for memcpy!)"

Here's his full statement: https://www.phoronix.com/scan.php?pa...lds-On-AVX-512

Originally posted by ddriver View Post

Intel appears to have given up on improving general purpose performance

No, they made plenty of general-purpose improvements in both Sunny Cove and Willow Cove. We probably don't see it so much on Rocket Lake, because it's a backport to 14 nm. Anandtech confirmed IPC increases in both Ice Lake (laptop) and Tiger Lake, though the latter still has slightly lower IPC than Zen3.
Likes 2
Leave a comment:
coder replied

08 April 2021, 02:34 AM
Originally posted by mle86pho View Post

At least for dav1d I'm suspicious if the benchmark measured something meaningful.

If you look into the sourcode, you'll notice, that there's a ton of handwritten assembler code (including AVX512):

src/x86 · master · VideoLAN / dav1d · GitLab

https://code.videolan.org/videolan/dav1d/-/tree/master/src/x86

dav1d is the fastest AV1 decoder on all platforms :) Targeted to be small, portable and very fast.

And there's is code to directly decode the cpuid to determine the available vector instructions. I guess that setting the usual -march/.. compiler settings are pointless, they are used anyway.

Yeah, they usually have a build-time and/or runtime option to override utilization of certain CPU features. So, PTS should really tie into that, for the results to be meaningful (although we don't necessarily know if the hand-coded stuff tries to use only 256-bit or goes for the full 512).
Leave a comment:
coder replied

08 April 2021, 02:31 AM
While it does drive up power consumption and in some cases can be detrimental to the performance due to the clock speed differences when engaging AVX-512

Michael , has this been observed specifically on Rocket Lake?

I have seen it on older 14 nm CPUs, for sure, but I don't see any of your benchmarks where AVX-512 actually hurts raw performance (only perf/W).
Leave a comment:
torsionbar28 replied

07 April 2021, 10:44 PM
Originally posted by piotrj3 View Post

The biggest problem of AVX-512 is that you need handcrafted program for it, and you need processor with preferably at least 2 units of AVX-512.

So it's completely useless for 99.999% of today's workloads, client or server, and the few pieces of software that theoretically could use it, haven't been written yet. Got it.

This reminds me of another Intel innovation: Itanium. Crap performance, poor software ecosystem, if only we had better compilers and better software to take advantage of it, then it would be super duper awesome!!111 Mmmkay.

Last edited by torsionbar28; 07 April 2021, 10:47 PM.
Likes 2
Leave a comment:
vegabook replied

07 April 2021, 06:39 PM
Originally posted by GPSnoopy View Post

Intel AVX is not faster. It's really not. Intel MKL by default does not use AVX on AMD CPUs, it falls back to something like SSE2 or SSE4. Quite slow.

You need to binary patch Intel MKL binaries to be able to bypass this rather arbitrary limitation. See:
- https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html
- https://www.extremetech.com/computin...eadripper-cpus
- http://www.swallowtail.org/naughty-intel.shtml

yeah 'cos on Haswell I had exactly the same code running in under a second. On a 4-core. Identical code. So now I have to figure how to do this patching for my numpy version.

Last edited by vegabook; 07 April 2021, 06:49 PM.
Likes 1
Leave a comment:
piotrj3 replied

07 April 2021, 05:43 PM
Originally posted by torsionbar28 View Post

This is the oft cited answer for what AVX-512 is useful for, yet nobody has any examples of this usage benefiting from AVX-512. Do you have any real world examples? Actual software products that use it, and specific workloads where there is a demonstrated benefit? These seem to always be missing from these AVX 512 discussions. IMO the benefit of AVX 512 seems far more theoretical than practical at this point.

What is the intended way? Can you quantify the benefits? Where can I go to see this benefit demonstrated?

Edit: AVX-512 feels like the CPU instruction equivalent of an herbal supplement, with promises of increased vitality, improved clarity, and stronger constitution. Not FDA approved. Not intended to treat or cure any disease. Consult your doctor before taking. Results not guaranteed. Advertisement contains paid actors. Batteries not included. Void where prohibited. Not for sale in ME, TX, CA, NY, or NJ.

Doesn't CPUminer exactly shown you that? And keep in mind NCNN had also some improvements. and keep in mind i think desktop chips have only 1 AVX512 unit per core (or thread) server chips have 2 mostly.

AVX256 you have mostly 2 units in desktop chips, so often you end up in situation where AVX256 2 instructions are done in one cycle vs 1 instruction of AVX512, so gain mostly happens only if 2 AVX-256 instructions can't be used at the same time, and if AVX512 can reduce complexity of algorithm.

The biggest problem of AVX-512 is that you need handcrafted program for it, and you need processor with preferably at least 2 units of AVX-512.

Last edited by piotrj3; 07 April 2021, 05:48 PM.
Likes 1
Leave a comment:
jrch2k8 replied

07 April 2021, 05:30 PM
Originally posted by TemplarGR View Post

Seems the benefit of AVX in general is too small. Really, really, small. Unless specifically written for it.

The performance gains are huge but it requires you to understand certain CPU basics that most developers simply ignore or assume the compiler should handle it, so effectively speaking 99% of the code you can find around will see no benefit or very little by simply recompiling because the code itself is in a not vectorizable state and there is nothing the compiler can do without introducing very nasty undefined behavior.

Now if your code is in a vectorizable state a recompile will show some nice gains but for maximum performance, yeah specific implementations are best simply because the developer is the one that understand that code and can go further than the compiler's safe approach.

Think of it as regular C/C++ and Co are akin to OpenGL/DX11 and SIMD C/C++ and Co is akin to Vulkan/DX12.

Also SIMD regardless of the platform/acronym do a hell of lot more than simple math as claimed by some other posts, sure BLAS and other software make of use of the "Math" side of SIMD but it also greatly speedup memory, cache, shifting, comparison, crypto (not only aes-ni do this btw), etc. operations and those can be used on any app/library as long as you understand what you are doing.
Leave a comment:

Announcement

AVX / AVX2 / AVX-512 Performance + Power On Intel Rocket Lake

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: