Announcement

**numacross** · 28 June 2020, 03:07 AM

Originally posted by jabl View Post

But thanks to Intel market segmentation games avx-512 is available only in high end parts, and thus any software with avx-512 support will have fallback paths.

Not only that, but AVX-512 itself is very fragmented.

**coder** · 28 June 2020, 07:27 AM

Originally posted by zxy_thf View Post

An alternative way to implement AVX-512 is only implementing the decoder part and keep the core ALU as 256-bit. AMD did this to AVX-2 in Zen 1/Zen+.

Not quite. There are big downsides to even that half-measure:

You still need to double the register sizes.
You still need to implement some subset of AVX-512 instructions, and even Intel doesn't have a single CPU which has all of it. Just look at this mess: https://en.wikipedia.org/wiki/AVX-512#Instruction_set
You expose yourself to the pitfall of apps entering 512-bit mode (which comes with extra power/clock penalties), if they include even minor use of AVX-512 instructions (which some libraries will do, automatically).

And you still can't even deliver the full benefits of AVX-512, for heavy usage. So, it's most of the pain with almost none of the gain.

Originally posted by zxy_thf View Post

Considering the benefit/cost, and currently even real AVX-512 modules are not running at full clock, this might be the most realistic way to adopt AVX-512.

It's a minefield AMD would be best to stay out of, IMO.

Intel saw that vector instructions were a good things and people liked them, so they kept adding more and making the vectors bigger. With AVX-512, they finally went too far.

Fortunately, ARM had a better idea, when they added SVE. It's still debatable whether such vector-processing horsepower really belongs in a CPU, but that at least has fewer downsides.

**coder** · 28 June 2020, 07:34 AM

BTW, the few details on AMX sound intriguing, at least. I plan to read up on it.

My first thought is to question the wisdom of bloating the thread execution context by another 8 kB, if that's indeed how it works. From the bit quoted in the article, it almost sounds as if it could be a separate device, like the iGPU.

**pipe13** · 28 June 2020, 01:04 PM

Originally posted by coder View Post

Intel saw that vector instructions were a good thing and people liked them, so they kept adding more and making the vectors bigger. With AVX-512, they finally went too far.

Fortunately, ARM had a better idea, when they added SVE. It's still debatable whether such vector-processing horsepower really belongs in a CPU, but that at least has fewer downsides.

Of course it does -- for sufficiently constrained values of "vector" and "such". I'm thinking back on a robotics contract. A critical part of the initialization was workspace evaluation -- determining for a given mobile setup whether the robot arm had clearance to do it's assigned task. 7-DOF arm control involves a lot of 4x4 rot-trans matrix multiplies. In fp64 that's only 1024 bytes per matrix, and it wasn't clear the pipeline was well suited to a gpu even if we were budgeted to code one.

So they assigned me to fix the required memory alignment issues for AVX-512, which wasn't hard and bought us about 5% better performance. Since workspace evaluation is a trivially parallelized exhaustive search, we'd be much better off with more cores. This was just pre-zen, I assume that's where they've gone as I've moved on to other things.

**coder** · 02 July 2020, 10:27 PM

Originally posted by pipe13 View Post

Of course it does -- for sufficiently constrained values of "vector" and "such". I'm thinking back on a robotics contract.

It's certainly more convenient to have enough compute power for your needs, right in the CPU cores. No argument there.

The matter really comes down to cost, power, and scalability. The numbers don't lie -- GPUs are just way more efficient at raw compute, both in terms of GFLOPS/W and GFLOPS/$. Also, in peak GFLOPS, period.

It's great that you were able to do everything you needed in a CPU. There are always going to be some applications which are right in that sweet spot. However, the main target for AVX-512 (as indicated by its inclusion in Intel's server CPUs, yet it's nowhere to be seen on the desktop after 5 years!) is one where there's a lot more flexibility (not to mention a wealth of software support) for doing the compute on GPUs.

Speaking from the other side of the fence, I've shipped products that use GPU compute and didn't include any hand-written GPU code by us. Not that we couldn't, but we simply didn't need to, since the libraries for doing what we needed were already out there and of more than adequate quality.

Robotics isn't my field, so I can't speak to the level of GPU software support for it, but I know Nvidia has been pushing their embedded SoCs for robotics applications for a while.

**coder** · 02 July 2020, 11:24 PM

Had a few more thoughts about this.

Originally posted by pipe13 View Post

7-DOF arm control involves a lot of 4x4 rot-trans matrix multiplies. In fp64 that's only 1024 bytes per matrix, and it wasn't clear the pipeline was well suited to a gpu even if we were budgeted to code one.

Did you use any inter-lane operations, or just use it to process 8 separate matrices in parallel?

Now, what would be cool is if one were happy with fp32 (meaning you could fit a whole 4x4 matrix in a single register) and AVX-512 actually had proper support for things like 4x4 matrix-multiply. I don't think it tips the overall balance in favor of CPUs, but it'd still be nice and would make a slightly more compelling case for AVX-512.

Originally posted by pipe13 View Post

So they assigned me to fix the required memory alignment issues for AVX-512, which wasn't hard and bought us about 5% better performance.

Just from fixing the alignment, or that was the net improvement from AVX-512, overall? In the first case, that'd be pretty good. If you meant the latter, that sounds like a pretty disappointing benefit, for a full doubling of the vector width.

**pipe13** · 12 July 2020, 12:08 PM

^^^ Sorry to have missed this, coder. The alignments needed to be fixed for to compile at all, so the 5% was net. As mentioned, the entire workspace evaluation process involved simulating all arm motions that might be required in the up-coming procedure, so was trivially parallelizeable and that part had been done. But WSE was the main time-critical path and the project engineer wished to see if there were any easy gains to be had from the hardware we had at the time: 8-core Skylake.

FP32 might or might not have been "good enough" for WSE. But it certainly wasn't for the actual procedure and if your suggestion had occurred to me I would have recommended against it, just on the additional coding time and complexity needed to integrate and maintain the two precision paths. We ended up disabling AVX-512 for similar reason: the rest of the code was still in a high degree of flux, and the engineers working on it were tripping over the malignment requirement.

32-core Zen was on the horizon but not there yet. I'm no longer on that project, but it had some very intelligent people and I assume that's where they've gone.

**coder** · 17 July 2020, 09:27 PM

Originally posted by pipe13 View Post

FP32 might or might not have been "good enough" for WSE. But it certainly wasn't for the actual procedure and if your suggestion had occurred to me I would have recommended against it, just on the additional coding time and complexity needed to integrate and maintain the two precision paths.

That was just a musing that's truly irrelevant, because AVX-512 has no 4x4 matrix-multiply instruction. For some things, it'd be great. Game engines, for instance, use almost exclusively fp32 on the GPU. 4x4 matrices abound, in graphics programming.

Announcement

Intel Begins Volleying Open-Source Patches Around Intel AMX

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment