Announcement
Collapse
No announcement yet.
Intel Begins Volleying Open-Source Patches Around Intel AMX
Collapse
X
-
Originally posted by zxy_thf View PostAn alternative way to implement AVX-512 is only implementing the decoder part and keep the core ALU as 256-bit. AMD did this to AVX-2 in Zen 1/Zen+.- You still need to double the register sizes.
- You still need to implement some subset of AVX-512 instructions, and even Intel doesn't have a single CPU which has all of it. Just look at this mess: https://en.wikipedia.org/wiki/AVX-512#Instruction_set
- You expose yourself to the pitfall of apps entering 512-bit mode (which comes with extra power/clock penalties), if they include even minor use of AVX-512 instructions (which some libraries will do, automatically).
Originally posted by zxy_thf View PostConsidering the benefit/cost, and currently even real AVX-512 modules are not running at full clock, this might be the most realistic way to adopt AVX-512.
Intel saw that vector instructions were a good things and people liked them, so they kept adding more and making the vectors bigger. With AVX-512, they finally went too far.
Fortunately, ARM had a better idea, when they added SVE. It's still debatable whether such vector-processing horsepower really belongs in a CPU, but that at least has fewer downsides.
Comment
-
BTW, the few details on AMX sound intriguing, at least. I plan to read up on it.
My first thought is to question the wisdom of bloating the thread execution context by another 8 kB, if that's indeed how it works. From the bit quoted in the article, it almost sounds as if it could be a separate device, like the iGPU.
Comment
-
Originally posted by coder View Post
Intel saw that vector instructions were a good thing and people liked them, so they kept adding more and making the vectors bigger. With AVX-512, they finally went too far.
Fortunately, ARM had a better idea, when they added SVE. It's still debatable whether such vector-processing horsepower really belongs in a CPU, but that at least has fewer downsides.
So they assigned me to fix the required memory alignment issues for AVX-512, which wasn't hard and bought us about 5% better performance. Since workspace evaluation is a trivially parallelized exhaustive search, we'd be much better off with more cores. This was just pre-zen, I assume that's where they've gone as I've moved on to other things.
Comment
-
Originally posted by pipe13 View PostOf course it does -- for sufficiently constrained values of "vector" and "such". I'm thinking back on a robotics contract.
The matter really comes down to cost, power, and scalability. The numbers don't lie -- GPUs are just way more efficient at raw compute, both in terms of GFLOPS/W and GFLOPS/$. Also, in peak GFLOPS, period.
It's great that you were able to do everything you needed in a CPU. There are always going to be some applications which are right in that sweet spot. However, the main target for AVX-512 (as indicated by its inclusion in Intel's server CPUs, yet it's nowhere to be seen on the desktop after 5 years!) is one where there's a lot more flexibility (not to mention a wealth of software support) for doing the compute on GPUs.
Speaking from the other side of the fence, I've shipped products that use GPU compute and didn't include any hand-written GPU code by us. Not that we couldn't, but we simply didn't need to, since the libraries for doing what we needed were already out there and of more than adequate quality.
Robotics isn't my field, so I can't speak to the level of GPU software support for it, but I know Nvidia has been pushing their embedded SoCs for robotics applications for a while.
Comment
-
Had a few more thoughts about this.
Originally posted by pipe13 View Post7-DOF arm control involves a lot of 4x4 rot-trans matrix multiplies. In fp64 that's only 1024 bytes per matrix, and it wasn't clear the pipeline was well suited to a gpu even if we were budgeted to code one.
Now, what would be cool is if one were happy with fp32 (meaning you could fit a whole 4x4 matrix in a single register) and AVX-512 actually had proper support for things like 4x4 matrix-multiply. I don't think it tips the overall balance in favor of CPUs, but it'd still be nice and would make a slightly more compelling case for AVX-512.
Originally posted by pipe13 View PostSo they assigned me to fix the required memory alignment issues for AVX-512, which wasn't hard and bought us about 5% better performance.
Comment
-
^^^ Sorry to have missed this, coder. The alignments needed to be fixed for to compile at all, so the 5% was net. As mentioned, the entire workspace evaluation process involved simulating all arm motions that might be required in the up-coming procedure, so was trivially parallelizeable and that part had been done. But WSE was the main time-critical path and the project engineer wished to see if there were any easy gains to be had from the hardware we had at the time: 8-core Skylake.
FP32 might or might not have been "good enough" for WSE. But it certainly wasn't for the actual procedure and if your suggestion had occurred to me I would have recommended against it, just on the additional coding time and complexity needed to integrate and maintain the two precision paths. We ended up disabling AVX-512 for similar reason: the rest of the code was still in a high degree of flux, and the engineers working on it were tripping over the malignment requirement.
32-core Zen was on the horizon but not there yet. I'm no longer on that project, but it had some very intelligent people and I assume that's where they've gone.
- Likes 1
Comment
-
Originally posted by pipe13 View PostFP32 might or might not have been "good enough" for WSE. But it certainly wasn't for the actual procedure and if your suggestion had occurred to me I would have recommended against it, just on the additional coding time and complexity needed to integrate and maintain the two precision paths.
Comment
Comment