David Airlie Tries DOOM On CPU-Based Lavapipe Vulkan

Guest replied

22 April 2021, 03:30 AM
Originally posted by airlied View Post

Ryzen 7 1800x

Dave.

Hm, would be better to test on Zen 3 or Rocketlake I think. I know there won't be a huge boost in performance or anything, but Zen 2 and Zen 3 improvements were pretty big compared to Zen and Zen+
Leave a comment:
bridgman replied

21 April 2021, 10:13 AM
Originally posted by c0d1f1ed View Post

It's a common misconception that a CPU core can be compared to a GPU core. What most GPUs call a core is actually a 32-bit SIMD lane. A CPU with 64 cores with two AVX-512 SIMD units each actually has 2K of those, and would be capable of 8 SP FMA TFLOPS of throughput @ 2 GHz.

Right... the closest equivalent to a CPU core in GCN would be one of our compute units - each of the four SIMDs in a CU is effectively a 10-thread scalar processor with a 2048-bit vector unit ("AVX-2048") running at 1/4 of the engine clock.

You could either say "the vector unit is 4 times as wide and that compensates for running at 1/4 the engine clock and so a single SIMD = a CPU core" or say that "putting all four SIMDs together compensates for running at 1/4 the engine clock and so a CU = a CPU core". We take the more conservative approach in our marketing blurb and talk about CPU cores and CUs.

For RDNA each CU has two SIMDs, each with a scalar processor and 1024-bit vector unit running at full engine clock, so it's probably easiest to say SIMD = CPU core. By that logic a 6900XT would have 160 cores each with "AVX-1024".

Originally posted by coder View Post

and GPUs have famously loose memory consistency guarantees.

Yep... the big register files and loose memory models both have a big impact on performance. IIRC a modern GPU uses almost as much area for registers as it does for ALUs.

Last edited by bridgman; 21 April 2021, 10:18 AM.
Likes 1
Leave a comment:
coder replied

20 April 2021, 09:13 PM
Originally posted by c0d1f1ed View Post

It's a common misconception that a CPU core can be compared to a GPU core. What most GPUs call a core is actually a 32-bit SIMD lane. A CPU with 64 cores with two AVX-512 SIMD units each actually has 2K of those, and would be capable of 8 SP FMA TFLOPS of throughput @ 2 GHz.

A 64-core CPU with AVX-512? Doesn't exist!

Intel's latest Ice Lake CPUs feature up to 40 cores and 2x AVX-512 FMA per core. So, that'd be the equivalent of 1280 GPU "cores" or "shaders". Though its base freq is 2.3 GHz, 2.0 GHz probably isn't a bad estimate, since even Ice Lake still clock-throttles under heavy AVX-512 utilization. Assuming a 2x 16x fp32 FMAs per core per cycle, that amounts 5.1 TFLOPS @ 2.0 GHz. By comparison, a RTX 3090 is rated at 29.4 and a RX 6900 XT advertises 18.7 TFLOPS.

Memory bandwidth is another area in GPUs' favor. The Ice Lake 8380 has a nominal bandwidth of about 205 GB/s, whereas the RTX 3090 advertises 936 and RX 6900 XT has a nominal GDDR6 bandwidth of 512 GB/s (if we counted Infinity Cache, then we'd have to compare it with the CPU's L3 bandwidth).

However, as I mentioned, GPU performance isn't only about the shaders, or else they'd all look like AMD's new CDNA -- with no ROPs, texture samplers, tessellators, RT cores, etc. So, we'd really need to look beyond the TFLOPS. I know you, of all people, are well aware of this. I'm just mentioning it for arQon or anyone else who might not be paying attention to that stuff.

GPUs also have other advantages, like much greater SMT (Ampere is 64-way?) and many more registers (Ampere is up to 255 SIMD registers per warp). By comparison, x86 is just 2-way SMT (but has OoO) and AVX-512 has just 32 architectural registers per thread. Ampere also has other SIMD refinements you won't find in AVX-512, and GPUs have famously loose memory consistency guarantees.

CPUs are just no match for GPUs, at their own game. Intel didn't believe this, until 2 generations of Xeon Phi accelerators couldn't even catch the prior generations of GPUs' compute performance!

Last edited by coder; 20 April 2021, 09:20 PM.
Leave a comment:
c0d1f1ed replied

20 April 2021, 04:52 PM
Originally posted by arQon View Post

cool to see, but hardly a surprising outcome. 16 - or even 64 - general purpose cores are SO far from the 2K-3K specialised EUs/etc in even 7900/700-series GPUs that it's not even funny. The CPU may be 2x faster, but it's still facing a 100+x difference in throughput ability even before you factor in just how much faster those EUs are AT this kind of work.

It's a common misconception that a CPU core can be compared to a GPU core. What most GPUs call a core is actually a 32-bit SIMD lane. A CPU with 64 cores with two AVX-512 SIMD units each actually has 2K of those, and would be capable of 8 SP FMA TFLOPS of throughput @ 2 GHz.
Leave a comment:
arQon replied

20 April 2021, 03:49 PM
cool to see, but hardly a surprising outcome. 16 - or even 64 - general purpose cores are SO far from the 2K-3K specialised EUs/etc in even 7900/700-series GPUs that it's not even funny. The CPU may be 2x faster, but it's still facing a 100+x difference in throughput ability even before you factor in just how much faster those EUs are AT this kind of work.
Leave a comment:
reba replied

20 April 2021, 06:18 AM
Congratulations to all involved! I mean - wow! Having something like DOOM run on a software stack is pretty much totally insane - whatever the frame rates are. And for software based rendering they are pretty good in my book, especially if there could still some low-ish hanging fruits be identifed.

Kudos to all of you
Likes 1
Leave a comment:
tildearrow replied

19 April 2021, 09:17 PM
Hey, at least it runs. I've heard that normally a software implementation of a graphics API would crash on a very intensive game...

Originally posted by torsionbar28 View Post

8266752-core Fugaku is the solution.

Fixed...

Originally posted by Etherman View Post

Cool, up to 10 fpm of pure performane.

I remember Xonotic has an "spf" (seconds per frame) metric if your frame rate goes beyond 1 FPS...
Likes 3
Leave a comment:
airlied replied

19 April 2021, 07:19 PM
Originally posted by coder View Post

Last time Michael tried this on a high-core-count machine, I think he found that performance maxed out at like 16 cores.

airlied 's blog entry doesn't specify what CPU this latest test used. I didn't check to see if his earlier entries mentioned what HW platform he's using.

Ryzen 7 1800x

Dave.
Likes 3
Leave a comment:
airlied replied

19 April 2021, 04:43 PM
Originally posted by coder View Post

It's interesting to hear how this is progressing.

I wonder if there's a good, generic way to profile JIT code. operf certainly hasn't done me much good, but then I haven't really looked into it, either.

An order of magnitude less than real GPU performance probably isn't unreasonable to hope for, though it probably depends somewhat on how much the app leans on HW features vs. generic shader code.

Our BPTC decompression isn't JIT yet, it is also very naive and done at runtime, I'm considering up front decompression, but it's a large memory usage + bw increase vs a large CPU usage, it might be possible to JIT the BPTC decompressor and get things a bit better that way.

As for profiling, llvm has perf integration now, I can at least see in perf report what assembly is eating up CPU, though mapping that back to fragment shader source is always tricky.

Dave.
Likes 6
Leave a comment:
Laughing1 replied

19 April 2021, 12:49 PM
Originally posted by commodore256 View Post

If we can get graphene CPUs and they you could go 1,000x times faster, (like they say) that's 166fps.

How about graphene GPUs?
Likes 2
Leave a comment:

Announcement

David Airlie Tries DOOM On CPU-Based Lavapipe Vulkan

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: