Announcement

**reba** · 20 April 2021, 06:18 AM

Congratulations to all involved! I mean - wow! Having something like DOOM run on a software stack is pretty much totally insane - whatever the frame rates are. And for software based rendering they are pretty good in my book, especially if there could still some low-ish hanging fruits be identifed.

Kudos to all of you

**arQon** · 20 April 2021, 03:49 PM

cool to see, but hardly a surprising outcome. 16 - or even 64 - general purpose cores are SO far from the 2K-3K specialised EUs/etc in even 7900/700-series GPUs that it's not even funny. The CPU may be 2x faster, but it's still facing a 100+x difference in throughput ability even before you factor in just how much faster those EUs are AT this kind of work.

**c0d1f1ed** · 20 April 2021, 04:52 PM

Originally posted by arQon View Post

cool to see, but hardly a surprising outcome. 16 - or even 64 - general purpose cores are SO far from the 2K-3K specialised EUs/etc in even 7900/700-series GPUs that it's not even funny. The CPU may be 2x faster, but it's still facing a 100+x difference in throughput ability even before you factor in just how much faster those EUs are AT this kind of work.

It's a common misconception that a CPU core can be compared to a GPU core. What most GPUs call a core is actually a 32-bit SIMD lane. A CPU with 64 cores with two AVX-512 SIMD units each actually has 2K of those, and would be capable of 8 SP FMA TFLOPS of throughput @ 2 GHz.

**coder** · 20 April 2021, 09:13 PM

Originally posted by c0d1f1ed View Post

It's a common misconception that a CPU core can be compared to a GPU core. What most GPUs call a core is actually a 32-bit SIMD lane. A CPU with 64 cores with two AVX-512 SIMD units each actually has 2K of those, and would be capable of 8 SP FMA TFLOPS of throughput @ 2 GHz.

A 64-core CPU with AVX-512? Doesn't exist!

Intel's latest Ice Lake CPUs feature up to 40 cores and 2x AVX-512 FMA per core. So, that'd be the equivalent of 1280 GPU "cores" or "shaders". Though its base freq is 2.3 GHz, 2.0 GHz probably isn't a bad estimate, since even Ice Lake still clock-throttles under heavy AVX-512 utilization. Assuming a 2x 16x fp32 FMAs per core per cycle, that amounts 5.1 TFLOPS @ 2.0 GHz. By comparison, a RTX 3090 is rated at 29.4 and a RX 6900 XT advertises 18.7 TFLOPS.

Memory bandwidth is another area in GPUs' favor. The Ice Lake 8380 has a nominal bandwidth of about 205 GB/s, whereas the RTX 3090 advertises 936 and RX 6900 XT has a nominal GDDR6 bandwidth of 512 GB/s (if we counted Infinity Cache, then we'd have to compare it with the CPU's L3 bandwidth).

However, as I mentioned, GPU performance isn't only about the shaders, or else they'd all look like AMD's new CDNA -- with no ROPs, texture samplers, tessellators, RT cores, etc. So, we'd really need to look beyond the TFLOPS. I know you, of all people, are well aware of this. I'm just mentioning it for arQon or anyone else who might not be paying attention to that stuff.

GPUs also have other advantages, like much greater SMT (Ampere is 64-way?) and many more registers (Ampere is up to 255 SIMD registers per warp). By comparison, x86 is just 2-way SMT (but has OoO) and AVX-512 has just 32 architectural registers per thread. Ampere also has other SIMD refinements you won't find in AVX-512, and GPUs have famously loose memory consistency guarantees.

CPUs are just no match for GPUs, at their own game. Intel didn't believe this, until 2 generations of Xeon Phi accelerators couldn't even catch the prior generations of GPUs' compute performance!

**bridgman** · 21 April 2021, 10:13 AM

Originally posted by c0d1f1ed View Post

It's a common misconception that a CPU core can be compared to a GPU core. What most GPUs call a core is actually a 32-bit SIMD lane. A CPU with 64 cores with two AVX-512 SIMD units each actually has 2K of those, and would be capable of 8 SP FMA TFLOPS of throughput @ 2 GHz.

Right... the closest equivalent to a CPU core in GCN would be one of our compute units - each of the four SIMDs in a CU is effectively a 10-thread scalar processor with a 2048-bit vector unit ("AVX-2048") running at 1/4 of the engine clock.

You could either say "the vector unit is 4 times as wide and that compensates for running at 1/4 the engine clock and so a single SIMD = a CPU core" or say that "putting all four SIMDs together compensates for running at 1/4 the engine clock and so a CU = a CPU core". We take the more conservative approach in our marketing blurb and talk about CPU cores and CUs.

For RDNA each CU has two SIMDs, each with a scalar processor and 1024-bit vector unit running at full engine clock, so it's probably easiest to say SIMD = CPU core. By that logic a 6900XT would have 160 cores each with "AVX-1024".

Originally posted by coder View Post

and GPUs have famously loose memory consistency guarantees.

Yep... the big register files and loose memory models both have a big impact on performance. IIRC a modern GPU uses almost as much area for registers as it does for ALUs.

**Guest** · 22 April 2021, 03:30 AM

Originally posted by airlied View Post

Ryzen 7 1800x

Dave.

Hm, would be better to test on Zen 3 or Rocketlake I think. I know there won't be a huge boost in performance or anything, but Zen 2 and Zen 3 improvements were pretty big compared to Zen and Zen+

**coder** · 22 April 2021, 09:52 AM

Originally posted by sandy8925 View Post

Hm, would be better to test on Zen 3 or Rocketlake I think. I know there won't be a huge boost in performance or anything, but Zen 2 and Zen 3 improvements were pretty big compared to Zen and Zen+

So, Rocket Lake's AVX-512 could be interesting, except Lavapipe doesn't support beyond AVX2, IIRC.

Otherwise, I'm not sure it'd be a very informative test, as merely tens of % speed improvement isn't going to make these games playable. What could make a big difference is a significant boost in core count and/or memory bandwidth (and more from the perspective of looking at the scaling data). Perhaps a giant chunk of L3 cache would be another potential game changer (if you'll excuse the pun). I'm guessing he'd have used something newer/bigger, if it were readily available.

**Guest** · 22 April 2021, 10:25 AM

Originally posted by coder View Post

So, Rocket Lake's AVX-512 could be interesting, except Lavapipe doesn't support beyond AVX2, IIRC.

Otherwise, I'm not sure it'd be a very informative test, as merely tens of % speed improvement isn't going to make these games playable. What could make a big difference is a significant boost in core count and/or memory bandwidth (and more from the perspective of looking at the scaling data). Perhaps a giant chunk of L3 cache would be another potential game changer (if you'll excuse the pun). I'm guessing he'd have used something newer/bigger, if it were readily available.

Yeah well, I don't think trying to actually make the game playable on CPUs is a goal to pursue. I look at it as something useful for debugging/verification/testing. Faster performance is always nicer, but I doubt it will ever match even iGPUs.

**coder** · 22 April 2021, 10:54 AM

Originally posted by sandy8925 View Post

Yeah well, I don't think trying to actually make the game playable on CPUs is a goal to pursue. I look at it as something useful for debugging/verification/testing.

I get that, but I also look at testing as what I think it could tell me. If a faster CPU is architecturally similar and just faster than the one I've got, then it's unlikely to highlight any new performance or scaling problems that are limiting lavapipe. So, what'd be the point, besides just having marginally better numbers to publish?

If the goal is really to enable higher throughput from lavapipe, then improving multi-core scaling should probably be the priority. Though, not necessarily higher than addressing already-known bottlenecks.

**c0d1f1ed** · 22 April 2021, 04:23 PM

Originally posted by coder View Post

Assuming a 2x 16x fp32 FMAs per core per cycle, that amounts 5.1 TFLOPS @ 2.0 GHz. By comparison, a RTX 3090 is rated at 29.4 and a RX 6900 XT advertises 18.7 TFLOPS.

While that's still a sizeable gap, my point was that CPUs have plenty of processing power to theoretically run DOOM 2016 in real-time. I think it's also worth noting that this gap has been getting smaller in recent years. CPUs are bound by the same laws of semiconductor physics as GPUs, and that goes both ways.

Memory bandwidth is another area in GPUs' favor.

The CPU has plenty of bandwidth for running DOOM 2016. Integrated GPUs, popular also in gaming consoles, illustrate this is not a significant bottleneck.

However, as I mentioned, GPU performance isn't only about the shaders, or else they'd all look like AMD's new CDNA -- with no ROPs, texture samplers, tessellators, RT cores, etc. So, we'd really need to look beyond the TFLOPS. I know you, of all people, are well aware of this.

Now we're getting closer to the true reasons. Texture sampling takes close to 100 instructions on a CPU (which can be vectorized and run on all cores), and worse when doing on-the-fly decompression like in Lavapipe's case, while on a GPU it's often just 1 pipelined operation executed on dedicated units.

There's a lot that can be done to make things faster on the CPU though, both on the software and the hardware side. Gather instructions still don't do coalescing, last I checked. I got DOOM to run 4x faster with one evening of optimization work, but that's not my day job nor anyone else's, so neither Lavapipe nor SwiftShader are representative of what is technically feasible on a CPU.

GPUs also have other advantages, like much greater SMT (Ampere is 64-way?) and many more registers (Ampere is up to 255 SIMD registers per warp). By comparison, x86 is just 2-way SMT (but has OoO) and AVX-512 has just 32 architectural registers per thread.

Those are other not so significant factors, in my view. Indeed out-of-order scheduling makes up for low SMT. And it's not a hard limit. IBM has 8-way SMT capable CPUs. The low number of registers isn't an issue either thanks to fairly efficient stack memory access, which isn't an option on GPUs largely because of the very high SMT. While that's still a successful formula for GPUs, I don't think low SMT is what's keeping CPUs from running graphics efficiently.

GPUs have famously loose memory consistency guarantees.

I'm curious what the net effect of that really is. GPUs don't have a lot of memory consistency because they haven't really needed it, but that doesn't necessarily mean that this is what's holding back CPUs. I do believe it's a major hurdle during design and validation though, which might be why we haven't seen much faster gather instructions yet.

CPUs are just no match for GPUs, at their own game. Intel didn't believe this, until 2 generations of Xeon Phi accelerators couldn't even catch the prior generations of GPUs' compute performance!

History is on your side, but I'm not sure this conclusively shows CPUs can't catch up with GPUs. The gap seems to be closing, albeit very slowly. I think we have to take business dynamic into account here as well. A company like NVIDIA pours all of its investment into the next GPU, while for Intel, Larrabee and Xeon Phi were small experiments and AVX-512 is still receiving just a fraction of attention since it's not the be-all and end-all of their bottom line. Although seeing it appear in consumer CPUs is promising, it's going to take many more years to be used more widely. But the trend is there. Meanwhile we also see new players challenging the x86 architecture itself. So a company that missed the boat on efficient mobile CPUs might not be the one we should look toward for challenging GPU architectures.

Announcement

David Airlie Tries DOOM On CPU-Based Lavapipe Vulkan

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment