Announcement

**coder** · 22 April 2021, 06:36 PM

Thanks for the reply. It's nice to hear your views.
: )

Originally posted by c0d1f1ed View Post

I think it's also worth noting that this gap has been getting smaller in recent years. CPUs are bound by the same laws of semiconductor physics as GPUs, and that goes both ways.

Depends, in part, on how much more you think CPUs can scale up SIMD width. Intel got badly burned by embracing AVX-512 in 14 nm CPUs, if you look at the horrible clock throttling it triggers. My team just got burned by some AVX-512 code in a library we're using, that caused the clock speed of a 14 nm Xeon to drop down to 1.2 or 1.3 GHz, when its base clock is 2.1 GHz! And because the AVX-512 code only got run for a small fraction of the time, the performance boost it provided wasn't nearly enough to offset the loss in clock speed. Compiling out the AVX-512 code was a pure win, for us. After that, the CPU stayed at or above 2.1 GHz (often boosting to 2.4 GHz) and delivered much more throughput on our workload!

At 10 nm, it supposedly throttles a lot less, and therefore is less of a pitfall. But, that means we shouldn't look for AVX-1024 or a 1024-bit (or 2x 512-bit) implementation of ARM SVE in any CPU with a decent clock speed, any time soon. Fujitsu's 7 nm A64FX didn't even go that wide, and it clocked around 2 GHz, which is actually GPU territory, interestingly enough.

The other question is how much CPU core counts can continue to meaningfully scale. Already, we're starting to see increasing popularity around running them in NUMA mode.

Originally posted by c0d1f1ed View Post

Integrated GPUs, popular also in gaming consoles, illustrate this is not a significant bottleneck.

Oh dear. I'd encourage you to spend some quality time looking over the specs of PS4, PS5, XBox One (original and X flavors), and XBox Series X (yeah, MS is unparalleled in their knack for bad/confusing product names). It should make for some interesting reading. You can find a lot of it on wikipedia, if you're allergic to gamer sites.

To cut to the heart of the matter, the only one of these to use bog-standard DDR memory (in this case, DDR3) was the original XBox One. However, it compensated by including a hunk of embedded DRAM for its GPU, as Microsoft has done in prior consoles. All of the rest of these consoles & iterations used GDDR-type memory, and most with a much wider data bus than mainstream desktop PCs employ.

Originally posted by c0d1f1ed View Post

Texture sampling takes close to 100 instructions on a CPU (which can be vectorized

Thanks for the data point. Is that worst-case? What sort of filtering & texture format are you assuming?

Originally posted by c0d1f1ed View Post

Gather instructions still don't do coalescing, last I checked.

Are you talking about AVX2 gather instructions? Do you mean coalescing of the memory operations, I guess?

Originally posted by c0d1f1ed View Post

I got DOOM to run 4x faster with one evening of optimization work, but that's not my day job nor anyone else's,

Sounds fun, though. I recently vectorized some area-sampling code, which was a hoot! ...once I'd finally worked out a good approach (no scatter/gather!).

A long time ago, I learned that jobs usually don't pay enough for this sort of thing. You've got to just do it for the love of the work, sometimes. I like to think of it in terms of craftsmanship.

Originally posted by c0d1f1ed View Post

out-of-order scheduling makes up for low SMT.

SMT is more energy-efficient, if not also more area-efficient. The main reason why CPUs don't do more of it is probably lack of concurrency in their workloads, and so much emphasis on single-thread performance.

Originally posted by c0d1f1ed View Post

IBM has 8-way SMT capable CPUs.

...because they're designed for server workloads. However, I question whether that's a good move for high core-count CPUs, since integer benchmarks (which benefit most from SMT) seem to show the benefits of Hyperthreading on both Intel and AMD tapering off as core counts continue to rise. And SMT is basically a wash for floating-point performance, on recent CPUs. Maybe not if you're doing lots of random-access, though.

It'll be interesting to see if ARM utilizes SMT in their own server core designs. So far, their server chips seem to be fairly competitive without it, and that's also with a few-years-old micro-architecture (Neoverse N1 is just an amped-up A76 core).

Originally posted by c0d1f1ed View Post

The low number of registers isn't an issue either thanks to fairly efficient stack memory access,

That's dangerous thinking. AFAIK, the CPU can't optimize away those spills (though Zen2 can short-circuit the reload), since I think the CPU can't rule out that it might either be used to communicate with another thread or possibly be referenced later (unless the CPU sees it get overwritten, inside the reorder window?). All of that burns power. Touching cache burns power. Spilling adds contention to the memory port, which can block other operations. No matter how much you can optimize a spill, it's never as good as not having to spill, in the first place.

Originally posted by c0d1f1ed View Post

which isn't an option on GPUs largely because of the very high SMT.

Huh? I'm sure GPUs use stack, in some form and to some degree.

Originally posted by c0d1f1ed View Post

I don't think low SMT is what's keeping CPUs from running graphics efficiently.

GPUs use SMT because it's simply a more efficient way of hiding latency than out-of-order, speculation, and prediction. If your workload has enough concurrency, then SMT lets you make simple in-order cores, which you can then scale up in greater numbers. It's a simple formula. Of the big three, Intel is straying from it the most.

Originally posted by c0d1f1ed View Post

I'm curious what the net effect of that really is. GPUs don't have a lot of memory consistency because they haven't really needed it,

Well, GPUs have a lot of leverage over us poor graphics programmers! Because we're addicted to the performance, we have to jump through whatever hoops they place in our way!

I think memory consistency hurts CPUs in the following ways:

constrains instruction re-ordering
forces more memory bus transactions, which has cache-coherency overhead and burns additional power

Originally posted by c0d1f1ed View Post

History is on your side, but I'm not sure this conclusively shows CPUs can't catch up with GPUs.

GPUs have another advantage, which is the ability to make arbitrary, incompatible changes to their ISA! This means you don't need an internal micro-op format, further shortening pipelines and simplifying their cores.

CPUs could do this, and some have even tried. Probably the latest example is Nvidia's Denver cores, in their X2 SoC.

Originally posted by c0d1f1ed View Post

The gap seems to be closing, albeit very slowly.

Not sure about that. I need to look for something more current, but here's some food for thought:

https://www.karlrupp.net/2018/02/42-...or-trend-data/

He did a nice comparison of GPUs, but that's even more out-of-date:

https://www.karlrupp.net/2013/06/cpu...ics-over-time/

Originally posted by c0d1f1ed View Post

a company that missed the boat on efficient mobile CPUs might not be the one we should look toward for challenging GPU architectures.

I'm not. I'm trying to think openly about the benefits and drawbacks of each architectural approach, as well as the best kind of architecture for each class of problem.

Computer architecture has been a casual interest of mine, since I started reading about supercomputers as a kid. That was before anyone coined the term "GPU", but SGIs & OpenGL were all the rage. I got my first job writing graphics code for floating point DSPs connected in a mesh. "SIMD" meant one of them actually broadcast an instruction stream over a special bus, and all of the other DSPs literally ran in lockstep!

**c0d1f1ed** · 23 April 2021, 12:44 PM

Originally posted by coder View Post

Depends, in part, on how much more you think CPUs can scale up SIMD width.

As much as GPUs can. AMD stuck with 512-bit until 2019 though, so I don't think CPUs desperately need to go wider any time soon.

Originally posted by coder View Post

Intel got badly burned by embracing AVX-512 in 14 nm CPUs, if you look at the horrible clock throttling it triggers.

That is sadly true, but it's more of a consequence of how they chose to implement it, than an inherent limitation of wide vectors in CPU architectures. They wanted to make sure the AVX-512 instructions had the same low latencies as AVX-256 instructions, and SSE before it. I believe they should have instead created half-frequency execution units, and have four 512-bit ones per core, feeding them 1024-bit instructions. This way the front-end and scalar units can still run at ~4 GHz while the SIMD units run at ~2 GHz. Easier said than done of course, but it would combine the strengths of CPUs with those of GPUs.

Originally posted by coder View Post

To cut to the heart of the matter, the only one of these to use bog-standard DDR memory (in this case, DDR3) was the original XBox One. However, it compensated by including a hunk of embedded DRAM for its GPU, as Microsoft has done in prior consoles. All of the rest of these consoles & iterations used GDDR-type memory, and most with a much wider data bus than mainstream desktop PCs employ.

My point was that higher bandwidth memory isn't the exclusive domain of GPUs. The A64FX has 1 TB/s of bandwidth to its HBM2 memory. The high bandwidth of GPUs is a necessary consequence of their lopsided cache hierarchy more so than a specific advantage. Mainstream CPUs have plenty of headroom for more bandwidth, should they need it, while GPUs are now forced to bring more cache on-package/chip.

Originally posted by coder View Post

Thanks for the data point. Is that worst-case? What sort of filtering & texture format are you assuming?

It's just an order-of-magnitude average. It ranges all the way from 1D single-level point sampling, to anisotropic trilinear filtering of a compressed format, to things like cube array lookups with shadow compare. Every small feature costs additional instructions.

When people wonder why GPUs are faster at 3D graphics than CPUs, you don't have to look much further than texture sampling. The compute throughput or bandwidth differences are negligible in comparison. The other big factor is multi-core scaling. Beyond 16 threads the lock contention and inter-core communication become problematic when the scheduling is not locality aware. It's solvable through better software design, but it's a lot of effort and nobody has put in the work yet. OpenSWR has good scaling characteristics apparently, but lacks many features. But once we have similar scaling behavior in Lavapipe and/or SwiftShader, there's still the high texture sampling cost, and every instruction saved will be hard-fought.

Originally posted by coder View Post

Are you talking about AVX2 gather instructions? Do you mean coalescing of the memory operations, I guess?

Yes, when sampling a texture for neighboring pixels, there's a lot of overlap in the texel fetches (e.g. for 4x4 pixels you need 64 texels for bilinear, but the actual footprint is closer to 5x5 texels). A gather instruction which can take multiple elements from the same cache line per cycle, or even from multiple cache lines, would speed up texture sampling on the CPU immensely.

Originally posted by coder View Post

SMT is more energy-efficient, if not also more area-efficient. The main reason why CPUs don't do more of it is probably lack of concurrency in their workloads, and so much emphasis on single-thread performance.

There appears to be a slow but steady trend towards lower SMT in GPUs though. It has the advantage of keeping higher data locality, which becomes essential as bandwidth doesn't increase as fast as compute. GPUs have started to embrace some speculative work and pre-fetching, which used to be the exclusive domain of CPUs.

Originally posted by coder View Post

Touching cache burns power. Spilling adds contention to the memory port, which can block other operations. No matter how much you can optimize a spill, it's never as good as not having to spill, in the first place.

Of course. All I'm saying is most CPU architectures are quite content with 32 registers. You rarely need more, and when you do the L1 accesses offer graceful backup. GPUs on the other hand can't really afford spills. For example a 32 KiB L1 cache is only 512 x 64-byte cache lines. If you have 64 threads in flight that's a mere 8 stack slots for each thread before every access becomes an L2 access. So you need as many registers as the biggest shader you want to be able to execute without performance degradation requires.

Originally posted by coder View Post

Huh? I'm sure GPUs use stack, in some form and to some degree.

Yes, but as far as I'm aware, not without serious performance impact. In contrast on the CPU stack usage is quite efficient. So, similar to high memory bandwidth, the GPU's large register file is more out of necessity than a specific advantage.

I'm not saying that's a bad thing either. But simply stating that GPUs are superior due to an N times larger register file is pretty meaningless since the actual advantage is not proportional to N at all.

Originally posted by coder View Post

GPUs have another advantage, which is the ability to make arbitrary, incompatible changes to their ISA! This means you don't need an internal micro-op format, further shortening pipelines and simplifying their cores.

CPUs could do this, and some have even tried. Probably the latest example is Nvidia's Denver cores, in their X2 SoC.

Yes, this is definitely true. x86 has a ton of baggage and still maintains compatibility with software written ~40 years go.

I'm not sure run-time binary translation is the answer though, as it still costs performance, both in the form of startup time and optimization losses. I don't think the ISA needs to be fundamentally changed with every generation to stay competitive. Something like Apple's approach where every few decades they switch architectures and drag the whole software ecosystem with it, has a lot of merit. Most software gets recompiled, and legacy stuff is supported through an emulation layer.

Originally posted by coder View Post

Computer architecture has been a casual interest of mine, since I started reading about supercomputers as a kid. That was before anyone coined the term "GPU", but SGIs & OpenGL were all the rage. I got my first job writing graphics code for floating point DSPs connected in a mesh. "SIMD" meant one of them actually broadcast an instruction stream over a special bus, and all of the other DSPs literally ran in lockstep!

That sounds like a lot of fun. I've always had a soft spot for PC software though. I got frustrated as a teenager when games started to demand custom hardware, because when I finally spent all my money on a Voodoo graphics card, it didn't support the OpenGL and Direct3D games that were coming out in the years that followed. My hobby project that explored implementing those APIs on the CPU ended up being valued for testing purposes, but I hope one day it also proves useful for running graphics on efficient universal computing devices that support new features with a simple software update.

**coder** · 24 April 2021, 12:38 AM

Originally posted by c0d1f1ed View Post

As much as GPUs can. AMD stuck with 512-bit until 2019 though, so I don't think CPUs desperately need to go wider any time soon.

No... bridgman just explained this quite well. GCN uses 64-element vectors (i.e. 2048-bit) and has 4 quarter-wide pipelines per CU, in a barrel-type architecture. So, the net-throughput is actually 2048-bits per cycle. In fairness, we can call Intel's Skylake SP & Cascade Lake Gold & Platinum cores 1024-bit, since those higher-end models have 2x AVX-512 FMAs per core.

Originally posted by c0d1f1ed View Post

I believe they should have instead created half-frequency execution units, and have four 512-bit ones per core, feeding them 1024-bit instructions. This way the front-end and scalar units can still run at ~4 GHz while the SIMD units run at ~2 GHz. Easier said than done of course, but it would combine the strengths of CPUs with those of GPUs.

Well, that would eat up even more silicon. And you'd probably want even more registers, in order to keep 4 pipelines fed. double-length pipelines would also increase the branch mispredict penalty, when running AVX-512 code.

Originally posted by c0d1f1ed View Post

My point was that higher bandwidth memory isn't the exclusive domain of GPUs. The A64FX has 1 TB/s of bandwidth to its HBM2 memory.

That's not really a general-purpose CPU, though. Like Knights Landing, it uses general-purpose cores, but the HBM2 isn't enough for general-purpose workloads. That's why KNL supported 6-channel DDR4 and had a mode to use its HMC as L4 cache, BTW.

Originally posted by c0d1f1ed View Post

The high bandwidth of GPUs is a necessary consequence of their lopsided cache hierarchy more so than a specific advantage.

Before I proffer my objection, please elaborate on what you mean by "lopsided cache hierarchy".

Originally posted by c0d1f1ed View Post

Mainstream CPUs have plenty of headroom for more bandwidth, should they need it,

I'm not sure about that. HBM2 is expensive and desktop CPU packages don't have room for it. You can't put GDDR memory on a DIMM, so that's out.

Originally posted by c0d1f1ed View Post

while GPUs are now forced to bring more cache on-package/chip.

I guess you mean AMD's Infinity Cache? I need to read more about it, but I had long wondered why having local framebuffer & Z-buffer memory seemed to have fallen out-of-favor. To that end, I wonder how much it's helping, besides specific cases like those. In other words, I think there's probably a fairly steep drop-off in the benefits of increasing it a whole lot further.

One of the more interesting plots in the second of the pages I linked above is of the amount of bandwidth relative to compute performance:

Originally posted by c0d1f1ed View Post

When people wonder why GPUs are faster at 3D graphics than CPUs, you don't have to look much further than texture sampling.

This aligns with Intel's reported decision to equip Larrabee with hardware texture engines.

Originally posted by c0d1f1ed View Post

But once we have similar scaling behavior in Lavapipe and/or SwiftShader, there's still the high texture sampling cost, and every instruction saved will be hard-fought.

Say, do you use any tricks like constant-Z scan-conversion? Back when people were first implementing texture-mapping renderers on PCs, that was one of the more frequently-discussed tricks for avoiding a divide-per-pixel.

Originally posted by c0d1f1ed View Post

A gather instruction which can take multiple elements from the same cache line per cycle, or even from multiple cache lines, would speed up texture sampling on the CPU immensely.

I thought you said you're still not using AVX2, right?

Originally posted by c0d1f1ed View Post

There appears to be a slow but steady trend towards lower SMT in GPUs though.

Really? Can you cite any examples? Intel's GPUs have always been on the low side, BTW.

Originally posted by c0d1f1ed View Post

It has the advantage of keeping higher data locality, which becomes essential as bandwidth doesn't increase as fast as compute.

I see that, but also GDDR tends to have narrow channel width, I think with GRRD6 halving it again (to just 16-bits).

Originally posted by c0d1f1ed View Post

GPUs have started to embrace some speculative work and pre-fetching, which used to be the exclusive domain of CPUs.

I don't know why GPUs would use hardware prefetching, when they can just do it in software. Any idea which ones are doing speculative execution?

Originally posted by c0d1f1ed View Post

GPUs on the other hand can't really afford spills. For example a 32 KiB L1 cache is only 512 x 64-byte cache lines. If you have 64 threads in flight that's a mere 8 stack slots for each thread before every access becomes an L2 access.

Yes, I see your point. However, just because GPUs have lots of registers by necessity doesn't mean it's not also a benefit that they keep more context in registers. It has the side-effect of leaving the memory subsystem free for streaming data in & out.

I really do think GPUs get an efficiency benefit from directly-addressable, local memory. IMO, CPUs & their software really need to work out how to do something similar.

Originally posted by c0d1f1ed View Post

I'm not sure run-time binary translation is the answer though, as it still costs performance, both in the form of startup time and optimization losses.

Huh? Why should it have any losses? I'd just take the existing frontend decoder and either move that into software, or maybe just let the CPU use system memory to buffer its decoded micro-ops. If that were made software-visible, the host OS could even snapshot the decoded program and reload it on subsequent execution. This would eliminate a big bottleneck from x86, and markedly improve performance on branch-heavy code.

The biggest argument I've heard against it is that perhaps micro-ops are less dense than an x86 instruction stream, although I doubt that's true for some of the longer instruction opcodes. If it's true, maybe they could implement some efficient compression, with the decompressor sitting between memory and the L1 i-cache.

Originally posted by c0d1f1ed View Post

I don't think the ISA needs to be fundamentally changed with every generation to stay competitive. Something like Apple's approach where every few decades they switch architectures and drag the whole software ecosystem with it, has a lot of merit.

AFAIK, ARM also has the equivalent of micro-ops. So, even that isn't really solving the problem

Originally posted by c0d1f1ed View Post

My hobby project that explored implementing those APIs on the CPU ended up being valued for testing purposes, but I hope one day it also proves useful for running graphics on efficient universal computing devices that support new features with a simple software update.

So, which project was that? I thought you pushed for Google to buy TransGaming, so I assume you're not talking about SwiftShader?

**coder** · 29 April 2021, 02:11 PM

Originally posted by c0d1f1ed View Post

As much as GPUs can.

In case you didn't see this, ARM's new HPC-oriented V1 cores went with 2x 256-bit SVE, with a projected clock speed of 2.7 GHz @ 7 nm.

Originally posted by c0d1f1ed View Post

I believe they should have instead created half-frequency execution units, and have four 512-bit ones per core,

I won't repeat my response from my first reply, but I wanted to point out that it sounds like ARM might have another way to skin this cat. Their N2 cores have 2x 128-bit SVE2 and what sounds like an interesting mechanism for power-throttling. When the core is asked to reduce its power consumption, it does things like reducing speculation and shrinking the size of some buffers and queues. I think this could potentially include temporarily disabling one of its vector pipelines.

Source: https://www.anandtech.com/show/16640...-cmn700-mesh/4

It doesn't solve quite the same problem as you were targeting, which I suppose is trying to improve energy-efficiency of continuous, vector-intensive computation. However, I think it could be a better solution than Intel's clock-throttling (at least, when they had to reduce clocks by nearly half, in their 14 nm CPUs).

**c0d1f1ed** · 06 May 2021, 09:40 AM

Originally posted by coder View Post

No... bridgman just explained this quite well. GCN uses 64-element vectors (i.e. 2048-bit) and has 4 quarter-wide pipelines per CU, in a barrel-type architecture. So, the net-throughput is actually 2048-bits per cycle. In fairness, we can call Intel's Skylake SP &amp; Cascade Lake Gold &amp; Platinum cores 1024-bit, since those higher-end models have 2x AVX-512 FMAs per core.

Ah, you meant SIMD width per "core". Same answer though, CPUs can follow suit. The architecture I described earlier would have 4 x 512-bit so on par with both GCN and RDNA CUs.

Originally posted by coder View Post

Well, that would eat up even more silicon.

These SIMD units can be much smaller, because unlike Intel's designs today that have to support AVX2 at ~4 GHz, designing them from the ground up to only run at half frequency reduces the current the transistors have to carry and therefore how wide they have to be. It also reduces the heat and interference issues. This is why GPUs are able to have higher ALU densities while clocking at around 2 GHz. CPUs too can play that game.

Originally posted by coder View Post

And you'd probably want even more registers, in order to keep 4 pipelines fed.

With four 512-bit units at half frequency, the total throughput would remain the same. Also, by feeding them 1024-bit instructions it would help hide latency and thus reduce how deep the out-of-order execution has to be, allowing the physical register file to be smaller.

Originally posted by coder View Post

double-length pipelines would also increase the branch mispredict penalty, when running AVX-512 code.

Running at half frequency keeps the misprediction penalty the same. Plus there isn't much branching going on in throughput-oriented workloads anyway.

Originally posted by coder View Post

Before I proffer my objection, please elaborate on what you mean by "lopsided cache hierarchy".

This article sums it up nicely: https://rastergrid.com/blog/gpu-tech...ng-gpu-caches/. Unless the GPU's threads are accessing the same data at the same time (e.g. shader uniforms and overlapping texel footprints), they have to fetch data off-chip. Even if two objects have e.g. the same geometry, there's zero chance the data is still in the caches by the time the second object gets rendered, and it will have to be accessed from RAM a second time. Considering that the caches have to be made larger and the SMT lowered to get more temporal reuse, both of which have their own costs, it makes sense for GPUs to try to get the highest possible bandwidth. So this is by design. But there are limits, and so we're slowly starting to see GPUs do those other things anyway. Meanwhile CPUs already have a balanced cache hierarchy for a while range of workloads, but they can still increase bandwidth should they need it. So it seems to me that the high bandwidth of GPUs isn't a unique advantage, and that further convergence is inevitable.

Originally posted by coder View Post

Say, do you use any tricks like constant-Z scan-conversion? Back when people were first implementing texture-mapping renderers on PCs, that was one of the more frequently-discussed tricks for avoiding a divide-per-pixel.

No. First of all, there is no such thing as constant-Z rasterization. The pixels are off-center from the actual lines of constant-Z. It wouldn't pass conformance requirements.

Division is cheap these days. Intel has 5-cycle reciprocal throughput for 256-bit vectors. For a modest 4-core CPU at 2 GHz, that's enough to do 500 divisions per pixel at 1080p @ 60 Hz. In SwiftShader we recently abandoned reciprocal approximation with Newton-Rhapson refinement in favor of regular division, resulting in lower total latency and freeing up some execution ports.

Originally posted by coder View Post

I thought you said you're still not using AVX2, right?

Actually, the LLVM JIT will emit 128-bit variants of the AVX-512 instruction set no problem.

Originally posted by coder View Post

Yes, I see your point. However, just because GPUs have lots of registers by necessity doesn't mean it's not also a benefit that they keep more context in registers. It has the side-effect of leaving the memory subsystem free for streaming data in &amp; out.

Most CPU micro-architectures have two load ports now, and judging from the NEON code that I've seen being generated, 32 registers is enough to keep the number of spills low enough to not make it a bottleneck. Note that out-of-order execution can even things out a lot. It's very rare to need two loads per cycle for long. Texel fetch being the notable exception, which is why I'd love to see gather instructions that can read multiple elements per cycle.

Originally posted by coder View Post

I really do think GPUs get an efficiency benefit from directly-addressable, local memory. IMO, CPUs &amp; their software really need to work out how to do something similar.

Cell BE SPEs had exclusive local memory, but I'm not sure how much of a net win that was. Larrabee didn't have local memory, and I don't think that was identified as a major problem. Of course there are always workloads for which it is a perfect fit (usually after considerable tuning), but when it's not, shoehorning software into it tends to incur overhead that cancels out the benefits. Also it ports very poorly between different micro-architectures. Meanwhile I believe there have been great advances in L1 cache efficiency, including hardware-based optimizations that exploit locality near the stack frame pointer.

Originally posted by coder View Post

Huh? Why should it have any losses? I'd just take the existing frontend decoder and either move that into software, or maybe just let the CPU use system memory to buffer its decoded micro-ops. If that were made software-visible, the host OS could even snapshot the decoded program and reload it on subsequent execution. This would eliminate a big bottleneck from x86, and markedly improve performance on branch-heavy code.

The biggest argument I've heard against it is that perhaps micro-ops are less dense than an x86 instruction stream, although I doubt that's true for some of the longer instruction opcodes. If it's true, maybe they could implement some efficient compression, with the decompressor sitting between memory and the L1 i-cache.

Indeed just moving the decoder to software isn't generally effective due to the size of micro-ops. The EVEX encoding of AVX-512 is stupidly large though, which results in poor performance when an inner loop doesn't fit in the uop cache (which would often be the case for graphics shaders). So there's definitely substantial room for improvement by using a different instruction set architecture. I'm not convinced run-time binary translation is the answer though. The days of variable-length encoding are numbered, but beyond fixed-length encoding the savings from making incompatible ISA changes are limited.

Originally posted by coder View Post

AFAIK, ARM also has the equivalent of micro-ops. So, even that isn't really solving the problem

The main reason for that is to support microcode routines for complex operations. Their use is relatively rare and mostly for OS-related functionality, so I think this is distinct from the encoding efficiency problem (in fact it seems more like a solution than a problem).

Note that there are various GPU architectures which are competitive with each other, despite what are likely significant instruction encoding differences. I don't disagree that GPUs have more flexibility for inter-generational adjustments than CPUs, but I don't think it's a factor that will eternally keep CPUs from being competitive with GPUs.

Originally posted by coder View Post

So, which project was that? I thought you pushed for Google to buy TransGaming, so I assume you're not talking about SwiftShader?

swShader became SwiftShader.

**coder** · 07 May 2021, 01:31 AM

Thanks for the reply. I do enjoy geeking out about this stuff. Always lots of good info, in your posts.

Originally posted by c0d1f1ed View Post

Ah, you meant SIMD width per "core". Same answer though, CPUs can follow suit. The architecture I described earlier would have 4 x 512-bit so on par with both GCN and RDNA CUs.

These SIMD units can be much smaller, because unlike Intel's designs today that have to support AVX2 at ~4 GHz, designing them from the ground up to only run at half frequency reduces the current the transistors have to carry and therefore how wide they have to be.

I'm not an ASIC designer, but I think you still end up with a significant net increase in area. Since a lot of workloads don't utilize SIMD, this ends up being rather a lot of overhead for the sake of the few that do.

As I mentioned in that follow-up post, ARM used just 2x 256-bit SVE, in their HPC-oriented 5nm V1 core that runs at just 2.7 GHz. And yet, they're targeting implementations of up to 96 cores, which sits in between the CU count of AMD's 64-CU Vega 20 and 128-CU Arcturus (from MI100), both at 7nm and with 2048-bit wide CUs. And Vega has the overhead of fixed-function graphics, video codecs, and display controllers. While Arcturus dispenses with these (except the decoder), it adds Matrix cores. And yet, from the sound of it, the V1's power disipation is projected to be well in the same ballpark (i.e. probably at least ~300 W).

Arm Announces Neoverse V1, N2 Platforms & CPUs, CMN-700 Mesh: More Performance, More Cores, More Flexibility

https://www.anandtech.com/show/16640/arm-announces-neoverse-v1-n2-platforms-cpus-cmn700-mesh/6

Originally posted by c0d1f1ed View Post

This is why GPUs are able to have higher ALU densities while clocking at around 2 GHz. CPUs too can play that game.

There are lots of reasons GPUs have higher ALU densities. The compute cores of GPUs are basically nothing but ALU, registers, SRAM, and datapath. They lack all of the tricks CPUs use to get higher single-thread performance, because they don't care about it. They just rely on SMT for latency-hiding. Everything is simple, pipelined, in-order, and asynchronous.

As the ISA is easy to decode, no need for uop-caches; because they're in-order, no need for register-renaming, etc. They don't even have the ability to trap SNaNs, because that requires a notion of consistent ISA state that doesn't even exist! A couple years ago, someone I know at a GPU company told me that just supporting instruction traps adds a lot of overhead to CPUs.

It's amazing how much work CPUs are doing to make themselves easy to program. GPUs dispense with most of that, although some has crept back in (mostly for the sake of compute workloads).

Originally posted by c0d1f1ed View Post

Running at half frequency keeps the misprediction penalty the same. Plus there isn't much branching going on in throughput-oriented workloads anyway.

If you're looking at it from the perspective of vector-intensive code, then I agree that the misprediction penalty looks similar. However, from the perspective of code making lighter usage of vector instructions, a half-rate pipeline looks twice as deep and thus takes twice as long to flush.

As for the issue of branch-density, I'd say someone running vectorizable code with low branch-density should just use a GPU. The main reasons I'd run such code on a CPU are for small hotspots not worth the dispatch overhead and cases that GPUs don't handle well (such as narrow vectors and branchy code).

Originally posted by c0d1f1ed View Post

Unless the GPU's threads are accessing the same data at the same time (e.g. shader uniforms and overlapping texel footprints), they have to fetch data off-chip. Even if two objects have e.g. the same geometry, there's zero chance the data is still in the caches by the time the second object gets rendered, and it will have to be accessed from RAM a second time.

Yeah, that's why I don't really believe in GPU caches. The main reason they even added L2 was for GPGPU compute. If you're going to share data between warps or wavefronts, you can use local memory, though it's obviously very limited.

Originally posted by c0d1f1ed View Post

it seems to me that the high bandwidth of GPUs isn't a unique advantage,

It is, if you're mostly processing data streams and have low temporal locality. This tends to be common, in graphics workloads, and it's a key reason GPUs excel at deep learning.

Originally posted by c0d1f1ed View Post

No. First of all, there is no such thing as constant-Z rasterization. The pixels are off-center from the actual lines of constant-Z. It wouldn't pass conformance requirements.

I can believe it's not worthwhile, but I wonder if what you mean is that the pixels aren't usually centered at lines of constant-Z? I can definitely see that.

Sorry, I hadn't thought about that very hard -- just trying to reach back into my cobweb-filled attic of software rasterization tricks.

Originally posted by c0d1f1ed View Post

Division is cheap these days. Intel has 5-cycle reciprocal throughput for 256-bit vectors. For a modest 4-core CPU at 2 GHz, that's enough to do 500 divisions per pixel at 1080p @ 60 Hz. In SwiftShader we recently abandoned reciprocal approximation with Newton-Rhapson refinement in favor of regular division, resulting in lower total latency and freeing up some execution ports.

That's amazing. I once coded integer division, in assembly language, for a little programmable core that didn't even have hardware multiplication. It sure gave me proper respect for hardwired integer dividers -- especially low-latency implementations.

Originally posted by c0d1f1ed View Post

Most CPU micro-architectures have two load ports now,

True. Although, it's more energy-efficient, the closer you can keep your data. Hitting cache also carries a latency-penalty, as well as the energy overhead of tag-RAM lookups.

Any time you can save energy, that can potentially be re-invested in some other way that nets higher performance, even if it's just higher clock speed.

Originally posted by c0d1f1ed View Post

Cell BE SPEs had exclusive local memory, but I'm not sure how much of a net win that was.

I'm sure it wasn't enough to compensate for the headache that they had no direct access to RAM! From what I've read, they had to fetch everything they needed using DMA engines!

However, if you had some sort of signal-processing application, like convolutions, small FFTs, or transforming a bunch of geometry, the raw compute performance of Cell was off-the-charts, for its day. It had to use tricks like local SRAM.

Originally posted by c0d1f1ed View Post

Larrabee didn't have local memory, and I don't think that was identified as a major problem.

Given that if failed at graphics and its successors undewhelmed as compute accelerators, that's a weird example.

Originally posted by c0d1f1ed View Post

I believe there have been great advances in L1 cache efficiency, including hardware-based optimizations that exploit locality near the stack frame pointer.

One of my favorite tricks is to use small data buffers on the stack. Not only is allocation free, but top-of-stack is virtually always in L1 cache. I've even done things like dynamically building small image-processing pipelines on the stack, with the leaf function in the call-chain then pulling the data through it, via virtual functions on objects instantiated by its callers.

Originally posted by c0d1f1ed View Post

The EVEX encoding of AVX-512 is stupidly large though, which results in poor performance when an inner loop doesn't fit in the uop cache (which would often be the case for graphics shaders).

A while back, I did a bit of hacking on x264, and I seem to recall they also had issues with some of the larger loops that didn't fit in I-cache.

Announcement

David Airlie Tries DOOM On CPU-Based Lavapipe Vulkan

Comment

Comment

Comment

Comment

Comment

Comment