Announcement

**NobodyXu** · 09 August 2022, 07:41 AM

Originally posted by coder View Post

Well, 32-way SIMD is 1024-bit. You can compare that to CPU cores with AVX-512.

When comparing GPU cores to CPU cores, note that GPUs still fit a lot more cores per area.

GPU is much simpler than CPU, e.g. no context switching, no out-of-order execution, no branch prediction, no I/O, no interruption.

Originally posted by coder View Post

That's not too different from CPUs gaining SIMD instruction sets. Compute-wise, they've never been very competitive with GPUs, however.

Intel is moving a little further in this direction, with AMX, which has 8x 8192-bit tile registers. I think it's a little more loosely coupled to the core than usual.

IMHO AMX is mostly for data mining, scientific computing and running inference for ML models so that the client does not need to buy that expensive GPUs.

Originally posted by coder View Post

No, it's not. Its GPU is just like other integrated GPUs, in that it's a distinct block and has its own programmable cores that are completely independent of the CPU cores.

I must be carried away by the crowd saying that how amazing M1 is.

Originally posted by coder View Post

I'm sure it reduces latency, but I'm not sure by how much. A significant part of DRAM's latency is due to the protocol, which is intrinsic to the way it works.

I remember that it is because DRAM requires freshing all its cells periodically.

Originally posted by coder View Post

Not the only way. As I said before, hardware prefetching is a key latency-hiding technique, in modern CPUs. Almost as important as Out-of-Order execution.

And just like branch prediction, the prefetchers get smarter and smarter in each successive generation of CPUs.

Yeah, I missed that.
It is not mentioned as much as cache in uni lectures but very important on performance.

**coder** · 09 August 2022, 01:34 PM

Originally posted by NobodyXu View Post

GPU is much simpler than CPU, e.g. no context switching, no out-of-order execution, no branch prediction, no I/O, no interruption.

The fact that they're thread-based means there's definitely some form of context-switching.

However, the main area savings comes from no OoO or branch-prediction.

Originally posted by NobodyXu View Post

IMHO AMX is mostly for data mining, scientific computing and running inference for ML models so that the client does not need to buy that expensive GPUs.

In the context of this discussion, what's interesting about it is that it seems less closely-coupled to the CPU core than other recent extensions. I think it acts a lot like a bolt-on accelerator block, except that it reads/writes directly from/to the thread's registers.

Originally posted by NobodyXu View Post

I must be carried away by the crowd saying that how amazing M1 is.

Awesome doesn't mean it's revolutionary in every respect. It's good to steer clear of the hype and look for a good description of the actual hardware.

Originally posted by NobodyXu View Post

I remember that it is because DRAM requires freshing all its cells periodically.

I tried to comment further, but I'm really getting out of my depth. However, I think the higher latency of DRAM isn't due to the periodic refresh operations, themselves.

Originally posted by NobodyXu View Post

Yeah, I missed that.

I've seen some interesting micro-benchmarks of prefetchers, to try to determine what sorts of patterns they can detect. Like branch predictors, they tend not to get discussed much, by chip makers. For a long time, they've been able to detect strided and even tile access patterns.

One place this can be important is in the design of memory allocators. For instance, how much work should you do (and up to what size) to ensure that successive allocations tend to occur at contiguous memory addresses? This also has to be balanced against preferring allocations occur in "hot" memory addresses that are likely nearer in the cache hierarchy.

**NobodyXu** · 10 August 2022, 12:22 AM

Originally posted by coder View Post

The fact that they're thread-based means there's definitely some form of context-switching.

However, the main area savings comes from no OoO or branch-prediction.

Yeah, but they still don't need something like ring 0/1/2/3 and hyperviser support, or paging, virtual memory, etc.

Originally posted by coder View Post

In the context of this discussion, what's interesting about it is that it seems less closely-coupled to the CPU core than other recent extensions. I think it acts a lot like a bolt-on accelerator block, except that it reads/writes directly from/to the thread's registers.

That's definitely interesting, sounds like a step towards better integration of CPU and accelerators like GPU.

Originally posted by coder View Post

I tried to comment further, but I'm really getting out of my depth. However, I think the higher latency of DRAM isn't due to the periodic refresh operations, themselves.

Googling seems to tell me that everytime DRAM is accessed, its content is erased and it has to rewrite the content.

Originally posted by coder View Post

I've seen some interesting micro-benchmarks of prefetchers, to try to determine what sorts of patterns they can detect. Like branch predictors, they tend not to get discussed much, by chip makers. For a long time, they've been able to detect strided and even tile access patterns.

Yes, they are not as well studied as cache and thus is often missed when discussing about performance.

I think Linus or somebody said that modern prefetchers can even recognize list iteration and prefetch it for a certain degree.

Originally posted by coder View Post

One place this can be important is in the design of memory allocators. For instance, how much work should you do (and up to what size) to ensure that successive allocations tend to occur at contiguous memory addresses? This also has to be balanced against preferring allocations occur in "hot" memory addresses that are likely nearer in the cache hierarchy.

That will be definitely important to them, though they have to be generic and avoid making too much assumptions.

**coder** · 10 August 2022, 06:52 AM

Originally posted by NobodyXu View Post

Yeah, but they still don't need something like ring 0/1/2/3 and hyperviser support, or paging, virtual memory, etc.

They increasingly do need security features to keep them from accessing the wrong userspace memory, or even possibly data in GPU memory belonging to another process.

Starting with Vega, AMD introduced the notion of GPU memory simply acting as a cache. IIRC, they added a hardware block called the HBCC (High-Bandwidth Cache Controller), which sounds to me like it has a lot in common with the MMU of a CPU. I don't know as much about other GPUs, in this respect. But, for some cloud use cases, I believe they support similar.

I think the equivalent of ring-0 for GPUs basically employs special on-die micro-controllers that don't run user code.

BTW, I'm also told GPUs don't support instruction traps.

Originally posted by NobodyXu View Post

That's definitely interesting, sounds like a step towards better integration of CPU and accelerators like GPU.

Interesting, yeah. Hard to extrapolate a trend from a couple data points, though. My own take on AMX is that it's a heavy-weight feature that not many of their server CPU users need or want. So, it's definitely an open question how much Intel is going to embrace this approach. Especially given the schedule delays it's probably contributed to.

Sapphire Rapids is going to launch so late that its successor will begin ramping in the same year (2023) and such delays have no doubt had a significant impact on Intel's recent and anticipated financial under-performance. I think Intel hasn't done this poorly since the days of the Pentium 4.

Originally posted by NobodyXu View Post

Googling seems to tell me that everytime DRAM is accessed, its content is erased and it has to rewrite the content.

That's the impression I got, as well. I don't know quite how much that slows down accesses, but it sounds like it can't help!

The main thing to keep in mind about DRAM is that it's designed to prioritize density above all else. It would be interesting to know if there's any meaningful difference between regular DDR and GDDR memory, other than the ability for the former to sit on DIMMs.

**NobodyXu** · 10 August 2022, 08:12 AM

Originally posted by coder View Post

They increasingly do need security features to keep them from accessing the wrong userspace memory, or even possibly data in GPU memory belonging to another process.

Starting with Vega, AMD introduced the notion of GPU memory simply acting as a cache. IIRC, they added a hardware block called the HBCC (High-Bandwidth Cache Controller), which sounds to me like it has a lot in common with the MMU of a CPU. I don't know as much about other GPUs, in this respect. But, for some cloud use cases, I believe they support similar.

I think the equivalent of ring-0 for GPUs basically employs special on-die micro-controllers that don't run user code.

BTW, I'm also told GPUs don't support instruction traps.

For cloud use cases, they do need separation and that is particularly hard for GPU, because it doesn't support scheduler like CPU which can force a task to give up their time slice and switch to another task.

This means that they can run arbitary code without worrying for the turing problem (resource exhaustion).

An alternative way to fix it is to use something like eBPF, which can avoid infinite loop problem but will limit certain features (I heard that they also disallow looping for now) while able to compile eBPF to regular CPU instructions.

But even with sth like that, the code can still take longer than expected (overuse their badget) and it is just way robust to use a scheduler model in CPU.

Having mmu is also necessary for that abstraction, but IMHO having the ability to control their GPU timeslice and terminate whever you want is probably even more urgent and important.

Maybe the best way to solve this is to integrate GPU into CPU, making it a subcomponent of it and let it do the hard work of context switching and scheduling.

Originally posted by coder View Post

I don't know quite how much that slows down accesses, but it sounds like it can't help!

The main thing to keep in mind about DRAM is that it's designed to prioritize density above all else. It would be interesting to know if there's any meaningful difference between regular DDR and GDDR memory, other than the ability for the former to sit on DIMMs.

From the result of googling, GDDR has larger bandwidth and lower power consumption.

**coder** · 10 August 2022, 09:33 AM

Originally posted by NobodyXu View Post

For cloud use cases, they do need separation and that is particularly hard for GPU, because it doesn't support scheduler like CPU which can force a task to give up their time slice and switch to another task.

In fact, they do.

https://mynameismjp.wordpress.com/20...pu-preemption/

Originally posted by NobodyXu View Post

From the result of googling, GDDR has larger bandwidth and lower power consumption.

I could've told you that much. What I don't know is the entire set of underlying differences which support those advantages. At a fundamental level, does it work the same as regular DDRx DRAM, or did they perhaps change its operation in some way that sacrifices density in the interest of performance?

**NobodyXu** · 10 August 2022, 10:51 AM

Originally posted by coder View Post

In fact, they do.

https://mynameismjp.wordpress.com/20...pu-preemption/

Thanks for the link!

So GPU has thread-level preemption because it is too hard for them to save and reload all the register files.
They also have multiple queues with different priorities and have multiple cores to migrate the issue of one thread taking too long.

This might be enough for desktop/laptop users, though for cloud providers I suspect that this is far from enough as number of queues are very limited and the latency is still determined by the length of individual thread, which can be very long on cloud.

Originally posted by coder View Post

I could've told you that much. What I don't know is the entire set of underlying differences which support those advantages. At a fundamental level, does it work the same as regular DDRx DRAM, or did they perhaps change its operation in some way that sacrifices density in the interest of performance?

I also have no idea as I am not an hardware engineer and honestly the best I can do is just googling.

**coder** · 10 August 2022, 02:59 PM

Originally posted by NobodyXu View Post

So GPU has thread-level preemption because it is too hard for them to save and reload all the register files.
They also have multiple queues with different priorities and have multiple cores to migrate the issue of one thread taking too long.

This might be enough for desktop/laptop users, though for cloud providers I suspect that this is far from enough as number of queues are very limited and the latency is still determined by the length of individual thread, which can be very long on cloud.

Their cloud GPUs support a priori segmentation of the GPU resources to support multiple clients. I think I recall hearing AMD claim support for something like a max of 32 clients, in some recent generation of server GPU cards.

Originally posted by NobodyXu View Post

I also have no idea as I am not an hardware engineer and honestly the best I can do is just googling.

I've found descriptions that might even have enough detail to answer some of my questions, but I lack the working knowledge of microelectronics to grasp the explanations in sufficient depth.

Anyway, the "why" isn't as important as the "what", for us. We just need to know how the system behaves, and then design our code around that behavior. With that said, it's worth having a good understanding of CPU caches. Most software developers don't properly understand the effects of cache sets, nor things like "write-miss" penalties of copy-back caches. Knowing about such details can enable you to write code that's just a little (or very occasionally, a lot) better.

**NobodyXu** · 11 August 2022, 12:35 AM

Originally posted by coder View Post

Their cloud GPUs support a priori segmentation of the GPU resources to support multiple clients. I think I recall hearing AMD claim support for something like a max of 32 clients, in some recent generation of server GPU cards.

Interesting.

Originally posted by coder View Post

I've found descriptions that might even have enough detail to answer some of my questions, but I lack the working knowledge of microelectronics to grasp the explanations in sufficient depth.

Anyway, the "why" isn't as important as the "what", for us. We just need to know how the system behaves, and then design our code around that behavior. With that said, it's worth having a good understanding of CPU caches. Most software developers don't properly understand the effects of cache sets, nor things like "write-miss" penalties of copy-back caches. Knowing about such details can enable you to write code that's just a little (or very occasionally, a lot) better.

Yeah, just knowning CPU cache and the spatial/temporal locality is already enough for most of the task.

**coder** · 11 August 2022, 06:23 AM

Originally posted by NobodyXu View Post

Yeah, just knowning CPU cache and the spatial/temporal locality is already enough for most of the task.

I think it's good for people to understand caches at a functional level. They're not magic. A CPU with 64 kB of L1D cache is not necessarily going to contain the last 64 kB of stuff you touched. In fact, it probably won't, and sometimes even much less than that!

Here's the short list I'd recommend working programmers know about:

Cache sets & eviction
What's a copy-back cache and why it matters
Cache coherency and its implications

Announcement

GCC 12 Profile Guided Optimization Benchmarks With The AMD Threadripper 3990X

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment