Originally posted by coder
View Post
Announcement
Collapse
No announcement yet.
GCC 12 Profile Guided Optimization Benchmarks With The AMD Threadripper 3990X
Collapse
X
-
-
Originally posted by NobodyXu View PostYeah, just knowning CPU cache and the spatial/temporal locality is already enough for most of the task.
Here's the short list I'd recommend working programmers know about:- Cache sets & eviction
- What's a copy-back cache and why it matters
- Cache coherency and its implications
- Likes 1
Leave a comment:
-
Originally posted by coder View PostTheir cloud GPUs support a priori segmentation of the GPU resources to support multiple clients. I think I recall hearing AMD claim support for something like a max of 32 clients, in some recent generation of server GPU cards.
Originally posted by coder View PostI've found descriptions that might even have enough detail to answer some of my questions, but I lack the working knowledge of microelectronics to grasp the explanations in sufficient depth.
Anyway, the "why" isn't as important as the "what", for us. We just need to know how the system behaves, and then design our code around that behavior. With that said, it's worth having a good understanding of CPU caches. Most software developers don't properly understand the effects of cache sets, nor things like "write-miss" penalties of copy-back caches. Knowing about such details can enable you to write code that's just a little (or very occasionally, a lot) better.
Leave a comment:
-
Originally posted by NobodyXu View PostSo GPU has thread-level preemption because it is too hard for them to save and reload all the register files.
They also have multiple queues with different priorities and have multiple cores to migrate the issue of one thread taking too long.
This might be enough for desktop/laptop users, though for cloud providers I suspect that this is far from enough as number of queues are very limited and the latency is still determined by the length of individual thread, which can be very long on cloud.
Originally posted by NobodyXu View PostI also have no idea as I am not an hardware engineer and honestly the best I can do is just googling.
Anyway, the "why" isn't as important as the "what", for us. We just need to know how the system behaves, and then design our code around that behavior. With that said, it's worth having a good understanding of CPU caches. Most software developers don't properly understand the effects of cache sets, nor things like "write-miss" penalties of copy-back caches. Knowing about such details can enable you to write code that's just a little (or very occasionally, a lot) better.
Leave a comment:
-
Originally posted by coder View Post
So GPU has thread-level preemption because it is too hard for them to save and reload all the register files.
They also have multiple queues with different priorities and have multiple cores to migrate the issue of one thread taking too long.
This might be enough for desktop/laptop users, though for cloud providers I suspect that this is far from enough as number of queues are very limited and the latency is still determined by the length of individual thread, which can be very long on cloud.
Originally posted by coder View PostI could've told you that much. What I don't know is the entire set of underlying differences which support those advantages. At a fundamental level, does it work the same as regular DDRx DRAM, or did they perhaps change its operation in some way that sacrifices density in the interest of performance?
Leave a comment:
-
Originally posted by NobodyXu View PostFor cloud use cases, they do need separation and that is particularly hard for GPU, because it doesn't support scheduler like CPU which can force a task to give up their time slice and switch to another task.
https://mynameismjp.wordpress.com/20...pu-preemption/
Originally posted by NobodyXu View PostFrom the result of googling, GDDR has larger bandwidth and lower power consumption.
Leave a comment:
-
Originally posted by coder View PostThey increasingly do need security features to keep them from accessing the wrong userspace memory, or even possibly data in GPU memory belonging to another process.
Starting with Vega, AMD introduced the notion of GPU memory simply acting as a cache. IIRC, they added a hardware block called the HBCC (High-Bandwidth Cache Controller), which sounds to me like it has a lot in common with the MMU of a CPU. I don't know as much about other GPUs, in this respect. But, for some cloud use cases, I believe they support similar.
I think the equivalent of ring-0 for GPUs basically employs special on-die micro-controllers that don't run user code.
BTW, I'm also told GPUs don't support instruction traps.
This means that they can run arbitary code without worrying for the turing problem (resource exhaustion).
An alternative way to fix it is to use something like eBPF, which can avoid infinite loop problem but will limit certain features (I heard that they also disallow looping for now) while able to compile eBPF to regular CPU instructions.
But even with sth like that, the code can still take longer than expected (overuse their badget) and it is just way robust to use a scheduler model in CPU.
Having mmu is also necessary for that abstraction, but IMHO having the ability to control their GPU timeslice and terminate whever you want is probably even more urgent and important.
Maybe the best way to solve this is to integrate GPU into CPU, making it a subcomponent of it and let it do the hard work of context switching and scheduling.
Originally posted by coder View PostI don't know quite how much that slows down accesses, but it sounds like it can't help!
The main thing to keep in mind about DRAM is that it's designed to prioritize density above all else. It would be interesting to know if there's any meaningful difference between regular DDR and GDDR memory, other than the ability for the former to sit on DIMMs.
Leave a comment:
-
Originally posted by NobodyXu View PostYeah, but they still don't need something like ring 0/1/2/3 and hyperviser support, or paging, virtual memory, etc.
Starting with Vega, AMD introduced the notion of GPU memory simply acting as a cache. IIRC, they added a hardware block called the HBCC (High-Bandwidth Cache Controller), which sounds to me like it has a lot in common with the MMU of a CPU. I don't know as much about other GPUs, in this respect. But, for some cloud use cases, I believe they support similar.
I think the equivalent of ring-0 for GPUs basically employs special on-die micro-controllers that don't run user code.
BTW, I'm also told GPUs don't support instruction traps.
Originally posted by NobodyXu View PostThat's definitely interesting, sounds like a step towards better integration of CPU and accelerators like GPU.
Sapphire Rapids is going to launch so late that its successor will begin ramping in the same year (2023) and such delays have no doubt had a significant impact on Intel's recent and anticipated financial under-performance. I think Intel hasn't done this poorly since the days of the Pentium 4.
Originally posted by NobodyXu View PostGoogling seems to tell me that everytime DRAM is accessed, its content is erased and it has to rewrite the content.
The main thing to keep in mind about DRAM is that it's designed to prioritize density above all else. It would be interesting to know if there's any meaningful difference between regular DDR and GDDR memory, other than the ability for the former to sit on DIMMs.
Leave a comment:
-
Originally posted by coder View PostThe fact that they're thread-based means there's definitely some form of context-switching.
However, the main area savings comes from no OoO or branch-prediction.
Originally posted by coder View PostIn the context of this discussion, what's interesting about it is that it seems less closely-coupled to the CPU core than other recent extensions. I think it acts a lot like a bolt-on accelerator block, except that it reads/writes directly from/to the thread's registers.
Originally posted by coder View PostI tried to comment further, but I'm really getting out of my depth. However, I think the higher latency of DRAM isn't due to the periodic refresh operations, themselves.
Originally posted by coder View PostI've seen some interesting micro-benchmarks of prefetchers, to try to determine what sorts of patterns they can detect. Like branch predictors, they tend not to get discussed much, by chip makers. For a long time, they've been able to detect strided and even tile access patterns.
I think Linus or somebody said that modern prefetchers can even recognize list iteration and prefetch it for a certain degree.
Originally posted by coder View PostOne place this can be important is in the design of memory allocators. For instance, how much work should you do (and up to what size) to ensure that successive allocations tend to occur at contiguous memory addresses? This also has to be balanced against preferring allocations occur in "hot" memory addresses that are likely nearer in the cache hierarchy.
Leave a comment:
-
Originally posted by NobodyXu View PostGPU is much simpler than CPU, e.g. no context switching, no out-of-order execution, no branch prediction, no I/O, no interruption.
However, the main area savings comes from no OoO or branch-prediction.
Originally posted by NobodyXu View PostIMHO AMX is mostly for data mining, scientific computing and running inference for ML models so that the client does not need to buy that expensive GPUs.
Originally posted by NobodyXu View PostI must be carried away by the crowd saying that how amazing M1 is.
Originally posted by NobodyXu View PostI remember that it is because DRAM requires freshing all its cells periodically.
Originally posted by NobodyXu View PostYeah, I missed that.
One place this can be important is in the design of memory allocators. For instance, how much work should you do (and up to what size) to ensure that successive allocations tend to occur at contiguous memory addresses? This also has to be balanced against preferring allocations occur in "hot" memory addresses that are likely nearer in the cache hierarchy.
Leave a comment:
Leave a comment: