Originally posted by coder
View Post
Announcement
Collapse
No announcement yet.
Linux Kernel Orphans Itanium Support, Linus Torvalds Acknowledges Its Death
Collapse
X
-
- Likes 1
-
Originally posted by coder View PostI'm no hardware engineer, but I gather the things they're most concerned with optimizing are:- reducing critical path
- reducing number of pipeline stages
- reducing energy consumption
- reducing latency
- reducing bugs
- reducing time-to-market
#5 and #6 argue for simplicity, since each additional feature has to be tested and can interact with other features to increase test cases and failure modes in a nonlinear fashion.
So, their challenge is to do a cost-benefit tradeoff and ensure that each bit of additional logic that implements an optimization "pulls its weight, and then some".
Also, while transistors are cheap, they're still not free. If they took the view that they're free, they would end up with bigger, more expensive cores (like Intel). So, optimizations that are only a slight net-win should probably be cut.
I'd just say that if you end up eliminating enough instructions from OoO execution, you can reduce energy consumption (and no, I'm not without bias here).
Leave a comment:
-
Originally posted by vladpetric View PostB. With all due respect to dr. Fog, his reverse engineering is super scientific, but guesses as to the reasons - not. Yes, if they didn't put it's because they didn't think it was worth it. But, we're really not in the 1990s when it comes to transistors budgets.- reducing critical path
- reducing number of pipeline stages
- reducing energy consumption
- reducing latency
- reducing bugs
- reducing time-to-market
#5 and #6 argue for simplicity, since each additional feature has to be tested and can interact with other features to increase test cases and failure modes in a nonlinear fashion.
So, their challenge is to do a cost-benefit tradeoff and ensure that each bit of additional logic that implements an optimization "pulls its weight, and then some".
Also, while transistors are cheap, they're still not free. If they took the view that they're free, they would end up with bigger, more expensive cores (like Intel). So, optimizations that are only a slight net-win should probably be cut.Last edited by coder; 05 February 2021, 01:02 AM.
Leave a comment:
-
Originally posted by vladpetric View PostA. That's too bad.
B. With all due respect to dr. Fog, his reverse engineering is super scientific, but guesses as to the reasons - not. Yes, if they didn't put it's because they didn't think it was worth it. But, we're really not in the 1990s when it comes to transistors budgets.
See this for instance: https://citeseerx.ist.psu.edu/viewdo...=rep1&type=pdf
"With one billion transistors, 60 million transistors will be allocated to the execution core, 240 million to the trace cache, 48 million to the branch predictor, 32 million to the data caches, and 640 million to the second-level caches"
These are ballpark figures, yes. But a symbolic table to track renames is several orders of magnitude smaller in terms of required transistors.
I've seen electronic scans of such CPUs where the cache is clearly visible, and 90% sounds about right.
- Likes 1
Leave a comment:
-
Originally posted by jabl View Post
Agner just published an update to his uarch document with zen3 details https://www.agner.org/optimize/microarchitecture.pdf , and it seems the memory mirroring feature has been removed:
"The advanced feature of mirroring memory operands inside the CPU that we saw in the Zen 2 (page 220) is no longer implemented in the Zen 3. This feature was probably very expensive in terms of hardware complexity and temporary registers. The feature was mostly useful in 32-bit mode where memory operands are more common. It is less valuable in 64-bit mode where temporary variables and function parameters are mostly stored in registers. Therefore, it makes sense to prioritize the hardware budget for other improvements instead."
B. With all due respect to dr. Fog, his reverse engineering is super scientific, but guesses as to the reasons - not. Yes, if they didn't put it's because they didn't think it was worth it. But, we're really not in the 1990s when it comes to transistors budgets.
See this for instance: https://citeseerx.ist.psu.edu/viewdo...=rep1&type=pdf
"With one billion transistors, 60 million transistors will be allocated to the execution core, 240 million to the trace cache, 48 million to the branch predictor, 32 million to the data caches, and 640 million to the second-level caches"
These are ballpark figures, yes. But a symbolic table to track renames is several orders of magnitude smaller in terms of required transistors.
- Likes 1
Leave a comment:
-
Originally posted by vladpetric View Post
Thankfully stores for the most part don't get in the way (or to use a more formal term, they're not on the critical path). Reason is that for the most part, nobody waits for them (now there are cases where that's not true; e.g. if a structure such as the store queue gets full, and there's backpressure). However, loads - in the vast majority of cases, another instruction waits for them to complete.
Anyway, found this and I'm excited https://www.agner.org/forum/viewtopic.php?t=41
Thanks for mentioning this.
"The advanced feature of mirroring memory operands inside the CPU that we saw in the Zen 2 (page 220) is no longer implemented in the Zen 3. This feature was probably very expensive in terms of hardware complexity and temporary registers. The feature was mostly useful in 32-bit mode where memory operands are more common. It is less valuable in 64-bit mode where temporary variables and function parameters are mostly stored in registers. Therefore, it makes sense to prioritize the hardware budget for other improvements instead."
Leave a comment:
-
Originally posted by coder View PostThat's cool, but on the other hand, the overhead of tracking that stuff burns power and silicon.
So, to paraphrase your earlier statement, the best way to optimize a register spill is to avoid the need for it, in the first place.
Then there's the aspect of energy saving. If you don't send an instruction through the out-of-order engine, and just handle it via a register renaming, you end up saving energy actually.
Of course, best if you don't have the spills to begin with, but with 15 GPRs you oftentimes don't have a choice. Sweet spot at around ~31 GPRs as some other posts have said.
- Likes 1
Leave a comment:
-
Originally posted by vladpetric View Post"It can even track an address on the stack while compensating for changes in the stack pointer across push, pop, call, and return instructions"
I do not know what kind of implementation they use, but my proposal (RENO) allows specifically for the above ^^^ (tracking on the stack).
So, to paraphrase your earlier statement, the best way to optimize a register spill is to avoid the need for it, in the first place.
Leave a comment:
-
Originally posted by jabl View PostIt's an interesting idea, sure, but getting back to Itanium, wasn't the software speculation a big reason why IA-64 code density was so poor, necessitating large caches and consuming a lot of BW?
Originally posted by jabl View PostThat could be interesting yes. Something like this is supported by Nvidia GPU's; in CUDA you can partition the local SRAM between L1 cache or scratchpad memory.
Originally posted by jabl View PostI believe many CPU's also have some mode for using cache as memory, used by the bootup code to initialize the RAM controllers. But I guess that mode is hidden from usage later on as the systems starts.
I think stack frames probably provide a good conceptual model for using and managing scratchpad RAM. As necessary, new frames can be allocated and freed, resulting in earlier frames being spilled to RAM. That way, you don't have to do a full context-switch on every interrupt or call to a library function.
Leave a comment:
-
Originally posted by coder View PostI just did a quick web search and literally just looked at the first page of results (there weren't many more). My search query was something like "CPU memory-renaming".
Good explanation. That term sounds familiar. Agreed that it's harder to optimize away the stores, since you'd need to be certain that it has no other consumers than the load that was optimized-away, and could mess up debugging.
"It can even track an address on the stack while compensating for changes in the stack pointer across push, pop, call, and return instructions"
I do not know what kind of implementation they use, but my proposal (RENO) allows specifically for the above ^^^ (tracking on the stack).
- Likes 1
Leave a comment:
Leave a comment: