Announcement

Collapse
No announcement yet.

Linux Kernel Orphans Itanium Support, Linus Torvalds Acknowledges Its Death

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • uxmkt
    replied
    Originally posted by coder View Post
    The problem is that is scales very poorly, as you try to decode more instructions in parallel. This is because the start of each instruction depends on the length of the previous ones.
    Good thing RISC-V has fix^W^W oh fuck it, now they have to deal with 2- and 4-byte instructions too!

    Leave a comment:


  • vladpetric
    replied
    Originally posted by coder View Post
    I'm no hardware engineer, but I gather the things they're most concerned with optimizing are:
    1. reducing critical path
    2. reducing number of pipeline stages
    3. reducing energy consumption
    4. reducing latency
    5. reducing bugs
    6. reducing time-to-market
    #3 isn't just about raw transistors, but I think it also has to do with how often they switch. Consequently, a transistor that's often in critical path can run contrary to more of those goals than a transistor that's in cache memory. Also, cachelines with defects can probably be disabled, unlike logic elsewhere in a core.

    #5 and #6 argue for simplicity, since each additional feature has to be tested and can interact with other features to increase test cases and failure modes in a nonlinear fashion.

    So, their challenge is to do a cost-benefit tradeoff and ensure that each bit of additional logic that implements an optimization "pulls its weight, and then some".

    Also, while transistors are cheap, they're still not free. If they took the view that they're free, they would end up with bigger, more expensive cores (like Intel). So, optimizations that are only a slight net-win should probably be cut.
    Yeap.

    I'd just say that if you end up eliminating enough instructions from OoO execution, you can reduce energy consumption (and no, I'm not without bias here).

    Leave a comment:


  • coder
    replied
    Originally posted by vladpetric View Post
    B. With all due respect to dr. Fog, his reverse engineering is super scientific, but guesses as to the reasons - not. Yes, if they didn't put it's because they didn't think it was worth it. But, we're really not in the 1990s when it comes to transistors budgets.
    I'm no hardware engineer, but I gather the things they're most concerned with optimizing are:
    1. reducing critical path
    2. reducing number of pipeline stages
    3. reducing energy consumption
    4. reducing latency
    5. reducing bugs
    6. reducing time-to-market
    #3 isn't just about raw transistors, but I think it also has to do with how often they switch. Consequently, a transistor that's often in critical path can run contrary to more of those goals than a transistor that's in cache memory. Also, cachelines with defects can probably be disabled, unlike logic elsewhere in a core.

    #5 and #6 argue for simplicity, since each additional feature has to be tested and can interact with other features to increase test cases and failure modes in a nonlinear fashion.

    So, their challenge is to do a cost-benefit tradeoff and ensure that each bit of additional logic that implements an optimization "pulls its weight, and then some".

    Also, while transistors are cheap, they're still not free. If they took the view that they're free, they would end up with bigger, more expensive cores (like Intel). So, optimizations that are only a slight net-win should probably be cut.
    Last edited by coder; 05 February 2021, 01:02 AM.

    Leave a comment:


  • Weasel
    replied
    Originally posted by vladpetric View Post
    A. That's too bad.

    B. With all due respect to dr. Fog, his reverse engineering is super scientific, but guesses as to the reasons - not. Yes, if they didn't put it's because they didn't think it was worth it. But, we're really not in the 1990s when it comes to transistors budgets.

    See this for instance: https://citeseerx.ist.psu.edu/viewdo...=rep1&type=pdf

    "With one billion transistors, 60 million transistors will be allocated to the execution core, 240 million to the trace cache, 48 million to the branch predictor, 32 million to the data caches, and 640 million to the second-level caches"

    These are ballpark figures, yes. But a symbolic table to track renames is several orders of magnitude smaller in terms of required transistors.
    Sounds about right. 90% of transistors are in the cache. This is why I cringe when I see people trying to optimize the transistor budget for the 5-10% (especially people who parrot RISC bullshit).

    I've seen electronic scans of such CPUs where the cache is clearly visible, and 90% sounds about right.

    Leave a comment:


  • vladpetric
    replied
    Originally posted by jabl View Post

    Agner just published an update to his uarch document with zen3 details https://www.agner.org/optimize/microarchitecture.pdf , and it seems the memory mirroring feature has been removed:

    "The advanced feature of mirroring memory operands inside the CPU that we saw in the Zen 2 (page 220) is no longer implemented in the Zen 3. This feature was probably very expensive in terms of hardware complexity and temporary registers. The feature was mostly useful in 32-bit mode where memory operands are more common. It is less valuable in 64-bit mode where temporary variables and function parameters are mostly stored in registers. Therefore, it makes sense to prioritize the hardware budget for other improvements instead."
    A. That's too bad.

    B. With all due respect to dr. Fog, his reverse engineering is super scientific, but guesses as to the reasons - not. Yes, if they didn't put it's because they didn't think it was worth it. But, we're really not in the 1990s when it comes to transistors budgets.

    See this for instance: https://citeseerx.ist.psu.edu/viewdo...=rep1&type=pdf

    "With one billion transistors, 60 million transistors will be allocated to the execution core, 240 million to the trace cache, 48 million to the branch predictor, 32 million to the data caches, and 640 million to the second-level caches"

    These are ballpark figures, yes. But a symbolic table to track renames is several orders of magnitude smaller in terms of required transistors.

    Leave a comment:


  • jabl
    replied
    Originally posted by vladpetric View Post

    Thankfully stores for the most part don't get in the way (or to use a more formal term, they're not on the critical path). Reason is that for the most part, nobody waits for them (now there are cases where that's not true; e.g. if a structure such as the store queue gets full, and there's backpressure). However, loads - in the vast majority of cases, another instruction waits for them to complete.

    Anyway, found this and I'm excited https://www.agner.org/forum/viewtopic.php?t=41

    Thanks for mentioning this.
    Agner just published an update to his uarch document with zen3 details https://www.agner.org/optimize/microarchitecture.pdf , and it seems the memory mirroring feature has been removed:

    "The advanced feature of mirroring memory operands inside the CPU that we saw in the Zen 2 (page 220) is no longer implemented in the Zen 3. This feature was probably very expensive in terms of hardware complexity and temporary registers. The feature was mostly useful in 32-bit mode where memory operands are more common. It is less valuable in 64-bit mode where temporary variables and function parameters are mostly stored in registers. Therefore, it makes sense to prioritize the hardware budget for other improvements instead."

    Leave a comment:


  • vladpetric
    replied
    Originally posted by coder View Post
    That's cool, but on the other hand, the overhead of tracking that stuff burns power and silicon.

    So, to paraphrase your earlier statement, the best way to optimize a register spill is to avoid the need for it, in the first place.
    A symbolic table is not that expensive from a power perspective. What do I mean by symbolic? It only needs to track register names (if you have 256 physical registers, that's 8 bits) and offsets (how large ... well, I would guess another maybe 6-8 bits). That's really small (it really doesn't cost that much, compared to a single value).

    Then there's the aspect of energy saving. If you don't send an instruction through the out-of-order engine, and just handle it via a register renaming, you end up saving energy actually.

    Of course, best if you don't have the spills to begin with, but with 15 GPRs you oftentimes don't have a choice. Sweet spot at around ~31 GPRs as some other posts have said.

    Leave a comment:


  • coder
    replied
    Originally posted by vladpetric View Post
    "It can even track an address on the stack while compensating for changes in the stack pointer across push, pop, call, and return instructions"

    I do not know what kind of implementation they use, but my proposal (RENO) allows specifically for the above ^^^ (tracking on the stack).
    That's cool, but on the other hand, the overhead of tracking that stuff burns power and silicon.

    So, to paraphrase your earlier statement, the best way to optimize a register spill is to avoid the need for it, in the first place.

    Leave a comment:


  • coder
    replied
    Originally posted by jabl View Post
    It's an interesting idea, sure, but getting back to Itanium, wasn't the software speculation a big reason why IA-64 code density was so poor, necessitating large caches and consuming a lot of BW?
    Maybe compress the instruction stream and decompress in the memory controller or on I-Cache fills?

    Originally posted by jabl View Post
    That could be interesting yes. Something like this is supported by Nvidia GPU's; in CUDA you can partition the local SRAM between L1 cache or scratchpad memory.
    I have seen that, but I still think there might be some interesting opportunities if software had more control over cache. But, mainly what I'm after is for CPUs to have some directly-addressable scratchpad memory, yes.

    Originally posted by jabl View Post
    I believe many CPU's also have some mode for using cache as memory, used by the bootup code to initialize the RAM controllers. But I guess that mode is hidden from usage later on as the systems starts.
    Well, not something that can be used in normal operation, as you say. Or, I've at least never heard of it, if so. And can they really bypass the tag RAM lookups, entirely? The point is to avoid both the latency hit and energy utilization of tag lookups (plus other cache management overhead).

    I think stack frames probably provide a good conceptual model for using and managing scratchpad RAM. As necessary, new frames can be allocated and freed, resulting in earlier frames being spilled to RAM. That way, you don't have to do a full context-switch on every interrupt or call to a library function.

    Leave a comment:


  • vladpetric
    replied
    Originally posted by coder View Post
    I just did a quick web search and literally just looked at the first page of results (there weren't many more). My search query was something like "CPU memory-renaming".


    Good explanation. That term sounds familiar. Agreed that it's harder to optimize away the stores, since you'd need to be certain that it has no other consumers than the load that was optimized-away, and could mess up debugging.
    Sorry, I'm super excited and I barely have an audience :-p :

    "It can even track an address on the stack while compensating for changes in the stack pointer across push, pop, call, and return instructions"

    I do not know what kind of implementation they use, but my proposal (RENO) allows specifically for the above ^^^ (tracking on the stack).

    Leave a comment:

Working...
X