Announcement

Collapse
No announcement yet.

Intel Details APX - Advanced Performance Extensions

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Originally posted by coder View Post
    You're funny.

    Before I dig into the meat of that, let's pause for a moment and contemplate how you're choosing to argue that ISA is irrelevant in a discussion of Intel's biggest revision to the x86 ISA since x86-64, and which is being done for no other reason than efficiency!

    In case you didn't read this article, here are Intel's claims:
    • APX-compiled code contains 10% fewer loads and more than 20% fewer stores than the same code compiled for an Intel 64 baseline.
    • Register accesses are not only faster, but they also consume significantly less dynamic power than complex load and store operations.
    • legacy integer instructions now can also use EVEX to encode a dedicated destination register operand – turning them into three-operand instructions and reducing the need for extra register move instructions.
    • there are 10% fewer instructions in APX-compiled code
    • data-dependent branches are fundamentally hard to predict. To address this growing performance issue, we significantly expand the conditional instruction set of x86​

    So, the mighty Intel is telling us that x86-64 ISA is suffering real deficits for the lack of features found in ISAs like ARM's AArch64! While I have a lot of respect for ChipsAndCheese, there's simply no comparison between what they do and what Intel's own CPU architecture and performance modelling group does.


    You just don't understand that, new, much better ISA (maybe RISC-V) than old x86 give several percent increase of speed in the same microarchitecture.

    * This increase is too low to claim that x86 totally bad.
    * This increase is too small to abandon the entire software base.

    * 40 years ago, RISC was 3 times faster than CISC for the same amount of transistors. Today, only a few percent. Therefore, this dispute is now obsolete.
    * Tooday manufacturing technology, microarchitecture is much more important than ISA.

    data-dependent branches are fundamentally hard to predict. To address this growing performance issue, we significantly expand the conditional instruction set of x86​​
    Same problem is in conditional instruction. If you don't have state of condition what instructions you execute? All instructions? So you waste energy (same in terms when CPU execute both paths of jmp). Or may you can predict - same problem with jump predictor.

    According to this, it looks as if Golden Cove's decoder is at least
    And??? Is too small?? Too big???

    And how much space does it occupy in a comparable ARM?​​

    Compare to ARM, not to INT execution.
    Last edited by HEL88; 29 July 2023, 06:52 PM.

    Comment


    • #62
      Originally posted by HEL88 View Post
      You just don't understand that, new, much better ISA (maybe RISC-V) than old x86 give several percent increase of speed in the same microarchitecture.

      * This increase is too low to claim that x86 totally bad.
      First, a numerical claim without data doesn't interest me.

      Second, a point I've touched upon again and again is that scalability matters. You can't argue about ISA in the abstract. It's one thing if we're talking about a a rather narrow microarchitecture running branchy code. It's another thing to talk about ultra-wide dispatch cores and code with large loops that exerts some real register pressure on a register-poor ISA like x86.

      Not only that, but don't forget about SMT. It helps wide x86 CPUs get past the narrow bottleneck of their decoders. If you take SMT off the table, x86 looks a lot less appealing. But, SMT isn't applicable everywhere. It's a powerful tool, but not a panacea for all of x86's woes.

      Originally posted by HEL88 View Post
      * This increase is too small to abandon the entire software base.
      People often come back to this, but it's less true than ever. We already live in a world where ARM dominates mobile, x86 dominates laptops, desktops, and is losing ground in the cloud. When you still need x86-64, we have high-quality "emulators" that are more than fast enough for most purposes.

      Originally posted by HEL88 View Post
      * 40 years ago, RISC was 3 times faster than CISC for the same amount of transistors. Today, only a few percent. Therefore, this dispute is now obsolete.
      RISC vs. CISC is a debate that has truly outlived its usefulness.

      Originally posted by HEL88 View Post
      * Tooday manufacturing technology, microarchitecture is much more important than ISA.
      There's no doubt that manufacturing technology counts for a lot, but each new process node is more expensive than the last. Transistors just aren't getting cheaper like they used to, nor do we have Dennard Scaling to fall back on. Two very good reasons why we can't afford to be so wasteful with them, and that brings us back to the point about ISA.

      A register spill is a register spill*. It generates extra load/store instructions, extra memory subsystem transactions, and extra cache operations without doing anything useful. Past a certain point, an optimizing compiler needs to decide whether to push past a shortage of ISA registers and generate spills, or leave some static optimizations on the table. Having more ISA registers lets compilers do more static optimizations and expose more ILP to help occupy the many pipelines of today's wide CPUs.

      * Zen 2 implements "memory renaming", but they dropped it in Zen 3. There's only so much a microarchitecture can do to make up for what's lacking in the code it runs, and each new optimization creates new opportunities for security vulnerabilities and other bugs. Not to mention wasting die area and power.

      Originally posted by HEL88 View Post
      ​Same problem is in conditional instruction. If you don't have state of condition what instructions you execute? All instructions? So you waste energy (same in terms when CPU execute both paths of jmp). Or may you can predict - same problem with jump predictor.
      What they're saying is that certain branches are fundamentally hard to predict. Also, not all branches are instrumental to predicting the behavior of subsequent code. So, these are good examples where converting conditional branches to conditional instructions makes sense. It avoids wasting space and energy in the branch predictor for something that ultimately stands to gain little or nothing from it.

      Originally posted by HEL88 View Post
      And??? Is too small?? Too big???
      Smaller is better. Transistors not only cost money, but they also burn power.

      Furthermore, the smaller you make your cores, the more of them you can have. 128-core Bergamo? 144-core Sierra Forest? ARM talking about 196-core N2 CPUs? We're not done scaling core counts, yet!

      Originally posted by HEL88 View Post
      And how much space does it occupy in a comparable ARM?​​
      It's an excellent question. I've searched for comparable die-shot analysis of modern ARM cores, but have yet to find anything. Let us know if you do.

      Originally posted by HEL88 View Post
      ​Compare to ARM, not to INT execution.
      The point was to put it into some kind of perspective. We could look at it as a fraction of overall core area, but the decoder is a lot more opaque to us than the integer execution part of the architecture. So, when you compare it to a part of the chip we know more about, it suggests certain things about the decoder's complexity.


      FWIW, I actually think APX is basically just Intel picking some low-hanging fruit. It's worth doing, but it's probably going to rank pretty low among the kinds of ISA changes we might see over the next couple decades. Compared to GPUs, CPUs burn an awful lot of power relative to the amount of real work they do, which has ushered us into the E-core era. However, E-cores also aren't the ultimate solution to this problem.
      Last edited by coder; 29 July 2023, 11:02 PM.

      Comment


      • #63
        Originally posted by coder View Post
        Interestingly, ARM introduced their own equivalent. It's called Transactional Memory Extension (TME) and is included in ARMv8.5-A and ARMv9.0-A. I never heard of that getting disabled.
        Well someone has to try to implement it before it can be disabled for being buggy.

        Comment


        • #64
          Originally posted by coder View Post
          People often come back to this, but it's less true than ever. We already live in a world where ARM dominates mobile, x86 dominates laptops, desktops, and is losing ground in the cloud. When you still need x86-64, we have high-quality "emulators" that are more than fast enough for most purposes.
          1. I still spec out my CPUs based on "highest single-threaded x86 ISA performance available for 65W TDP within my budget". (And, next time I upgrade my GPU, I'm going to start by coming here and looking up one of Michael's performance-per-watt benchmarks, because I don't have air conditioning.)
          2. Again, for use-cases where I want RISC-V or ARM, I already buy RISC-V or ARM.
          3. I run stuff like "Win16 applications such as BrickLayer inside Wine 1.2", so any given solution also needs to handle edge-case compatibility well. (If I were running Windows, you betcha I'd have WineVDM installed, but it doesn't let me have multiple versions of Wine installed to work around how Wine's "rip out the hacks and do it right" transition broke many Win16 apps.)
          4. Citation needed. The only emulator I've heard tell of being "good enough" is Rosetta 2, and that's closed-source and macOS-only.​
          ARM dominates mobile because the application ecosystem made a clean break to support touch properly. ARM in servers makes sense because it's all either open-source (eg. Apache, nginx, etc.) or ISA-independent source/byte code (eg. Python, Ruby, PHP, Node.js, etc.). ARM in Chromebooks makes sense for the same reason.

          Unseating x86 in the desktop market is like bringing about the Year of Desktop Linux. Not to be underestimated. There's a reason Microsoft has been successfully banking on backwards compatibility with the existing software the user bought (and, to a lesser extent, hardware drivers) since the MS-DOS to Windows transition.
          Last edited by ssokolow; 30 July 2023, 03:54 AM.

          Comment


          • #65
            Originally posted by ssokolow View Post
            1. I still spec out my CPUs based on "highest single-threaded x86 ISA performance available for 65W TDP within my budget". (And, next time I upgrade my GPU, I'm going to start by coming here and looking up one of Michael's performance-per-watt benchmarks, because I don't have air conditioning.)
            2. Again, for use-cases where I want RISC-V or ARM, I already buy RISC-V or ARM.
            And you expect that to be true for the rest of time? As it always was? Because you've always used RISC-V and ARM machines the exact same way you do now? For your entire life?

            Originally posted by ssokolow View Post
            4. Citation needed. The only emulator I've heard tell of being "good enough" is Rosetta 2, and that's closed-source and macOS-only.​
            I was also referring to Microsoft's x86-64 emulation in Windows 11/ARM. I'd imagine gaming is main thing it can't do "well enough".

            Originally posted by ssokolow View Post
            ​Unseating x86 in the desktop market is like bringing about the Year of Desktop Linux. Not to be underestimated.
            I'm not making any predictions, other than that Windows/ARM and mini-PCs running ChromeOS could both chip away at x86's hold on the desktop market. Gaming is probably going to be the last big market x86 dominates.

            Originally posted by ssokolow View Post
            There's a reason Microsoft has been successfully banking on backwards compatibility with the existing software the user bought (and, to a lesser extent, hardware drivers) since the MS-DOS to Windows transition.
            I just don't see people buying software like they used to. So much of what people do is now web & cloud-based. The few, big commercial apps can easily port over, as Apple has shown. Yes, there will always be niches, but if we're talking about mainstream, then I don't see why most business couldn't switch to Windows/ARM today.
            Last edited by coder; 30 July 2023, 04:18 AM.

            Comment


            • #66
              Originally posted by coder View Post
              And you expect that to be true for the rest of time? As it always was? Because you've always used RISC-V and ARM machines the exact same way you do now? For your entire life?
              Addressing me, specifically, is a bad argumentative position, because I'm one of those people who actively selects his tech stack with the intent to make his software purchases (both games and retro-hobby non-games) have a longevity approaching that of the print books I own.

              Originally posted by coder View Post
              I just don't see people buying software like they used to. So much of what people do is now web & cloud-based. The few, big commercial apps can easily port over, as Apple has shown. Yes, there will always be niches, but if we're talking about mainstream, then I don't see why most business couldn't switch to Windows/ARM today.
              "Mainstream" is bifurcated. If it's businesses, then sure. Business uses are more Chromebook-like and have been for ages. Non-business, on the other hand, means games are a not-insignificant factor and games are the main situation where people still buy non-web-based software.

              Comment


              • #67

                Compared to GPUs, CPUs burn an awful lot of power relative to the amount of real work they do, which has ushered us into the E-core era. However, E-cores also aren't the ultimate solution to this problem.
                Try to run browser or compile linux kernel on GPU and compare time and power consumption.

                Even shaders is compiled on CPU because is faster and more energy efficient than GPU.

                If you take SMT off the table, x86 looks
                King of SMT is RISC. SPARC and POWER have even 8 SMT. So mayby risc is worse and poor that needs such amount of threads to compete with x86 with only two smt.

                also burn power.
                Haw many times runs decoders? Several percentage, because uOP cache.


                Last edited by HEL88; 30 July 2023, 02:14 PM.

                Comment


                • #68
                  Originally posted by HEL88 View Post
                  Try to run browser or compile linux kernel on GPU and compare time and power consumption.
                  The point I was trying to make is that CPUs burn most of their power on things other than the actual computation dictated by the instruction stream (e.g. decoding, scheduling, caching, prediction, prefetching, etc.). A significant part of the reason CPUs have to do so much work is due to the conceit of the ISA, which presents the CPU core as simple, serially-executing state machine.

                  If you re-negotiate the hardware/software interface to explicitly expose more of the concurrency in both the software and hardware, you can potentially save the hardware a significant amount of work (which translates into better area-efficiency and energy-efficiency). This was tried with schemes like VLIW, but that went to the completely opposite extreme. I don't believe that's the final word in alternate ISA approaches. There must be a better middle-ground that cues the hardware to do the things better done dynamically, and saves it from trying to redo work that's adequately done statically.

                  Originally posted by HEL88 View Post
                  King of SMT is RISC. SPARC and POWER have even 8 SMT. So mayby risc is worse and poor that needs such amount of threads to compete with x86 with only two smt.
                  SMT is simply a technique that can be employed for different reasons and to solve different problems. Because it was used by one CPU to solve a certain set of problems doesn't mean its value is limited only to those types of CPUs or to solve just those problems.

                  The best way to look at SMT is in a specific context. In the context of modern, wide x86 CPUs, one of the benefits it can provide is potentially to mitigate decoder bottlenecks by the likelihood that the peer thread will have already-decoded instructions that can be dispatched from the uOP cache.

                  Originally posted by HEL88 View Post
                  Haw many times runs decoders? Several percentage, because uOP cache.​
                  Depends on the code. In code with lots of smaller loops, you'd expect a very good hit rate from the uOP cache. In highly-branchy code, like compilers, parsers, databases, etc., I would expect to see the decoder become a substantial bottleneck.
                  Last edited by coder; 30 July 2023, 03:40 PM.

                  Comment

                  Working...
                  X