Announcement

Collapse
No announcement yet.

Ampere Altra Max Continues To Deliver Competitive Power Efficiency To AMD EPYC & Intel Xeon

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #81
    Originally posted by coder View Post
    Prove it. Sure, x86 can combine a load or store with computation. It can do relatively simple address arithmetic, as well. In practice, just how often do such instructions actually occur?
    Haha... go ahead and disprove me. It will be a good exercise for you. Look at the origins of CISC... it was designed from the ground up for code density, to save RAM in constrained systems before RAM became more plentiful.

    I don't disagree with you re: Zen having a 4-way front end, but there are two of them per core, so from a total throughput perspective it's 8-way. From a single-thread perspective it's 4-way. I pointed that out. No need to rub it in my face, lol.​

    Originally posted by coder View Post
    ​They're not. VLIW is statically-scheduled. An out-of-order CPU is fundamentally not VLIW, even if it's multi-issue.
    I think you mean compiler-scheduled, which is why VLIW fails as a user-facing ISA - because the compiler can't predict operating conditions, and that leads to endless back-end stalls.

    Actually, the OOO mechanisms of modern CPUs (ROB, branch predictors, speculation, op schedulers) exist exactly for the purpose of extracting parallelism in order to construct waves of parallel operations. They're not VLIW from a user-facing ISA perspective, but they are from a microarch perspective. ​The waves of parallel micro-ops / fused micro-ops, are fed to the schedulers which form the basis of the VLIW instruction set driving the execution/memory units. Those units and their schedulers are the core, and everything that comes ahead of it are just getting its ducks in a row so it can actually crunch the numbers and push the effects out.

    Now coming full circle... let's apply an empirical test to your assertions:
    • x86 ISA is by design inferior and impairs core performance vs ARM ISA
    • CISC cannot outperform RISC on code density
    • Zen only uses a 4-way front end (granted)
    If these things are collectively true, then modern 10-way ARM ISA cores should run absolute circles around a [hobbled /pathetic] 4-way Zen x86 ISA core.

    Hmm... what's real-world testing on sites like this one say? Generally, x86 still holds the single-thread and multi-thread crowns. There are some exceptions. x86 doesn't completely pwn ARM.

    I've followed a lot of your contributions here and I appreciate a lot of the information you've provided, even your occasional wry humour.

    If the empirical evidence supported your assertions, by all means, I would go conduct research to educate myself around the superiority of this new information you're providing, which disagrees with the 40 years of research I've already done. However in this instance...
    Last edited by linuxgeex; 14 January 2024, 10:02 AM.

    Comment


    • #82
      Originally posted by linuxgeex View Post
      Haha... go ahead and disprove me. It will be a good exercise for you. Look at the origins of CISC... it was designed from the ground up for code density, to save RAM in constrained systems before RAM became more plentiful.
      You have unintentionally touched on a relevant point, here. In the original x86 ISA, there were only like 4 general purpose registers (maybe the two 2 index registers count, as well). So, a lot more memory I/O was necessary, but that's okay because clockspeeds were far lower and using RAM instead of a register wasn't such a big difference.

      Then, x86 went 32-bit and doubled the number of GPRs. With 64-bit, they doubled it, again. Now, Intel announced "APX", which will move x86 up to 32 GPRs, achieving parity with AArch64. The more GPRs you have, the less you depend on loads & stores. That reduces the frequency & value of instructions with memory operands.

      You still didn't answer my question of just how often memory operands occur in real world code, but I'll give you partial credit for helping me illustrate my point.

      Originally posted by linuxgeex View Post
      ​I don't disagree with you re: Zen having a 4-way front end, but there are two of them per core,
      Your posts are the first time I've ever seen such a claim. You said this before, with no evidence! Source?

      Originally posted by linuxgeex View Post
      I think you mean compiler-scheduled, which is why VLIW fails as a user-facing ISA - because the compiler can't predict operating conditions, and that leads to endless back-end stalls.
      Stalls are the reason the backend is not VLIW. VLIW is statically-scheduled, but the backend doesn't know when loads will arrive or the store port will be unblocked. That cascades through and makes everything else unpredictable, as well. OoO CPUs schedule the backend dynamically, hence they're not VLIW.

      Originally posted by linuxgeex View Post
      ​Now coming full circle... let's apply an empirical test to your assertions:
      • x86 ISA is by design inferior and impairs core performance vs ARM ISA
      • CISC cannot outperform RISC on code density
      • Zen only uses a 4-way front end (granted)
      I'll thank you not to put words in my mouth.

      Originally posted by linuxgeex View Post
      ​​modern 10-way ARM ISA cores should run absolute circles around a [hobbled /pathetic] 4-way Zen x86 ISA core.
      Frontend isn't everything.

      Originally posted by linuxgeex View Post
      ​​​I've followed a lot of your contributions here and I appreciate a lot of the information you've provided, even your occasional wry humour.
      Thanks. I've seen many impressive contributions by yourself. It's annoying, because you're too smart and too detail-oriented for me to just write off, as I would some of the others. That doesn't mean your always right.


      Originally posted by linuxgeex View Post
      ​​​​If the empirical evidence supported your assertions,
      Well, early results show the two flagship phone SoCs with the Cortex-X4 reaching similar GeekBench 6 single-threaded CPU scores as AMD's Zen 4-based TSMC N4 Phoneix (7840U). However, keep in mind that one is a phone and another is a laptop. The more troubling gap is between the X4 and Apple's A17.
      I wish we had more than GB 6 to go on. However, since Anandtech stopped doing deep dives of phone SoCs, I'm not aware of anyone else running SPEC2017 on them.


      I'd love to see good data on perf/W and IPC. Since the X4 is a mobile-first core, it's not designed to clock high. However, I expect it certainly surpasses Zen 4 on IPC (except for vector-heavy workloads, where Zen 4's 6-way 256-bit AVXn backend comes into play).

      Originally posted by linuxgeex View Post
      ​​​​​I would go conduct research to educate myself around the superiority of this new information you're providing, which disagrees with the 40 years of research I've already done. However in this instance...
      IMO, it's a continual learning experience. As semiconductor technology evolves, so do microarchitectures, software, and the resulting bottlenecks.

      Comment

      Working...
      X