Apple M2 Support Added To Upstream LLVM Along With The A15, A16

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • name99
    replied
    Originally posted by coder View Post
    Uh, reduces instructions by combining 2 simpler ops into a single complex one? I get that it's better to do it earlier, so you occupy less space in caches, ROB, etc.
    Fusion is generally done at Decode time. Decode is part of the In-Order front-end of the machine. For fusion to happen, two fusible instructions have to be recognized as such, which essentially means they need to be sequential in the instruction stream.

    Leave a comment:


  • coder
    replied
    Originally posted by name99 View Post
    I think you do not understand/appreciate what instruction fusion does.
    Uh, reduces instructions by combining 2 simpler ops into a single complex one? I get that it's better to do it earlier, so you occupy less space in caches, ROB, etc.
    Last edited by coder; 26 September 2022, 10:53 PM.

    Leave a comment:


  • name99
    replied
    Originally posted by coder View Post
    Not sure where you got the 10% figure, but it's not consistent with what ARM reported (as I quoted in comment 13).

    In any case, I just think it's interesting. I don't have a dog in this fight -- just a bemused observer.
    The comparison is not SVE vs NEON, it is SVE vs Macroscalar (or whatever does as opposed to SVE, and assuming code optimized for SVE ideas.)

    Leave a comment:


  • name99
    replied
    Originally posted by coder View Post
    I considered that, but I still think things like the ratio of different execution ports can have a measurable effect. Maybe in just a few compute-heavy corner cases, but I'm not convinced it's irrelevant.


    Interesting. I'd have expected their OoO would handle that, too. I guess, if you can just patch the compiler, then why bother doing it in hardware?
    I think you do not understand/appreciate what instruction fusion does.

    Leave a comment:


  • coder
    replied
    Originally posted by name99 View Post
    Losing 3x from not having a SIMD ISA is a big deal. Losing 10% by having autovectorization go down one path rather than another is no big deal.
    Not sure where you got the 10% figure, but it's not consistent with what ARM reported (as I quoted in comment 13).

    In any case, I just think it's interesting. I don't have a dog in this fight -- just a bemused observer.

    Leave a comment:


  • coder
    replied
    Originally posted by name99 View Post
    You don’t need a scheduling model when you’re as OoO as Apple, you really don’t!
    I considered that, but I still think things like the ratio of different execution ports can have a measurable effect. Maybe in just a few compute-heavy corner cases, but I'm not convinced it's irrelevant.

    Originally posted by name99 View Post
    All you need is hints to ensure that fused pairs are always placed adjacent in the instruction stream.
    Interesting. I'd have expected their OoO would handle that, too. I guess, if you can just patch the compiler, then why bother doing it in hardware?

    Leave a comment:


  • name99
    replied
    Originally posted by coder View Post
    Thanks for the tip, and I will check it out, but my point still stands about them missing out on SVE-optimized software. So, I think they'll eventually need to add it.
    And ARM is missing out on AMX optimized software. These things happen and life goes on.
    Apple’s bet is that little specifically SVE optimized code will be written (as opposed to auto-vectorized code). They are probably correct.
    It’s no longer the 1990s, not even the 2010s.

    Losing 3x from not having a SIMD ISA is a big deal. Losing 10% by having autovectorization go down one path rather than another is no big deal.

    Leave a comment:


  • name99
    replied
    Originally posted by coder View Post
    Okay, thanks for pointing that out. What I meant was the scheduling model. I was expecting to see a custom scheduler model for the new cores, but I now see that Apple is always just using Cyclone. I'm also noticing they didn't bother to tune the prefetch parameters since A7.

    Do you think they maintain a different scheduler model, on their internal fork? I guess a way to find out would be to compile the same code with the same version of public LLVM that Apple's tools seem sync'd with.
    You don’t need a scheduling model when you’re as OoO as Apple, you really don’t! All you need is hints to ensure that fused pairs are always placed adjacent in the instruction stream.

    Leave a comment:


  • coder
    replied
    Originally posted by name99 View Post
    Actually it does. Look at the feature list, eg the fuse options. You can track these through LLVM to see the exact pattern that are fused.
    Okay, thanks for pointing that out. What I meant was the scheduling model. I was expecting to see a custom scheduler model for the new cores, but I now see that Apple is always just using Cyclone. I'm also noticing they didn't bother to tune the prefetch parameters since A7.

    Do you think they maintain a different scheduler model, on their internal fork? I guess a way to find out would be to compile the same code with the same version of public LLVM that Apple's tools seem sync'd with.

    Leave a comment:


  • coder
    replied
    Originally posted by name99 View Post
    Or maybe the plan is to provide an alternative to SVE…
    SVE is better than the hash Intel has made of AVX but it’s far from perfect in various ways.
    Look up the Macroscalar architecture…
    Thanks for the tip, and I will check it out, but my point still stands about them missing out on SVE-optimized software. So, I think they'll eventually need to add it.

    Leave a comment:

Working...
X