Announcement

Collapse
No announcement yet.

AMD Zen 4 AVX-512 Performance Analysis On The Ryzen 9 7950X

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • coder
    replied
    Originally posted by arQon View Post
    Zen4's subset is what will - or at least, should - define AVX for the next 5+ years. Anyone who wants to keep chasing whatever the latest random instruction Intel added to win a benchmark with is welcome to do so, and in some workloads it might even be worth the effort; but the rest of the world should have the sense to realize that now is the time to get off this particular hamster wheel.
    To the extent newer instructions are deep learning-focused, you only really need a few libraries to support them. In that sense, they're kinda like crypto extensions.

    Leave a comment:


  • arQon
    replied
    Originally posted by coder View Post
    Intel would likely thank you for selling lots of their new Sapphire Rapids Xeons, because those support a superset of what Zen4 does.
    Not the point though, and you know it.

    Zen4's subset is what will - or at least, should - define AVX for the next 5+ years. Anyone who wants to keep chasing whatever the latest random instruction Intel added to win a benchmark with is welcome to do so, and in some workloads it might even be worth the effort; but the rest of the world should have the sense to realize that now is the time to get off this particular hamster wheel.

    Leave a comment:


  • coder
    replied
    Originally posted by arQon View Post
    We shouldn't be wasting *untold thousands* of man hours reworking code just for the benefit of a trillion dollar company that *went out of its way* to screw everyone over. We should be beating it into Intel's sociopathic skull via the only means they care about - money - that we're tired of picking up their dogshit. The only way to do that, and to keep this from happening again and again, is to support Zen4 and any other complete AVX512 implementation, and let Intel's broken garbage ones go f**k themselves.

    If that means some Intel customers get the short end of the stick and go with AMD next time instead because of it, that's how things *should* be,
    Intel would likely thank you for selling lots of their new Sapphire Rapids Xeons, because those support a superset of what Zen4 does.

    Leave a comment:


  • arQon
    replied
    Originally posted by coder View Post
    Considering how many >= Skylake SP Intel server CPUs there are in the wild, I think that ship has sailed. glibc's hwcaps already defined x86-64 ISA feature levels, and v4 includes the following:
    (snip)
    Not sure if that's the final list.
    It might be for glibc, but it certainly doesn't cover all the variations - and that's kind of my point...

    In financial systems, the options are now either "We only buy this one specific SKU", or "We need 3 more expert devs *just to deal with this one aspect* of the code".

    In games etc, either you put your codebase at risk of exploding on whichever of the *eight* AVX512 paths got least testing, or you drop support for anything except "Real 512", and NotActually-AVX512-Fuctardery gets the AVX2 path, the end.

    This mess reminds me of the early days of GL, and having to maintain multiple codepaths and/or hacks for drivers that were nonconformant or hardware that was too limited to actually support what we wanted to do at all, or couldn't do it performantly. At least they were acting in good faith, even if they fell short.

    This, though? No. This is not the same.

    We shouldn't be wasting *untold thousands* of man hours reworking code just for the benefit of a trillion dollar company that *went out of its way* to screw everyone over. We should be beating it into Intel's sociopathic skull via the only means they care about - money - that we're tired of picking up their dogshit. The only way to do that, and to keep this from happening again and again, is to support Zen4 and any other complete AVX512 implementation, and let Intel's broken garbage ones go f**k themselves.

    If that means some Intel customers get the short end of the stick and go with AMD next time instead because of it, that's how things *should* be, until and unless Intel starts to do better. If they won't, then let them choke to death on their own hubris.

    Leave a comment:


  • coder
    replied
    Originally posted by arQon View Post
    > It's like that quirk MMX had, where you needed to reset the FPU state, when switching from MMX instructions to x87, except more subtle.

    * PTSD intensifies...
    Remember you could also have the rounding/precision behavior corrupted by OS interrupt handlers. Good times.
    😅

    Originally posted by arQon View Post
    If/since AMD is maintaining something very close to the full ISA, on all its chips, I think (or at the very least, hope) it has a significant chance of defining that as the "real" AVX512 ISA, and allowing developers to treat anything less as not supporting AVX512 at all.
    Considering how many >= Skylake SP Intel server CPUs there are in the wild, I think that ship has sailed. glibc's hwcaps already defined x86-64 ISA feature levels, and v4 includes the following:
    • AVX512F
    • AVX512BW
    • AVX512CD
    • AVX512DQ
    • AVX512VL

    Not sure if that's the final list. I had a little trouble locating an authoritative source, so I just used: https://www.phoronix.com/news/Linux-...Feature-Levels

    Leave a comment:


  • arQon
    replied
    Originally posted by coder View Post
    That's pretty bad, IMO.
    No argument from me. I'm just saying *even that* is already not the biggest problem with AVX *in general* right now, because the whole damn thing has been bullshitted to shreds by Intel's artificial segmentation policies.
    On a case-specific basis, yeah: that part is worse - but at least those are where you have some chance (and hopefully, the time/resources) to investigate.

    > It's like that quirk MMX had, where you needed to reset the FPU state, when switching from MMX instructions to x87, except more subtle.

    * PTSD intensifies...
    Remember you could also have the rounding/precision behavior corrupted by OS interrupt handlers. Good times.

    > One of the best features of SSE was no longer having to do that.

    Absolutely. SSE had its shortcomings, but at least it was sane. Now we're at ?8? versions (and counting) of AVX that were deliberately "broken" for the sake of getting a < 1% higher score on an artificial benchmark, and may or may not be missing half the instructions you actually want.
    Things were bad enough even when you "only" had to worry about 3 or 4 different tiers of support, but now we have sibling tiers rather than supersets, and so many of them it's just completely unmanageable.

    If/since AMD is maintaining something very close to the full ISA, on all its chips, I think (or at the very least, hope) it has a significant chance of defining that as the "real" AVX512 ISA, and allowing developers to treat anything less as not supporting AVX512 at all. If AMD stumbles and decides to pull half the ISA from e.g. a "7400" low-end Zen4 chip to save die space I think that would be a significant mistake, but we'll see how it goes.

    Leave a comment:


  • coder
    replied
    Originally posted by arQon View Post
    Yeah: you can pull some of the data from CPU sheets or compiler cost tables, but if it matters to you the only way to know where the tipping point really is is to bench it. (Which is something I should write and give to Michael for PTS, but I just don't have the time).
    It's a systemic problem, because I might call a library function not knowing it uses AVX-512, and now neither it nor my code calls VZEROUPPER​, any SSE/AVX/AVX2 code that I (perhaps also unwittingly) use will perform worse!

    That's pretty bad, IMO. It's like that quirk MMX had, where you needed to reset the FPU state, when switching from MMX instructions to x87, except more subtle. One of the best features of SSE was no longer having to do that.

    More: https://stackoverflow.com/questions/...o-sse-instruct

    Leave a comment:


  • arQon
    replied
    Originally posted by coder View Post
    That's one problem, and the biggest conceptual one that I've seen with it. I'm curious to know how much of a liability it is for Zen 4.
    Yeah: you can pull some of the data from CPU sheets or compiler cost tables, but if it matters to you the only way to know where the tipping point really is is to bench it. (Which is something I should write and give to Michael for PTS, but I just don't have the time).

    > Segmentation of the various extensions, which creates potential headaches for software developers.

    Yeah. Again, that's typical Intel pettiness / stupidity, and I'm really tired of it. It's barely tolerable even if you're building dedicated systems, but for software you're releasing to end users I simply wouldn't bother with it any more: either a machine has full support for the pieces I need, or it gets AVX2. I'm not going to produce eight different builds just because of Intel's incompetence, nor am I going to maintain 5 different versions of key building blocks. If that means the bulk of Intel CPUs have sub-optimal performance, tough - that's Intel's fault and Intel's problem, not mine.

    Leave a comment:


  • coder
    replied
    Originally posted by arQon View Post
    There's nothing inherently "wrong with" AVX512 on a conceptual level. (Well, there is, but that's a longer topic than I have time for right now).


    That's one problem, and the biggest conceptual one that I've seen with it. I'm curious to know how much of a liability it is for Zen 4.

    I think the main implementation problems are:

    Leave a comment:


  • arQon
    replied
    Originally posted by ms178 View Post
    Right but wasn't AVX-512 particularly better suited for more wide-spread use than other vector ISAs before it?
    Short answer: no.

    Longer answer: no, it was the exact opposite of that. At least a majority of - and IIRC every - SIMD ISA *apart* from AVX512 ?and maybe 3DNow? had general-purpose value *as well*, c.f. using MMX instructions for faster memcpy/memset. AVX512 is more like using floating point on a 486: the individual instructions are beneficial, but there's a setup / transition cost to them which wipes out that benefit unless you're executing a significant number of them.

    That is, you're solving an inequality of C + xN < xM (where M > N, obviously). Sometimes that math works out, and sometimes it doesn't, but for AVX512 it was further skewed to C + xN*S < xM, where S < 1, and potentially so by a pretty large amount.
    For MMX, C ~= 18M (IIRC, though it's been a while). That is, not fast enough to make it worthwhile for e.g. dp or xp, but a win for matrices, just (or would have been, if not for the second FP half-pipe on 586 and later). For anything larger, like operating on a nice juicy array of vectors, it was an easy win, and that was the ballpark at worst for every later variation too.

    For AVX512 though, *as implemented on Intel*, C was in the range of "several thousand" at best. That's the exact opposite of "better suited to widespread use", *even if* you assume near-sole ownership of the CPU, which is very much not the case in DC land, nor in a substantial portion of any systems written in the last 15 years since multicore CPUs became commonplace.

    There's nothing inherently "wrong with" AVX512 on a conceptual level. (Well, there is, but that's a longer topic than I have time for right now). However, given that 100% of the implementations of it had that massive C, and the other negatives, it would absolutely be fair to call it "a shitshow", or "a way of 'cheating' on specific benchmarks", or "a waste of die space", or pretty much any of the other derogatory phrases it's been called over the last few years, because it deserved every one of them.

    Leave a comment:

Working...
X