Announcement

Collapse
No announcement yet.

AMD Zen 4 AVX-512 Performance Analysis On The Ryzen 9 7950X

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #51
    Originally posted by ms178 View Post
    Nah, as there was no throtteling involved it rather means that Intel finally managed to optimize AVX-512 to be more power efficient. Yeah, we all have to throw away old wisdoms about AVX-512 as the old equation "AVX-512 usage = higher power draw" is no longer true.
    There's a fundamental inconsistency, though. Igor showed a performance advantage for AVX-512 of a mere 13.0 % (Y-cruncher) and 9.4% (LynX).

    Compare that to the GeoMean from this article, where AMD got a whopping 59.0% performance advantage in AVX-512 mode, with what should be a narrower implementation (1x FMA vs. 2x and higher latency for at least some instructions).

    Curious to hear your explanations of that.


    Edit: Michael 's CpuMiner testing of AVX-512 on Alder Lake showed much bigger gains, often accompanied by higher power consumption (though not insane). I wish he'd tested a few more apps, so we could get a similar picture as we now have for Zen 4.

    Last edited by coder; 29 September 2022, 04:33 PM.

    Comment


    • #52
      Originally posted by coder View Post
      There's a fundamental inconsistency, though. Igor showed a performance advantage for AVX-512 of a mere 13.0 % (Y-cruncher) and 9.4% (LynX).

      Compare that to the GeoMean from this article, where AMD got a whopping 59.0% performance advantage in AVX-512 mode, with what should be a narrower implementation (1x FMA vs. 2x and higher latency for at least some instructions).

      Curious to hear your explanations of that.
      Well, I'd say we should not mix too many questions together such as, "Is AVX-512 more power effective on Alder Lake?" with "Is AMD's AVX-512 implementation better than Alder Lake's?"

      My statement was focused on the former. From Igor's article: "[...] the trigger for today’s article was the literally incredibly low power consumption of the P-cores with AVX-512 and the question whether the efficiency is actually higher. And indeed, with the feature enabled, the efficiency of the P-cores is significantly higher in all benchmarks than without. In fact, the results are so clear that the instruction set can and should always be safely activated – if possible, of course.​"

      Igor also measured his power numbers with special hardware which should provide a more accurate picture than standard software tools. His testing was also done on Windows in December 2021, wheras Michael did his testing on Linux in September 2022 and with different tests involved. The variables at play here are too many to compare numbers from Igor's and Michael's tests to each other, as the geomean of Michael's testing also includes different benchmarks. Michael could do an AVX-512-focused comparison between Alder Lake and Zen 4 with power numbers. But as Alder Lake is no longer shipping with AVX-512 enabled, the new HEDT-platforms from both AMD and Intel should be better suited for such a comparison.

      Comment


      • #53
        Originally posted by ms178 View Post
        Well, I'd say we should not mix too many questions together such as, "Is AVX-512 more power effective on Alder Lake?" with "Is AMD's AVX-512 implementation better than Alder Lake's?"
        As performance is directly linked to power consumption, it seems foolish to address an anomaly in one while ignoring an anomaly in the other.

        Originally posted by ms178 View Post
        "... indeed, with the feature enabled, the efficiency of the P-cores is significantly higher in all benchmarks than without."
        News flash: that was always true. AVX-512 always increased performance in vector-heavy workloads by more than it increased power consumption. Hence, improved efficiency. So, that really doesn't address the key questions.

        Originally posted by ms178 View Post
        Igor also measured his power numbers with special hardware which should provide a more accurate picture than standard software tools.
        What's your point? Other reviewers have tested other CPUs with hardware measurements, as well.

        Originally posted by ms178 View Post
        His testing was also done on Windows in December 2021, wheras Michael did his testing on Linux in September 2022 and with different tests involved.
        If you're addressing the Zen 4 performance discrepancy, I agree that we have very little data from Igor's testing. However, I think we've seen prior data which supports that Y-cruncher responds well to AVX-512. At least, better than what Igor measured. I'll try to confirm that.

        BTW, the link I added to my prior post about Michael's AVX-512 testing of Alder Lake was done in November 2021. So, every bit as legitimate as Igor's. You should check it out!

        Comment


        • #54
          Originally posted by ms178 View Post
          Right but wasn't AVX-512 particularly better suited for more wide-spread use than other vector ISAs before it?
          Short answer: no.

          Longer answer: no, it was the exact opposite of that. At least a majority of - and IIRC every - SIMD ISA *apart* from AVX512 ?and maybe 3DNow? had general-purpose value *as well*, c.f. using MMX instructions for faster memcpy/memset. AVX512 is more like using floating point on a 486: the individual instructions are beneficial, but there's a setup / transition cost to them which wipes out that benefit unless you're executing a significant number of them.

          That is, you're solving an inequality of C + xN < xM (where M > N, obviously). Sometimes that math works out, and sometimes it doesn't, but for AVX512 it was further skewed to C + xN*S < xM, where S < 1, and potentially so by a pretty large amount.
          For MMX, C ~= 18M (IIRC, though it's been a while). That is, not fast enough to make it worthwhile for e.g. dp or xp, but a win for matrices, just (or would have been, if not for the second FP half-pipe on 586 and later). For anything larger, like operating on a nice juicy array of vectors, it was an easy win, and that was the ballpark at worst for every later variation too.

          For AVX512 though, *as implemented on Intel*, C was in the range of "several thousand" at best. That's the exact opposite of "better suited to widespread use", *even if* you assume near-sole ownership of the CPU, which is very much not the case in DC land, nor in a substantial portion of any systems written in the last 15 years since multicore CPUs became commonplace.

          There's nothing inherently "wrong with" AVX512 on a conceptual level. (Well, there is, but that's a longer topic than I have time for right now). However, given that 100% of the implementations of it had that massive C, and the other negatives, it would absolutely be fair to call it "a shitshow", or "a way of 'cheating' on specific benchmarks", or "a waste of die space", or pretty much any of the other derogatory phrases it's been called over the last few years, because it deserved every one of them.

          Comment


          • #55
            Originally posted by arQon View Post
            There's nothing inherently "wrong with" AVX512 on a conceptual level. (Well, there is, but that's a longer topic than I have time for right now).


            That's one problem, and the biggest conceptual one that I've seen with it. I'm curious to know how much of a liability it is for Zen 4.

            I think the main implementation problems are:

            Comment


            • #56
              Originally posted by coder View Post
              That's one problem, and the biggest conceptual one that I've seen with it. I'm curious to know how much of a liability it is for Zen 4.
              Yeah: you can pull some of the data from CPU sheets or compiler cost tables, but if it matters to you the only way to know where the tipping point really is is to bench it. (Which is something I should write and give to Michael for PTS, but I just don't have the time).

              > Segmentation of the various extensions, which creates potential headaches for software developers.

              Yeah. Again, that's typical Intel pettiness / stupidity, and I'm really tired of it. It's barely tolerable even if you're building dedicated systems, but for software you're releasing to end users I simply wouldn't bother with it any more: either a machine has full support for the pieces I need, or it gets AVX2. I'm not going to produce eight different builds just because of Intel's incompetence, nor am I going to maintain 5 different versions of key building blocks. If that means the bulk of Intel CPUs have sub-optimal performance, tough - that's Intel's fault and Intel's problem, not mine.

              Comment


              • #57
                Originally posted by arQon View Post
                Yeah: you can pull some of the data from CPU sheets or compiler cost tables, but if it matters to you the only way to know where the tipping point really is is to bench it. (Which is something I should write and give to Michael for PTS, but I just don't have the time).
                It's a systemic problem, because I might call a library function not knowing it uses AVX-512, and now neither it nor my code calls VZEROUPPER​, any SSE/AVX/AVX2 code that I (perhaps also unwittingly) use will perform worse!

                That's pretty bad, IMO. It's like that quirk MMX had, where you needed to reset the FPU state, when switching from MMX instructions to x87, except more subtle. One of the best features of SSE was no longer having to do that.

                More: https://stackoverflow.com/questions/...o-sse-instruct

                Comment


                • #58
                  Originally posted by coder View Post
                  That's pretty bad, IMO.
                  No argument from me. I'm just saying *even that* is already not the biggest problem with AVX *in general* right now, because the whole damn thing has been bullshitted to shreds by Intel's artificial segmentation policies.
                  On a case-specific basis, yeah: that part is worse - but at least those are where you have some chance (and hopefully, the time/resources) to investigate.

                  > It's like that quirk MMX had, where you needed to reset the FPU state, when switching from MMX instructions to x87, except more subtle.

                  * PTSD intensifies...
                  Remember you could also have the rounding/precision behavior corrupted by OS interrupt handlers. Good times.

                  > One of the best features of SSE was no longer having to do that.

                  Absolutely. SSE had its shortcomings, but at least it was sane. Now we're at ?8? versions (and counting) of AVX that were deliberately "broken" for the sake of getting a < 1% higher score on an artificial benchmark, and may or may not be missing half the instructions you actually want.
                  Things were bad enough even when you "only" had to worry about 3 or 4 different tiers of support, but now we have sibling tiers rather than supersets, and so many of them it's just completely unmanageable.

                  If/since AMD is maintaining something very close to the full ISA, on all its chips, I think (or at the very least, hope) it has a significant chance of defining that as the "real" AVX512 ISA, and allowing developers to treat anything less as not supporting AVX512 at all. If AMD stumbles and decides to pull half the ISA from e.g. a "7400" low-end Zen4 chip to save die space I think that would be a significant mistake, but we'll see how it goes.

                  Comment


                  • #59
                    Originally posted by arQon View Post
                    > It's like that quirk MMX had, where you needed to reset the FPU state, when switching from MMX instructions to x87, except more subtle.

                    * PTSD intensifies...
                    Remember you could also have the rounding/precision behavior corrupted by OS interrupt handlers. Good times.
                    😅

                    Originally posted by arQon View Post
                    If/since AMD is maintaining something very close to the full ISA, on all its chips, I think (or at the very least, hope) it has a significant chance of defining that as the "real" AVX512 ISA, and allowing developers to treat anything less as not supporting AVX512 at all.
                    Considering how many >= Skylake SP Intel server CPUs there are in the wild, I think that ship has sailed. glibc's hwcaps already defined x86-64 ISA feature levels, and v4 includes the following:
                    • AVX512F
                    • AVX512BW
                    • AVX512CD
                    • AVX512DQ
                    • AVX512VL

                    Not sure if that's the final list. I had a little trouble locating an authoritative source, so I just used: https://www.phoronix.com/news/Linux-...Feature-Levels

                    Comment


                    • #60
                      Originally posted by coder View Post
                      Considering how many >= Skylake SP Intel server CPUs there are in the wild, I think that ship has sailed. glibc's hwcaps already defined x86-64 ISA feature levels, and v4 includes the following:
                      (snip)
                      Not sure if that's the final list.
                      It might be for glibc, but it certainly doesn't cover all the variations - and that's kind of my point...

                      In financial systems, the options are now either "We only buy this one specific SKU", or "We need 3 more expert devs *just to deal with this one aspect* of the code".

                      In games etc, either you put your codebase at risk of exploding on whichever of the *eight* AVX512 paths got least testing, or you drop support for anything except "Real 512", and NotActually-AVX512-Fuctardery gets the AVX2 path, the end.

                      This mess reminds me of the early days of GL, and having to maintain multiple codepaths and/or hacks for drivers that were nonconformant or hardware that was too limited to actually support what we wanted to do at all, or couldn't do it performantly. At least they were acting in good faith, even if they fell short.

                      This, though? No. This is not the same.

                      We shouldn't be wasting *untold thousands* of man hours reworking code just for the benefit of a trillion dollar company that *went out of its way* to screw everyone over. We should be beating it into Intel's sociopathic skull via the only means they care about - money - that we're tired of picking up their dogshit. The only way to do that, and to keep this from happening again and again, is to support Zen4 and any other complete AVX512 implementation, and let Intel's broken garbage ones go f**k themselves.

                      If that means some Intel customers get the short end of the stick and go with AMD next time instead because of it, that's how things *should* be, until and unless Intel starts to do better. If they won't, then let them choke to death on their own hubris.

                      Comment

                      Working...
                      X