Announcement

Collapse
No announcement yet.

AMD Ryzen 7040 Series Shows Great AVX-512 Performance For Laptops / Mobile / Edge

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • coder
    replied
    Here's a bombshell:

    I computed the GeoMean* for all benchmarks, except the OpenVINO fp16 ones that only benefited Zen 4 and the graphs which computed perf/W, and found the following relative speedups:
    • Core i7-1065G7 (Ice Lake): 49.0%
    • Core i7-1165G7 (Tiger Lake): 49.9%
    • Ryzen 7 7840U (Phoenix): 44.8%

    Wow. I figured they'd all be similar, but didn't expect an inversion. Not too surprising, if you think about it. Here's what I think explains it:
    • Zen 4 has 6 instruction issue ports for various vector operations, all usable whether or not you're using AVX-512. So, the main reason you'd expect it to benefit from AVX-512 would be more complex/sophisticated instructions that replace the work of more than a pair of AVX/AVX2 instructions.
    • The Ryzen CPU has 8 cores and is therefore already closer to being memory-bottlenecked, in the baseline benchmarks. Enabling AVX-512 just makes it that much more memory-bottlenecked.

    * I was also careful to compute the reciprocals, in the cases where lower -> better.

    Leave a comment:


  • coder
    replied
    Originally posted by MorrisS. View Post
    That's incredible how much this instruction can affect the final result.
    Especially because Zen 4 has the same overall dispatch width for AVX/AVX2 and AVX-512. 1536 bits per cycle.

    A significant portion of the benefit comes from capabilities that are unique to AVX-512, like:

    VDPBF16PS Calculate dot product of two Bfloat16 pairs and accumulate the result into one packed single precision number
    ‚Äč
    If you don't have that, then OpenVINO is going to do the computation using fp32. So, that's a clear case where you get about double the throughput.

    Leave a comment:


  • MorrisS.
    replied
    That's incredible how much this instruction can affect the final result.

    Leave a comment:


  • coder
    replied
    Originally posted by ddriver View Post
    Having "full" avx 512 and more bandwidth per core and still getting such low perf boost.... indicates intel's implementation is not very good to say the least.
    Well, it's lacking an extra FMA port that their server CPUs have. AMD Zen 4 has the same overall AVX-512 issue thoughput as Intel server cores, except that AMD only has FMA on one port.

    Anyway, that might go some ways towards explaining the reason AMD got better speedup than the Intel cores. I think the other reason is that some of the benchmarks use bf16 instructions not supported on the older Intel AVX-512 implementations.


    Cases like that skew the geomean. I count about 9 of them.

    One reason it'd be interesting to see Alder Lake or Raptor Lake in these benchmarks is that Intel backported a couple instructions like that to AVX. That should narrow the deficit from its lack of AVX-512.
    Last edited by coder; 14 July 2023, 08:32 AM.

    Leave a comment:


  • ddriver
    replied
    Keep in mind, a quad core will have more bandwidth per core, presuming roughly equivalent memory. But even with a gap in the bandwidth in favor of amd, the intel cpus here are less susceptible to load store bottlenecks, and avx 512 is bandwidth-heavy.

    Having "full" avx 512 and more bandwidth per core and still getting such low perf boost.... indicates intel's implementation is not very good to say the least.

    Leave a comment:


  • avis
    replied
    Originally posted by cbxbiker61 View Post

    Yeah, Up until my latest notebook I was always just swapping out the wi-fi cards to Intel. My latest AMD notebook has a Realtek RTL8852AE in it that seems to work great with current Linux kernels. It's always a shame to have to open up a notebook computer to swap wi-fi cards but quite often it's worth it.
    Don't buy laptops with Mediatek M.2 Wi-Fi/Bluetooth modules. They simply do not work. Damn.

    Leave a comment:


  • drakonas777
    replied
    The implementation of Intel hybrid architecture is not elegant, to put it politely. They should either make E cores more fat, or P cores less fat. Disabling AVX512 is a workaround. Making P cores less fat is probably better approach. AVX512 is not that important on client platforms and SMT/HT is not that important when there is a bunch of E cores to handle highly parallel loads.

    Leave a comment:


  • coder
    replied
    Originally posted by stormcrow View Post
    It doesn't matter if the efficiency cores have the extension or not.
    Sure it does. A naive program (which is most of them) spawns one thread for every hardware thread and oversubscribes the P-cores. Contention for those P-cores causes excessive context switches, cache thrashing, and more latency on thread-to-thread communication or around resources under heavy contention. Meanwhile, the E-cores (which are each more than half as fast as P-cores) twiddle their thumbs.

    Furthermore, if you compared performance of the P-cores running AVX-512 workloads vs. all cores running AVX2 workloads, the all-core scenario would average the same or better performance. Even in this article, AVX-512 benefited Ice Lake by 34.5%, Tiger Lake by 34.1%, and Phoenix by 54.2%. On floating point workloads, E-cores have been shown to be about 54% as fast as a P-core, on Alder Lake, making their decision probably either a win or a wash. However, Raptor Lake's 2x E-core count makes it a decisive win for them. And that's only considering the impact on workloads which could benefit significantly from AVX-512, which is the minority of client workloads.

    Moreover, the E-cores aren't just a cost-efficient way to add performance, they're also energy-efficient. The addition of more E-cores is a big reason for Raptor Lake's improved efficiency. That's especially important, when Intel is on a less-efficient process node than AMD.

    Originally posted by stormcrow View Post
    There's no technical reason the AVX-512 extensions couldn't be restricted to performance settings.
    Technically plausible, but too often would result in performance regressions, as I said.

    People only look at the plausibility aspect and don't think hard enough about the second part. There's a lot of fuzzy thinking, around this whole issue. It's mostly people who are mad because they feel like they're losing something and that feeling of aggrievement clouds their ability to think about the big picture.

    Originally posted by stormcrow View Post
    Anux The article in the link given was talking about a 7840 (eight not zero) model AMD CPU not a 7040 as in this article, or perhaps I misread it.
    This is a case of confusion over AMD's nomeclature. What they mean isn't 7040, but rather 7x40. They do the same thing with EPYC model numbers, talking about 7002 to mean Rome, 7003 to mean Milan, etc. In those cases, only the first and last digits count. In this case, it's the first one and the last two. I blame AMD for creating this confusion.

    Originally posted by pWe00Iri3e7Z9lHOX2Qx View Post
    Something like "7X40" might be more clear, but they already use X a lot .
    Well, lower-case x would be better, but they could use n, #, _, or even *. Okay, * might be too easily confused with multiplication, but what about:

    7n40 ?


    Nah, I still like 7x40 better.
    Last edited by coder; 14 July 2023, 05:35 AM.

    Leave a comment:


  • pWe00Iri3e7Z9lHOX2Qx
    replied
    Originally posted by stormcrow View Post
    Anux The article in the link given was talking about a 7840 (eight not zero) model AMD CPU not a 7040 as in this article, or perhaps I misread it.
    This one is a 7840. It's somewhat confusing, but they call this series of processors "7040" which sounds more like a specific SKU. Something like "7X40" might be more clear, but they already use X a lot .

    Leave a comment:


  • stormcrow
    replied
    Originally posted by coder View Post
    It's not artificial differentiation. It's simply that their E-cores don't have AVX-512. Intel decided adding it would make them too big and hurt other use cases for those cores (i.e. Sierra Forest). Going hybrid-ISA would open a can of worms and introduce plenty of performance regressions vs. simply disabling it.

    I'm sure they evaluated the hybrid-ISA option, as evidenced by the fact that they didn't fuse off the functionality in early Alder Lakes. The only reason to leave it accessible was so they could experiment with hybrid-ISA, and I think they made the right choice not to go down that rabbit hole.
    That's an artificial distinction. It doesn't matter if the efficiency cores have the extension or not. This has been hashed at nauseum in many places. There's no technical reason the AVX-512 extensions couldn't be restricted to performance settings. It is, after all, a performance enhancement. The reasons given are artificial. I think it was the wrong choice because they've already done so, unofficially.

    Anux The article in the link given was talking about a 7840 (eight not zero) model AMD CPU not a 7040 as in this article, or perhaps I misread it.

    Leave a comment:

Working...
X