Announcement

Collapse
No announcement yet.

AMD Ryzen 7040 Series Shows Great AVX-512 Performance For Laptops / Mobile / Edge

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Keep in mind, a quad core will have more bandwidth per core, presuming roughly equivalent memory. But even with a gap in the bandwidth in favor of amd, the intel cpus here are less susceptible to load store bottlenecks, and avx 512 is bandwidth-heavy.

    Having "full" avx 512 and more bandwidth per core and still getting such low perf boost.... indicates intel's implementation is not very good to say the least.

    Comment


    • #22
      Originally posted by ddriver View Post
      Having "full" avx 512 and more bandwidth per core and still getting such low perf boost.... indicates intel's implementation is not very good to say the least.
      Well, it's lacking an extra FMA port that their server CPUs have. AMD Zen 4 has the same overall AVX-512 issue thoughput as Intel server cores, except that AMD only has FMA on one port.

      Anyway, that might go some ways towards explaining the reason AMD got better speedup than the Intel cores. I think the other reason is that some of the benchmarks use bf16 instructions not supported on the older Intel AVX-512 implementations.


      Cases like that skew the geomean. I count about 9 of them.

      One reason it'd be interesting to see Alder Lake or Raptor Lake in these benchmarks is that Intel backported a couple instructions like that to AVX. That should narrow the deficit from its lack of AVX-512.
      Last edited by coder; 14 July 2023, 08:32 AM.

      Comment


      • #23
        That's incredible how much this instruction can affect the final result.

        Comment


        • #24
          Originally posted by MorrisS. View Post
          That's incredible how much this instruction can affect the final result.
          Especially because Zen 4 has the same overall dispatch width for AVX/AVX2 and AVX-512. 1536 bits per cycle.

          A significant portion of the benefit comes from capabilities that are unique to AVX-512, like:

          VDPBF16PS Calculate dot product of two Bfloat16 pairs and accumulate the result into one packed single precision number
          If you don't have that, then OpenVINO is going to do the computation using fp32. So, that's a clear case where you get about double the throughput.

          Comment


          • #25
            Here's a bombshell:

            I computed the GeoMean* for all benchmarks, except the OpenVINO fp16 ones that only benefited Zen 4 and the graphs which computed perf/W, and found the following relative speedups:
            • Core i7-1065G7 (Ice Lake): 49.0%
            • Core i7-1165G7 (Tiger Lake): 49.9%
            • Ryzen 7 7840U (Phoenix): 44.8%

            Wow. I figured they'd all be similar, but didn't expect an inversion. Not too surprising, if you think about it. Here's what I think explains it:
            • Zen 4 has 6 instruction issue ports for various vector operations, all usable whether or not you're using AVX-512. So, the main reason you'd expect it to benefit from AVX-512 would be more complex/sophisticated instructions that replace the work of more than a pair of AVX/AVX2 instructions.
            • The Ryzen CPU has 8 cores and is therefore already closer to being memory-bottlenecked, in the baseline benchmarks. Enabling AVX-512 just makes it that much more memory-bottlenecked.

            * I was also careful to compute the reciprocals, in the cases where lower -> better.

            Comment


            • #26
              Originally posted by coder View Post
              Especially because Zen 4 has the same overall dispatch width for AVX/AVX2 and AVX-512. 1536 bits per cycle.

              A significant portion of the benefit comes from capabilities that are unique to AVX-512, like:

              VDPBF16PS Calculate dot product of two Bfloat16 pairs and accumulate the result into one packed single precision number
              If you don't have that, then OpenVINO is going to do the computation using fp32. So, that's a clear case where you get about double the throughput.
              Shouldn't apply AVX2 instructions once AVX512 is missed?

              Comment


              • #27
                Originally posted by drakonas777 View Post
                The implementation of Intel hybrid architecture is not elegant, to put it politely. They should either make E cores more fat, or P cores less fat. Disabling AVX512 is a workaround. Making P cores less fat is probably better approach. AVX512 is not that important on client platforms and SMT/HT is not that important when there is a bunch of E cores to handle highly parallel loads.
                Personally I want Intel to release a pure E-core processor for the desktop.

                There's rumors Intel will be releasing a very high core count E-core only Xeon, I want a desktop version, something like 50 E-cores would be perfect for my use cases and I suspect for a lot of people's use cases.

                Comment


                • #28
                  Originally posted by sophisticles View Post
                  Personally I want Intel to release a pure E-core processor for the desktop.
                  It's not quite the same thing, but the N-series can give you a taste of what it would be like. The N300 and N305 even have 8 Gracemont cores.

                  The downsides, as you probably know:
                  • BGA, not socketed.
                  • Only 1 memory channel.
                  • I/O is only PCIe 3.0 @ 9 lanes.
                  • Smaller iGPU, in some models.

                  So, not a bad option for powering mini-PCs, but not something you can pair with a dGPU or otherwise use in an I/O-heavy configuration.

                  Originally posted by sophisticles View Post
                  There's rumors Intel will be releasing a very high core count E-core only Xeon,
                  Sierra Forest is more than a rumor.

                  Originally posted by sophisticles View Post
                  I want a desktop version, something like 50 E-cores would be perfect for my use cases and I suspect for a lot of people's use cases.
                  I've heard rumors of future desktop CPUs with up to 32 E-cores. With 8 P-cores, that would give you 48 threads.

                  Maybe it'd be pointless, though. 2-channel memory is already a big enough bottleneck with just 32 threads.

                  Comment


                  • #29
                    Originally posted by MorrisS. View Post
                    Shouldn't apply AVX2 instructions once AVX512 is missed?
                    Yes, but AVX2 doesn't have all of the corresponding instructions of AVX-512.

                    To partially plug the hole left by ripping out AVX-512, Intel added AVX-VNNI, but none of the CPUs in this comparison have those instructions.

                    Comment


                    • #30
                      More Insights

                      Following up on the post where I found AVX-512 benefited Intel's Ice Lake and Tiger Lake more than Phoenix, if you exclude the OpenVINO fp16 benchmarks, here are some other points of interest.

                      I computed GeoMean for each benchmark program, to see where AVX-512 helped the most/least.
                      Program Benches i7-1065G7 i7-1165G7 R7 7840U
                      Embree 4.1
                      3
                      1.070
                      1.117
                      1.189
                      OpenVKL 1.3.1
                      1
                      1.293
                      1.302
                      1.239
                      OSPRay 2.12
                      4
                      1.446
                      1.433
                      1.412
                      OSPRay Studio 0.11
                      6
                      1.044
                      1.105
                      1.108
                      oneDNN 3.1
                      5
                      1.694
                      1.686
                      1.561
                      Cpuminer-Opt 3.20.3
                      8
                      1.876
                      1.915
                      1.473
                      OpenVINO 2022.3
                      8
                      1.750
                      1.755
                      1.635
                      miniBUDE 20210901
                      2
                      1.288
                      1.233
                      1.280
                      libxsmm 2-1.17-3645
                      1
                      1.065
                      0.986
                      1.082
                      TensorFlow 2.12
                      6
                      1.550
                      1.509
                      1.858

                      The Benches column indicates how many benchmarks that program had. This affects how strongly its performance is weighted in the final GeoMean. This also shows how one could influence the test suite to swing final results one way or another. For instance, more Cpuminer-Opt tests would make the Intel CPUs' AVX-512 support look a lot better, meanwhile more TensorFlow tests would benefit AMD's Phoenix.

                      BTW, a key part of my sanity-checks, to make sure I hadn't made any data-entry errors, was to look at the per-test speedup and scan for any outliers.
                      Last edited by coder; 15 July 2023, 11:29 PM.

                      Comment

                      Working...
                      X