Announcement

**ddriver** · 14 July 2023, 07:19 AM

Keep in mind, a quad core will have more bandwidth per core, presuming roughly equivalent memory. But even with a gap in the bandwidth in favor of amd, the intel cpus here are less susceptible to load store bottlenecks, and avx 512 is bandwidth-heavy.

Having "full" avx 512 and more bandwidth per core and still getting such low perf boost.... indicates intel's implementation is not very good to say the least.

**coder** · 14 July 2023, 08:28 AM

Originally posted by ddriver View Post

Having "full" avx 512 and more bandwidth per core and still getting such low perf boost.... indicates intel's implementation is not very good to say the least.

Well, it's lacking an extra FMA port that their server CPUs have. AMD Zen 4 has the same overall AVX-512 issue thoughput as Intel server cores, except that AMD only has FMA on one port.

Anyway, that might go some ways towards explaining the reason AMD got better speedup than the Intel cores. I think the other reason is that some of the benchmarks use bf16 instructions not supported on the older Intel AVX-512 implementations.

Cases like that skew the geomean. I count about 9 of them.

One reason it'd be interesting to see Alder Lake or Raptor Lake in these benchmarks is that Intel backported a couple instructions like that to AVX. That should narrow the deficit from its lack of AVX-512.

**MorrisS.** · 14 July 2023, 10:07 AM

That's incredible how much this instruction can affect the final result.

**coder** · 14 July 2023, 10:33 AM

Originally posted by MorrisS. View Post

That's incredible how much this instruction can affect the final result.

Especially because Zen 4 has the same overall dispatch width for AVX/AVX2 and AVX-512. 1536 bits per cycle.

A significant portion of the benefit comes from capabilities that are unique to AVX-512, like:

VDPBF16PS

Calculate dot product of two Bfloat16 pairs and accumulate the result into one packed single precision number

If you don't have that, then OpenVINO is going to do the computation using fp32. So, that's a clear case where you get about double the throughput.

**coder** · 15 July 2023, 06:02 AM

Here's a bombshell:

I computed the GeoMean* for all benchmarks, except the OpenVINO fp16 ones that only benefited Zen 4 and the graphs which computed perf/W, and found the following relative speedups:

Core i7-1065G7 (Ice Lake): 49.0%
Core i7-1165G7 (Tiger Lake): 49.9%
Ryzen 7 7840U (Phoenix): 44.8%

Wow. I figured they'd all be similar, but didn't expect an inversion. Not too surprising, if you think about it. Here's what I think explains it:

Zen 4 has 6 instruction issue ports for various vector operations, all usable whether or not you're using AVX-512. So, the main reason you'd expect it to benefit from AVX-512 would be more complex/sophisticated instructions that replace the work of more than a pair of AVX/AVX2 instructions.
The Ryzen CPU has 8 cores and is therefore already closer to being memory-bottlenecked, in the baseline benchmarks. Enabling AVX-512 just makes it that much more memory-bottlenecked.

* I was also careful to compute the reciprocals, in the cases where lower -> better.

**MorrisS.** · 15 July 2023, 12:23 PM

Originally posted by coder View Post

Especially because Zen 4 has the same overall dispatch width for AVX/AVX2 and AVX-512. 1536 bits per cycle.

A significant portion of the benefit comes from capabilities that are unique to AVX-512, like:

VDPBF16PS

Calculate dot product of two Bfloat16 pairs and accumulate the result into one packed single precision number

If you don't have that, then OpenVINO is going to do the computation using fp32. So, that's a clear case where you get about double the throughput.

Shouldn't apply AVX2 instructions once AVX512 is missed?

**sophisticles** · 15 July 2023, 04:37 PM

Originally posted by drakonas777 View Post

The implementation of Intel hybrid architecture is not elegant, to put it politely. They should either make E cores more fat, or P cores less fat. Disabling AVX512 is a workaround. Making P cores less fat is probably better approach. AVX512 is not that important on client platforms and SMT/HT is not that important when there is a bunch of E cores to handle highly parallel loads.

Personally I want Intel to release a pure E-core processor for the desktop.

There's rumors Intel will be releasing a very high core count E-core only Xeon, I want a desktop version, something like 50 E-cores would be perfect for my use cases and I suspect for a lot of people's use cases.

**coder** · 15 July 2023, 07:18 PM

Originally posted by sophisticles View Post

Personally I want Intel to release a pure E-core processor for the desktop.

It's not quite the same thing, but the N-series can give you a taste of what it would be like. The N300 and N305 even have 8 Gracemont cores.

The downsides, as you probably know:

BGA, not socketed.
Only 1 memory channel.
I/O is only PCIe 3.0 @ 9 lanes.
Smaller iGPU, in some models.

So, not a bad option for powering mini-PCs, but not something you can pair with a dGPU or otherwise use in an I/O-heavy configuration.

Originally posted by sophisticles View Post

There's rumors Intel will be releasing a very high core count E-core only Xeon,

Sierra Forest is more than a rumor.

Originally posted by sophisticles View Post

I want a desktop version, something like 50 E-cores would be perfect for my use cases and I suspect for a lot of people's use cases.

I've heard rumors of future desktop CPUs with up to 32 E-cores. With 8 P-cores, that would give you 48 threads.

Maybe it'd be pointless, though. 2-channel memory is already a big enough bottleneck with just 32 threads.

**coder** · 15 July 2023, 07:36 PM

Originally posted by MorrisS. View Post

Shouldn't apply AVX2 instructions once AVX512 is missed?

Yes, but AVX2 doesn't have all of the corresponding instructions of AVX-512.

To partially plug the hole left by ripping out AVX-512, Intel added AVX-VNNI, but none of the CPUs in this comparison have those instructions.

**coder** · 15 July 2023, 11:20 PM

More Insights

Following up on the post where I found AVX-512 benefited Intel's Ice Lake and Tiger Lake more than Phoenix, if you exclude the OpenVINO fp16 benchmarks, here are some other points of interest.

I computed GeoMean for each benchmark program, to see where AVX-512 helped the most/least.

Program	Benches	i7-1065G7	i7-1165G7	R7 7840U
Embree 4.1	3	1.070	1.117	1.189
OpenVKL 1.3.1	1	1.293	1.302	1.239
OSPRay 2.12	4	1.446	1.433	1.412
OSPRay Studio 0.11	6	1.044	1.105	1.108
oneDNN 3.1	5	1.694	1.686	1.561
Cpuminer-Opt 3.20.3	8	1.876	1.915	1.473
OpenVINO 2022.3	8	1.750	1.755	1.635
miniBUDE 20210901	2	1.288	1.233	1.280
libxsmm 2-1.17-3645	1	1.065	0.986	1.082
TensorFlow 2.12	6	1.550	1.509	1.858

The Benches column indicates how many benchmarks that program had. This affects how strongly its performance is weighted in the final GeoMean. This also shows how one could influence the test suite to swing final results one way or another. For instance, more Cpuminer-Opt tests would make the Intel CPUs' AVX-512 support look a lot better, meanwhile more TensorFlow tests would benefit AMD's Phoenix.

BTW, a key part of my sanity-checks, to make sure I hadn't made any data-entry errors, was to look at the per-test speedup and scan for any outliers.

Announcement

AMD Ryzen 7040 Series Shows Great AVX-512 Performance For Laptops / Mobile / Edge

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment