AMD Zen 4 AVX-512 Performance Analysis On The Ryzen 9 7950X

qarium replied

27 September 2022, 07:57 PM
Originally posted by bridgman View Post

It's not half-baked, it's half-sized and perfectly baked
In fairness we do also have multiple FP execution ports/pipes (6 vs 3 for Golden Cove) so there can still be a lot of work happening in parallel.

for me its really funny that intel failed on AVX512 so many times and AMD just did it right.

on other tasks like SGX intel failed to...

and the intel ARC GPUs in my point of view failed to.

really man intel would be doing better by just license the AMD version of AVX512 and just license RDNA3 design to ...

same for apple... the Apple M1/M2 SOCs could be much better in my point of view if apple just license RDNA3,...

i did read some benchmarks of the RDNA2 in Samsungs ARM SOCs vs Qualcomm old Adreno (old ATI gpu tech)

and the RDNA2 license pay for itself both have the same performance and the same power consumtion but the RDNA samsung chip has much less tranistors and much higher clock speed its 1400mhz on the Samsung SOC and only like 750mhz on the Qualcomm SOC... this means the RDNA2 license pay for itself by the tranistor count alone.

qualcomm could produce SOCs with the same performance with much less tranistors. also RDNA has more features like raytracing acceration hardware...

this means we have smart companies like samsung who just get a RDNA2 license and then we have stupid companies like intel who fail on their own design...

also if apple would license the RDNA3 design their linux support would instandly be much better because the opensource driver is done already.
Leave a comment:
ms178 replied

27 September 2022, 02:08 PM
coder MadCatX

Here you go:

Efficiency secret AVX-512 on Alder Lake - The resurrected instruction set in a practical test

https://www.igorslab.de/en/efficiency-secret-tip-avx-512-on-alder-lake-the-returned-command-set-in-practice-test/

AVX-512 was a hotly discussed topic around the launch of the new Intel Alder Lake CPUs. At first it was said that the P cores supported it in principle, but in
Likes 1
Leave a comment:
ms178 replied

27 September 2022, 02:00 PM
Originally posted by coder View Post

HSA was a nice dream, but it never gained the necessary industry momentum. I think some of its advantages still live on in the form of ROCm, which I believe was architected to support it. Perhaps bridgman can say more about that.

BTW, OpenCL 2.0 has a feature called SVM (Shared Virtual Memory), which I believe is cache-coherent. Also, CXL supports cache-coherency at the interconnect protocol level.

It wouldn't characterize it as a dream as it was a key selling point for their hardware for a long time and I am still waiting to see the vision they promised come to fruition eventually. HSA itself did not gain any industry traction, but at least some key technologies which furthers that vision are now standardized across the industry (e.g. CXL) and I hope that the software side also will get better with future language standards incorporating some key elements. While SVM in OpenCL supports coarse-grained and fine-grained virtual memory, the implementations that matter to the market only support the former, right? I haven't checked that in a long time though and I was under the impression that this limitation limits the practicability of that feature for some workloads quite a bit. I am also not aware of any commonly used software making use of that feature. But maybe you know some examples?
Leave a comment:
coder replied

27 September 2022, 12:54 PM
Originally posted by MadCatX View Post

Yes but there is nothing better to compare against.

There were no other socketed, mainstream desktop CPUs with it. However, Intel had some HEDT (socket 2066, I think) CPUs with it, which launched back in 2017 or so.

Better yet would be to compare it with a 65 W Tiger Lake H-based NUC Extreme. The H's are 8-core and basically look like they were originally intended to be a mainstream desktop CPU that got cancelled either because of power/clock limitations in Intel's 10 nm SF process (now known as "Intel 10") or yield problems. Or, maybe they simply ran too close to the launch window of Alder Lake.

Originally posted by MadCatX View Post

Are there any ADL benchmarks with its AVX512 haxxored enabled?

I agree that it would be ideal to compare it against an initial-stepping Alder Lake CPU, on a motherboard which allows its AVX-512 to be enabled. I'm not sure if Michael has such a setup, however. Perhaps someone has uploaded these results to the OpenBenchmarking database, although we don't necessarily know what their OC configuration and cooling setup is like.
Leave a comment:
coder replied

27 September 2022, 12:45 PM
Originally posted by carewolf View Post

Personally I feel AVX-512 would be better if we just forced it to only work in 128bit and 256bit mode, and only used the new improved instructions. The last downside AMD cant get rid of is the 4x times more registerbits to save for the kernel (twice as many register, twice as wide).

There's no going back. At least, not part-way back, like what you describe. But, register bits & context size will be more easily accommodated by ever shrinking process nodes. So, much less of an issue than Intel's decision to implement it on their 14 nm node.

AMD's implementation is interesting, when you put it next to ARM's recent announcement of the Neoverse V2. The V1 had 2x 256-bit SVE, but the V2 will feature 4x 128-bit SVE2. I wonder if this represents an growing industry consensus that ultra-wide SIMD isn't a good fit for general-purpose CPUs. Or, maybe it'll just turn out to be a speedbump.
Likes 1
Leave a comment:
coder replied

27 September 2022, 12:40 PM
Originally posted by ms178 View Post

Delivering OpenCL, OpenMP and HIP is not the same performance-wise as HSA tremendously cut the overhead for launching kernels, offered cach-coherency between CPU-GPU and promised to ease implemenation for programmers: https://ieeexplore.ieee.org/document/7482093

HSA was a nice dream, but it never gained the necessary industry momentum. I think some of its advantages still live on in the form of ROCm, which I believe was architected to support it. Perhaps bridgman can say more about that.

BTW, OpenCL 2.0 has a feature called SVM (Shared Virtual Memory), which I believe is cache-coherent. Also, CXL supports cache-coherency at the interconnect protocol level.
Leave a comment:
ddriver replied

27 September 2022, 05:44 AM
That's very good results for "firmware" avx 512. Just like zen 1 did pretty good at 256bit with 128 bit units.

There's no sign of throttling, which intel did way back with avx2 significantly.
Likes 1
Leave a comment:
MadCatX replied

27 September 2022, 04:35 AM
Originally posted by coder View Post

It really depends on your workload. If you're running an AVX-512 heavy workload, then it was always a performance and efficiency win! Even on 14 nm, and even in spite of the down-clocking!

If your entire hot path consists of vectorized code that can take 512bit wide data then yes but outside of that one Anandtech benchmark and some scientific calculations (that are better run on GPUs anyway) there aren't many programs that can do that. Of the SIMD code that I wrote I can think of perhaps one or two loops where AVX512 might help.

Originally posted by coder View Post

Where AVX-512 got into trouble was in workloads that used it for around 10% - 20% of the instructions, which was enough to trigger significant downclocking but not enough that it could compensate with its greater throughput. I experienced this, first hand. When we recompiled with AVX-512 completely disabled, we got higher overall throughput in my software.

Which is very likely what a lot of commercial SW will do - games being a likely candidate that everyone will benchmark - and that is why the AMD take on AVX512 might perform better in a lot of real-world scenarios.

Originally posted by coder View Post

You realize you're comparing an Intel 14 nm CPU with a TSMC N5 one, right? Rocket Lake's efficiency was always a joke. A very bad joke. To make matters worse, they solved the AVX-512 clock penalty by giving it an extremely high power budget. However, I think it's also a single-FMA design (somebody correct me if I'm wrong about that). So, power consumption was atrocious and performance wasn't even all that great.

Yes but there is nothing better to compare against. Are there any ADL benchmarks with its AVX512 haxxored enabled?
Leave a comment:
carewolf replied

27 September 2022, 04:29 AM
Originally posted by Sin2x View Post

You've been a fan of an instruction set? What's wrong with you?

Obligatory Linuses quote: https://www.realworldtech.com/forum/...rpostid=193190

Most of that criticism is invalid for AMDs implementation though. It no longer has the same performance downside or as big a transistor cost.

Personally I feel AVX-512 would be better if we just forced it to only work in 128bit and 256bit mode, and only used the new improved instructions. The last downside AMD cant get rid of is the 4x times more registerbits to save for the kernel (twice as many register, twice as wide).
Likes 1
Leave a comment:
ms178 replied

27 September 2022, 04:04 AM
Originally posted by coder View Post

Huh? They've stumbled, but it's not like they haven't delivered OpenCL, OpenMP, and now HiP. As well as C++ AMP and DirectCompute, on Windows.

Sure, ROCm was in the wilderness for a long time, but they had legacy, proprietary drivers available for much of that time. They deserve some criticism for that, but it's not as if they weren't working on it the whole time.

Delivering OpenCL, OpenMP and HIP is not the same performance-wise as HSA tremendously cut the overhead for launching kernels, offered cach-coherency between CPU-GPU and promised to ease implementation burdens for programmers: https://ieeexplore.ieee.org/document/7482093

Last edited by ms178; 27 September 2022, 02:02 PM.
Leave a comment:

Announcement

AMD Zen 4 AVX-512 Performance Analysis On The Ryzen 9 7950X

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: