Announcement

**ms178** · 27 September 2022, 02:00 PM

Originally posted by coder View Post

HSA was a nice dream, but it never gained the necessary industry momentum. I think some of its advantages still live on in the form of ROCm, which I believe was architected to support it. Perhaps bridgman can say more about that.

BTW, OpenCL 2.0 has a feature called SVM (Shared Virtual Memory), which I believe is cache-coherent. Also, CXL supports cache-coherency at the interconnect protocol level.

It wouldn't characterize it as a dream as it was a key selling point for their hardware for a long time and I am still waiting to see the vision they promised come to fruition eventually. HSA itself did not gain any industry traction, but at least some key technologies which furthers that vision are now standardized across the industry (e.g. CXL) and I hope that the software side also will get better with future language standards incorporating some key elements. While SVM in OpenCL supports coarse-grained and fine-grained virtual memory, the implementations that matter to the market only support the former, right? I haven't checked that in a long time though and I was under the impression that this limitation limits the practicability of that feature for some workloads quite a bit. I am also not aware of any commonly used software making use of that feature. But maybe you know some examples?

**ms178** · 27 September 2022, 02:08 PM

coder MadCatX

Here you go:

Efficiency secret AVX-512 on Alder Lake - The resurrected instruction set in a practical test

https://www.igorslab.de/en/efficiency-secret-tip-avx-512-on-alder-lake-the-returned-command-set-in-practice-test/

AVX-512 was a hotly discussed topic around the launch of the new Intel Alder Lake CPUs. At first it was said that the P cores supported it in principle, but in

**qarium** · 27 September 2022, 07:57 PM

Originally posted by bridgman View Post

It's not half-baked, it's half-sized and perfectly baked

In fairness we do also have multiple FP execution ports/pipes (6 vs 3 for Golden Cove) so there can still be a lot of work happening in parallel.

for me its really funny that intel failed on AVX512 so many times and AMD just did it right.

on other tasks like SGX intel failed to...

and the intel ARC GPUs in my point of view failed to.

really man intel would be doing better by just license the AMD version of AVX512 and just license RDNA3 design to ...

same for apple... the Apple M1/M2 SOCs could be much better in my point of view if apple just license RDNA3,...

i did read some benchmarks of the RDNA2 in Samsungs ARM SOCs vs Qualcomm old Adreno (old ATI gpu tech)

and the RDNA2 license pay for itself both have the same performance and the same power consumtion but the RDNA samsung chip has much less tranistors and much higher clock speed its 1400mhz on the Samsung SOC and only like 750mhz on the Qualcomm SOC... this means the RDNA2 license pay for itself by the tranistor count alone.

qualcomm could produce SOCs with the same performance with much less tranistors. also RDNA has more features like raytracing acceration hardware...

this means we have smart companies like samsung who just get a RDNA2 license and then we have stupid companies like intel who fail on their own design...

also if apple would license the RDNA3 design their linux support would instandly be much better because the opensource driver is done already.

**DanglingPointer** · 27 September 2022, 09:38 PM

Would be good to get a test of explicit avx512 vs avx2 with x265 or Handbrake!

**Dukenukemx** · 28 September 2022, 04:18 PM

Originally posted by Sin2x View Post

You've been a fan of an instruction set? What's wrong with you?

Obligatory Linuses quote: https://www.realworldtech.com/forum/...rpostid=193190

Not everything Linus Torvalds says is gospel. There's lots of real world use cases where AVX-512 has huge benefits. Emulators like RPCS3 and Yuzu both benefit greatly from the use of AVX-512. Linus sees what benefits him and that's compiling kernel code. He can't see the forest between the trees.

RPCS3 Dev Details Huge CPU Performance Gains With AVX-512 For Beloved PS3 Emulator

https://hothardware.com/news/rpcs3-dev-details-gains-with-avx-512

RPCS3 developer WhatCookie posted a blog explaining what AVX-512 does for the PS3 emulator.

PlayStation 3 Emulator RPCS3 Adds AVX-512 Support On Zen 4 For A Huge Gaming Boost

https://hothardware.com/news/rpcs3-adds-avx512-on-zen-4

The PlayStation 3 emulator can make use of the extra-wide SIMD in AMD's new Zen 4 CPUs to see a significant speedup.

AMD Zen 4's AVX-512 Instructions To Show Major Benefits In Emulators Such As Yuzu, Citra, Xenia & Vita3K

https://wccftech.com/amd-zen-4-avx-512-major-benefits-in-emulators-such-as-yuzu-citra-xenia-vita3k/

A Graphics Engineer from Riot Games has said that AMD's Zen 4 CPUs with AVX-512 can bring major benefits to emulators such as Yuzu

**coder** · 29 September 2022, 05:23 AM

Originally posted by ms178 View Post

coder MadCatX

Here you go:

https://www.igorslab.de/en/efficienc...practice-test/

Thanks for sharing. There's not a whole lot to go on, but these observations could be largely explained by Intel simply spending little/no time on frequency curve optimizations, when AVX-512 is enabled. Because, if enabling AVX-512 actually decreases average power consumption, then it must be getting clock-throttled more aggressively than the non- AVX-512 case.

**coder** · 29 September 2022, 05:28 AM

Originally posted by Dukenukemx View Post

Linus sees what benefits him and that's compiling kernel code.

Or traditional server apps, like web servers, databases, etc. Things which lean heavily on the kernel probably have disproportionate mind-share with him. Those are probably going to be multithreaded programs that use lots of memory and do extreme amounts of storage & network I/O.

**Sin2x** · 29 September 2022, 08:25 AM

Originally posted by Dukenukemx View Post

Not everything Linus Torvalds says is gospel. There's lots of real world use cases where AVX-512 has huge benefits. Emulators like RPCS3 and Yuzu both benefit greatly from the use of AVX-512. Linus sees what benefits him and that's compiling kernel code. He can't see the forest between the trees.
https://hothardware.com/news/rpcs3-d...s-with-avx-512

PlayStation 3 Emulator RPCS3 Adds AVX-512 Support On Zen 4 For A Huge Gaming Boost

https://hothardware.com/news/rpcs3-adds-avx512-on-zen-4

The PlayStation 3 emulator can make use of the extra-wide SIMD in AMD's new Zen 4 CPUs to see a significant speedup.

https://wccftech.com/amd-zen-4-avx-5...-xenia-vita3k/

No, it's you who can't even understand what he wrote -- that AVX512 is used in dispoportionately low percentage of tasks and the die area it uses could be put to better use for general computing. Which -- surprise -- Intel did by stripping this functionality from desktop processors and leaving it only on Xeons.

Don't ever presume you could be smarter than Linus, you only make yourself look like a clown.

**coder** · 29 September 2022, 12:14 PM

Originally posted by Sin2x View Post

the die area it uses could be put to better use for general computing. Which -- surprise -- Intel did by stripping this functionality from desktop processors and leaving it only on Xeons.

Um... I think Linus is probably at least as interested in server CPUs, here.

Not only that, but Intel actually did put AVX-512 into Alder Lake desktop/mobile chips, they just disabled it because the E-cores didn't have it and they didn't want to deal with the headaches of asymmetric instruction support. That means it's still using the extra die area.

Originally posted by Sin2x View Post

Don't ever presume you could be smarter than Linus, you only make yourself look like a clown.

It's not a matter of intelligence. He's neither omniscient nor unbiased. Furthermore, he's not a chip designer and he doesn't actually know as much about Intel's customers as Intel does. All of this makes me take his opinions on CPU architecture with a bit of salt.

That said, I've long been critical of AVX-512, or at least the aspect of it which involves widening vectors to 512-bit. Other things, like predication and scatter/gather, are indeed nice and maybe not hugely expensive in die area.

I'm a little bit critical of scatter/gather, just because I think it lulls programmers into thinking they don't need to worry about data layout. However, even having the CPU fetch & interleave your data doesn't mean you don't have to worry about things like cache thrashing.

**ms178** · 29 September 2022, 01:14 PM

Originally posted by coder View Post

There's not a whole lot to go on, but these observations could be largely explained by Intel simply spending little/no time on frequency curve optimizations, when AVX-512 is enabled. Because, if enabling AVX-512 actually decreases average power consumption, then it must be getting clock-throttled more aggressively than the non- AVX-512 case.

Nah, as there was no throtteling involved it rather means that Intel finally managed to optimize AVX-512 to be more power efficient. Yeah, we all have to throw away old wisdoms about AVX-512 as the old equation "AVX-512 usage = higher power draw" is no longer true. Buildzoid backed that up, too, with his own data in one of his videos.

Announcement

AMD Zen 4 AVX-512 Performance Analysis On The Ryzen 9 7950X

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment