Announcement

**Linuxxx** · 30 May 2021, 07:17 PM

The reason why the AMD 6800 performed rather oddly probably has to do with bad luck!

No, seriously! Let me explain:

Michael used a Ryzen 5900X, which has a rather high inter-CCX communication latency.

On the other hand Canonical still ships Ubuntu with "irqbalance" enabled out-of-the-box, unfortunately;
Debian already dropped it because it is no longer needed, since the Linux kernel already distributes all interrupts evenly across all cores, while keeping every single interrupt source tightly coupled with a specific CPU core.
However, "irqbalance" disturbs this nice & even distribution by forcing the interrupts to be handled by yet another random core!

To get a better understanding of what I mean, check it out for yourself by observing the output of this command:

Code:

cat /proc/interrupts

That's why I think a simple reboot on a Ryzen machine that's not made out of a single CCX in combination with the randomness introduced by "irqbalance" can drastically alter these results.

Anyway, would be kinda cool if anyone could actually test this theory out, since I'm not silly enough to buy such a duct-taped together Ryzen CPU in the first place!

(BTW, nothing against AMD, mind you!
In fact, I bought my brother a Ryzen 3300X last year for this very reason:
A single CCX where all the cores are created & treated equally as far as CPU cores are concerned!)

**theriddick** · 30 May 2021, 09:50 PM

I find it interesting that the 6800XT performs so much better then the 6800 given there isn't a huge difference between them.

**cynical** · 30 May 2021, 10:05 PM

Originally posted by Linuxxx

Anyway, would be kinda cool if anyone could actually test this theory out, since I'm not silly enough to buy such a duct-taped together Ryzen CPU in the first place!

(BTW, nothing against AMD, mind you!
In fact, I bought my brother a Ryzen 3300X last year for this very reason:
A single CCX where all the cores are created & treated equally as far as CPU cores are concerned!)

Does CCX latency really matter? Either you aren’t using that many cores, in which case you never run into the latency issue, or you are using all 12, in which case the performance of the extra cores is going to outweigh the latency involved in communicating with them.

In an ideal world, sure, you wouldn’t want the extra latency, but it’s a small price to pay if you actually need the cores.

**smitty3268** · 30 May 2021, 11:44 PM

RT support for radv is coming along. Apparently this partial support is enough for many demos: https://gitlab.freedesktop.org/mesa/...requests/11078

**CochainComplex** · 31 May 2021, 03:10 AM

Originally posted by kokoko3k View Post

could you be a little more specific?
what hardware? what mesa version? what benchmark?
any references?

thanks!

Hardware: Ryzen3600XT+5700XT
Popos20.10 - xanmodkernel-cacule 5.12.8 build with march=native and O3
Mesa and drm latest git pulls
Build with clear linux spec flags** and -march=native -flto* or -march=znver2 -mtune=znver2 -flto same for libdrm
https://github.com/clearlinux-pkgs/m...ster/mesa.spec
...only compilerflags not the config flags.
Compiler gcc.11.1.1 + binutils 2.36.1 pulled via ppa:netext/netext73

Benchmarks internal AC Odyssey (lutris), FC New Dawn (lutris) and DX MD (steam Dx11).

But be aware it can break very easily and sometimes regressions occure. E.g. AC:O at the moment I have 70fps@1920x1200 (custom settings close to ultra) after building the latest mesa+drm...before stock popos mesa 65-68fps...but it does not reboot into popshell again - I have installed some new libs as well so still bisecting the issue. But my approach worked almost a year. Maybe regressions but no "gamestopper" like now.
Well I did a clean reinstall after upgrading to pop os 21.04 beta which was a dissapointment performance wise before and usually I have oibaff ppa installed too to get latest dependecy libs But well I upgraded it on a modified system so that could have effected the worse performance on popos21.04...a lot off possible causes...I need time to figure it out.
However this will be highly depending on your hardware so tinkering is the way you have to go.

*-flto or -ffat-lto-objects what ever works better
** btw "-O3 -falign-functions=32 -fno-math-errno -fno-semantic-interposition -fno-trapping-math" is a quite good flag string which is extensively used by almost all clear linux pkgs and works quite well for a lot of situations. I start with this one and then I try if march=native or march=znver2 mtune=znver2 breaks it ... if it does I will simply add mtune=znver2 but I'm crosschecking if it hurts performance.

p.s.: it is helpfull to clean the shadercache - I have heard it is not necessary but I fear sometimes it effects the outcome if not cleared.

**aufkrawall** · 31 May 2021, 07:30 AM

Originally posted by CochainComplex View Post

p.s.: it is helpfull to clean the shadercache - I have heard it is not necessary but I fear sometimes it effects the outcome if not cleared.

Mesa shader cache should be quite trustworthy, I don't think you'd ever need to clear it manually (unless you want to desperately free a few hundred megabytes of disk space).

**Linuxxx** · 31 May 2021, 01:11 PM

Originally posted by cynical View Post

Does CCX latency really matter? Either you aren’t using that many cores, in which case you never run into the latency issue, or you are using all 12, in which case the performance of the extra cores is going to outweigh the latency involved in communicating with them.

In an ideal world, sure, you wouldn’t want the extra latency, but it’s a small price to pay if you actually need the cores.

Nerf this:

One of the first questions one may ask after seeing the graph is how a 3800X is performing better than a 3950X even though it has twice the cores and cache? The answer to that is due to increased latency from the 3950X’s multi-chiplet design. While the 3800X only has to communicate across two 4-core CCXes, the 3950X takes it a step further, and has two chiplets each with two 4-core CCXes it has to communicate across.

Unlike other software, RPCS3’s PPU & SPU threads need to communicate constantly which results in a major bottleneck if these threads are split across multiple CCXes / chiplets. That ends up with the CPU hitting this bottleneck constantly with all the data moving around. This is why we do not recommend Ryzen CPUs unless they have a 3 or 4 core CCX design (6-8 core Ryzen CPUs, or a 4 core Ryzen APU). A 4 core CCX design is ideal as RPCS3 can fit all the PPU & SPU threads onto a single CCX, allowing users to bypass inter-CCX latency bottleneck entirely, provided the PPU & SPU threads are being scheduled properly to be placed on a single CCX.

**cynical** · 31 May 2021, 05:04 PM

Originally posted by Linuxxx View Post

Nerf this:

One of the first questions one may ask after seeing the graph is how a 3800X is performing better than a 3950X even though it has twice the cores and cache? The answer to that is due to increased latency from the 3950X’s multi-chiplet design. While the 3800X only has to communicate across two 4-core CCXes, the 3950X takes it a step further, and has two chiplets each with two 4-core CCXes it has to communicate across.

I thought we were talking about Zen 3? Sure the above is true, because you are communicating across four different CCXs, and frequently if you are taking advantage of the thread count. Zen 3 only has two CCXs. From Anandtech:

(talking about the 3950x) Nevertheless, in the result we can clearly see the low-latencies of the four CCXs, with inter-core latencies between CPUs of differing CCXs suffering to a greater degree in the 82ns range, which remains one of the key disadvantages of AMD’s core complex and chiplet architecture.

On the new Zen3-based Ryzen 9 5950X, what immediately is obvious is that instead of four low-latency CPU clusters, there are now only two of them. This corresponds to AMD’s switch from four CCX’s for their 16-core predecessor, to only two such units on the new part, with the new CCX basically being the whole CCD this time around.

So on a 5900X, you are dealing with a configuration of eight cores on one CCX and four additional cores on the second CCX. That means you can have 16 threads on a single CCX, and from the site you are talking about...

The first thing that you should consider is that RPCS3 can heavily utilize up to 16 CPU threads, and once you go past that it’s very likely that you won’t see improvements. What this means is that once you have a CPU with 16 threads, you should invest in a faster single core performance instead. Keep in mind that you definitely won’t need 16 threads for all the titles, in RDR and a few other titles for instance won’t care if you go from 8C/8T to 8C/16T.

So you won't benefit from more threads anyway. You could not even encounter a latency issue unless the Linux kernel decided to split the workload between CCXs for some reason. And this is all talking about a very specific use case: this emulation software. Even in your quote it says "unlike other software", because most of them do not have this same requirement of constant communication between threads.

I think it's nuts to take this small use case on an older generation of Zen, and conclude that Zen 3 sucks and isn't worth buying lol. If you take a look at any benchmark comparing the 5900X to the 5800X you would see that while in gaming the 5800X is on par or better due to having everything on a single CCX, in any scenario where multiple cores is valuable (compilation, rendering, etc), the 5900X is significantly better at the task.

**smitty3268** · 31 May 2021, 06:22 PM

Originally posted by cynical View Post

So on a 5900X, you are dealing with a configuration of eight cores on one CCX and four additional cores on the second CCX. That means you can have 16 threads on a single CCX, and from the site you are talking about...

I'm fairly certain the 5900X has cores split 6 and 6 between the CCX's, so that both are equal to each other.

Agree with the rest of what you said though. Most of the latency issues Zen 2 had are solved or at least mitigated with Zen 3. That's one of the ways they improved gaming performance by so much this generation. And if RPCS3 was ok with Zen 2 8 core CPUs they are certainly going to be fine with Zen3 up to 16 cores due to the doubled size of each CCX.

**Linuxxx** · 31 May 2021, 06:33 PM

Originally posted by cynical View Post

If you take a look at any benchmark comparing the 5900X to the 5800X you would see that while in gaming the 5800X is on par or better due to having everything on a single CCX, in any scenario where multiple cores is valuable (compilation, rendering, etc), the 5900X is significantly better at the task.

And what was the on-topic confusion all about? (Hint: 6800)

I only brought up the RPCS3 example because you claimed that inter-CCX latency can't be that bad, while you now stand corrected by acknowledging yourself that the 5900X can have a worsening effect on gaming results.

And in my original post I named what I believed to be the culprit for those unreliable benchmarks; namely, "irqbalance".

Anyhow, have any better idea for the odd output by the AMD Radeon 6800?

Announcement

NVIDIA vs. AMD Linux Gaming Performance For End Of May 2021 Drivers

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment