Announcement

**baryluk** · 05 December 2020, 09:32 AM

Compared to znver2 it adds support for PTA_VAES | PTA_VPCLMULQDQ | PTA_PKU, almost everything else is reusing znver2 code paths and scheduling.

With exception of idiv, that is integer division, which does have new scheduling compared to znver2 and znver1. Looking at the new code, it appears the new divider is either pipelined better or simply faster, because it does reserve some ports for less cycles now. A bit over 2 times less cycles. Nice. It will affect scheduling of the code a bit, and improve register allocations a little bit too probably.

**scottishduck** · 05 December 2020, 10:23 AM

Failing to have compiler optimisations in place before the product launch seems like a massive oversight, especially with how zen3 has some rather profound changes that I’m sure optimisation could take big advantage of.

**baryluk** · 05 December 2020, 11:13 AM

Originally posted by scottishduck View Post

Failing to have compiler optimisations in place before the product launch seems like a massive oversight, especially with how zen3 has some rather profound changes that I’m sure optimisation could take big advantage of.

This is nonsense. CPU microarchitecture specific optimisations are extremely rarely used in practice. Not only that, using znver2 already provides 98% of performance gains possible on znver3. The "profound" optimizations are in hardware, and generated code doesn't really need to change to use them.

The gcc optimisations for znver3 will provide very little difference, and only in some very specific workloads.

Also AMD has a own compiler suite, https://developer.amd.com/amd-aocc/ AMD AOCC, that is available for months with good optimisations.

**sdack** · 05 December 2020, 12:18 PM

Originally posted by baryluk View Post

This is nonsense. CPU microarchitecture specific optimisations are extremely rarely used in practice. Not only that, using znver2 already provides 98% of performance gains possible on znver3. The "profound" optimizations are in hardware, and generated code doesn't really need to change to use them.

The gcc optimisations for znver3 will provide very little difference, and only in some very specific workloads.

Also AMD has a own compiler suite, https://developer.amd.com/amd-aocc/ AMD AOCC, that is available for months with good optimisations.

The major gains appear to be coming from the changes to the L3 cache, while L1/L2 have not been changed afaik. GCC then doesn't yet recognise L3 caches but only L1 and L2, for which it also provides command line switches in case the correct values haven't been recognised.

How to check on these values:

Code:

$ gcc -march=znver2 --help=params -Q|fgrep cache
--param=l1-cache-line-size= 32
--param=l1-cache-size= 64
--param=l2-cache-size= 512
$ gcc -march=znver2 --help=target -Q|fgrep march
-march= znver2

That said, I've noticed an unexpected behaviour while I was checking on these values again today on my Ryzen 7 3800X. The L1/L2 cache parameters changed when I used -march=native versus -march=znver2.

Code:

$ gcc -march=native --help=params -Q|fgrep cache
--param=l1-cache-line-size= 64
--param=l1-cache-size= 32
--param=l2-cache-size= 512
$ gcc -march=native --help=target -Q|fgrep march
-march= znver2

The later values as shown here are in fact the correct ones for the Ryzen 7 3800X.

So now I'm wondering as to what I'm missing here. It's apparently not enough to just specify -march=znver2 but it has to be -march=native before GCC uses the correct values for the L1/L2 cache on an actual Zen 2 CPU.

Edit:
I've now also tested it with -march=znver2 -mtune=znver2 and it keeps using the wrong values and only when -mtune=native is used and GCC performs a run-time detection does it use the correct values!

**baryluk** · 05 December 2020, 12:47 PM

Originally posted by sdack View Post

The major gains appear to be coming from the changes to the L3 cache, while L1/L2 have not been changed afaik. GCC then doesn't yet recognise L3 caches but only L1 and L2, for which it also provides command line switches in case the correct values haven't been recognised.

How to check on these values:

Code:

$ gcc -march=znver2 --help=params -Q|fgrep cache
--param=l1-cache-line-size= 32
--param=l1-cache-size= 64
--param=l2-cache-size= 512
$ gcc -march=znver2 --help=target -Q|fgrep march
-march= znver2

That said, I've noticed an unexpected behaviour while I was checking on these values again today on my Ryzen 7 3800X. The L1/L2 cache parameters changed when I used -march=native versus -march=znver2.

Code:

$ gcc -march=native --help=params -Q|fgrep cache
--param=l1-cache-line-size= 64
--param=l1-cache-size= 32
--param=l2-cache-size= 512
$ gcc -march=native --help=target -Q|fgrep march
-march= znver2

The later values as shown here are in fact the correct ones for the Ryzen 7 3800X. So now I'm wondering as to what I'm missing here. It's apparently not enough to just specify -march=znver2 but it has to be -march=native before GCC uses the correct values for the L1/L2 cache on an actual Zen 2 CPU.

Compilers don't take into account L3 into account, because it is both hard, and extremally workload dependent, including what other cores are doing. Compiler doesn't know. Such option will not be in GCC or clang in the next 10 years. Easy as that.

**sdack** · 05 December 2020, 01:29 PM

Originally posted by baryluk View Post

Compilers don't take into account L3 into account, because it is both hard, and extremally workload dependent, including what other cores are doing. Compiler doesn't know. Such option will not be in GCC or clang in the next 10 years. Easy as that.

It's not about that. It's that the knowledge of the cache characteristics can lead to an opportunity for an optimisation to produce a slightly better result instead of having no knowledge at all. The L1 and L2 cache are also affected by multi-threaded workloads, not only the L3, as you will know and it hasn't stopped developers from using this knowledge in their optimisations.

I've mentioned the L3 cache, because it is the one cache, which has changed with Zen 3, but isn't taken into account by gcc. This has relevance to the article, because the article is about Zen 3 support for gcc.

I wasn't trying to debate on whether it makes sense to include L3 characteristics or not. I'm sure some dev will find it interesting and it may have a few edge cases where it can already make a small difference. I'm not working on gcc and so won't get involved in those decisions. I also don't see the point in guessing if we see it happening within gcc in the next 10 years or not.

**AmericanLocomotive** · 06 December 2020, 12:24 PM

Originally posted by sdack View Post

So now I'm wondering as to what I'm missing here. It's apparently not enough to just specify -march=znver2 but it has to be -march=native before GCC uses the correct values for the L1/L2 cache on an actual Zen 2 CPU.

Edit:
I've now also tested it with -march=znver2 -mtune=znver2 and it keeps using the wrong values and only when -mtune=native is used and GCC performs a run-time detection does it use the correct values!

I would guess it's because you cannot assume that all Zen 2 cores will have the same cache configuration. While AMD seems to be using a fixed L1/L2 cache amount for Zen 2 regardless of application, the same has not always been true. For example, back in the Athlon 64 days, a K8 core could have anything from 128KB to 1MB of L2 cache per core.

**sdack** · 06 December 2020, 02:28 PM

Originally posted by AmericanLocomotive View Post

I would guess it's because you cannot assume that all Zen 2 cores will have the same cache configuration. While AMD seems to be using a fixed L1/L2 cache amount for Zen 2 regardless of application, the same has not always been true. For example, back in the Athlon 64 days, a K8 core could have anything from 128KB to 1MB of L2 cache per core.

From the looks of it does it get the false values from the generic or x86-64 default. I.e. -mtune=generic or -mtune=x86-64

However at least for the cacheline size is 64 the standard these days if not more. A cacheline size of 32 bytes goes back to AMD K5 and K6 iirc, and the K8 and later already had 64 bytes. It basically changed from 32 to 64 bytes when x86 went from 32-bit to 64-bit. It would therefore surprise me if today there was a Zen CPU out there that had a cacheline size of only 32 bytes. And even in the event that there was such a CPU would this be a rare exception and not the norm and you would want to stick with the 64 bytes that are far more common.

Just as a reminder, the cacheline size determines how many bytes are transferred between the caches at once, and to go back to 32 bytes would mean there was a Zen CPU where AMD would have cut the path width between the caches in half. I don't see what major technical gain this would give. It's not like it's an external bus requiring more pins and more complicated board layouts. Cutting the width in half would also affect how the instruction decoder works ...

Frankly, I don't believe there is a single Zen CPU out there with a 32 byte cacheline size. And there probably hasn't been an AMD CPU ever since the Athlon 64 with less than 64 bytes. This looks like a simple bug, a twist in numbers. A 32 got swapped for a 64 and nobody noticed it, because the effect is likely only a mild regression.

Anyhow, I've asked for an explanation on the gcc-help mailing list and I will open a PR if necessary to see if I can get an answer. If it's a bug then it can get fixed.

**tchiwam** · 15 March 2021, 12:21 PM

Hmmm, good point here, I'll try prefetching L3 manually in some of my loops and see if there's any difference. Didn't think of that one, thanks.

Announcement

Initial AMD Zen 3 Support Successfully Lands In GCC 11

Initial AMD Zen 3 Support Successfully Lands In GCC 11

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment