Originally posted by Weasel
View Post
Announcement
Collapse
No announcement yet.
AMD 3D V-Cache Performance Optimizer Driver Posted For Linux
Collapse
X
-
Originally posted by Anux View PostInteresting, it looks like "GOMP_CPU_AFFINITY" is from OpenMP: https://spec.org/cpu2017/flags/aocc4...022-12-08.html
Leave a comment:
-
Originally posted by habilain View PostThinking about it, I believe that AMD's AOCC does have some options on this as well (it certainly can build using CPU affinities), which might be useful in deciding when not to do a full "check all other CCDs for hits".
Leave a comment:
-
Originally posted by Anux View PostOk we misunderstood each other, you are pointing on a single CCD that uses another CCD cache while I explained a scenario that uses all cores and their local caches.
But even in your case the second CCDs cache is still faster then accessing the RAM, surely with much less benefit then a CCD-local cache would give. But I don't think Zens caches work this way, whenever a core needs to access another CCDs cache, it gets copied over to the cores cache and other stuff needs to be evicted, you can't really use all the cache for one core.
For gaming sure, if you have perfectly multi-thread-able code (no write locks between threads) then not.
Still I don't get what a compiler can do there, can you point to a specific option one would use? As far as I know your problem is either solved with a VM for each CCD or core pinning on a single workload.
I did not specify "compilers", I said code would be compiled and deployed for a specific deployment. That includes compile time options to limit the number of cores used, which I think is how some frameworks do it - OpenBLAS and OpenMP come to mind. Thinking about it, I believe that AMD's AOCC does have some options on this as well (it certainly can build using CPU affinities), which might be useful in deciding when not to do a full "check all other CCDs for hits".
- Likes 1
Leave a comment:
-
Originally posted by habilain View PostIt's not about memory bandwidth / working set size. It's about minimising cross-CCD latency to get maximum benefit. Going to the L3 Cache on another CCD incurs a fairly substantial latency penalty, as it has to go through the infinity fabric - which, while fast, is a) a finite resource (i.e. you *can* saturate the infinity fabric) and b) is much slower than just going to your own L3 cache.
But even in your case the second CCDs cache is still faster then accessing the RAM, surely with much less benefit then a CCD-local cache would give. But I don't think Zens caches work this way, whenever a core needs to access another CCDs cache, it gets copied over to the cores cache and other stuff needs to be evicted, you can't really use all the cache for one core.
It's only when you actually can't run on a single chiplet that you should consider running across multiple chiplets, because doing so has a penalty.
Still I don't get what a compiler can do there, can you point to a specific option one would use? As far as I know your problem is either solved with a VM for each CCD or core pinning on a single workload.
Leave a comment:
-
Originally posted by Anux View PostBut compiling will hardly help with memory bandwidth and working set size. To profit of the large L3 cache your workload needs to be limited by memory bandwidth or access time and the working set must be small enough to roughly fit in L3, non of which can be influenced by compiling.
So to maximise benefit from multiple 3D-VCache chips (and, to be honest, any processor with chiplets over a shared interconnect), applications should be compiled / deployed in a way that respects the underlying architecture. So, in the case of 12 3D-VCache chiplets, you'd get substantially better overall performance by treating them as a system with 12 socketed CPUs. It's only when you actually can't run on a single chiplet that you should consider running across multiple chiplets, because doing so has a penalty.
And that's why in a server setting, people are willing to make changes to how they compile - and I should have said this earlier - deploy applications to get maximum performance. On a desktop, that's just not the case. Core parking gets performance benefits partly from turning off a CCD to get more thermal headroom, but also by forcing tasks to be on a single CCD if possible, eliminating cross-CCD latency.
Leave a comment:
-
Originally posted by habilain View Post
Remember that there probably isn't that huge performance benefit to 2 cores with 3DVCache in cache sensitive applications, because of cross-CCD latency. I mean, we don't know that for certain, but it certainly seems a plausible explanation.
I'd guess it's something along the lines of picking between these two performance options:- Game on one CCD, Windows background on another CCD
- Game + Windows background on one CCD, other CCD off, higher boosting on active CCD due to power management + no need to worry about cross-CCD latency affecting communication between Game + Windows
On an old Ryzen with 4MBx2 cache on split CCDs I was able to get a nearly 50% performance boost forcing everything that wasn't Dwarf Fortress worldgen over to the other CCD.
Leave a comment:
-
Originally posted by habilain View PostIt's not that it can't work, but it's less likely to work on consumer platforms. In a commercial setting, people make sure that they compile their applications to take full advantage of the hardware.
Originally posted by NotMine999 View PostI am pretty certain that 12 is larger than 2
A 12 CCD (eny-meany-miney-moe) CPU has a better opportunity than a 2 CCD (ping-pong, either-or) CPU to fully utilize and maximize the application of this feature??
AMDs explanation for the 59XXXX3D to only get one cached CCD was that it didn't improve gaming (games suffer from inter CCD communication) but the higher clocks of the uncached CCD helped in certain games/workloads (and price was probably also a reason).
Leave a comment:
-
Originally posted by bug77 View Post
Tbh, sounds like a terrible idea if the CPU can't tell whether it's running a cache or IPC sensitive load and it needs to be told. I'll take it if it works, hopefully we won't need to modify all apps to send those hints.
Leave a comment:
Leave a comment: