Announcement

Collapse
No announcement yet.

AMD 3D V-Cache Performance Optimizer Driver Posted For Linux

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • npwx
    replied
    Originally posted by Weasel View Post
    No it's a terrible idea. This requires the application to actually be aware of this shit and write to it. Unreal. It's not gonna happen.
    Indeed. Just checked the mailing list and was surprised it hasn't been shot down yet.

    Leave a comment:


  • habilain
    replied
    Originally posted by Anux View Post
    Interesting, it looks like "GOMP_CPU_AFFINITY" is from OpenMP: https://spec.org/cpu2017/flags/aocc4...022-12-08.html
    It is from OpenMP - mainly because OpenMP is one of the frameworks that provides the means to adequately describe say, a 256 core CPU's cache, interconnect and memory layout. It's really not trivial to get maximum performance out of these things!

    Leave a comment:


  • Anux
    replied
    Originally posted by habilain View Post
    Thinking about it, I believe that AMD's AOCC does have some options on this as well (it certainly can build using CPU affinities), which might be useful in deciding when not to do a full "check all other CCDs for hits".
    Interesting, it looks like "GOMP_CPU_AFFINITY" is from OpenMP: https://spec.org/cpu2017/flags/aocc4...022-12-08.html

    Leave a comment:


  • habilain
    replied
    Originally posted by Anux View Post
    Ok we misunderstood each other, you are pointing on a single CCD that uses another CCD cache while I explained a scenario that uses all cores and their local caches.
    But even in your case the second CCDs cache is still faster then accessing the RAM, surely with much less benefit then a CCD-local cache would give. But I don't think Zens caches work this way, whenever a core needs to access another CCDs cache, it gets copied over to the cores cache and other stuff needs to be evicted, you can't really use all the cache for one core.

    For gaming sure, if you have perfectly multi-thread-able code (no write locks between threads) then not.

    Still I don't get what a compiler can do there, can you point to a specific option one would use? As far as I know your problem is either solved with a VM for each CCD or core pinning on a single workload.
    It can be faster on a cache hit than going to RAM but it likely comes at the penalty of increased RAM latency overall, as on a CCD's cache miss, it has to check the other CCD's caches to see if a hit can be found there. So overall performance may not be a straight up increase - it's going to vary by application.

    I did not specify "compilers", I said code would be compiled and deployed for a specific deployment. That includes compile time options to limit the number of cores used, which I think is how some frameworks do it - OpenBLAS and OpenMP come to mind. Thinking about it, I believe that AMD's AOCC does have some options on this as well (it certainly can build using CPU affinities), which might be useful in deciding when not to do a full "check all other CCDs for hits".

    Leave a comment:


  • Anux
    replied
    Originally posted by habilain View Post
    It's not about memory bandwidth / working set size. It's about minimising cross-CCD latency to get maximum benefit. Going to the L3 Cache on another CCD incurs a fairly substantial latency penalty, as it has to go through the infinity fabric - which, while fast, is a) a finite resource (i.e. you *can* saturate the infinity fabric) and b) is much slower than just going to your own L3 cache.
    Ok we misunderstood each other, you are pointing on a single CCD that uses another CCD cache while I explained a scenario that uses all cores and their local caches.
    But even in your case the second CCDs cache is still faster then accessing the RAM, surely with much less benefit then a CCD-local cache would give. But I don't think Zens caches work this way, whenever a core needs to access another CCDs cache, it gets copied over to the cores cache and other stuff needs to be evicted, you can't really use all the cache for one core.

    It's only when you actually can't run on a single chiplet that you should consider running across multiple chiplets, because doing so has a penalty.
    For gaming sure, if you have perfectly multi-thread-able code (no write locks between threads) then not.

    Still I don't get what a compiler can do there, can you point to a specific option one would use? As far as I know your problem is either solved with a VM for each CCD or core pinning on a single workload.

    Leave a comment:


  • habilain
    replied
    Originally posted by Anux View Post
    But compiling will hardly help with memory bandwidth and working set size. To profit of the large L3 cache your workload needs to be limited by memory bandwidth or access time and the working set must be small enough to roughly fit in L3, non of which can be influenced by compiling.
    It's not about memory bandwidth / working set size. It's about minimising cross-CCD latency to get maximum benefit. Going to the L3 Cache on another CCD incurs a fairly substantial latency penalty, as it has to go through the infinity fabric - which, while fast, is a) a finite resource (i.e. you *can* saturate the infinity fabric) and b) is much slower than just going to your own L3 cache.

    So to maximise benefit from multiple 3D-VCache chips (and, to be honest, any processor with chiplets over a shared interconnect), applications should be compiled / deployed in a way that respects the underlying architecture. So, in the case of 12 3D-VCache chiplets, you'd get substantially better overall performance by treating them as a system with 12 socketed CPUs. It's only when you actually can't run on a single chiplet that you should consider running across multiple chiplets, because doing so has a penalty.

    And that's why in a server setting, people are willing to make changes to how they compile - and I should have said this earlier - deploy applications to get maximum performance. On a desktop, that's just not the case. Core parking gets performance benefits partly from turning off a CCD to get more thermal headroom, but also by forcing tasks to be on a single CCD if possible, eliminating cross-CCD latency.

    Leave a comment:


  • JasonAK
    replied
    Originally posted by habilain View Post

    Remember that there probably isn't that huge performance benefit to 2 cores with 3DVCache in cache sensitive applications, because of cross-CCD latency. I mean, we don't know that for certain, but it certainly seems a plausible explanation.



    I'd guess it's something along the lines of picking between these two performance options:
    1. Game on one CCD, Windows background on another CCD
    2. Game + Windows background on one CCD, other CCD off, higher boosting on active CCD due to power management + no need to worry about cross-CCD latency affecting communication between Game + Windows
    My guess is that in testing, Game + Windows did not saturate one CCD (which I can believe, as most games use at most 4 threads), and Zen 5 initially had some cross-CCD latency issues (now more-or-less fixed), so option 2 made most sense.
    Yes, but you are also poisoning your cache using a CCD for other stuff.

    On an old Ryzen with 4MBx2 cache on split CCDs I was able to get a nearly 50% performance boost forcing everything that wasn't Dwarf Fortress worldgen over to the other CCD.

    Leave a comment:


  • Anux
    replied
    Originally posted by habilain View Post
    It's not that it can't work, but it's less likely to work on consumer platforms. In a commercial setting, people make sure that they compile their applications to take full advantage of the hardware.
    But compiling will hardly help with memory bandwidth and working set size. To profit of the large L3 cache your workload needs to be limited by memory bandwidth or access time and the working set must be small enough to roughly fit in L3, non of which can be influenced by compiling.

    Originally posted by NotMine999 View Post
    I am pretty certain that 12 is larger than 2
    Exactly, why would CCD intercommunication be no problem with 12 CCDs but with 2 it would be? That doesn't make any sense to me.
    A 12 CCD (eny-meany-miney-moe) CPU has a better opportunity than a 2 CCD (ping-pong, either-or) CPU to fully utilize and maximize the application of this feature??
    ? the more CCDs you have the more communication/data transfer has to happen between CCDs. So 2 CCDs should be much less problematic. And if you have additional cache on each CCD, the transfers can be buffered much better so more cached CCDs should always be better then less.

    AMDs explanation for the 59XXXX3D to only get one cached CCD was that it didn't improve gaming (games suffer from inter CCD communication) but the higher clocks of the uncached CCD helped in certain games/workloads (and price was probably also a reason).

    Leave a comment:


  • qsmcomp
    replied
    Now "semiconductor boobs" optimizer is finally available on Linux.

    Leave a comment:


  • F.Ultra
    replied
    Originally posted by bug77 View Post

    Tbh, sounds like a terrible idea if the CPU can't tell whether it's running a cache or IPC sensitive load and it needs to be told. I'll take it if it works, hopefully we won't need to modify all apps to send those hints.
    The CPU cannot do this because the CPU have no info on what thread or process is currently running (it is after all only executing code from where it's IP points in memory). Also just measuring say cache misses does not work here since that some threads are having many cache misses from time to time does not mean that the process as a whole is more cache dependent than clock dependent. The only thing that can make this decision are extensive benchmarks by the developer.

    Leave a comment:

Working...
X