Announcement

Collapse
No announcement yet.

AMD 3D V-Cache Performance Optimizer Driver Posted For Linux

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by Weasel View Post
    No it's a terrible idea. This requires the application to actually be aware of this shit and write to it. Unreal. It's not gonna happen.
    Script kiddies that write gamez be 2 lazy 2 optimize their kodez for the target environments of their applications. GEEZ

    They must be way too focused on their next bubble tea, bowl of pho, and freak-out BJ

    Comment


    • #22
      Originally posted by Anux View Post

      I fail to see why 2 would be a special case when it clearly works with 12 CCDs (see Epyc 9684X).
      I am pretty certain that 12 is larger than 2 ... even for very large values of 2

      A 12 CCD (eny-meany-miney-moe) CPU has a better opportunity than a 2 CCD (ping-pong, either-or) CPU to fully utilize and maximize the application of this feature??

      Comment


      • #23
        Originally posted by bug77 View Post

        Tbh, sounds like a terrible idea if the CPU can't tell whether it's running a cache or IPC sensitive load and it needs to be told. I'll take it if it works, hopefully we won't need to modify all apps to send those hints.
        The CPU cannot do this because the CPU have no info on what thread or process is currently running (it is after all only executing code from where it's IP points in memory). Also just measuring say cache misses does not work here since that some threads are having many cache misses from time to time does not mean that the process as a whole is more cache dependent than clock dependent. The only thing that can make this decision are extensive benchmarks by the developer.

        Comment


        • #24
          Now "semiconductor boobs" optimizer is finally available on Linux.

          Comment


          • #25
            Originally posted by habilain View Post
            It's not that it can't work, but it's less likely to work on consumer platforms. In a commercial setting, people make sure that they compile their applications to take full advantage of the hardware.
            But compiling will hardly help with memory bandwidth and working set size. To profit of the large L3 cache your workload needs to be limited by memory bandwidth or access time and the working set must be small enough to roughly fit in L3, non of which can be influenced by compiling.

            Originally posted by NotMine999 View Post
            I am pretty certain that 12 is larger than 2
            Exactly, why would CCD intercommunication be no problem with 12 CCDs but with 2 it would be? That doesn't make any sense to me.
            A 12 CCD (eny-meany-miney-moe) CPU has a better opportunity than a 2 CCD (ping-pong, either-or) CPU to fully utilize and maximize the application of this feature??
            ? the more CCDs you have the more communication/data transfer has to happen between CCDs. So 2 CCDs should be much less problematic. And if you have additional cache on each CCD, the transfers can be buffered much better so more cached CCDs should always be better then less.

            AMDs explanation for the 59XXXX3D to only get one cached CCD was that it didn't improve gaming (games suffer from inter CCD communication) but the higher clocks of the uncached CCD helped in certain games/workloads (and price was probably also a reason).

            Comment


            • #26
              Originally posted by habilain View Post

              Remember that there probably isn't that huge performance benefit to 2 cores with 3DVCache in cache sensitive applications, because of cross-CCD latency. I mean, we don't know that for certain, but it certainly seems a plausible explanation.



              I'd guess it's something along the lines of picking between these two performance options:
              1. Game on one CCD, Windows background on another CCD
              2. Game + Windows background on one CCD, other CCD off, higher boosting on active CCD due to power management + no need to worry about cross-CCD latency affecting communication between Game + Windows
              My guess is that in testing, Game + Windows did not saturate one CCD (which I can believe, as most games use at most 4 threads), and Zen 5 initially had some cross-CCD latency issues (now more-or-less fixed), so option 2 made most sense.
              Yes, but you are also poisoning your cache using a CCD for other stuff.

              On an old Ryzen with 4MBx2 cache on split CCDs I was able to get a nearly 50% performance boost forcing everything that wasn't Dwarf Fortress worldgen over to the other CCD.

              Comment


              • #27
                Originally posted by Anux View Post
                But compiling will hardly help with memory bandwidth and working set size. To profit of the large L3 cache your workload needs to be limited by memory bandwidth or access time and the working set must be small enough to roughly fit in L3, non of which can be influenced by compiling.
                It's not about memory bandwidth / working set size. It's about minimising cross-CCD latency to get maximum benefit. Going to the L3 Cache on another CCD incurs a fairly substantial latency penalty, as it has to go through the infinity fabric - which, while fast, is a) a finite resource (i.e. you *can* saturate the infinity fabric) and b) is much slower than just going to your own L3 cache.

                So to maximise benefit from multiple 3D-VCache chips (and, to be honest, any processor with chiplets over a shared interconnect), applications should be compiled / deployed in a way that respects the underlying architecture. So, in the case of 12 3D-VCache chiplets, you'd get substantially better overall performance by treating them as a system with 12 socketed CPUs. It's only when you actually can't run on a single chiplet that you should consider running across multiple chiplets, because doing so has a penalty.

                And that's why in a server setting, people are willing to make changes to how they compile - and I should have said this earlier - deploy applications to get maximum performance. On a desktop, that's just not the case. Core parking gets performance benefits partly from turning off a CCD to get more thermal headroom, but also by forcing tasks to be on a single CCD if possible, eliminating cross-CCD latency.

                Comment


                • #28
                  Originally posted by habilain View Post
                  It's not about memory bandwidth / working set size. It's about minimising cross-CCD latency to get maximum benefit. Going to the L3 Cache on another CCD incurs a fairly substantial latency penalty, as it has to go through the infinity fabric - which, while fast, is a) a finite resource (i.e. you *can* saturate the infinity fabric) and b) is much slower than just going to your own L3 cache.
                  Ok we misunderstood each other, you are pointing on a single CCD that uses another CCD cache while I explained a scenario that uses all cores and their local caches.
                  But even in your case the second CCDs cache is still faster then accessing the RAM, surely with much less benefit then a CCD-local cache would give. But I don't think Zens caches work this way, whenever a core needs to access another CCDs cache, it gets copied over to the cores cache and other stuff needs to be evicted, you can't really use all the cache for one core.

                  It's only when you actually can't run on a single chiplet that you should consider running across multiple chiplets, because doing so has a penalty.
                  For gaming sure, if you have perfectly multi-thread-able code (no write locks between threads) then not.

                  Still I don't get what a compiler can do there, can you point to a specific option one would use? As far as I know your problem is either solved with a VM for each CCD or core pinning on a single workload.

                  Comment


                  • #29
                    Originally posted by Anux View Post
                    Ok we misunderstood each other, you are pointing on a single CCD that uses another CCD cache while I explained a scenario that uses all cores and their local caches.
                    But even in your case the second CCDs cache is still faster then accessing the RAM, surely with much less benefit then a CCD-local cache would give. But I don't think Zens caches work this way, whenever a core needs to access another CCDs cache, it gets copied over to the cores cache and other stuff needs to be evicted, you can't really use all the cache for one core.

                    For gaming sure, if you have perfectly multi-thread-able code (no write locks between threads) then not.

                    Still I don't get what a compiler can do there, can you point to a specific option one would use? As far as I know your problem is either solved with a VM for each CCD or core pinning on a single workload.
                    It can be faster on a cache hit than going to RAM but it likely comes at the penalty of increased RAM latency overall, as on a CCD's cache miss, it has to check the other CCD's caches to see if a hit can be found there. So overall performance may not be a straight up increase - it's going to vary by application.

                    I did not specify "compilers", I said code would be compiled and deployed for a specific deployment. That includes compile time options to limit the number of cores used, which I think is how some frameworks do it - OpenBLAS and OpenMP come to mind. Thinking about it, I believe that AMD's AOCC does have some options on this as well (it certainly can build using CPU affinities), which might be useful in deciding when not to do a full "check all other CCDs for hits".

                    Comment


                    • #30
                      Originally posted by habilain View Post
                      Thinking about it, I believe that AMD's AOCC does have some options on this as well (it certainly can build using CPU affinities), which might be useful in deciding when not to do a full "check all other CCDs for hits".
                      Interesting, it looks like "GOMP_CPU_AFFINITY" is from OpenMP: https://spec.org/cpu2017/flags/aocc4...022-12-08.html

                      Comment

                      Working...
                      X