Announcement

Collapse
No announcement yet.

Exploring The Zen 5 SMT Performance With The AMD EPYC 9755 "Turin" CPU

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by Kjell View Post

    there doesn't seem to be any scheduler doing it yet AFAIK
    Could it be because of a performance penalty or latency during this execution?

    I found this, https://www.man7.org/linux/man-pages...ffinity.2.html, https://docs.redhat.com/en/documenta...essor_affinity - Perhaps this could work for you?

    Comment


    • #12
      Originally posted by Kjell View Post
      I mentioned this in the orginal edit but decided to remove it since there doesn't seem to be any scheduler doing it yet AFAIK
      BPFLand and LAVD are two such schedulers, but they utilize the sched-ext kernel module that's just being merged with the latest kernel version.

      Comment


      • #13
        Originally posted by Kjell View Post
        Sim
        However, today many games/heterogeneous-tasks actually run faster with SMT disabled due to modern CPUs having much higher single-core performance and core count. In other words, SMT can actually hurt your performance if you share the the resources of a core​ with tasks that are low-threaded (or single-threaded). Nonetheless, take this with a pinch of salt since multi-threaded/homogeneous-tasks (compilation, rendering, encoding/decoding, compression/decompression) still benefit from SMT
        Sounds like a poorly and lazily thought out concurrency model.

        The reason why SMT exists is a lot of time, logic units only feed instructions to the SIMDs, which is the saturation point, and the scalar resources of the cpu become underutilized. This means that a with just a bit more overhead can create a virtual twin thread to better utilize the unsaturated pipeline. A slightly wider uarch and you can even get good returns up to 4 SMT threads per physical core.

        Theoretically, for something where latency is a priority, a logical thread on the same cpu core is most preferable, as it can synchronize with the other thread in the same cache pool, it doesn't have to be synchronized over slower interconnect and many nodes.

        That's why AMD's SMT is in practice free in most use cases - it doesn't have to use extra energy, because the scalar pipeline drains ~ the same amount of power regardless of how saturated it is. And since it can do more work with SMT on in the same amount of power, it produces a lot better for for power unit ratio.




        Comment


        • #14
          That mythology about lower gaming performance with SMT enabled was always based on a fucking couple of percents difference. Some "major impact" right there. Your chip's silicon quality variation influenced opportunistic boosts most likely have about the same if not higher impact.

          Comment


          • #15
            Originally posted by drakonas777 View Post
            That mythology about lower gaming performance with SMT enabled was always based on a fucking couple of percents difference. Some "major impact" right there. Your chip's silicon quality variation influenced opportunistic boosts most likely have about the same if not higher impact.
            Michael's benchmark would be very boring with little to no difference between SMT on vs off if this wasn't an issue. However, you're right that most of the time it's not worth touching SMT and the cases where it's beneficial are limited/mostly CPU bound scenarios. The takeaway is that there's still room for improvement via scheduler optimizations and whatnot.

            Nonetheless, SMT off has the potential to reduce FPS jitters (0.1% LOW) presumably due to greater headroom in the core and cache hit ratio (L1/L2). There's some outliners like CSGO (FPS) where the benefit of SMT OFF is as much as +129 FPS in 0.1% LOW or productivity tools like webbrowsing seeing an uplift.

            -
            APP (w/ 7800X3D) SMT ON (AVG) SMT OFF (AVG) SMT ON (0.1% LOW) SMT OFF (0.1% LOW)
            CSGO 2 791 FPS 798 FPS 149 FPS 278 FPS
            Speedometer 2.0 589 points 632 points - -
            -

            I'm personally pinning my games to CCX0 and disabling sibling cores (SMT) to reduce input latency, it used to noticeably reduce microstuttering in Halo Infinite before VKD3D released optimizations.
            Last edited by Kjell; 18 October 2024, 07:48 AM.

            Comment


            • #16
              Originally posted by Kjell View Post
              Simultaneous Multithreading creates 2 Virtual CPUs per Core to prevent task execution from being blocked/queued when the CPU is fully saturated.
              Huh? No, it has two main benefits. First, it helps improve backend utilization, when one thread can't fully saturate it. Second, in cases where one of the threads is stalled by memory operations (or maybe flushing after a mis-predicted branch?), it can keep the core still mostly occupied by useful work.

              The main downside is traditionally one of higher cache contention, but you also have both threads competing for physical registers. In Zen 5, there's a split 4+4 decoder that works like two independent 4-way decoders, when the core is fully-occupied. I think I read that it takes some 10k cycles or so to switch the core between 1-thread and 2-thread mode, which seems unfortunate.

              Anyway, when running a job that's capable of mostly saturating the backend, like in some Vector/FP-heavy workloads that aren't very memory-bound, the additional pressure on cache and registers tends to make SMT either a wash, or sometimes even a slight negative. As recently as Zen3 vs. Cascade Lake, Intel did better on this front of mitigating its downsides.

              Originally posted by Kjell View Post
              However, today many games/heterogeneous-tasks actually run faster with SMT disabled due to modern CPUs having much higher single-core performance and core count. In other words, SMT can actually hurt your performance if you share the the resources of a core​ with tasks that are low-threaded (or single-threaded).
              That's basically because threading APIs lack expressiveness to tell the OS which threads are latency-sensitive vs. bulk background work. If they knew this, OS schedulers could prioritize the former threads for exclusive assignment to a core. It would also enable them to ensure it's running on a P-core, in hybrid CPUs, which is pretty much a variation on the same theme.

              As for vector/FP-heavy workloads wanting exclusive core usage, that's where Intel's Thread Director comes into play. It spies on each thread and gives the OS clues relating to where it should schedule them. AMD could do something similar, by simply having the OS check the detailed performance counters of each core, when making scheduling decisions.

              Originally posted by Kjell View Post
              Nonetheless, take this with a pinch of salt since multi-threaded/homogeneous-tasks (compilation, rendering, encoding/decoding, compression/decompression) still benefit from SMT
              🙄
              Did you even look at the benchmarks? Or did you think what you had to say was so valuable that you had to skip right to the comments section? I have to ask, because the very first benchmarks are the compilation tests, where SMT hurt 3 out of 5!! It utterly trashed Linux kernel compilation time!

              Comment


              • #17
                Originally posted by mb_q View Post
                This is somewhat confusing because it mixes SMT effect with the scalability problems
                Yes, but it's still useful as a system-level test to tell us how much benefit SMT still provides at this scale. As and end user, if you're faced with a yes/no decision about whether to enable SMT for a certain task, this is exactly what you want to know and pretty much the only thing you care about.

                What it doesn't tell us is about where the various bottlenecks are, and how much of it is simply due to software scalability problems vs. hardware bottlenecks of one form or another. Having such insights would be useful, but teasing them out would generally take a more concerted investigation. All Michael's testing can really do is give us clues and indicate what might be useful directions to go in.

                Comment


                • #18
                  Originally posted by carguello2 View Post
                  I could be wrong here, but AFAIU you can use "taskset -c 0,1 ./your_program" to pin CPU cores to a specific task.
                  taskset sucks, as a general solution. In the case where you have some high-priority thread(s), all you want to do is tell the scheduler to run them on a P-core and try to avoid pairing them with any other threads. Affinity can't do that. Affinity masks force you to overspecify which vCPUs you want it to run on (i.e. you have say which SMT sibling, when you actually don't care) and gives you no way to try and keep other threads away from whichever physical core it gets scheduled on.

                  This is also addressing your subsequent point about sched_setaffinity().

                  Originally posted by carguello2 View Post
                  ​As far disabling SMT, just as Michael said, "cat /sys/module/cpu/parameters/smt" you can "echo=on/off" to enable/disable smt at runtime.
                  Blanket-disabling SMT is also much too blunt a hammer. A machine is often doing more than one workload, especially if we're talking about big servers. For most threads, let them go ahead and use SMT!

                  Originally posted by Kjell View Post
                  Disabling Sibling Cores (VCores / SMT) + Pinning isn't effective as you're sacrificing multi-thread performance when the pinned cores are idle
                  That's not how it works. If you pin a task to a core, it (mostly) prevents the task from being moved to a different core, but it doesn't prevent other things from being run on the core (e.g. if the pinned task is blocking on I/O).​
                  Last edited by coder; 18 October 2024, 12:02 PM.

                  Comment


                  • #19
                    Originally posted by caligula View Post
                    Games have suffered from poor utilization of cores/threads as far as I can tell. The developers still don't know how to write engines. Many other tasks are not that demanding in terms of CPU load.
                    The games and their developers mostly aren't to blame, IMO. I think the real fault lies with insufficiently expressive threading APIs and overuse of userspace threading as a concurrency mechanism.

                    Also, the way a lot of programs use threads to scale multi-core performance is via work queues and thread pools. This isn't something userspace should be doing. We need the kernel to expose a work queue API, so the kernel can dynamically manage the size of the underlying thread pools, based on priority and what else is running. You never want a situation where multiple programs start more worker threads than they really need, and just end up thrashing each other. A lot of programs that use thread pools default to starting a worker thread for each core or SMT sibling, regardless of their actual need. When a bunch of work items are submitted, the can saturate all of the cores, resulting in some getting suspended for other tasks to run on those cores, and the application suffering worse performance than if it started fewer worker threads.​

                    Comment


                    • #20
                      Originally posted by coder View Post
                      The games and their developers mostly aren't to blame, IMO. I think the real fault lies with insufficiently expressive threading APIs and overuse of userspace threading as a concurrency mechanism.

                      Also, the way a lot of programs use threads to scale multi-core performance is via work queues and thread pools. This isn't something userspace should be doing. We need the kernel to expose a work queue API, so the kernel can dynamically manage the size of the underlying thread pools, based on priority and what else is running. You never want a situation where multiple programs start more worker threads than they really need, and just end up thrashing each other. A lot of programs that use thread pools default to starting a worker thread for each core or SMT sibling, regardless of their actual need. When a bunch of work items are submitted, the can saturate all of the cores, resulting in some getting suspended for other tasks to run on those cores, and the application suffering worse performance than if it started fewer worker threads.​
                      The thing is that creating less threads if your system is loaded isn't good either. Imagine the load drops halfway and now your app is stuck at half its performance and 50% CPU usage.

                      How would it know to create more threads? And what happens if you get a load back later, do you remove threads?

                      Comment

                      Working...
                      X