Announcement

Collapse
No announcement yet.

Exploring The Zen 5 SMT Performance With The AMD EPYC 9755 "Turin" CPU

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • coder
    replied
    Originally posted by Kjell View Post
    18.4% slower kernel compilation is observed with my 7950X when SMT is disabled
    For what it's worth, that's Zen 4 and the CPU tested in this article is based on Zen 5. So, apart from it being EPYC, that's another big difference.

    As I mentioned before, Zen 5 also has a weird (and I think somewhat unique) feature of a "split decoder" where, in SMT mode, it devotes one 4-wide block per thread.

    Leave a comment:


  • coder
    replied
    Originally posted by ddriver View Post
    You are misunderstanding what I said. The scalar pipeline (ALU, FPU) is the one being typically underutilized, as most actual "compute" on the cpu is done in SIMD. An auxiliary thread per core can saturate the scalar units, but when it comes to SIMD, the threads must contest.
    I didn't misunderstand that. The problem with that theory is that we lack supporting evidence. The threads in the benchmarks I cited won't be spending 100% of their time doing vector or floating-point operations, so it seems like there should be points where having a sibling thread can harness some of those underutilized resources. However, what the evidence shows is that whatever benefits from this actually occur, it's no match for the downsides of SMT in those cases.

    In order for your theory to hold, there would first have to be the right mix of threads. Next, you'd want some mechanism to facilitate pairing up an int-heavy thread with a vector/fp-heavy thread, yet no such mechanism (until Intel's Thread Director) was ever created. Finally, the benefit of doing so should not only be clear, but also great enough to justify all the trouble, complexity, and extra die area.

    Originally posted by ddriver View Post
    Thus "int" or scalar / alu heavy tests show great improvement from SMT, and "fp" or vector heavy tests show close to no improvement.
    How does this align with what you said before? You said the goal was to have another thread utilize the integer ALUs, while one thread is tying up the SIMD. So, why should it be that many int-only workloads are showing quite a lot of benefit?

    Also, I'm at a loss that you seem unconcerned with filling stalls due to cache misses. If a thread has to go to DRAM, that can take several hundred or even a couple thousand cycles, (depending on how full the request queues are)! Having another thread around to fill some of those idle cycles seems like an obvious win. Then, there's the matter of having ever-wider backends, stuck behind relatively narrow decoders. In branchy code, the uop cache might not do a lot of good, which is yet another obvious case where it seems like SMT should help.

    I don't know how/why you latched onto SIMD, as the main rationale, but if you've got any sources or evidence, let's see it.

    Originally posted by ddriver View Post
    ​As to how and why why in the world did the tech "industry" decided to refer to scalar as "int" as to vector as "fp" is beyond me.
    That's because the vector arithmetic & FP pipelines are usually the same thing.

    Originally posted by ddriver View Post
    ​​The cpu can do fp in scalar, and with far greater precision than simd,
    x86 CPUs have legacy support for x87 arithmetic, which supports up to 80-bit precision, but I don't usually see microachitecture diagrams which specifically split out that logic. In some cases, I think it's probably implemented via some of the fp ports.

    Here's a slide on the Lion Cove P-cores, from the Lunar Lake announcement:

    See the x87/MMX block, down at the bottom? It clearly implies that x87 is implemented using the vector/fp register file and scheduler.

    Leave a comment:


  • Kjell
    replied
    Originally posted by coder View Post
    πŸ™„
    Did you even look at the benchmarks? Or did you think what you had to say was so valuable that you had to skip right to the comments section? I have to ask, because the very first benchmarks are the compilation tests, where SMT hurt 3 out of 5!! It utterly trashed Linux kernel compilation time!
    18.4% slower kernel compilation is observed with my 7950X when SMT is disabled

    That's why I suggested the same test but with end-user hardware like a Ryzen or even Threadripper CPU.

    You can't take benchmarks at surface level when hardware- and software configuration tends to heavily impact the results

    Leave a comment:


  • ddriver
    replied
    Originally posted by coder View Post
    LOL, wut? Seems like you're making shit up, man.
    You are misunderstanding what I said. The scalar pipeline (ALU, FPU) is the one being typically underutilized, as most actual "compute" on the cpu is done in SIMD. An auxiliary thread per core can saturate the scalar units, but when it comes to SIMD, the threads must contest. Thus "int" or scalar / alu heavy tests show great improvement from SMT, and "fp" or vector heavy tests show close to no improvement.

    As to how and why why in the world did the tech "industry" decided to refer to scalar as "int" as to vector as "fp" is beyond me. The cpu can do fp in scalar, and with far greater precision than simd, and simd units can do ints, which is why I consider this alt nomenclature to be outright retarded.

    Leave a comment:


  • coder
    replied
    Originally posted by Weasel View Post
    The thing is that creating less threads if your system is loaded isn't good either. Imagine the load drops halfway and now your app is stuck at half its performance and 50% CPU usage.

    How would it know to create more threads? And what happens if you get a load back later, do you remove threads?
    That's why you need the kernel to manage it. The kernel could also have work queue threads, created and ready to go, for all apps which need them. It would simply know how many to make active at any given time, for each app, based on overall system load. When resizing thread counts, it could also try to avoid suspending work queue threads that are still actively executing work items.

    MacOS has something like this, though I don't know too much about it.

    Leave a comment:


  • coder
    replied
    Originally posted by ddriver View Post
    The reason why SMT exists is a lot of time, logic units only feed instructions to the SIMDs, which is the saturation point, and the scalar resources of the cpu become underutilized.
    LOL, wut? Seems like you're making shit up, man.

    In the last server CPU test I think Anandtech ever did, the data clearly shows better SMT scaling on int than fp workloads.
    SMT is pretty consistently helping int workloads, which wouldn't be the case if SMT were primarily about the SIMD units. The float workloads, which tend to be vector/FP-heavy, actually suffer regressions from it, implying that anything it's doing to help such cases is more than offset by the downsides.

    Note: SPEC2017 Rate-N tests involve running N copies of each program. This means they're "shared-nothing" and theoretically should exhibit perfect scaling (i.e. on the ideal machine).

    Originally posted by ddriver View Post
    Theoretically, for something where latency is a priority, a logical thread on the same cpu core is most preferable, as it can synchronize with the other thread in the same cache pool, it doesn't have to be synchronized over slower interconnect and many nodes.
    Yeah, if you have two threads exchanging a lot of data, it could be a net win to pair them on the same physical core. The best data I could find for this are some of the core-to-core latency plots in this article, but I think the only CPUs tested with SMT were Intel's. I'm not sure why EPYC had it disabled, other than perhaps the sheer core counts made the numbers too unreadable.

    Multicore CPUs have to give user programs a way to synchronize between different cores, and ensure a coherent view of memory.


    Tip: it helps to click the images and then use right-click to view them in a separate tab.
    Last edited by coder; 18 October 2024, 12:44 PM.

    Leave a comment:


  • Weasel
    replied
    Originally posted by coder View Post
    The games and their developers mostly aren't to blame, IMO. I think the real fault lies with insufficiently expressive threading APIs and overuse of userspace threading as a concurrency mechanism.

    Also, the way a lot of programs use threads to scale multi-core performance is via work queues and thread pools. This isn't something userspace should be doing. We need the kernel to expose a work queue API, so the kernel can dynamically manage the size of the underlying thread pools, based on priority and what else is running. You never want a situation where multiple programs start more worker threads than they really need, and just end up thrashing each other. A lot of programs that use thread pools default to starting a worker thread for each core or SMT sibling, regardless of their actual need. When a bunch of work items are submitted, the can saturate all of the cores, resulting in some getting suspended for other tasks to run on those cores, and the application suffering worse performance than if it started fewer worker threads.​
    The thing is that creating less threads if your system is loaded isn't good either. Imagine the load drops halfway and now your app is stuck at half its performance and 50% CPU usage.

    How would it know to create more threads? And what happens if you get a load back later, do you remove threads?

    Leave a comment:


  • coder
    replied
    Originally posted by caligula View Post
    Games have suffered from poor utilization of cores/threads as far as I can tell. The developers still don't know how to write engines. Many other tasks are not that demanding in terms of CPU load.
    The games and their developers mostly aren't to blame, IMO. I think the real fault lies with insufficiently expressive threading APIs and overuse of userspace threading as a concurrency mechanism.

    Also, the way a lot of programs use threads to scale multi-core performance is via work queues and thread pools. This isn't something userspace should be doing. We need the kernel to expose a work queue API, so the kernel can dynamically manage the size of the underlying thread pools, based on priority and what else is running. You never want a situation where multiple programs start more worker threads than they really need, and just end up thrashing each other. A lot of programs that use thread pools default to starting a worker thread for each core or SMT sibling, regardless of their actual need. When a bunch of work items are submitted, the can saturate all of the cores, resulting in some getting suspended for other tasks to run on those cores, and the application suffering worse performance than if it started fewer worker threads.​

    Leave a comment:


  • coder
    replied
    Originally posted by carguello2 View Post
    I could be wrong here, but AFAIU you can use "taskset -c 0,1 ./your_program" to pin CPU cores to a specific task.
    taskset sucks, as a general solution. In the case where you have some high-priority thread(s), all you want to do is tell the scheduler to run them on a P-core and try to avoid pairing them with any other threads. Affinity can't do that. Affinity masks force you to overspecify which vCPUs you want it to run on (i.e. you have say which SMT sibling, when you actually don't care) and gives you no way to try and keep other threads away from whichever physical core it gets scheduled on.

    This is also addressing your subsequent point about sched_setaffinity().

    Originally posted by carguello2 View Post
    ​As far disabling SMT, just as Michael said, "cat /sys/module/cpu/parameters/smt" you can "echo=on/off" to enable/disable smt at runtime.
    Blanket-disabling SMT is also much too blunt a hammer. A machine is often doing more than one workload, especially if we're talking about big servers. For most threads, let them go ahead and use SMT!

    Originally posted by Kjell View Post
    Disabling Sibling Cores (VCores / SMT) + Pinning isn't effective as you're sacrificing multi-thread performance when the pinned cores are idle
    That's not how it works. If you pin a task to a core, it (mostly) prevents the task from being moved to a different core, but it doesn't prevent other things from being run on the core (e.g. if the pinned task is blocking on I/O).​
    Last edited by coder; 18 October 2024, 12:02 PM.

    Leave a comment:


  • coder
    replied
    Originally posted by mb_q View Post
    This is somewhat confusing because it mixes SMT effect with the scalability problems
    Yes, but it's still useful as a system-level test to tell us how much benefit SMT still provides at this scale. As and end user, if you're faced with a yes/no decision about whether to enable SMT for a certain task, this is exactly what you want to know and pretty much the only thing you care about.

    What it doesn't tell us is about where the various bottlenecks are, and how much of it is simply due to software scalability problems vs. hardware bottlenecks of one form or another. Having such insights would be useful, but teasing them out would generally take a more concerted investigation. All Michael's testing can really do is give us clues and indicate what might be useful directions to go in.

    Leave a comment:

Working...
X