Announcement

Collapse
No announcement yet.

Exploring The Zen 5 SMT Performance With The AMD EPYC 9755 "Turin" CPU

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by ddriver View Post
    The reason why SMT exists is a lot of time, logic units only feed instructions to the SIMDs, which is the saturation point, and the scalar resources of the cpu become underutilized.
    LOL, wut? Seems like you're making shit up, man.

    In the last server CPU test I think Anandtech ever did, the data clearly shows better SMT scaling on int than fp workloads.
    SMT is pretty consistently helping int workloads, which wouldn't be the case if SMT were primarily about the SIMD units. The float workloads, which tend to be vector/FP-heavy, actually suffer regressions from it, implying that anything it's doing to help such cases is more than offset by the downsides.

    Note: SPEC2017 Rate-N tests involve running N copies of each program. This means they're "shared-nothing" and theoretically should exhibit perfect scaling (i.e. on the ideal machine).

    Originally posted by ddriver View Post
    Theoretically, for something where latency is a priority, a logical thread on the same cpu core is most preferable, as it can synchronize with the other thread in the same cache pool, it doesn't have to be synchronized over slower interconnect and many nodes.
    Yeah, if you have two threads exchanging a lot of data, it could be a net win to pair them on the same physical core. The best data I could find for this are some of the core-to-core latency plots in this article, but I think the only CPUs tested with SMT were Intel's. I'm not sure why EPYC had it disabled, other than perhaps the sheer core counts made the numbers too unreadable.

    Multicore CPUs have to give user programs a way to synchronize between different cores, and ensure a coherent view of memory.


    Tip: it helps to click the images and then use right-click to view them in a separate tab.
    Last edited by coder; 18 October 2024, 12:44 PM.

    Comment


    • #22
      Originally posted by Weasel View Post
      The thing is that creating less threads if your system is loaded isn't good either. Imagine the load drops halfway and now your app is stuck at half its performance and 50% CPU usage.

      How would it know to create more threads? And what happens if you get a load back later, do you remove threads?
      That's why you need the kernel to manage it. The kernel could also have work queue threads, created and ready to go, for all apps which need them. It would simply know how many to make active at any given time, for each app, based on overall system load. When resizing thread counts, it could also try to avoid suspending work queue threads that are still actively executing work items.

      MacOS has something like this, though I don't know too much about it.

      Comment


      • #23
        Originally posted by coder View Post
        LOL, wut? Seems like you're making shit up, man.
        You are misunderstanding what I said. The scalar pipeline (ALU, FPU) is the one being typically underutilized, as most actual "compute" on the cpu is done in SIMD. An auxiliary thread per core can saturate the scalar units, but when it comes to SIMD, the threads must contest. Thus "int" or scalar / alu heavy tests show great improvement from SMT, and "fp" or vector heavy tests show close to no improvement.

        As to how and why why in the world did the tech "industry" decided to refer to scalar as "int" as to vector as "fp" is beyond me. The cpu can do fp in scalar, and with far greater precision than simd, and simd units can do ints, which is why I consider this alt nomenclature to be outright retarded.

        Comment


        • #24
          Originally posted by coder View Post
          ๐Ÿ™„
          Did you even look at the benchmarks? Or did you think what you had to say was so valuable that you had to skip right to the comments section? I have to ask, because the very first benchmarks are the compilation tests, where SMT hurt 3 out of 5!! It utterly trashed Linux kernel compilation time!
          18.4% slower kernel compilation is observed with my 7950X when SMT is disabled

          That's why I suggested the same test but with end-user hardware like a Ryzen or even Threadripper CPU.

          You can't take benchmarks at surface level when hardware- and software configuration tends to heavily impact the results

          Comment


          • #25
            Originally posted by ddriver View Post
            You are misunderstanding what I said. The scalar pipeline (ALU, FPU) is the one being typically underutilized, as most actual "compute" on the cpu is done in SIMD. An auxiliary thread per core can saturate the scalar units, but when it comes to SIMD, the threads must contest.
            I didn't misunderstand that. The problem with that theory is that we lack supporting evidence. The threads in the benchmarks I cited won't be spending 100% of their time doing vector or floating-point operations, so it seems like there should be points where having a sibling thread can harness some of those underutilized resources. However, what the evidence shows is that whatever benefits from this actually occur, it's no match for the downsides of SMT in those cases.

            In order for your theory to hold, there would first have to be the right mix of threads. Next, you'd want some mechanism to facilitate pairing up an int-heavy thread with a vector/fp-heavy thread, yet no such mechanism (until Intel's Thread Director) was ever created. Finally, the benefit of doing so should not only be clear, but also great enough to justify all the trouble, complexity, and extra die area.

            Originally posted by ddriver View Post
            Thus "int" or scalar / alu heavy tests show great improvement from SMT, and "fp" or vector heavy tests show close to no improvement.
            How does this align with what you said before? You said the goal was to have another thread utilize the integer ALUs, while one thread is tying up the SIMD. So, why should it be that many int-only workloads are showing quite a lot of benefit?

            Also, I'm at a loss that you seem unconcerned with filling stalls due to cache misses. If a thread has to go to DRAM, that can take several hundred or even a couple thousand cycles, (depending on how full the request queues are)! Having another thread around to fill some of those idle cycles seems like an obvious win. Then, there's the matter of having ever-wider backends, stuck behind relatively narrow decoders. In branchy code, the uop cache might not do a lot of good, which is yet another obvious case where it seems like SMT should help.

            I don't know how/why you latched onto SIMD, as the main rationale, but if you've got any sources or evidence, let's see it.

            Originally posted by ddriver View Post
            โ€‹As to how and why why in the world did the tech "industry" decided to refer to scalar as "int" as to vector as "fp" is beyond me.
            That's because the vector arithmetic & FP pipelines are usually the same thing.

            Originally posted by ddriver View Post
            โ€‹โ€‹The cpu can do fp in scalar, and with far greater precision than simd,
            x86 CPUs have legacy support for x87 arithmetic, which supports up to 80-bit precision, but I don't usually see microachitecture diagrams which specifically split out that logic. In some cases, I think it's probably implemented via some of the fp ports.

            Here's a slide on the Lion Cove P-cores, from the Lunar Lake announcement:

            See the x87/MMX block, down at the bottom? It clearly implies that x87 is implemented using the vector/fp register file and scheduler.

            Comment


            • #26
              Originally posted by Kjell View Post
              18.4% slower kernel compilation is observed with my 7950X when SMT is disabled
              For what it's worth, that's Zen 4 and the CPU tested in this article is based on Zen 5. So, apart from it being EPYC, that's another big difference.

              As I mentioned before, Zen 5 also has a weird (and I think somewhat unique) feature of a "split decoder" where, in SMT mode, it devotes one 4-wide block per thread.

              Comment

              Working...
              X