Announcement

Collapse
No announcement yet.

AMD Launches EPYC 9004 "Genoa" Processors - Up To 96 Cores, AVX-512, Incredible Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • coder
    replied
    Originally posted by brucethemoose View Post
    I wouldn't call ARM, especially ARMv9, "RISC" in comparison to x86 anymore.
    Well, it still has fixed-size instruction words and is a register-to-register ISA, both of which remain true to the classic RISC orthodoxy.

    I think we haven't seen a "pure" RISC CPU in a while, but there's no doubt AArch64 is a more RISCy ISA than x86-64.

    Leave a comment:


  • brucethemoose
    replied
    Originally posted by josmat View Post

    Couldn't it be because RISC architectures like ARM don't benefit as much as x86 from SMT?
    I wouldn't call ARM, especially ARMv9, "RISC" in comparison to x86 anymore.


    Anyway, Broadcomm touted some significant benefits with 4-way SMT in their ARM server processors, but I suspect they canned them due to lack of interest rather than lack of performance. With Apple and such, I think its a design choice due to a number of factors more than a fundamental ISA issue.

    Leave a comment:


  • torsionbar28
    replied
    Originally posted by anarki2 View Post

    Yes, Dell can keep sitting on their @rses while people switch to self built workstations or other brands. We waited for them to finally release Ryzen workstations, then ended up building them for ourselves from off the shelf parts.

    They have ridiculously priced Alienware gimmicks with bundled GPUs. No thanks.
    It is widely known that intel engages in anti-competitive practices, like giving financial incentives to large OEM's to stay exclusively with intel in their business, mainstream, and high-end segments, and only use amd chips in the very lowest tier $299 crap peecee's. This gives consumers the false perception that intel is the "premium" choice, while amd is the "economy" choice.

    Self-built is not an option for any medium or large organization. And most consumers lack the skill or desire to build their own. DIY builds are an enthusiast niche, at best, not a solution to Intel's anti-competitive practices.
    Last edited by torsionbar28; 14 November 2022, 01:26 PM.

    Leave a comment:


  • coder
    replied
    Originally posted by josmat View Post
    Couldn't it be because RISC architectures like ARM don't benefit as much as x86 from SMT?
    Partly. They have an easier time scaling their front-end. However, making your architecture wider is a game of diminishing returns, because software only has so much Instruction-Level Parallelism to be extracted. And speculative execution risks wasting energy executing branches not taken. Furthermore, the structures used to schedule and track instructions in flight tend to scale worse than linear. This math explains why GPUs get so much more compute performance per-W and per-mm2 of silicon.

    However, the wider a CPU micro-architecture, the bigger the payoff you tend to get from SMT. So, in a way, it gives you a better return-on-investment from making your architecture wider. Furthermore, you can get better pipeline utilization for a given reorder buffer size, the more SMT threads you have. At the extreme, you end up with a GPU-like ~12-way SMT in-order core. Yet, SMT isn't a bottomless well - each SMT thread requires additional state and software never scales linearly with respect to the number of threads. So, if Intel and AMD have found SMT-2 adequate to get good pipeline utilization, that could explain why they haven't gone further.

    One interesting thing that's happened, in the past few years, is that instruction reorder buffers have nearly closed the gap vs. (best-case) DRAM access latency. That doesn't mean you can find enough useful work to do while waiting for a L3 cache miss, but it's pretty impressive that it's even theoretically possible.

    Leave a comment:


  • josmat
    replied
    Originally posted by brucethemoose View Post
    That being said, you do see ARM, Apple and such avoiding SMT, as it really is unecessary in those kinds of workloads.
    Couldn't it be because RISC architectures like ARM don't benefit as much as x86 from SMT?

    Leave a comment:


  • coder
    replied
    Originally posted by brucethemoose View Post
    I wonder if asymmetric SMT would be practical, where you have one "main" thread in a core as usual, and one "background" thread that strictly sucks up idle resources. It would have its own tiny l1, not put anything into l2/l3, and always stop to yield other resources so that it doesn't reduce the main thread's performance at all. And scheduling would basically be the same as big.LITTLE or Alder Lake.
    It could be interesting for the OS scheduler to have some control over SMT priority. I think it couldn't be absolute priority, though. The OS would need some assurance that the low-priority thread would always make some progress, since it might hold a lock on some shared resource needed by other threads.

    But, I think I see where you're coming from. I guess you want a thread running on a P-core to have roughly the same performance, whether or not it's sharing the core with a second thread. That way, the core can be more fully utilized while still offering a performance advantage over threads running on an E-core. Nice idea!

    Leave a comment:


  • brucethemoose
    replied
    Originally posted by coder View Post
    It seems like it should have a slightly negative impact on perf/W, in some cases. For Apple, excluding it is a no-brainer, because they seem happy to spend $ on dies with greater silicon area, as long as doing so can scale performance without hurting efficiency.
    Yeah I have heard this too. Task energy is everything in mobile, where absolute performance of the silicon slab is more of the goal on desktop/server.


    I wonder if asymmetric SMT would be practical, where you have one "main" thread in a core as usual, and one "background" thread that strictly sucks up idle resources. It would have its own tiny l1, not put anything into l2/l3, and always stop to yield other resources so that it doesn't reduce the main thread's performance at all. And scheduling would basically be the same as big.LITTLE or Alder Lake.
    Last edited by brucethemoose; 11 November 2022, 08:29 PM.

    Leave a comment:


  • coder
    replied
    Originally posted by brucethemoose View Post
    That being said, you do see ARM, Apple and such avoiding SMT, as it really is unecessary in those kinds of workloads.
    It seems like it should have a slightly negative impact on perf/W, in some cases. For Apple, excluding it is a no-brainer, because they seem happy to spend $ on dies with greater silicon area, as long as doing so can scale performance without hurting efficiency.

    Leave a comment:


  • Rabiator
    replied
    Originally posted by anarki2 View Post

    Yes, Dell can keep sitting on their @rses while people switch to self built workstations or other brands. We waited for them to finally release Ryzen workstations, then ended up building them for ourselves from off the shelf parts.

    They have ridiculously priced Alienware gimmicks with bundled GPUs. No thanks.
    Getting a little off topic, but Gamers Nexus has some Alienware reviews that are not very favorable, to put it mildly. You are not the only one who is not impressed.

    Leave a comment:


  • brucethemoose
    replied
    Originally posted by Espionage724

    Consumer chips don't need 96 cores; 16 maybe average, 8 minimum, and 32-64 for the few games that might be able to do something with that with higher-end CPUs. If 96 cores can go into that Epyc chip, surely 16 can work fine in consumer stuff at a minimum. Actual workstations and professional app users can deal with SMT and higher thread counts where all that might be a benefit.

    I'm of the opinion that SMT/HT and even CCX/CCD are shortcuts and marketing gimmicks that cause nothing but scheduler difficulties and issues across different platforms. I have a 2700X now and I'd rather have 16 real cores, vs 8 cores, split into two, with 4 in each group that communicate over a slower path causing performance issues for non-aware schedulers and drivers. Sure I could pin applications to certain cores and make sure IRQs only run off of certain cores for certain devices and deal with the manual set-up of all of that, but why does this complexity need to exist for such a small amount of cores?

    Look at this nonsense: https://www.neowin.net/news/windows-...zen-7000-cpus/

    Crossing the CCD and/or having SMT on lowers performance even on 7000-series Ryzen CPUs.
    SMT doesn't take up much die space according to Intel/AMD, and it does help certain workloads significantly (with RAM-heavy database processing being a prime example, where the individual cores are largely twiddling their tumbs waiting for something from memory).

    Leaving it in the design doesn't really negatively impact users that much, especially when you can turn it off or alter core affinity yourself.

    ... And don't really understand what that has to do with CCDs.


    That being said, you do see ARM, Apple and such avoiding SMT, as it really is unecessary in those kinds of workloads. I wouldnt be surprised if Ampere or Nuvia introduce it though, as we already saw Broadcomm do with their SMT4 server processors.

    Leave a comment:

Working...
X