AMD Ryzen 7 5800X Linux Performance

duby229 replied

13 November 2020, 04:52 PM
Originally posted by atomsymbol

Indeed. It has been that way since i486 (released in April 1989). Which is exactly the reason why the world hasn't been taken over by RISC CPUs during the 1990's - I hope you are aware of this.

Wrong, the 486 was the first x86 architecture that was pipelined, but the entire pipeline was CISC.

486 was pipelined CISC (Scalar architecture)
Pentium and i586 was pipelined CISC (Superscalar architecture)
K6 and Pentium Pro was OoO and pipelined RISC (OoO Superscalar architecture)

Pentium and Pentium Pro are two very different things

(what I mean by pipeline is a Scalar architecture and superpipeline is a Superscalar architecture. I just grew up with using the term pipeline in place of scalar because that's exactly what it means)

(In fact the world -was- taken over by RISC in the 90's, that's what K6 and Pentium Pro did)

EDIT: See here for clarification on the definition of Superpipeline
https://www.phoronix.com/forums/foru...41#post1219241

Last edited by duby229; 13 November 2020, 06:44 PM.
Leave a comment:
duby229 replied

13 November 2020, 04:51 PM
Originally posted by atomsymbol

Please make sure to take a look at the following article:

The i486 CPU: Executing Instructions in one Clock Cycle

Table 3 "Clock cycle counts for basic instructions" on page 34 might be the most interesting to you.

Pentium Pro - Wikipedia

https://en.wikipedia.org/wiki/Pentium_Pro

It was the NexGen K6 and then the Pentium Pro that was the first OoO Superpipelines and they were the first to decode x86 instructions into RISC-like uops.

Last edited by duby229; 13 November 2020, 04:53 PM.
Leave a comment:
bridgman replied

13 November 2020, 02:49 PM
Originally posted by artivision View Post

Nope. In the Arm camp they just add units without extra circuit. They base their tech on that a single thread of a game does have a lot individual actions like graphics engine and game engine that also have individual actions like character movement, damage calculation, sound synchronization and many others. They can simply compress 16-32 in order units and still fill them with just good software calculations, destroying your solution. That will not scale anywhere, probably in office will not pass the 4 unit barrier, but when tank power is needed will be there because tank threads have a lot of independent information anyway, no need for heavy analysis.

Sorry, I didn't understand your post. Are you talking about building a chip with 16-32 separate small cores (executing 16-32 threads) or a single core with 16-32 execution units working off a single thread ? You mentioned "in order units" which implies 16-32 separate cores and threads, but earlier in the post you were talking about a single thread of a game.

The closest interpretation I can come up with is something like the early superscalar microprocessors (dual pipeline but in-order) where two adjacent instructions could sometimes be executed in a single clock, but it seems really unlikely that approach could scale to the level you are describing.

I believe the ARM programming model still requires sequential consistency within each core so all the out-of-order circuitry is still required - the relaxed memory model only affects the timing of when the results of one core become visible to other cores.

AFAIK the only way to use 16-32 in order units and keep them busy would be a VLIW approach like PA-RISC/Itanium, where the compiler identifies groups of instructions that can be executed together and packs them into instruction bundles at ISA level. I don't think that is what you are suggesting though, is it ?

Last edited by bridgman; 13 November 2020, 05:32 PM.
Leave a comment:
Guest replied

13 November 2020, 01:36 PM
Originally posted by White Wolf View Post

RIP Intel and AMD is talking about Zen4 already:
https://www.guru3d.com/news-story/am...4-already.html

really amazing how it come from Zen to Zen 3 and beat Intel, like cat ate dog.

Willow Cove does seem like a good improvement, if they brought that, 10 nm and SuperFin over to the desktop, they would most likely beat Zen 3. Of course, Zen 4 is a different matter.
Leave a comment:
r1348 replied

13 November 2020, 01:27 PM
Originally posted by birdie View Post

In terms of performance per dollar and thermals the Ryzen 5800X is the worst CPU of this lineup.

You can pay $100 (22%) more and get 50% more cores with the same thermal package, i.e. the 5900X and it runs significantly cooler too.

Or you can pay $150 (33%) less and lose just 25% of cores, i.e. the 5600X.

That is debatable, it really depends on what your workload is. 5800X still has a single L3 cache, so single thread performance is good (gaming), yet it appeals also to those who do workloads that enjoy higher parallelism. I think it's actually the most "generic" CPU of the 5000 line so far, in which it positions itself in the middle between two more specialized markets, gaming and "creator". If AMD manages to push out a 5700X with 8C/16T but 65W TDP, it'll likely be my next CPU.
Likes 2
Leave a comment:
oiaohm replied

13 November 2020, 10:58 AM
Originally posted by arQon View Post

Not quite: it's just that as lower-margin parts they tend to trail the big earners by at least several months. Zen2 for example has a "Ryzen 3 3100" (which actually exists), and technically also has a "Ryzen 3 3300X" (which doesn't, but was a spectacular chip for the 5 seconds it existed, since all 4 cores were on a single CCX and it thus had better core-to-core latency than any other part).

I would expect a "Ryzen 3 5400X" or somesuch to appear in late Spring 2021, once the manufacturing pressure has eased off a little.

The merge L3 and single CCX alters the chiplet yeild a lot.

Ryzen 3 made sense in the zen2 there is chance there will not be enough defective chiplets to justify making any Ryzen 3 this time around. Zen 3 is revised production at this nm so higher yields just from that.

The 4 CCX pattern also said that a rizen 5 in Zen 2 and before was 1 dead core per CCX. But with Zen 3 8 cores on to a unified L3 now means a Ryzen 5 is 2 dead cores out of 8. This is a big difference some old Ryzen 3300X had 4 perfect cores in 1 CCX and really only 2 dead cores in the other CCX same defect with Zen 3 is a Ryzen 5.

Like it or not there is lot higher yield of chiplet suitable for making 6 core Ryzen 5 and 12 core Rizen 9 with Zen 3 this is going to come at the price of consuming the chiplets that historical came Ryzen 3.

Its possible that Zen 3 will not have Ryzen 3 and will will not see it again until Zen 4.

Thing you have missed in the Ryzen/Zen model the lower-margin parts are the defective parts not suitable to make higher end models. Second time at the same nm normally gives increased yield and in the CCX change making chiplets that would have been put into the reject bin to come Ryzen 3 not be put there, Finally you do want at least decent volume in that reject bin to release a product. Lower-margin parts in the class of Ryzen3 are after big earners because the stockpile of rejects has to get high enough to justify making them. Remember we are talking reconfiguring a production line and that not costless so if there are not enough defective chiplets to make Ryzen3 to cover the cost of setting up the production line to make Ryzen3 it will not be happening.

Time will tell if we see a Ryzen 3 or not. The odds are fairly high that there just will not be the chiplets to make Zen3 Ryzen 3.
Leave a comment:
arQon replied

13 November 2020, 09:25 AM
Originally posted by oiaohm View Post

Basically 9, 7, 5 are not generations. They are targeted market segment. There use to be Ryzen 3 as well those were 4 core and that segment has gone out the modern AMD lines.

Not quite: it's just that as lower-margin parts they tend to trail the big earners by at least several months. Zen2 for example has a "Ryzen 3 3100" (which actually exists), and technically also has a "Ryzen 3 3300X" (which doesn't, but was a spectacular chip for the 5 seconds it existed, since all 4 cores were on a single CCX and it thus had better core-to-core latency than any other part).

I would expect a "Ryzen 3 5400X" or somesuch to appear in late Spring 2021, once the manufacturing pressure has eased off a little.
Likes 2
Leave a comment:
artivision replied

13 November 2020, 08:45 AM
Originally posted by bridgman View Post

I have to gently disagree here - it's not the most parallelism that can be extracted, but it is definitely the point of diminishing returns for typical code and typical compilers, whether the intermediate ISA is x86 or ISA.

It's only the combination of tiny fab processes and a new arms race between CPU vendors that is prompting recent increases in both width (# ALUs, AGUs, load/store paths etc..) and depth (reorder buffer, physical reg file, load/store queue depth, prefetcher complexity etc...). There is additional parallelism to be exploited but it takes a big increase in width and depth to get a fairly small increase in performance, and that just hasn't been worth doing until recently.

The other relatively recent change is heavy use of micro-op caches, which has largely removed what used to be a bottleneck at the instruction decoder stage. Fixed length ISAs used to have an advantage here but even Zen2 has an 8-wide path from the micro-op cache into the execution pipeline.

Nope. In the Arm camp they just add units without extra circuit. They base their tech on that a single thread of a game does have a lot individual actions like graphics engine and game engine that also have individual actions like character movement, damage calculation, sound synchronization and many others. They can simply compress 16-32 in order units and still fill them with just good software calculations, destroying your solution. That will not scale anywhere, probably in office will not pass the 4 unit barrier, but when tank power is needed will be there because tank threads have a lot of independent information anyway, no need for heavy analysis.
Leave a comment:
oiaohm replied

13 November 2020, 07:01 AM
Originally posted by AdrianBc View Post

Like I have already said the high power efficiency of the Apple cores could have been used to design CPUs easily beating the Intel and AMD CPUs, but Apple did not attempt to do such a thing because for maximum profit they just needed a performance better than their old models, which were already slower than most of the competition, because they have their customers which are captured by the software environment, so they cannot make buying decisions based mainly on performance and cost.

Not quite true. Remember M1 chip is being produced in the settle in stage of the 5nm TSMC process. If you attempt to push it in the early stage of a particular nm production you will get insanely low yields. Like 1 in 1000 yeilds. AMD has bought space at TSMC for 5nm after the early settle in stage is done.

Originally posted by AdrianBc View Post

The great advantage of the Apple design is that achieving the same performance at lower clock frequency results in a much lower power consumption for the same performance, i.e. about half of the core power, compared to Intel and AMD.

Problem you have to remember this is not apples to apples compare. AMD is on 7nm and Apple is on 5nm. If you are not increasing speed or changing design the change from 7nm to 5nm is a 30% power reduction and TSMC documents this. That quite a bit to start off with.

Driving PCIe 3 and 4 is not cheap. Good part we can see how much with AMD because the IO part in the AMD cpu is absolutely identical B570 chipset chip.

https://youtu.be/qk3PD-4zPN0?t=843

So roughly 5 watts to drive PCIe 3 and 10 watts to drive PCIe4 with the limited channels the desktop versions of the AMD processes have. Of course this goes in your server chips as your PCIe lanes increase. Same with driving ram being able to drive 128G of ram on external cards does not come without a price tag.

Making a SOC chip with everything integrated is very power effective because you get to lose like 20 watts off the start line. AMD does have 15 watt line of chips with external memory. Yes that would be a 10watt part or less with the memory placed inside the soc.

Then losing 1/3 of the cpu due to removing nm advantage then you lose half because you have half the multi thread due to lack of cores.

So you start with a amd 65 watt part.
65-20=45 to remove the external hardware drive of pci-e and ram.
45* 2/3 to remove the NM difference that gives you 30.
Divide by 2 to bring back to same number of cores for heavy multi thread. Gives you 15 watts.

M1 apple chip being 10watts is only at best really winning by 5 watts once you remove the differences and attempt to make it a true apples to apples compare. That small enough to disappear in generational revisions in design in x86 by AMD.

The difference is not a big advantage. Its really easy to miss how much of apple M1 chip light power budget is removable of expandability features and nm of production.

M1 chip is a lot better than Intel offerings mostly because Intel is stuck on a really old NM production method but it not game changing difference vs the AMD.

Remember AMD has moved to 8 cores per CCX this does help a lot with multi threaded workloads.
Leave a comment:
AdrianBc replied

13 November 2020, 05:51 AM
Originally posted by uid313 View Post

The Ryzen 5800X based on the Zen 3 architecture is great, and the new Zen 4 architecture is going to be great too.

But both Intel and AMD is going to be outperformed by Apple with their M1 processor. It's single-core performance beats anything, its IPC is so high.

The Apple M1 has an IPC higher by a little over 3/2 compared to the IPC of Intel (Tiger Lake) and AMD (Zen 3).

However, Apple M1 is able to run only at a clock frequency lower by 2/3 (i.e. 3.2 GHz) compared to Intel and AMD.

Therefore the single-thread performance is about the same for all 3 processors.

The Apple M1 is slightly faster in single thread, but the advantage is really negligible (3% vs. Zen 3, 6% to 8% vs. Tiger Lake).

The great advantage of the Apple design is that achieving the same performance at lower clock frequency results in a much lower power consumption for the same performance, i.e. about half of the core power, compared to Intel and AMD.

Nevertheless, Apple has chosen to not exploit yet this possible advantage, by not attempting to design a processor with many cores.

With their much lower power per core, an 128-core server CPU should have no problems with the power dissipation, which is probably what are trying to do the Apple engineers who left Apple for the Nuvia company.

However, for now, the Apple M1 barely matches in multi-threaded performance the cheaper laptops with AMD Renoir and bad cooling, while those laptops with AMD Renoir that have good cooling exceed Apple M1 by about 10% in multi-threaded performance.

While for casual users and gamers single-thread performance may be the most important, for professional or other demanding users the multi-threaded performance is what counts, because that is the real performance of a given chip. Whenever an application does not reach the multi-threaded level of performance, that means it was not yet optimized for maximum performance.

Like I have already said the high power efficiency of the Apple cores could have been used to design CPUs easily beating the Intel and AMD CPUs, but Apple did not attempt to do such a thing because for maximum profit they just needed a performance better than their old models, which were already slower than most of the competition, because they have their customers which are captured by the software environment, so they cannot make buying decisions based mainly on performance and cost.
Leave a comment:

Announcement

AMD Ryzen 7 5800X Linux Performance

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: