Intel Core i5 14600K & Intel Core i9 14900K Linux Benchmarks
Collapse
X
-
Curious of the results if the CPUs were run within spec, as the peak power consumption of both chips was above their spec. Was the bios using out of spec settings?
-
-
Originally posted by Anux View PostFor the same reason ARM has one or basically anything has a cache, to reduce bottlenecks and maximize utilization of different stages.
In ARM's case, they've been removing them from the ARMv9 cores, now that v9 allows them to drop backward compatibility with AArch32. That simplifies their decoders to the point where they're cheap enough that the mOP cache no longer pulls its weight.
Originally posted by Anux View PostThe micro-op cache is an integral part of the decoder.
Originally posted by Anux View PostAs is every other stage of the CPU depending on the workload.
Leave a comment:
-
-
Sorry this thread is defect for me, I cant really quote you.
then why have the op cache?
The backend width doesn't match the frontend. That's what tells us decoders are a potential bottleneck.Last edited by Anux; 07 November 2023, 05:01 AM.
Leave a comment:
-
-
Originally posted by Anux View PostJust comparing the decoder width also doesn't give any clues because x86 can do more with one instruction (meaning it decodes to more micro ops than a ARM).
Originally posted by Anux View PostAlso those decoders are optimized to saturate a core,
Originally posted by Anux View Postjust increasing it's width might not improve overall performance and at last a 6 wide decode at 5 GHz is theoretically faster than a 8 wide at 3 GHz.
Here's data collected from Zen 2, across a variety of workloads, showing that the op cache indeed provides performance benefits. As I mentioned before, I think the power data might be unreliable.
Keep in mind that the opcache doesn't totally alleviate decoder bottlenecks. So, this is an underestimate of just to what extent the decoder can comprise one.
Leave a comment:
-
-
I often heard the argument here on phoronix, that ARM decoders are faster or more efficient. But it's hard to really prove that. Just comparing the decoder width also doesn't give any clues because x86 can do more with one instruction (meaning it decodes to more micro ops than a ARM). Also those decoders are optimized to saturate a core, just increasing it's width might not improve overall performance and at last a 6 wide decode at 5 GHz is theoretically faster than a 8 wide at 3 GHz.
So a 3GHz ARM needs a wider decoder to get the same throughput of higher clocked designs.
Leave a comment:
-
-
Originally posted by Jabberwocky View PostI based some of my views on this old paper: https://www.researchgate.net/publica...uction_Decoder
Originally posted by Jabberwocky View PostThe reason why I'm bringing it up is because of section "3.2 Microbenchmark Design". Do you think it's practical to write tests like that and monitor system power in order to deduce decoder efficiency?
One thing to note about their older articles analyzing Zen 2 & earlier: a more recent article mentions AMD's RAPL values, which they were relying upon for power data, seem to be merely estimates rather than direct measurements.
Originally posted by Jabberwocky View PostI'm just wondering how can we find more data on where the single code performance / efficiency comes
Reviews across the internet show Alder Lake getting very competitive performance with very high power consumption.
I think they're missing a key point, however. When running a highly-threaded workload, the optimal strategy should be to clock all your cores so they're delivering the same energy-efficiency. Then, once the E-cores are maxed out, you continue increasing clock speeds on the P-cores. Since the efficiency of the overall chip is an average of its cores' efficiency, having those E-cores in the mix is still a win.
That crossover on x264 looks worse than it really is. The idea is that once the E-cores reach the crossover point, you don't increase their clock speeds until the P-cores get to 4.0 GHz. Then, you bump the E-cores up to 3.5 GHz. After that, increase P-cores to 4.2 GHz. Next, max the E-cores at 3.8 GHz. Beyond that, any remaining power budget goes to pushing the P-cores to 4.5 GHz and beyond.
If someone isn't really thinking very hard, you could see how they look at the graph and think: "oh, when you hit that crossover point, you just switch over to P-cores and that's it". But it's wrong (in a multithreaded workload, at least) because that's an efficiency graph and the P-cores become very inefficient towards the top of their frequency envelope.
Leave a comment:
-
-
Originally posted by coder View Post
You'll find lots of analysis and investigation of these sorts of questions on Chips & Cheese. For instance, they did a detailed performance comparison (i.e. not just benchmarks, but also analysis) on ARM Neoverse N1 vs. Zen 2, a few years ago:
Just recently, they posted analysis of ARM's X2 cores, via "Snapdragon 8+ Gen 1" (note that it's a phone SoC - not one of their laptop SoCs).
If performance is comparable and efficiency is way better, then it could definitely have implications for the laptop and server markets.
I based some of my views on this old paper: https://www.researchgate.net/publica...uction_Decoder
The reason why I'm bringing it up is because of section "3.2 Microbenchmark Design". Do you think it's practical to write tests like that and monitor system power in order to deduce decoder efficiency?
I'm just wondering how can we find more data on where the single code performance / efficiency comes from in ARM, Intel and AMD's latest cores (without just making the assumption that it's the decoder width).
Leave a comment:
-
-
Originally posted by Jabberwocky View PostThanks for making a technical argument.
I have not read about 80376 before. I wasn't aware that these were made and put into dumb terminals. Removing 16-bit and 32-bit instructions is a extremely challenging task. I would be surprised it this happened anytime soon. I sometimes compare x86 to JavaScript. We have all this legacy junk and hacks is because if either broke backwards compatibility the industry likely would have moved to something else.
I've looked at some benchmarks for Apple's M2 (8+10) 20 billion transistors TSMC 5nm N5P vs AMD 6800U (zen3+ rdna2) 13 billion transistors TSMC 6 nm FinFET in an Asus Zenbook S13. The results doesn't match the hype IMO.
use these benchmark results with current tech instead: https://www.computerbase.de/2023-11/...eils-deutlich/
the multicore results are not really interesting because surprise surprise the chip with more cores has better result...
no the single core results is the interesting one.
"
Geekbench v6 – Single-Core- Apple M3
• 100 % - Core i9-13900K
99 % - Snapdragon X Elite (80 W)
97 % - Apple M3 Max
96 % - Snapdragon X Elite (23 W)
90 % - Apple M2 Max
89 % - Apple M2
87 %
- Apple M3
as you see in this comparison the apple m3 chip is the fastest declared 100% the 13900K is then 99%
then we can say intel 14900K is maybe 3-4% faster than this would result in place 1 with maybe 2-3% faster than the Apple M3 chip.
but keep in mind that the 14900K has a TDP of 282watt or something like this and the Apple M3 much less than this.
"I've looked at some benchmarks for Apple's M2 (8+10) 20 billion transistors TSMC 5nm N5P vs AMD 6800U (zen3+ rdna2) 13 billion transistors TSMC 6 nm FinFET in an Asus Zenbook S13. The results doesn't match the hype IMO."
your comparison is only correct if you focus on multicore performance. because of course in single core performance the AMD 6800U has no change what so ever agaist a Apple M3/Core i9-13900K
/Snapdragon X Elite or a 14900K
you talk about (zen3+ rdna2) and even if we talk about zen4+RDNA3 you either have a low performance APU or else a iGPU+dGPU combination and in this case lets say AI workloads are the most important point today then unified memory model of course beats the vram of the dGPU because even if you talk about a AMD PRO W7900 with 48gb vram with apple unified memory model you have 128GB of vram for your AI workloads....
and AMD is not yet in the game with BIG-APUs with mega size of unified memory...
Originally posted by Jabberwocky View PostIf the headlines read low idle and hours of battery lifetime then I would say the hype is real. Tom's Hardware claiming it's better than a Threadripper 3990X and Apple's Senior VP, Johny Srouji, claiming it's faster than Geforce 3090 is just moronic. I digress, let's look at the benchmarks...
The M2 has a better GPU hands down, but on the CPU side it's more or less the same as the 6800U. In FL Studio WAV export M2 is 12% slower than 6800U (15W), 16% slower than 6800U (25W). In Blender the M2 is only 3.2% faster 6800U (15W) and is 7.1% slower than the 6800U (25W). The M2 does well in 7zip compressing but fails miserably with 7zip decompression being 32,4% slower than 6800U (15W). Also nobody is testing AVX512 and comparing that to Apple. The M2 is also much more expensive. One thing is for certain, Intel's 1260P is struggling all round in terms of performance / Watt. These stats were obtained from Hardware Unboxed's benchmarks. I'm still waiting for proper Zen4 tests vs Apple M2. Here's a generic comparison: https://www.notebookcheck.net/R7-784....247596.0.html
and if i would have a notebook and i would need performance for AI workloads i would prever to SSH into a workstation or server instead of a more powerfull notebook.
can you please stop talk about apple M2 because no one cares about this old history lessions anymore you can buy a Apple M3...
I do not have a apple m3 and i will not buy one LOL...
"The M2 has a better GPU hands down,"
M1/M2 did not have raytracing support and also had no VP9/AV1 decode or encode
this fact alone shows you that for linux and opensource an AMD GPU is better then you have AV1 decode and raytracing support.
by the way the AMD GPU has opensource drivers and the Apple M1/M2 only have reverse-engineered official-unsupported driver with right now only openGL support...
M3 now they claim has raytracing support and AV1 decode but no AV1 encode ?
m2 vs 6800U
i honestly do not care so much who is slower or faster.
one has good opensource drivers for the GPU
and also AV1 decode support
and also official raytracing support for the GPU...
and the apple m2 not not have all of this so who cares who is faster ?
apple m2 could have the double performance and i would not buy it.
Originally posted by Jabberwocky View Post
I have a few questions:- What is the percentage of power that the decoder uses and how much does that objectively contribute to improved efficiency / performance?
- who cares the apple M2 does not have VP9/AV1 decode
Originally posted by Jabberwocky View Post- We have seen in x86 alone that there have been big improvements to IPC when optimizing between branch prediction and micro-op cache / decode-queue. What proof do we have that we have reached a limit where we cannot optimize variable length decoding further (regardless if we keep or remove legacy instructions) ?
who cares that intel 14900K is maybe 4% faster in single core than a 13900K?
in my point of view multicore is not very intersting because you can always put in more cores to get more multicore performance.
can you explain to me how exactly intel want to go from TDP 283watt to only TDP 30 watt with the same performance ?
Originally posted by Jabberwocky View Post- How can we objectively measure that ARM's success is due to the "ultra" 8-wide decode step?
- Is x86 struggling to improve the decode step because of legacy instructions, perhaps it's AVX or even something else?
the end-consumer does not care at all if it is the 8-wide decode or the 10nm vs 3nm part.
the end-consumer only does see that intel has a big problem.
intel right now has a 6-wide decode design... of course they can go to 8-wide decode the point is they need more tranistors for this step because flexible wide design results in more tranistors used than a fixed decode wideness.
intel can not afford wasting more tranistors because their 10nm node can not handle it.
Originally posted by Jabberwocky View Post
CPU core frontends are very complicated. The smallest change to something like cache latency. The balance between predictions and mispredictions. Every small change has a profound impact on the entire system. I enjoy studying this, but I can't honestly say I know exactly what is going on in Raptor Lake / Zen5 / M3 / Snapdragon. it's easy to make statements like "X has better single core performance because the CPU is ultra wide", but to actually prove that this architectural change is responsible for this is another story. If there's a proper study on this and we can definitively say it's too difficult to feed the execution and single core performance is over on x86,
and intel's 10nm node can not handle more transistors...
the ARM 8-wide decode design compared to a flexible wide design saves tranistors because you have less possibilities.
intel in the past did avoid going from 4 wide to 6 wide decode design and intead used hyperthreating to get more multicore performance. for multicore performance you do not need this.
if intel could go from 10nm to 3nm then they could spend more tranistors to handle this complexity at 8-wide decode...
Originally posted by Jabberwocky View Post
I really doubt that this is the case, however I'll be happy to admit that I'm wrong. Right now though, AFAIK Snapdragon Elite is only being released June 2024 so we won't even have independent testing of Snapdragon Elite X until then.
"I'll be happy to admit that I'm wrong."
you will never admit it because you can always claim it is 10nm vs 3nm and not the ISA or cpu design.
Originally posted by Jabberwocky View Post
Regarding the announced Snapdragon Elite X single core benchmarks: Even if we trust that independent testing will give the same results over all workloads then it doesn't look like it's a life-changing difference. Certainly not enough to revive dead RISC vs CISC arguments. Nor something that spells the end of x86. One thing that is impressive is the lower frequencies of the device, but it still seems to draw a lot of power (80W) at 4.3 Ghz. Maybe this will improve in the near future?
then keep in mind Snapdragon Elite X is only at 4nm and the Apple M3 is at 3nm... so a Snapdragon Elite X 2.0 on 3nm would be even better.
Originally posted by Jabberwocky View Post
If something would bring x86 laptops to and end it's bad power management on the OS side in both Linux and Windows. Also the lack of innovation compared to the bells and whistles of the M2 and how many applications take advantage of the SOC on a low level. Apple does a good job at marketing their new features and makes it exciting for teams (including managers) to implement. Dopamine firing off the charts every time a new hardware feature is supported. On the other side you have Windows benchmarking and video editing software that took many years to support very useful u.arch improvements. Like Adobe's trash software that would just crash on some CPUs or not use hardware encoding for many years. This is where the real battle is being fought IMO.
and your argument here ... i do not see any change what so ever for microsoft and closed source software like Adobe software in general. it is technically impossible to get what you want or what we want with closed source software or a closed source operating system. as you say apple can only do this because they control all and only if you control all like apple you can do this with closed source. this means Microsoft is no real competitor to linux we can easily beat microsoft and it does already happen with valve steam deck.
honestly i do not unterstand apple long time ago apple also had servers and they lost the server market comletely now they could easily officially support linux on M1/M2/M3 but they choose not to do this.
also Apple is MPEG LA member and for this fact alone "evil" and in apple M1/M2 they implemented all the closed source and patented video codexes but for opensource video codexes they only adobt it very slowly .. why ?
you always talk about 6800U but AMDs 7000 series mobile chips honestly look good with AV1 encode support and the 6000 series only has decode support.
for linux the apple products are not ready... faster or better battery time does not matter in this case
Originally posted by Jabberwocky View Post
I hope ARM and RISC-V find more ways of improving over x86 and not in nonobjective hype tactics. Massive speculation: It seems like AMD is going after inter-CCD latency in the next 2 years. AMD will likely grow at 10 to 15% IPC with Zen5 and ~10% with Zen6 (completely new chip layout design, new infinity fabric, CCD stacking) we might see Arrow Lake and ARM growing more efficient than that over the next 2 years so it will be interesting. We will see some interesting laptops next year with AMD bringing out some new things too but we might see delays for the big stuff like Strix Halo (likely 2025) but we still should see Strix Point in 2024 due to Windows 11 AI requirements. Both AMD Strix Point and Snapdragon X Elite will go for ~40 to ~45 TOPS if the rumors are right.
I love low-power passive cooled devices, so Cortex-A5 over Cortex-A7 and Cortex-X models... never mind x86.
RISC-V is just a distraction from real free and fast solutions like OpenPOWER.
i have never seen any benchmark result of a fast RISC-V chip they clearly do not exist.
"I hope ARM and RISC-V find more ways of improving over x86 and not in nonobjective hype tactics."
do you remember the time when the intel cpus and amd cpus where produces in 14nm node ?
at this time power9 cpus produced at 14nm did defeat all the X86 chips in single-core performance.
just to make an example how pointless X86 is.
after this time you can no longer compare the cpus because one is at 10nm the other is at 3nm and IBM Power10 is at 7nm
and any difference in the production node would result you in cry: "nonobjective hype tactics"
"see Strix Point in 2024 due to Windows 11 AI requirements"
what are these requirements ? last time i checked amds inference accelerators where 8 bit integer
but minifloats like 4bit floating point and 6bit floating point and 8bit floating point clearly beats 8bit integer.
this means the hardware company who manage to include 4/6/8bit floating point and 8bit integer will win this.
Leave a comment:
-
Originally posted by Jabberwocky View PostI've looked at some benchmarks for Apple's M2 (8+10) 20 billion transistors TSMC 5nm N5P vs AMD 6800U (zen3+ rdna2) 13 billion transistors TSMC 6 nm FinFET in an Asus Zenbook S13. The results doesn't match the hype IMO.
Here's a review of 7840 HS. You can add Apple MacBook Pro 14 2023 M2 Pro to the CPU Performance Rating graph and it shows the Mac beats Ryzen by 7% (76.3 vs. 71.8).
Originally posted by Jabberwocky View PostI have a few questions:- What is the percentage of power that the decoder uses and how much does that objectively contribute to improved efficiency / performance?
- We have seen in x86 alone that there have been big improvements to IPC when optimizing between branch prediction and micro-op cache / decode-queue. What proof do we have that we have reached a limit where we cannot optimize variable length decoding further (regardless if we keep or remove legacy instructions) ?
- How can we objectively measure that ARM's success is due to the "ultra" 8-wide decode step?
- Is x86 struggling to improve the decode step because of legacy instructions, perhaps it's AVX or even something else?
- Only Intel and AMD actually know such detailed information about their CPUs.
- I think it's probably not so much due to the size of the opcode space, as it is just the hassle of dealing with variable-length instructions, to begin with.
- Again, this is something only ARM or Apple would know about their CPUs.
- Same answer as point 2.
Originally posted by Jabberwocky View PostI enjoy studying this, but I can't honestly say I know exactly what is going on in Raptor Lake / Zen5 / M3 / Snapdragon. it's easy to make statements like "X has better single core performance because the CPU is ultra wide", but to actually prove that this architectural change is responsible for this is another story. If there's a proper study on this and we can definitively say it's too difficult to feed the execution and single core performance is over on x86, I really doubt that this is the case, however I'll be happy to admit that I'm wrong.
Just recently, they posted analysis of ARM's X2 cores, via "Snapdragon 8+ Gen 1" (note that it's a phone SoC - not one of their laptop SoCs).
Originally posted by Jabberwocky View PostEven if we trust that independent testing will give the same results over all workloads then it doesn't look like it's a life-changing difference.
Originally posted by Jabberwocky View Postit still seems to draw a lot of power (80W) at 4.3 Ghz.
Leave a comment:
-
Originally posted by Classical View PostIn my opinion, ARM is actually not that much better than what AMD and Intel currently use. Compare e.g. the results of the iPhone 12 with the Intel 12600K that I published here:
There are other Windoze system in the house. I have a Radeon RX 580 graphics card. I'm most certainly not planning on moving hardware components around trying to get this stupid benchmark to run. My youngest sister has an i7 8700k + RX 580 8GB in the system I built for her so I can ask her what...
As you can see, the performance of the iPhone 12 mini (as fast as a standard iPhone) is really very weak in the 'animation & skinning', 'particles' and 'AI agents' sections.
Originally posted by TemplarGR View PostYou do not simply increase or decrease the clockrate by changing a switch, in order for the clockrate to be able to go higher you need to change the architecture and/or the process node. x86 architectures are able to clock high and use wider execution units per core because of their architecture, arm cores CANNOT.
Originally posted by TemplarGR View Postif ARM ever develops a desktop cpu it will be using similar levels of power.
Originally posted by TemplarGR View PostSo you can't say Intel's architecture is "bad".- Variable-length instructions
- Fewer general-purpose registers
- Mostly 2-operand instructions
- More stringent memory semantics
Originally posted by TemplarGR View PostAlso Apple cores are severely overrated/overhyped and are not really better than Intel either. And Apple also benefits by compiling the OS and Software for their architecture, unlike Intel and x86 software which is more generic.
As for the OS, it's not as if AMD and Intel don't both submit plenty of kernel patches, to make their CPUs run Linux efficiently.Last edited by coder; 05 November 2023, 06:58 PM.
Leave a comment:
Leave a comment: