Announcement

**qarium** · 05 November 2023, 11:34 PM

Originally posted by Jabberwocky View Post

Thanks for making a technical argument.
I have not read about 80376 before. I wasn't aware that these were made and put into dumb terminals. Removing 16-bit and 32-bit instructions is a extremely challenging task. I would be surprised it this happened anytime soon. I sometimes compare x86 to JavaScript. We have all this legacy junk and hacks is because if either broke backwards compatibility the industry likely would have moved to something else.
I've looked at some benchmarks for Apple's M2 (8+10) 20 billion transistors TSMC 5nm N5P vs AMD 6800U (zen3+ rdna2) 13 billion transistors TSMC 6 nm FinFET in an Asus Zenbook S13. The results doesn't match the hype IMO.

i think something like 80376 never existed. also why do you do legacy comparisons with already obsolete tech ?

use these benchmark results with current tech instead: https://www.computerbase.de/2023-11/...eils-deutlich/

the multicore results are not really interesting because surprise surprise the chip with more cores has better result...

no the single core results is the interesting one.

"
Geekbench v6 – Single-Core

- Apple M3
  • 100 %
- Core i9-13900K
  99 %
- Snapdragon X Elite (80 W)
  97 %
- Apple M3 Max
  96 %
- Snapdragon X Elite (23 W)
  90 %
- Apple M2 Max
  89 %
- Apple M2
  87 %

"

as you see in this comparison the apple m3 chip is the fastest declared 100% the 13900K is then 99%
then we can say intel 14900K is maybe 3-4% faster than this would result in place 1 with maybe 2-3% faster than the Apple M3 chip.
but keep in mind that the 14900K has a TDP of 282watt or something like this and the Apple M3 much less than this.

"I've looked at some benchmarks for Apple's M2 (8+10) 20 billion transistors TSMC 5nm N5P vs AMD 6800U (zen3+ rdna2) 13 billion transistors TSMC 6 nm FinFET in an Asus Zenbook S13. The results doesn't match the hype IMO."

your comparison is only correct if you focus on multicore performance. because of course in single core performance the AMD 6800U has no change what so ever agaist a Apple M3/Core i9-13900K
/Snapdragon X Elite or a 14900K

you talk about (zen3+ rdna2) and even if we talk about zen4+RDNA3 you either have a low performance APU or else a iGPU+dGPU combination and in this case lets say AI workloads are the most important point today then unified memory model of course beats the vram of the dGPU because even if you talk about a AMD PRO W7900 with 48gb vram with apple unified memory model you have 128GB of vram for your AI workloads....

and AMD is not yet in the game with BIG-APUs with mega size of unified memory...

Originally posted by Jabberwocky View Post

If the headlines read low idle and hours of battery lifetime then I would say the hype is real. Tom's Hardware claiming it's better than a Threadripper 3990X and Apple's Senior VP, Johny Srouji, claiming it's faster than Geforce 3090 is just moronic. I digress, let's look at the benchmarks...
The M2 has a better GPU hands down, but on the CPU side it's more or less the same as the 6800U. In FL Studio WAV export M2 is 12% slower than 6800U (15W), 16% slower than 6800U (25W). In Blender the M2 is only 3.2% faster 6800U (15W) and is 7.1% slower than the 6800U (25W). The M2 does well in 7zip compressing but fails miserably with 7zip decompression being 32,4% slower than 6800U (15W). Also nobody is testing AVX512 and comparing that to Apple. The M2 is also much more expensive. One thing is for certain, Intel's 1260P is struggling all round in terms of performance / Watt. These stats were obtained from Hardware Unboxed's benchmarks. I'm still waiting for proper Zen4 tests vs Apple M2. Here's a generic comparison: https://www.notebookcheck.net/R7-784....247596.0.html

honestly i do not care about battery time or even devices on battery... why ? i use desktop/workstation or smartphone who does not need performance at all.
and if i would have a notebook and i would need performance for AI workloads i would prever to SSH into a workstation or server instead of a more powerfull notebook.

can you please stop talk about apple M2 because no one cares about this old history lessions anymore you can buy a Apple M3...
I do not have a apple m3 and i will not buy one LOL...

"The M2 has a better GPU hands down,"

M1/M2 did not have raytracing support and also had no VP9/AV1 decode or encode
this fact alone shows you that for linux and opensource an AMD GPU is better then you have AV1 decode and raytracing support.
by the way the AMD GPU has opensource drivers and the Apple M1/M2 only have reverse-engineered official-unsupported driver with right now only openGL support...

M3 now they claim has raytracing support and AV1 decode but no AV1 encode ?

m2 vs 6800U

i honestly do not care so much who is slower or faster.

one has good opensource drivers for the GPU
and also AV1 decode support
and also official raytracing support for the GPU...

and the apple m2 not not have all of this so who cares who is faster ?
apple m2 could have the double performance and i would not buy it.

Originally posted by Jabberwocky View Post

I have a few questions:

What is the percentage of power that the decoder uses and how much does that objectively contribute to improved efficiency / performance?

who cares the apple M2 does not have VP9/AV1 decode

Originally posted by Jabberwocky View Post
We have seen in x86 alone that there have been big improvements to IPC when optimizing between branch prediction and micro-op cache / decode-queue. What proof do we have that we have reached a limit where we cannot optimize variable length decoding further (regardless if we keep or remove legacy instructions) ?

**Jabberwocky** · 06 November 2023, 06:39 AM

Originally posted by coder View Post

You'll find lots of analysis and investigation of these sorts of questions on Chips & Cheese. For instance, they did a detailed performance comparison (i.e. not just benchmarks, but also analysis) on ARM Neoverse N1 vs. Zen 2, a few years ago:

https://chipsandcheese.com/2021/08/0...m-in-practice/

Just recently, they posted analysis of ARM's X2 cores, via "Snapdragon 8+ Gen 1" (note that it's a phone SoC - not one of their laptop SoCs).

If performance is comparable and efficiency is way better, then it could definitely have implications for the laptop and server markets.

Thanks for the feedback. I'm going to keep an close eye on these developments and rethink my views, specifically around variable length decoders.

I based some of my views on this old paper: https://www.researchgate.net/publica...uction_Decoder

The reason why I'm bringing it up is because of section "3.2 Microbenchmark Design". Do you think it's practical to write tests like that and monitor system power in order to deduce decoder efficiency?

I'm just wondering how can we find more data on where the single code performance / efficiency comes from in ARM, Intel and AMD's latest cores (without just making the assumption that it's the decoder width).

**coder** · 06 November 2023, 07:28 AM

Originally posted by Jabberwocky View Post

I based some of my views on this old paper: https://www.researchgate.net/publica...uction_Decoder

That's analyzing Haswell -- a 10-year-old CPU with only a 4-wide decoder.

Originally posted by Jabberwocky View Post

The reason why I'm bringing it up is because of section "3.2 Microbenchmark Design". Do you think it's practical to write tests like that and monitor system power in order to deduce decoder efficiency?

I've seen Chips & Cheese try to evaluate the energy cost of decoders, but it was too long ago for me to remember their exact approach or which CPU they were analyzing. I think it essentially amounted to the same thing - try to blow out the uOp cache and look at what that does to power usage and performance.

One thing to note about their older articles analyzing Zen 2 & earlier: a more recent article mentions AMD's RAPL values, which they were relying upon for power data, seem to be merely estimates rather than direct measurements.

Originally posted by Jabberwocky View Post

I'm just wondering how can we find more data on where the single code performance / efficiency comes

They sometimes publish comparative analysis of core efficiency.

Alder Lake’s Power Efficiency – A Complicated Picture

https://chipsandcheese.com/2022/01/28/alder-lakes-power-efficiency-a-complicated-picture/

Reviews across the internet show Alder Lake getting very competitive performance with very high power consumption. For example, Anandtech measured 272 W of package power during a POV-Ray run. Our o…

I think they're missing a key point, however. When running a highly-threaded workload, the optimal strategy should be to clock all your cores so they're delivering the same energy-efficiency. Then, once the E-cores are maxed out, you continue increasing clock speeds on the P-cores. Since the efficiency of the overall chip is an average of its cores' efficiency, having those E-cores in the mix is still a win.

That crossover on x264 looks worse than it really is. The idea is that once the E-cores reach the crossover point, you don't increase their clock speeds until the P-cores get to 4.0 GHz. Then, you bump the E-cores up to 3.5 GHz. After that, increase P-cores to 4.2 GHz. Next, max the E-cores at 3.8 GHz. Beyond that, any remaining power budget goes to pushing the P-cores to 4.5 GHz and beyond.

If someone isn't really thinking very hard, you could see how they look at the graph and think: "oh, when you hit that crossover point, you just switch over to P-cores and that's it". But it's wrong (in a multithreaded workload, at least) because that's an efficiency graph and the P-cores become very inefficient towards the top of their frequency envelope.

**Anux** · 06 November 2023, 08:11 AM

I often heard the argument here on phoronix, that ARM decoders are faster or more efficient. But it's hard to really prove that. Just comparing the decoder width also doesn't give any clues because x86 can do more with one instruction (meaning it decodes to more micro ops than a ARM). Also those decoders are optimized to saturate a core, just increasing it's width might not improve overall performance and at last a 6 wide decode at 5 GHz is theoretically faster than a 8 wide at 3 GHz.

So a 3GHz ARM needs a wider decoder to get the same throughput of higher clocked designs.

**coder** · 06 November 2023, 12:33 PM

Originally posted by Anux View Post

Just comparing the decoder width also doesn't give any clues because x86 can do more with one instruction (meaning it decodes to more micro ops than a ARM).

Except that the frontend of x86 CPUs typically have only one "complex" decoder, and the rest are "simple".

Originally posted by Anux View Post

Also those decoders are optimized to saturate a core,

If that were true, then why have the op cache? Conversely, if most ops are served from the op cache, why not make the backend wide enough to take advantage of it?

Originally posted by Anux View Post

just increasing it's width might not improve overall performance and at last a 6 wide decode at 5 GHz is theoretically faster than a 8 wide at 3 GHz.

The backend width doesn't match the frontend. That's what tells us decoders are a potential bottleneck.

Here's data collected from Zen 2, across a variety of workloads, showing that the op cache indeed provides performance benefits. As I mentioned before, I think the power data might be unreliable.

Source: https://chipsandcheese.com/2021/07/0...s-performance/

Keep in mind that the opcache doesn't totally alleviate decoder bottlenecks. So, this is an underestimate of just to what extent the decoder can comprise one.

**Anux** · 07 November 2023, 04:52 AM

Sorry this thread is defect for me, I cant really quote you.

then why have the op cache?

For the same reason ARM has one or basically anything has a cache, to reduce bottlenecks and maximize utilization of different stages. The micro-op cache is an integral part of the decoder.

The backend width doesn't match the frontend. That's what tells us decoders are a potential bottleneck.

As is every other stage of the CPU depending on the workload. All is a careful trade-off between area, power and utilization of the CPU stages.

**coder** · 12 November 2023, 05:13 AM

Originally posted by Anux View Post

For the same reason ARM has one or basically anything has a cache, to reduce bottlenecks and maximize utilization of different stages.

Caches always have overhead, in both silicon area and energy usage. Therefore, you only put a cache where it provides more benefit than it costs.

In ARM's case, they've been removing them from the ARMv9 cores, now that v9 allows them to drop backward compatibility with AArch32. That simplifies their decoders to the point where they're cheap enough that the mOP cache no longer pulls its weight.

Originally posted by Anux View Post

The micro-op cache is an integral part of the decoder.

There's no reason it should be, nor is it depicted that way in block diagrams.

Originally posted by Anux View Post

As is every other stage of the CPU depending on the workload.

This is a non-answer. If x86 CPUs had sufficiently wide decoders to keep their backend fed, then the decode rate should match the issue rate, which it doesn't. Intel got a big boost, when it added the uOp cache to Sandybridge. A similar thing happened for AMD, in Zen. All of the evidence, plus the analysis & empirical data I've cited, is pointing to x86 decoders' being a bottleneck. You can continue to deny this if you want, but that doesn't make it untrue.

**chrcoluk** · 22 February 2024, 10:24 PM

Curious of the results if the CPUs were run within spec, as the peak power consumption of both chips was above their spec. Was the bios using out of spec settings?

Announcement

Intel Core i5 14600K & Intel Core i9 14900K Linux Benchmarks

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment