Announcement

Collapse
No announcement yet.

Intel Core i5 14600K & Intel Core i9 14900K Linux Benchmarks

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    Originally posted by Jabberwocky View Post
    Thanks for making a technical argument.
    I have not read about 80376 before. I wasn't aware that these were made and put into dumb terminals. Removing 16-bit and 32-bit instructions is a extremely challenging task. I would be surprised it this happened anytime soon. I sometimes compare x86 to JavaScript. We have all this legacy junk and hacks is because if either broke backwards compatibility the industry likely would have moved to something else.
    I've looked at some benchmarks for Apple's M2 (8+10) 20 billion transistors TSMC 5nm N5P vs AMD 6800U (zen3+ rdna2) 13 billion transistors TSMC 6 nm FinFET in an Asus Zenbook S13. The results doesn't match the hype IMO.
    i think something like 80376 never existed. also why do you do legacy comparisons with already obsolete tech ?

    use these benchmark results with current tech instead: https://www.computerbase.de/2023-11/...eils-deutlich/

    the multicore results are not really interesting because surprise surprise the chip with more cores has better result...

    no the single core results is the interesting one.

    "
    Geekbench v6 – Single-Core
      • Apple M3
        • 100 %
      • Core i9-13900K
        99 %
      • Snapdragon X Elite (80 W)
        97 %
      • Apple M3 Max
        96 %
      • Snapdragon X Elite (23 W)
        90 %
      • Apple M2 Max
        89 %
      • Apple M2
        87 %
    ​"

    as you see in this comparison the apple m3 chip is the fastest declared 100% the 13900K is then 99%
    then we can say intel 14900K is maybe 3-4% faster than this would result in place 1 with maybe 2-3% faster than the Apple M3 chip.
    but keep in mind that the 14900K has a TDP of 282watt or something like this and the Apple M3 much less than this.

    "I've looked at some benchmarks for Apple's M2 (8+10) 20 billion transistors TSMC 5nm N5P vs AMD 6800U (zen3+ rdna2) 13 billion transistors TSMC 6 nm FinFET in an Asus Zenbook S13. The results doesn't match the hype IMO.​"

    your comparison is only correct if you focus on multicore performance. because of course in single core performance the AMD 6800U has no change what so ever agaist a Apple M3/Core i9-13900K
    /Snapdragon X Elite or a 14900K

    you talk about (zen3+ rdna2)​ and even if we talk about zen4+RDNA3 you either have a low performance APU or else a iGPU+dGPU combination and in this case lets say AI workloads are the most important point today then unified memory model of course beats the vram of the dGPU because even if you talk about a AMD PRO W7900 with 48gb vram with apple unified memory model you have 128GB of vram for your AI workloads....

    and AMD is not yet in the game with BIG-APUs with mega size of unified memory...


    Originally posted by Jabberwocky View Post
    If the headlines read low idle and hours of battery lifetime then I would say the hype is real. Tom's Hardware claiming it's better than a Threadripper 3990X and Apple's Senior VP, Johny Srouji, claiming it's faster than Geforce 3090 is just moronic. I digress, let's look at the benchmarks...
    The M2 has a better GPU hands down, but on the CPU side it's more or less the same as the 6800U. In FL Studio WAV export M2 is 12% slower than 6800U (15W), 16% slower than 6800U (25W). In Blender the M2 is only 3.2% faster 6800U (15W) and is 7.1% slower than the 6800U (25W). The M2 does well in 7zip compressing but fails miserably with 7zip decompression being 32,4% slower than 6800U (15W). Also nobody is testing AVX512 and comparing that to Apple. The M2 is also much more expensive. One thing is for certain, Intel's 1260P is struggling all round in terms of performance / Watt. These stats were obtained from Hardware Unboxed's benchmarks. I'm still waiting for proper Zen4 tests vs Apple M2. Here's a generic comparison: https://www.notebookcheck.net/R7-784....247596.0.html
    honestly i do not care about battery time or even devices on battery... why ? i use desktop/workstation or smartphone who does not need performance at all.
    and if i would have a notebook and i would need performance for AI workloads i would prever to SSH into a workstation or server instead of a more powerfull notebook.

    can you please stop talk about apple M2 because no one cares about this old history lessions anymore you can buy a Apple M3...
    I do not have a apple m3 and i will not buy one LOL...

    "The M2 has a better GPU hands down,"

    M1/M2 did not have raytracing support and also had no VP9/AV1 decode or encode
    this fact alone shows you that for linux and opensource an AMD GPU is better then you have AV1 decode and raytracing support.
    by the way the AMD GPU has opensource drivers and the Apple M1/M2 only have reverse-engineered official-unsupported driver with right now only openGL support...

    M3 now they claim has raytracing support and AV1 decode but no AV1 encode ?

    m2 vs 6800U

    i honestly do not care so much who is slower or faster.

    one has good opensource drivers for the GPU
    and also AV1 decode support
    and also official raytracing support for the GPU...

    and the apple m2 not not have all of this so who cares who is faster ?
    apple m2 could have the double performance and i would not buy it.


    Originally posted by Jabberwocky View Post

    I have a few questions:
    1. What is the percentage of power that the decoder uses and how much does that objectively contribute to improved efficiency / performance?
    1. who cares the apple M2 does not have VP9/AV1 decode

      Originally posted by Jabberwocky View Post
    2. We have seen in x86 alone that there have been big improvements to IPC when optimizing between branch prediction and micro-op cache / decode-queue. What proof do we have that we have reached a limit where we cannot optimize variable length decoding further (regardless if we keep or remove legacy instructions) ?
    very simple: Apple M3 beats a intel 13900K in singlecore performance and it is TDP 283watt vs TDP 30 watt...
    who cares that intel 14900K is maybe 4% faster in single core than a 13900K?

    in my point of view multicore is not very intersting because you can always put in more cores to get more multicore performance.

    can you explain to me how exactly intel want to go from TDP 283watt to only TDP 30 watt with the same performance ?

    Originally posted by Jabberwocky View Post
  • How can we objectively measure that ARM's success is due to the "ultra" 8-wide decode step?
  • Is x86 struggling to improve the decode step because of legacy instructions, perhaps it's AVX or even something else?
i honestly don't care. Intel is on 10nm+++++++++ I honestly don't care apple is on 3nm I honestly don't care.

the end-consumer does not care at all if it is the 8-wide decode or the 10nm vs 3nm part.

the end-consumer only does see that intel has a big problem.

intel right now has a 6-wide decode design... of course they can go to 8-wide decode the point is they need more tranistors for this step because flexible wide design results in more tranistors used than a fixed decode wideness.

intel can not afford wasting more tranistors because their 10nm node can not handle it.
Originally posted by Jabberwocky View Post

CPU core frontends are very complicated. The smallest change to something like cache latency. The balance between predictions and mispredictions. Every small change has a profound impact on the entire system. I enjoy studying this, but I can't honestly say I know exactly what is going on in Raptor Lake / Zen5 / M3 / Snapdragon. it's easy to make statements like "X has better single core performance because the CPU is ultra wide", but to actually prove that this architectural change is responsible for this is another story. If there's a proper study on this and we can definitively say it's too difficult to feed the execution and single core performance is over on x86,
you clearly use the wrong words nothing is "too difficult" its just a fact that it needs more tranistors to make a flexible wide architecture 8-wide decode compared to the 6-wide decode what intel does right now.
and intel's 10nm node can not handle more transistors...
the ARM 8-wide decode design compared to a flexible wide design saves tranistors because you have less possibilities.

intel in the past did avoid going from 4 wide to 6 wide decode design and intead used hyperthreating to get more multicore performance. for multicore performance you do not need this.

if intel could go from 10nm to 3nm then they could spend more tranistors to handle this complexity at 8-wide decode...

Originally posted by Jabberwocky View Post

I really doubt that this is the case, however I'll be happy to admit that I'm wrong. Right now though, AFAIK Snapdragon Elite is only being released June 2024 so we won't even have independent testing of Snapdragon Elite X until then.
is the Snapdragon Elite X even a relevant point if the Apple M3 is the fastest in this comparison ?

"I'll be happy to admit that I'm wrong."

you will never admit it because you can always claim it is 10nm vs 3nm and not the ISA or cpu design.

Originally posted by Jabberwocky View Post
​​
Regarding the announced Snapdragon Elite X single core benchmarks: Even if we trust that independent testing will give the same results over all workloads then it doesn't look like it's a life-changing difference. Certainly not enough to revive dead RISC vs CISC arguments. Nor something that spells the end of x86. One thing that is impressive is the lower frequencies of the device, but it still seems to draw a lot of power (80W) at 4.3 Ghz. Maybe this will improve in the near future?
first 80w TDP is low in comparison of the intel's 283watt TDP of the 13900K/14900K

then keep in mind Snapdragon Elite X is only at 4nm and the Apple M3 is at 3nm... so a Snapdragon Elite X 2.0 on 3nm would be even better.

Originally posted by Jabberwocky View Post
​​​
If something would bring x86 laptops to and end it's bad power management on the OS side in both Linux and Windows. Also the lack of innovation compared to the bells and whistles of the M2 and how many applications take advantage of the SOC on a low level. Apple does a good job at marketing their new features and makes it exciting for teams (including managers) to implement. Dopamine firing off the charts every time a new hardware feature is supported. On the other side you have Windows benchmarking and video editing software that took many years to support very useful u.arch improvements. Like Adobe's trash software that would just crash on some CPUs or not use hardware encoding for many years. This is where the real battle is being fought IMO.
right as i said above i would not buy apple m2/3 even if the performance would be double as high because of no opensource drivers for linux and also no AV1 decode/encode...
and your argument here ... i do not see any change what so ever for microsoft and closed source software like Adobe software in general. it is technically impossible to get what you want or what we want with closed source software or a closed source operating system. as you say apple can only do this because they control all and only if you control all like apple you can do this with closed source. this means Microsoft is no real competitor to linux we can easily beat microsoft and it does already happen with valve steam deck.

honestly i do not unterstand apple long time ago apple also had servers and they lost the server market comletely now they could easily officially support linux on M1/M2/M3 but they choose not to do this.

also Apple is MPEG LA member and for this fact alone "evil" and in apple M1/M2 they implemented all the closed source and patented video codexes but for opensource video codexes they only adobt it very slowly .. why ?

you always talk about 6800U but AMDs 7000 series mobile chips honestly look good with AV1 encode support and the 6000 series only has decode support.

for linux the apple products are not ready... faster or better battery time does not matter in this case

Originally posted by Jabberwocky View Post
​​​
I hope ARM and RISC-V find more ways of improving over x86 and not in nonobjective hype tactics. Massive speculation: It seems like AMD is going after inter-CCD latency in the next 2 years. AMD will likely grow at 10 to 15% IPC with Zen5 and ~10% with Zen6 (completely new chip layout design, new infinity fabric, CCD stacking) we might see Arrow Lake and ARM growing more efficient than that over the next 2 years so it will be interesting. We will see some interesting laptops next year with AMD bringing out some new things too but we might see delays for the big stuff like Strix Halo (likely 2025) but we still should see Strix Point in 2024 due to Windows 11 AI requirements. Both AMD Strix Point and Snapdragon X Elite will go for ~40 to ~45 TOPS if the rumors are right.
I love low-power passive cooled devices, so Cortex-A5 over Cortex-A7 and Cortex-X models... never mind x86.
I do not unterstand the RISC-V hype they are fake as the Libre-SOC people discovered.

RISC-V is just a distraction from real free and fast solutions like OpenPOWER.

i have never seen any benchmark result of a fast RISC-V chip they clearly do not exist.

"I hope ARM and RISC-V find more ways of improving over x86 and not in nonobjective hype tactics."

do you remember the time when the intel cpus and amd cpus where produces in 14nm node ?
at this time power9 cpus produced at 14nm did defeat all the X86 chips in single-core performance.
just to make an example how pointless X86 is.

after this time you can no longer compare the cpus because one is at 10nm the other is at 3nm and IBM Power10 is at 7nm

and any difference in the production node would result you in cry: "nonobjective hype tactics"

"see Strix Point in 2024 due to Windows 11 AI requirements"

what are these requirements ? last time i checked amds inference accelerators where 8 bit integer
but minifloats like 4bit floating point and 6bit floating point and 8bit floating point clearly beats 8bit integer.

this means the hardware company who manage to include 4/6/8bit floating point and 8bit integer will win this.
Phantom circuit Sequence Reducer Dyslexia

Comment


  • #42
    Originally posted by coder View Post

    You'll find lots of analysis and investigation of these sorts of questions on Chips & Cheese. For instance, they did a detailed performance comparison (i.e. not just benchmarks, but also analysis) on ARM Neoverse N1 vs. Zen 2, a few years ago:

    Just recently, they posted analysis of ARM's X2 cores, via "Snapdragon 8+ Gen 1" (note that it's a phone SoC - not one of their laptop SoCs).


    If performance is comparable and efficiency is way better, then it could definitely have implications for the laptop and server markets.
    Thanks for the feedback. I'm going to keep an close eye on these developments and rethink my views, specifically around variable length decoders.

    I based some of my views on this old paper: https://www.researchgate.net/publica...uction_Decoder

    The reason why I'm bringing it up is because of section "3.2 Microbenchmark Design". Do you think it's practical to write tests like that and monitor system power in order to deduce decoder efficiency?

    I'm just wondering how can we find more data on where the single code performance / efficiency comes from in ARM, Intel and AMD's latest cores (without just making the assumption that it's the decoder width).

    Comment


    • #43
      Originally posted by Jabberwocky View Post
      I based some of my views on this old paper: https://www.researchgate.net/publica...uction_Decoder​
      That's analyzing Haswell -- a 10-year-old CPU with only a 4-wide decoder.

      Originally posted by Jabberwocky View Post
      ​The reason why I'm bringing it up is because of section "3.2 Microbenchmark Design". Do you think it's practical to write tests like that and monitor system power in order to deduce decoder efficiency?​
      I've seen Chips & Cheese try to evaluate the energy cost of decoders, but it was too long ago for me to remember their exact approach or which CPU they were analyzing. I think it essentially amounted to the same thing - try to blow out the uOp cache and look at what that does to power usage and performance.

      One thing to note about their older articles analyzing Zen 2 & earlier: a more recent article mentions AMD's RAPL values, which they were relying upon for power data, seem to be merely estimates rather than direct measurements.

      Originally posted by Jabberwocky View Post
      I'm just wondering how can we find more data on where the single code performance / efficiency comes
      ​They sometimes publish comparative analysis of core efficiency.

      Reviews across the internet show Alder Lake getting very competitive performance with very high power consumption. For example, Anandtech measured 272 W of package power during a POV-Ray run. Our o…


      I think they're missing a key point, however. When running a highly-threaded workload, the optimal strategy should be to clock all your cores so they're delivering the same energy-efficiency. Then, once the E-cores are maxed out, you continue increasing clock speeds on the P-cores. Since the efficiency of the overall chip is an average of its cores' efficiency, having those E-cores in the mix is still a win.





      That crossover on x264 looks worse than it really is. The idea is that once the E-cores reach the crossover point, you don't increase their clock speeds until the P-cores get to 4.0 GHz. Then, you bump the E-cores up to 3.5 GHz. After that, increase P-cores to 4.2 GHz. Next, max the E-cores at 3.8 GHz. Beyond that, any remaining power budget goes to pushing the P-cores to 4.5 GHz and beyond.

      If someone isn't really thinking very hard, you could see how they look at the graph and think: "oh, when you hit that crossover point, you just switch over to P-cores and that's it". But it's wrong (in a multithreaded workload, at least) because that's an efficiency graph and the P-cores become very inefficient towards the top of their frequency envelope.

      Comment


      • #44
        I often heard the argument here on phoronix, that ARM decoders are faster or more efficient. But it's hard to really prove that. Just comparing the decoder width also doesn't give any clues because x86 can do more with one instruction (meaning it decodes to more micro ops than a ARM). Also those decoders are optimized to saturate a core, just increasing it's width might not improve overall performance and at last a 6 wide decode at 5 GHz is theoretically faster than a 8 wide at 3 GHz.

        So a 3GHz ARM needs a wider decoder to get the same throughput of higher clocked designs.

        Comment


        • #45
          Originally posted by Anux View Post
          Just comparing the decoder width also doesn't give any clues because x86 can do more with one instruction (meaning it decodes to more micro ops than a ARM).
          Except that the frontend of x86 CPUs typically have only one "complex" decoder, and the rest are "simple".

          Originally posted by Anux View Post
          ​Also those decoders are optimized to saturate a core,
          If that were true, then why have the op cache? Conversely, if most ops are served from the op cache, why not make the backend wide enough to take advantage of it?

          Originally posted by Anux View Post
          ​​just increasing it's width might not improve overall performance and at last a 6 wide decode at 5 GHz is theoretically faster than a 8 wide at 3 GHz.
          The backend width doesn't match the frontend. That's what tells us decoders are a potential bottleneck.

          Here's data collected from Zen 2, across a variety of workloads, showing that the op cache indeed provides performance benefits. As I mentioned before, I think the power data might be unreliable.
          Keep in mind that the opcache doesn't totally alleviate decoder bottlenecks. So, this is an underestimate of just to what extent the decoder can comprise one.

          Comment


          • #46
            Sorry this thread is defect for me, I cant really quote you.
            then why have the op cache?
            For the same reason ARM has one or basically anything has a cache, to reduce bottlenecks and maximize utilization of different stages. The micro-op cache is an integral part of the decoder.

            The backend width doesn't match the frontend. That's what tells us decoders are a potential bottleneck.
            As is every other stage of the CPU depending on the workload. All is a careful trade-off between area, power and utilization of the CPU stages.
            Last edited by Anux; 07 November 2023, 05:01 AM.

            Comment


            • #47
              Originally posted by Anux View Post
              For the same reason ARM has one or basically anything has a cache, to reduce bottlenecks and maximize utilization of different stages.
              Caches always have overhead, in both silicon area and energy usage. Therefore, you only put a cache where it provides more benefit than it costs.

              In ARM's case, they've been removing them from the ARMv9 cores, now that v9 allows them to drop backward compatibility with AArch32. That simplifies their decoders to the point where they're cheap enough that the mOP cache no longer pulls its weight.

              Originally posted by Anux View Post
              The micro-op cache is an integral part of the decoder.
              There's no reason it should be, nor is it depicted that way in block diagrams.

              Originally posted by Anux View Post
              ​As is every other stage of the CPU depending on the workload.
              This is a non-answer. If x86 CPUs had sufficiently wide decoders to keep their backend fed, then the decode rate should match the issue rate, which it doesn't. Intel got a big boost, when it added the uOp cache to Sandybridge. A similar thing happened for AMD, in Zen. All of the evidence, plus the analysis & empirical data I've cited, is pointing to x86 decoders' being a bottleneck. You can continue to deny this if you want, but that doesn't make it untrue.

              Comment


              • #48
                Curious of the results if the CPUs were run within spec, as the peak power consumption of both chips was above their spec. Was the bios using out of spec settings?

                Comment

                • Working...
                  X