Announcement

Collapse
No announcement yet.

Intel Announces 13th Gen "Raptor Lake" - Linux Benchmarks To Come

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • AdrianBc
    replied
    Originally posted by coder View Post
    Well, let's hope they get those cost tables updated accordingly, for zenver4 in gcc and llvm! It seems an easy substitution to simply use a temporary register target and then write out that. Won't help the inveterate assembly programmers, but the enlightened among us who use compiler intrinsics should hopefully not see much impact.

    Any idea how likely they are to fix it in a future stepping?

    In the current manufacturing processes, the cost of chip revisions has become exceedingly big, of millions of $ for the simplest changes.

    So the CPU designing companies do not make new revisions except for bugs so serious that they would expose the companies to legal liabilities, e.g. security bugs or data corruption bugs.

    If you look at the errata lists for the Intel CPUs (euphemistically named "Specification Update") and for the AMD CPUs (euphemistically named "Revision Guide"), for each CPU model there may be up to one hundred bugs that have the resolution "Won't fix".

    Because this bug, after being patched by microcode (which is the workaround always used to avoid data corruption bugs, which cannot be tolerated) only slows the execution and performance-improving workarounds are possible in compilers, it is likely that it will not be fixed in the desktop CPUs before whatever models will be introduced by AMD at the end of 2023.

    In the best case, the bug was discovered early enough for it to be corrected in the laptop Zen 4 CPUs, expected at the beginning of 2023.




    Leave a comment:


  • coder
    replied
    Originally posted by piotrj3 View Post
    The only minor going over limit was in some AVX loads but that was minor going over limit (like 5%).
    Try 13%.


    Originally posted by piotrj3 View Post
    now you have 7950X, and power draws as well boost frequencies are total rollercaster among reviewers.
    I don't really get what you're complaining about. Isn't it the dream of overclockers to have a CPU that's only thermally-limited? If you want to impose lower power limits, you can do it in BIOS.

    Originally posted by piotrj3 View Post
    GN had 251W.
    That's not at all atypical, when you do extreme overclocking, which is essentially what he did.

    Originally posted by piotrj3 View Post
    Another issue is that one reviewer before claimed 65W TDP 7950X outperforming 12900k problem was on same graph below package draw was 90W on 7950X
    I'll grant you this one point: that TDP is misleading when the actual PPT is 1.35x that much. Based on my simplistic understanding, I don't know why they're not equal. If someone can point me at a compelling rationale, I'd appreciate it.

    Originally posted by piotrj3 View Post
    So i am not certain in this generation if AMD will be more efficient.
    power consumption != power efficiency. Also, power efficiency changes, depending on the SKU and TDP configuration. It's not a single number that characterizes all models in all configurations.

    Because of that, it really matters why you're looking at it. If you just want to compare the microarchitecture and manufacturing process, then you will want to compare comparable models running in a similar power envelope (and not a similarly-named power envelope, but as close as you can get to one that's actually equivalent).

    If you want to compare the typical end user power efficiency, then compare comparable models at stock settings, with a normal case & cooler, running on a defined workload.

    People tend to take the the highest number from the most extreme part, in the highest-power configuration and use that to characterize the entire product line. However, that's only applicable to those intending to run that part in that configuration.

    Leave a comment:


  • coder
    replied
    Originally posted by AdrianBc View Post
    You have just said exactly the same thing that I have said.

    When both threads are active on a P-core, each of them has about 60% of the performance of the same core with only 1 active thread, so both threads increase the performance of the core to about 120%. Therefore a thread on an E-core has about the same performance as that of one of the 2 threads on a P-core with both threads active.
    According to https://www.anandtech.com/show/17047...d-complexity/9 8P2T is faster than 8P1T (both DDR5) by approximately:
    • 17.5% faster @ SPEC2017int
    • 2.1% faster @ SPEC2017fp

    So, it's a little worse than you say for int, and much worse for fp. FWIW, the numbers I quoted previously were based on the single-thread aggregate scores comparing 1P1T vs. 1E.

    Too bad they didn't test 8P1T + 8E, but we can at least see how much 8E adds to 8P2T:
    • 25.9% faster @ SPEC2017int
    • 7.9% faster @ SPEC2017fp

    Working backwards, that implies the 8P1T + 8E vs. 8P1T + 0E should be at least:
    • 30.4% faster @ SPEC2017int
    • 8.1% faster @ SPEC2017fp

    Still, that suggests the 8 E-cores are, in aggregate, 30.4% and 8.1% as fast as the 8P1T cores. Compared to 8P2T, the 8E are 25.9% and 7.9% as fast. Obviously, those numbers reveal some scaling problems.

    Now, an interesting fact about enabling the E-cores is that it creates a bottleneck in the ring bus, due to the E-core cluster stops being down-clocked. So, the multithreaded tests with E-cores show them adding less than what they'd ideally be capable of contributing.



    Not only that, but there's doubtless some clock throttling going on, as the CPU bumps into its various power limits. And then there's DDR5, which is clearly much less of a bottleneck, but feeding so many threads with 2 channel-pairs is still going to starve them relative to their single-thread performance.

    Originally posted by AdrianBc View Post
    When all the available threads are active on an Alder Lake or Raptor Lake, the SMT threads on the P-cores and the single threads on the E-core have about the same performance,
    No, the above data and what Intel has previously stated both indicate the E-cores each add more performance than doubling up a P-core. Granted, we're only looking at aggregates over the SPEC2017 bench and for an all-thread scenario, but the data supports Intel's claims.

    Originally posted by AdrianBc View Post
    so a Raptor Lake with 8 x 2 threads on P-Cores + 16 x 1 threads on E-cores has about the same performance as a CPU with 16 x 2 threads on P-cores,
    If we tread the fraught path of extrapolation, the above data suggests 16P2T would deliver 125.64 and 8P2T + 16E would deliver 94.98 on SPEC2017int. I won't go down the same path for SPEC2017fp, because my extrapolation would be off by even further.

    However, essentially what you're saying is that 16 E-cores = 8P2T-cores. I think your error is in assuming the E-cores scale as well. However, having 4 of them sharing a cache slice and ring bus stop is an impediment to this. I'm not saying my extrapolation is valid, but I think you overestimate them (or at least this implementation thereof).

    Originally posted by AdrianBc View Post
    This is not a coincidence. The Intel designers are not stupid so they have chosen this performance ratio so that the speed will not vary wildly when the threads happen to be migrated between cores by the operating system scheduler.
    That is not a requirement of a hybrid architecture. All that's required is for the OS to have some idea how much useful work each thread is able to do, so that it can ensure they all get scheduled fairly.

    Originally posted by AdrianBc View Post
    1 thread on each E-core and 2 threads on each P-core, and in the latter case all threads have similar performance.
    Cool story.

    Let's look at it this way. We'll divide out the points per thread, in the different permutations of P-core loading and E-core loading.


    0P + 8E 8P1T + 0E 8P1T + 8E 8P2T + 0E 8P2T + 8E
    int 29.81 53.45 69.69 62.82 79.06
    fp 38.07 72.38 78.22 73.92 79.76
    int/thread 3.73 6.68 4.36 3.93 3.29
    fp/thread 4.76 9.05 4.89 4.62 3.32
    ​​

    Again, the 8P1T + 8E column is merely an estimate. However, this confirms that you really do want to load them the way Intel recommends.

    The other thing that's interesting is that the performance of an int-heavy thread is higher in a 8P2T configuration than moving it to an E-core, but your aggregate performance still argues in favor of going 8P1T + 8E and then just balancing execution time of the threads between P-cores and E-cores.

    For fp-heavy threads, it's always better to put the thread on an E-core than double up a P-core, both in aggregate and just in terms of its performance potential.

    Originally posted by AdrianBc View Post
    Sandy Bridge had a full 256-bit width implementation for the floating point instructions, including for multiplication and addition, which matter most for the power consumption.

    Haswell added 256-bit implementations for the integer instructions and it also replaced the multiplier and adder of Sandy Bridge with two FMA units, which double the computation throughput, but it also doubled the power consumption, causing the down-clocking problems that you mention.
    Thanks for the info.

    I think we agree. It was just the notion of going to 512 bits @ 32 nm (on a general-purpose CPU) which I thought was ludicrous.

    Leave a comment:


  • coder
    replied
    Originally posted by AdrianBc View Post
    Also, it appears that Zen 4 had a bug in the vpcompressd instruction, only for the case when the destination is in the memory, and that bug was discovered late, so it was patched with a microcoded sequence.

    Because of that, on Zen 4 vpcompressd with a memory destination is abnormally slow, even if it is fast with a register destination and vpexpand is fast even with a memory operand.

    So an AVX-512 program intended to run on Zen 4 should replace vpcompressd with a memory destination with an equivalent instruction sequence using vpcompressd with a register destination.
    Well, let's hope they get those cost tables updated accordingly, for zenver4 in gcc and llvm! It seems an easy substitution to simply use a temporary register target and then write out that. Won't help the inveterate assembly programmers, but the enlightened among us who use compiler intrinsics should hopefully not see much impact.

    Any idea how likely they are to fix it in a future stepping?

    Leave a comment:


  • piotrj3
    replied
    Originally posted by atomsymbol

    They made a video about power consumption of 7950X - but they didn't make an in-depth video. An in-depth analysis of CPU power-efficiency would look somewhat different.



    I think you don't quite get/understand it. If you take the cost of electricity into account (which you should; unlike GamersNexus), then the most efficient setup of running Blender on Ryzen 7000 is a single point: it is a point "X" that is the highest one on an ⋂-shaped curve. The probability that [the values (Watts, Amperes) attributed to X depend on whether the cooler can dissipate 250W or "just" 190W] is quite low, because the most cost&power-efficient way of running Blender is well below 250W.

    Can you point me to the time where any GamersNexus Ryzen 7000 review video is showing the point X on such ⋂-shaped curve?

    Your statement that ".... one reviewer will claim 190W power draw, another 220W another 250W" is true only because those reviewers don't know how to properly review the CPU so that most potential Ryzen 7000 buyers/users can find their use-case in that review, which has a primary cause in the fact that people watching/reading those reviews don't demand those reviews to be more complex.

    Can you point me to the time where any GamersNexus Ryzen 7000 review video about which you can say "This point here: that will precisely be my use-case"?

    Without taking costs into account, the best way of running Blender on Ryzen 7000 is to use liquid nitrogen to cool the CPU.
    The issue it is subjective. Anyway my issue is that AMD changed definition of their TDP (what video exactly mentions). Before TDP was actually the power your CPU drew from EPS rail. Now it is something else. Intel meanwhile implies 2 things one is base power draw and boost power draw and with exception of some AVX512 workloads you will not break that boost power draw.

    So you have Intel 12900k https://ark.intel.com/content/www/us...-5-20-ghz.html
    You see boost power draw 241W. You maybe as reviewer open Intel extreme tuning utility or some diffrent tool and you see if processor is thermal throttling. So if you don't see thermal throttling 12900k will perform almost exactly the same for Phoronix, GN, Linus Tech Tips, Anandtech, arctechnica etc. etc. even under diffrent coolers. Keep in mind i am talking about 12900k which has unilimted Power limit modes in terms of time. The only minor going over limit was in some AVX loads but that was minor going over limit (like 5%).

    now you have 7950X, and power draws as well boost frequencies are total rollercaster among reviewers.

    On AMD site you see one figure 170W.
    Phoronix had max power draw of 230W, but max temp is 96C what implies throttling.
    GN had 251W.
    Hardware unboxed has 355W whole system power consumption (not easy to compare but 130W above Ryzen 5950X).
    Linus Tech tips had 190W and yes throttling.

    Another issue is that one reviewer before claimed 65W TDP 7950X outperforming 12900k problem was on same graph below package draw was 90W on 7950X (which is just 30W under maximum 5950X). And now Intel claims 13900k in multicore workload will have same performance in 65W mode as 12900k in 241W PL. So i am not certain in this generation if AMD will be more efficient. Because if Intel literally draws 65W on average to have such performance, then dialed down from 90W to 65W Ryzen will lose here a lot of performance. At that point i don't know.

    Leave a comment:


  • AdrianBc
    replied
    Originally posted by coder View Post
    LOL, wut?

    No, they have only about 60% the integer performance of a P-core running 1 thread. Where the E-cores are faster is to load them instead of putting a second thread on a P-core.
    You have just said exactly the same thing that I have said.

    When both threads are active on a P-core, each of them has about 60% of the performance of the same core with only 1 active thread, so both threads increase the performance of the core to about 120%. Therefore a thread on an E-core has about the same performance as that of one of the 2 threads on a P-core with both threads active.

    When all the available threads are active on an Alder Lake or Raptor Lake, the SMT threads on the P-cores and the single threads on the E-core have about the same performance, so a Raptor Lake with 8 x 2 threads on P-Cores + 16 x 1 threads on E-cores has about the same performance as a CPU with 16 x 2 threads on P-cores, but at a smaller area and power consumption.

    This is not a coincidence. The Intel designers are not stupid so they have chosen this performance ratio so that the speed will not vary wildly when the threads happen to be migrated between cores by the operating system scheduler.

    While you are right that when only a part of the threads are active, it is always better to start a thread on an idle E-core instead of the second thread on a P-core, most programs either use only a few threads running on few of the P-cores, or they use all the available threads, 1 thread on each E-core and 2 threads on each P-core, and in the latter case all threads have similar performance.


    Originally posted by coder View Post

    Sandy Bridge was a 32 nm CPU and it didn't even implement AVX at full 256-bit width. I think they didn't do that until Haswell, which used 22 nm. And Haswell had an infamous clock-throttling issue with AVX2-heavy workloads, although it pales in comparison to the AVX-512 clock throttling problems Intel had on the 14 nm CPUs where they introduced it.

    My point is that what you're talking about is a low-clocked, in-order Larrabee core. You cannot compare that to a high-clocked out-of-order, general-purpose CPU core. Even 2016 was too soon for Intel to deploy AVX-512 on general-purpose cores @ full width. It was a big mistake, due to all of the clock-throttling problems it caused. Possibly 10 nm ESF (AKA "Intel 7") is the first time it really makes sense.

    Sandy Bridge had a full 256-bit width implementation for the floating point instructions, including for multiplication and addition, which matter most for the power consumption.

    Haswell added 256-bit implementations for the integer instructions and it also replaced the multiplier and adder of Sandy Bridge with two FMA units, which double the computation throughput, but it also doubled the power consumption, causing the down-clocking problems that you mention.

    The AVX instruction set added only minimal improvements over SSE, except for extending the registers to 256 bits and allowing 3-address instructions instead of 2-address instructions.

    The Larrabee New Instructions, later renamed as AVX-512 was a completely new instruction set that was much better designed than MMX/SSE/AVX.

    AVX-512 has nothing to do with the width of the execution units, which determines the power consumption that can cause down-clocking problems. You can implement AVX-512 even with 64-bit wide execution units in a very cheap implementation.

    AVX-512 has nothing to do with whether the CPU has in-order or out-of-order execution.

    For lower cost in Sandy Bridge, it would have been very easy to implement only the 256-bit versions of the AVX-512 instructions and with only 16 registers, at a cost very close to that of the AVX implementation, but having a much simpler path for the future extension of the ISA.

    The choice between AVX and the Larrabee New Instructions for Sandy Bridge had absolutely nothing to do with the technical merits of the 2 instructions sets. Those 2 instruction set extensions have been designed in parallel by different Intel teams, working in different continents. It is pretty certain that there was no adequate communication between the different Intel teams and that the relationships between the teams were more of competition than of cooperation.

    So it is likely that for the A team it would have been seen absurd to discuss with some secondary team about merging their possibly better design into the Sandy Bridge project, instead of developing their own ISA extension, independently of other teams, even if the NIH approach resulted in an inferior ISA.

    A couple of years later Haswell has added a few of the instructions provided earlier by Larrabee and Knights Corner, e.g. fused multiply-add and gather instructions, but due to the initial design of AVX it was impossible to add the most important AVX-512 features, like the mask registers.









    Last edited by AdrianBc; 30 September 2022, 07:04 AM.

    Leave a comment:


  • AdrianBc
    replied
    Originally posted by coder View Post
    Well, you're comparing Zen 4 to Skylake-era cores. So, of course it's better than those. What's more interesting is to compare it with Sapphire Rapids' Golden Cove AVX-512. Do you know of any analysis of it, via Alder Lake?
    At the end of the page

    https://www.mersenneforum.org/showthread.php?p=614191

    there is the table with the measured throughputs and latencies for Zen 4




    The same throughput and latency table for an Alder Lake with enabled AVX-512 is at




    In general Zen 4 has either the same or better throughputs and latencies in comparison with Sapphire Rapids, but there are 2 important exceptions.

    As mentioned before, Sapphire Rapids will have two 512-bit FMA units, thus double throughput for FMA.

    Besides that, Sapphire Rapids will have an approximately double throughput for the gather instructions.

    The same double throughput for gather is also valid for the AVX2 variant of gather, i.e. for Raptor Lake/Alder Lake when running AVX2 code vs. Zen 4.


    Also, it appears that Zen 4 had a bug in the vpcompressd instruction, only for the case when the destination is in the memory, and that bug was discovered late, so it was patched with a microcoded sequence.

    Because of that, on Zen 4 vpcompressd with a memory destination is abnormally slow, even if it is fast with a register destination and vpexpand is fast even with a memory operand.

    So an AVX-512 program intended to run on Zen 4 should replace vpcompressd with a memory destination with an equivalent instruction sequence using vpcompressd with a register destination.






    Leave a comment:


  • WannaBeOCer
    replied
    Originally posted by coder View Post
    As already mentioned many times, 7950X delivers very strong performance at lower power thresholds. It remains a perfectly viable & competitive solution in such configurations. Whether the same can be said of Raptor Lake remains to be seen, but I wouldn't count on it.

    Also, I'd like to see their raw measurement data. Specifically, how much of the time did the benchmarks which ran > 200 W stay at such elevated levels?
    According to Intel, the 13900K provides the same performance as the 12900K at 64w. From early leaks the 13900K already outperforms the 7950X in synthetic benchmarks and use about the same power. With 100w more we’re going to see a 6Ghz Raptor Lake. While the mid-range 13700K/13600K will mostly be a decent amount ahead of the 7700X/7600X. I wouldn’t be shocked when AMD lowers the price of the 7900X to $450 to compete with the 13700K.

    Leave a comment:


  • coder
    replied
    Originally posted by piotrj3 View Post
    Don't use AMD pictures for that. GamersNexus made video in depth analyzing power consumption of 7950X, and they found that power consumption is extremly high as long as you can cool down chip (its power draw is optimized not towards power draw but towards reaching 95C).
    His measurements don't refute their claims. You need to understand that power consumption is not the same thing as power efficiency, and that there's more than one way to run the CPU.

    Originally posted by piotrj3 View Post
    So 7950X can take on average... 251W just on EPS rail ON AVERAGE during blender if your cooling allows that.
    Alder Lake will do the same thing, on gaming boards. Intel allows it to stay in boost mode indefinitely, so the boost duration is ultimately limited by your cooling solution.

    Originally posted by piotrj3 View Post
    11:51 you have broken efficiency promises.
    If people want it to run efficiently, they just need to select the desired TDP and optionally Eco mode. Alder Lake doesn't even give you that option.

    I find it funny that people are up in arms about this. It's a race to the bottom scenario. I don't get why you somehow expect AMD to take "the high road", when it would mean losing market share to an even less-efficient Intel CPU. As long as Intel is playing these games, AMD has no choice but to respond.

    Leave a comment:


  • piotrj3
    replied
    Originally posted by coder View Post
    This was true until Zen 3. Once Zen 3 happened, Intel actually had to raise clock speeds & power consumption of its 14 nm CPUs even to compete in single-threaded performance!

    That held until Alder Lake, which enabled Intel to comfortably regain the single-threaded lead, although they seemed reluctant to take their foot off the gas (i.e. clock speeds).


    Leaving aside the issue of the E-cores, let's stay focused on generational power-efficiency improvements. AMD delivered this:





    So, their fundamental efficiency indeed improved. This will be virtually impossible for Intel to do in Raptor Lake, because they have the same microarchitecture being made on virtually the same process node. So, fundamental efficiency will not drastically change.

    We can also see that AMD traded some of those efficiency gains for better performance, by increasing clock speeds. Intel will do the same. However, by not starting from a lower base like AMD, Intel's single-threaded efficiency can pretty much only get worse, in Gen 13. If they kept the same clocks as Gen 12, then we could see some small improvement, but they've already said they won't.


    ​The main place where Raptor Lake can possibly lower power consumption is in workloads with about 24 threads, because half of those threads will now move to the additional E-cores instead of over-taxing the 8 P-cores. In all-core workloads, the throughput added via 8 additional E-cores should actually enable better perf/W than Alder Lake. The pity is that power consumption of such workloads is so very high, due to their aggressive clocking.

    However, it's incorrect to say that Raptor Lake is chiefly about improving power-efficiency. If that were true, they wouldn't be increasing clock speeds, as well. What Intel is doing with Raptor Lake is to look for performance gains anywhere they can find them. Faster clock speeds, bigger L2 cache, faster DDR5, and more E-cores. It's all really about performance.


    That's not really true. AMD's APUs were much more power-efficient. The 5800X was an outlier, in terms of power-efficiency for the 5000-series.

    If their next-gen APUs remain monolithic, then I think it'll be a similar story. However, the penalty Ryzen 7000's MCM architecture should be lower, now that the I/O Die is 6 nm (in the 5000 series it was either 14 nm or 12 nm).
    Don't use AMD pictures for that. GamersNexus made video in depth analyzing power consumption of 7950X, and they found that power consumption is extremly high as long as you can cool down chip (its power draw is optimized not towards power draw but towards reaching 95C). So 7950X can take on average... 251W just on EPS rail ON AVERAGE during blender if your cooling allows that. This is why Ryzen 7950X is so bad chip to rate because one reviewer will claim 190W power draw, another 220W another 250W just on EPS rail. And all of them will also have diffrent performance claims. Meanwhile Intel on 241W alder lake was very representative - power draw is capped if your cooling allows this power draw you will have same performance as reviewer. https://youtu.be/nRaJXZMOMPU?t=541

    11:51 you have broken efficiency promises.
    Last edited by piotrj3; 29 September 2022, 06:49 PM.

    Leave a comment:

Working...
X