Announcement

Collapse
No announcement yet.

Intel Announces 13th Gen "Raptor Lake" - Linux Benchmarks To Come

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #81
    Originally posted by AdrianBc View Post
    You have just said exactly the same thing that I have said.

    When both threads are active on a P-core, each of them has about 60% of the performance of the same core with only 1 active thread, so both threads increase the performance of the core to about 120%. Therefore a thread on an E-core has about the same performance as that of one of the 2 threads on a P-core with both threads active.
    According to https://www.anandtech.com/show/17047...d-complexity/9 8P2T is faster than 8P1T (both DDR5) by approximately:
    • 17.5% faster @ SPEC2017int
    • 2.1% faster @ SPEC2017fp

    So, it's a little worse than you say for int, and much worse for fp. FWIW, the numbers I quoted previously were based on the single-thread aggregate scores comparing 1P1T vs. 1E.

    Too bad they didn't test 8P1T + 8E, but we can at least see how much 8E adds to 8P2T:
    • 25.9% faster @ SPEC2017int
    • 7.9% faster @ SPEC2017fp

    Working backwards, that implies the 8P1T + 8E vs. 8P1T + 0E should be at least:
    • 30.4% faster @ SPEC2017int
    • 8.1% faster @ SPEC2017fp

    Still, that suggests the 8 E-cores are, in aggregate, 30.4% and 8.1% as fast as the 8P1T cores. Compared to 8P2T, the 8E are 25.9% and 7.9% as fast. Obviously, those numbers reveal some scaling problems.

    Now, an interesting fact about enabling the E-cores is that it creates a bottleneck in the ring bus, due to the E-core cluster stops being down-clocked. So, the multithreaded tests with E-cores show them adding less than what they'd ideally be capable of contributing.



    Not only that, but there's doubtless some clock throttling going on, as the CPU bumps into its various power limits. And then there's DDR5, which is clearly much less of a bottleneck, but feeding so many threads with 2 channel-pairs is still going to starve them relative to their single-thread performance.

    Originally posted by AdrianBc View Post
    When all the available threads are active on an Alder Lake or Raptor Lake, the SMT threads on the P-cores and the single threads on the E-core have about the same performance,
    No, the above data and what Intel has previously stated both indicate the E-cores each add more performance than doubling up a P-core. Granted, we're only looking at aggregates over the SPEC2017 bench and for an all-thread scenario, but the data supports Intel's claims.

    Originally posted by AdrianBc View Post
    so a Raptor Lake with 8 x 2 threads on P-Cores + 16 x 1 threads on E-cores has about the same performance as a CPU with 16 x 2 threads on P-cores,
    If we tread the fraught path of extrapolation, the above data suggests 16P2T would deliver 125.64 and 8P2T + 16E would deliver 94.98 on SPEC2017int. I won't go down the same path for SPEC2017fp, because my extrapolation would be off by even further.

    However, essentially what you're saying is that 16 E-cores = 8P2T-cores. I think your error is in assuming the E-cores scale as well. However, having 4 of them sharing a cache slice and ring bus stop is an impediment to this. I'm not saying my extrapolation is valid, but I think you overestimate them (or at least this implementation thereof).

    Originally posted by AdrianBc View Post
    This is not a coincidence. The Intel designers are not stupid so they have chosen this performance ratio so that the speed will not vary wildly when the threads happen to be migrated between cores by the operating system scheduler.
    That is not a requirement of a hybrid architecture. All that's required is for the OS to have some idea how much useful work each thread is able to do, so that it can ensure they all get scheduled fairly.

    Originally posted by AdrianBc View Post
    1 thread on each E-core and 2 threads on each P-core, and in the latter case all threads have similar performance.
    Cool story.

    Let's look at it this way. We'll divide out the points per thread, in the different permutations of P-core loading and E-core loading.


    0P + 8E 8P1T + 0E 8P1T + 8E 8P2T + 0E 8P2T + 8E
    int 29.81 53.45 69.69 62.82 79.06
    fp 38.07 72.38 78.22 73.92 79.76
    int/thread 3.73 6.68 4.36 3.93 3.29
    fp/thread 4.76 9.05 4.89 4.62 3.32
    ​​

    Again, the 8P1T + 8E column is merely an estimate. However, this confirms that you really do want to load them the way Intel recommends.

    The other thing that's interesting is that the performance of an int-heavy thread is higher in a 8P2T configuration than moving it to an E-core, but your aggregate performance still argues in favor of going 8P1T + 8E and then just balancing execution time of the threads between P-cores and E-cores.

    For fp-heavy threads, it's always better to put the thread on an E-core than double up a P-core, both in aggregate and just in terms of its performance potential.

    Originally posted by AdrianBc View Post
    Sandy Bridge had a full 256-bit width implementation for the floating point instructions, including for multiplication and addition, which matter most for the power consumption.

    Haswell added 256-bit implementations for the integer instructions and it also replaced the multiplier and adder of Sandy Bridge with two FMA units, which double the computation throughput, but it also doubled the power consumption, causing the down-clocking problems that you mention.
    Thanks for the info.

    I think we agree. It was just the notion of going to 512 bits @ 32 nm (on a general-purpose CPU) which I thought was ludicrous.

    Comment


    • #82
      Originally posted by piotrj3 View Post
      The only minor going over limit was in some AVX loads but that was minor going over limit (like 5%).
      Try 13%.


      Originally posted by piotrj3 View Post
      now you have 7950X, and power draws as well boost frequencies are total rollercaster among reviewers.
      I don't really get what you're complaining about. Isn't it the dream of overclockers to have a CPU that's only thermally-limited? If you want to impose lower power limits, you can do it in BIOS.

      Originally posted by piotrj3 View Post
      GN had 251W.
      That's not at all atypical, when you do extreme overclocking, which is essentially what he did.

      Originally posted by piotrj3 View Post
      Another issue is that one reviewer before claimed 65W TDP 7950X outperforming 12900k problem was on same graph below package draw was 90W on 7950X
      I'll grant you this one point: that TDP is misleading when the actual PPT is 1.35x that much. Based on my simplistic understanding, I don't know why they're not equal. If someone can point me at a compelling rationale, I'd appreciate it.

      Originally posted by piotrj3 View Post
      So i am not certain in this generation if AMD will be more efficient.
      power consumption != power efficiency. Also, power efficiency changes, depending on the SKU and TDP configuration. It's not a single number that characterizes all models in all configurations.

      Because of that, it really matters why you're looking at it. If you just want to compare the microarchitecture and manufacturing process, then you will want to compare comparable models running in a similar power envelope (and not a similarly-named power envelope, but as close as you can get to one that's actually equivalent).

      If you want to compare the typical end user power efficiency, then compare comparable models at stock settings, with a normal case & cooler, running on a defined workload.

      People tend to take the the highest number from the most extreme part, in the highest-power configuration and use that to characterize the entire product line. However, that's only applicable to those intending to run that part in that configuration.

      Comment


      • #83
        Originally posted by coder View Post
        Well, let's hope they get those cost tables updated accordingly, for zenver4 in gcc and llvm! It seems an easy substitution to simply use a temporary register target and then write out that. Won't help the inveterate assembly programmers, but the enlightened among us who use compiler intrinsics should hopefully not see much impact.

        Any idea how likely they are to fix it in a future stepping?

        In the current manufacturing processes, the cost of chip revisions has become exceedingly big, of millions of $ for the simplest changes.

        So the CPU designing companies do not make new revisions except for bugs so serious that they would expose the companies to legal liabilities, e.g. security bugs or data corruption bugs.

        If you look at the errata lists for the Intel CPUs (euphemistically named "Specification Update") and for the AMD CPUs (euphemistically named "Revision Guide"), for each CPU model there may be up to one hundred bugs that have the resolution "Won't fix".

        Because this bug, after being patched by microcode (which is the workaround always used to avoid data corruption bugs, which cannot be tolerated) only slows the execution and performance-improving workarounds are possible in compilers, it is likely that it will not be fixed in the desktop CPUs before whatever models will be introduced by AMD at the end of 2023.

        In the best case, the bug was discovered early enough for it to be corrected in the laptop Zen 4 CPUs, expected at the beginning of 2023.




        Comment


        • #84
          Originally posted by piotrj3 View Post

          The issue it is subjective. Anyway my issue is that AMD changed definition of their TDP (what video exactly mentions). Before TDP was actually the power your CPU drew from EPS rail.
          ​Are there still people that don't know the meaning of TDP? It's been discussed everywhere and it's on wikipedia. Thermal Design Power like the name suggests is not electrical power consumption, it only correlates with it. If you raise the Tjunction temp the TDP gets lower while power consumption might actually raise a little.

          Comment


          • #85
            Originally posted by coder View Post
            According to https://www.anandtech.com/show/17047...d-complexity/9 8P2T is faster than 8P1T (both DDR5) by approximately:
            • 17.5% faster @ SPEC2017int
            • 2.1% faster @ SPEC2017fp

            So, it's a little worse than you say for int, and much worse for fp. FWIW, the numbers I quoted previously were based on the single-thread aggregate scores comparing 1P1T vs. 1E.

            A gain for SMT of 20% to 25%, or even sometimes up to 30%, is typical when you run completely unrelated programs on the two threads of a core, because only then there are good opportunities that the loads from the main memory or the branch mispredictions from one thread will coincide in time with instructions that can be executed immediately from the other thread.

            The most common application that gains a lot from SMT is compiling a large software project, where each thread compiles a different source file, and the threads not only have frequent stalls due to branch mispredictions and cache misses, but there are also frequent stalls while waiting for SSD or HDD operations.


            The SPEC benchmark is notorious for having a low gain from SMT, which is expected, because all threads run the same program.

            Moreover, the lower SMT gain for floating-point applications is also well known and expected, because the execution time of such programs is dominated by loops with perfect branch prediction and they include a large percentage of computational instructions that can be overlapped over loads, and most data is reused several times, so it is loaded from various cache memory levels, not from the main memory.

            Because of this, most floating-point applications can achieve a very high percentage of use of the execution units, so there are few opportunities for executing the second thread of a core. For floating-point applications, it is not uncommon to achieve better performance by disabling SMT.


            So, the numbers presented by you are indeed typical for the SPEC benchmark, and they may also be representative for certain multi-threaded programs that load all the threads with similar computations, but they are not typical for the SMT gain when random programs are executed on multiple threads, when the gain can be much higher.


            So 20% can be considered as a median value between the SMT gains in different use cases.





            Comment


            • #86
              Originally posted by AdrianBc View Post
              In the current manufacturing processes, the cost of chip revisions has become exceedingly big, of millions of $ for the simplest changes.

              So the CPU designing companies do not make new revisions except for bugs so serious that they would expose the companies to legal liabilities, e.g. security bugs or data corruption bugs.
              Right. They wouldn't do a stepping just for this. However, I still wonder how many steppings they typically do over a product's lifetime. For instance, the B2 stepping of the 5800X looks to have some nice improvements that are quite plausibly just a collection of errata fixes.



              One thing I like about Raptor Lake is that because there are no major microarchitecture changes in it, I see it as basically just a patched and tuned version of Alder Lake. In other words, you could look at it as the CPU Alder Lake was meant to be.

              Traditionally, I've been a late-adopter, hoping to benefit from various fixes in later chip steppings, board revisions, and firmware fixes.

              Originally posted by AdrianBc View Post
              If you look at the errata lists for the Intel CPUs (euphemistically named "Specification Update") and for the AMD CPUs (euphemistically named "Revision Guide"), for each CPU model there may be up to one hundred bugs that have the resolution "Won't fix".
              That doesn't necessarily mean that some aren't opportunistically fixed in later steppings.

              Comment


              • #87
                Originally posted by AdrianBc View Post
                Moreover, the lower SMT gain for floating-point applications is also well known and expected, because the execution time of such programs is dominated by loops with perfect branch prediction and they include a large percentage of computational instructions that can be overlapped over loads, and most data is reused several times, so it is loaded from various cache memory levels, not from the main memory.

                Because of this, most floating-point applications can achieve a very high percentage of use of the execution units, so there are few opportunities for executing the second thread of a core. For floating-point applications, it is not uncommon to achieve better performance by disabling SMT.​
                Thanks for acknowledging that point. I hope you'll further acknowledge that it blows a hole in your theory that 1E = 1P2T / 2, or that such a thing was even a design requirement of Intel's. This is too simplistic.

                They created the Thread Director specifically to aid the OS in more effective thread scheduling, in an acknowledgement of the challenges it poses.

                Comment


                • #88
                  Originally posted by coder View Post
                  I'll grant you this one point: that TDP is misleading when the actual PPT is 1.35x that much. Based on my simplistic understanding, I don't know why they're not equal. If someone can point me at a compelling rationale, I'd appreciate it.
                  TDP is a made-up marketing term that is meaningless, beyond a basic correlation with power use. Only correlation within the same CPU lineup, not across generations, as AMD and Intel fully reserve the right to change variables in their made-up calculations at any point. An example factor that goes into it is "room temperature during testing". Which room temp did they use to get the numbers they are advertising? No idea, they won't tell you that.

                  They advertise a 125W TDP part because marketing thinks it sounds better than saying it's a 142W part. Simple as that.

                  Comment


                  • #89
                    I'd say TDP is a rough estimate for thermal solution, which guarantees that CPU is going to work at least at base frequency. This is mostly for shit tier PC builders, which can put a minimal and cheap cooler and more or less be sure CPU won't throttle. Actual CPU power consumption is higher, so if you want to sustain high boosts you need a better thermal solution. This is my interpretation why TDP parameter exists and why it's lower then actual consumption.

                    As for ZEN4, it's obvious that AMD made stock power parameters stupid to compete with Intel's stupid PL2. So in reality we should evaluate CPU performance in some sane power range, say 65-150W range. I don't see the point in performance graphs where CPUs draw 240+ W on mainstream desktop. It's insane. AFAIK performance gain in running ZEN4 beyond 150W are basically negligible, so extra 100W for <~15% is just irrational. It should not be a default.
                    Last edited by drakonas777; 02 October 2022, 09:07 AM.

                    Comment


                    • #90
                      Originally posted by coder View Post
                      Try 13%.



                      I don't really get what you're complaining about. Isn't it the dream of overclockers to have a CPU that's only thermally-limited? If you want to impose lower power limits, you can do it in BIOS.


                      That's not at all atypical, when you do extreme overclocking, which is essentially what he did.


                      I'll grant you this one point: that TDP is misleading when the actual PPT is 1.35x that much. Based on my simplistic understanding, I don't know why they're not equal. If someone can point me at a compelling rationale, I'd appreciate it.


                      power consumption != power efficiency. Also, power efficiency changes, depending on the SKU and TDP configuration. It's not a single number that characterizes all models in all configurations.

                      Because of that, it really matters why you're looking at it. If you just want to compare the microarchitecture and manufacturing process, then you will want to compare comparable models running in a similar power envelope (and not a similarly-named power envelope, but as close as you can get to one that's actually equivalent).

                      If you want to compare the typical end user power efficiency, then compare comparable models at stock settings, with a normal case & cooler, running on a defined workload.

                      People tend to take the the highest number from the most extreme part, in the highest-power configuration and use that to characterize the entire product line. However, that's only applicable to those intending to run that part in that configuration.
                      GN got 251W on stock not by extreme overclocking. Only thing you need to get 250W+ is very good cooler

                      Comment

                      Working...
                      X