Announcement

Collapse
No announcement yet.

AMD Ryzen 9 7900X / Ryzen 9 7950X Benchmarks Show Impressive Zen 4 Linux Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #71
    Originally posted by jrch2k8 View Post
    I would have no problem at all if the CPU comes out of the box at 105w and you can later on unlock the 170w tier but backwards set a dangerous trend because motherboards for one will get more expensive since you have to design for 170w(because it is the default), AIO can become an implicit requirement because by default it can reach 95c and while AMD claims is safe for the silicon that is not necessarily true for other part of the hardware especially the components related to power delivery, etc.
    I know people are going to hate this, but maybe the best solution is some form of regulation. That's often what it takes to stop an industry "race-to-the-bottom" scenario, like the power consumption race that's heated up () between Intel and AMD.

    It doesn't have to be in the form of an outright ban (which I'm not in favor of), but it could be as simple as labeling requirements, or as extreme as an extra tax levied on inefficient computers. Similar techniques have previously been adopted for household appliances and even automobiles.

    It's not ideal, because I don't trust regulators to get it 100% right, and you know that manufacturers will try to game whatever they come up with, but I think it's better than nothing.

    Comment


    • #72
      Originally posted by coder View Post
      I know people are going to hate this, but maybe the best solution is some form of regulation. That's often what it takes to stop an industry "race-to-the-bottom" scenario, like the power consumption race that's heated up () between Intel and AMD.

      It doesn't have to be in the form of an outright ban (which I'm not in favor of), but it could be as simple as labeling requirements, or as extreme as an extra tax levied on inefficient computers. Similar techniques have previously been adopted for household appliances and even automobiles.
      Or you have to accept that those that want lower TDPs are a small minority and most people don't care. Just by 65 W versions, there are always plenty of them. Or use cTDP.

      Comment


      • #73
        Originally posted by Anux View Post
        Or you have to accept that those that want lower TDPs are a small minority and most people don't care. Just by 65 W versions, there are always plenty of them. Or use cTDP.
        As jrch2k8 correctly cites, the issue is out-of-the-box performance. That's the main thing reviewers test (though they often explore other settings, as well) and how most users are likely to use it.

        I'd just want to see something like an efficiency-labeling scheme that reflects the out-of-the-box config. Then, if the minority who wants high-performance at all costs wants to, they can change the defaults and get that extra 10% - 15% of performance for 2x the power.

        Without clear & consistent labeling, even most well-intentioned buyers won't be sophisticated enough to know how much power the computer could actually use.

        Comment


        • #74
          Originally posted by coder View Post
          I know people are going to hate this, but maybe the best solution is some form of regulation. That's often what it takes to stop an industry "race-to-the-bottom" scenario, like the power consumption race that's heated up () between Intel and AMD.

          It doesn't have to be in the form of an outright ban (which I'm not in favor of), but it could be as simple as labeling requirements, or as extreme as an extra tax levied on inefficient computers. Similar techniques have previously been adopted for household appliances and even automobiles.

          It's not ideal, because I don't trust regulators to get it 100% right, and you know that manufacturers will try to game whatever they come up with, but I think it's better than nothing.
          The pathetic and embarrassing slow-walking of ATX12VO by motherboard and PSU manufacturers has a far greater effect on desktop efficiency than peak CPU power. People are worried about the difference between a 9 minute compile job that takes 230 W vs a 10 minute compile job that takes 142 W, when the computer is idling at 50+W for the many hours they spend reading, thinking, and typing between compile jobs.

          On that note, AMD is claiming big idle power optimizations on the IO die, but I haven't seen any reviews yet that put serious focus on wall power at idle. Although, I haven't looked that hard. PCWorld shows it on their graphs in a few places, and it doesn't look great. But the reviewer motherboards are likely to be super high end with all kinds of energy-wasting RGB nonsense, and with Zen 3, memory frequencies above 3200 MBd caused a significant increase of idle power.

          Comment


          • #75
            Originally posted by coder View Post
            And the compiler is limited in its ability to express those dependencies by the number of ISA registers.
            Some notes, opinions and facts:

            That is only partially true, because if a compiler writes to a particular CPU register (such as: %rax) twice per clock cycle then the limit isn't the number of ISA (user-visible) registers but it is the number of physical registers in the CPU and the size of the ROB (reorder buffer).

            Zen 4 has an integer register file of 224 [64-bit?] entries, floating-point register file of 192 [256-bit?] entries, and a reorder buffer of 320 entries.

            Compared to Zen 3: Zen 4 integer register file is 16.6% larger, floating-point register file is 20% larger, and the reorder buffer is 25% larger. The facts that (1) AMD slides report that on average Zen 4 has approximately 13% higher IPC than Zen 3 at freq=4GHz, and that (2) IPC gains in the benchmarked applications range from 1% to 36% (the average being 13%), are NOT a coincidence.

            In simplified terms: most of the other improvements in the Zen 4 core (compared to the Zen 3 core) such as 2x larger L2 cache per core or 12.5% higher number of µops per clock from the µop cache, are needed in order to make the percentual utilization of the larger register files and of the larger ROB be approximately same as in Zen 3 so that in total the Zen 4 architecture is a well-balanced micro-architecture.

            That said, Intel Alder Lake P-core has a ROB of 512 entries which means that Alder Lake is a less-balanced architecture compared to Zen 4 that has a much smaller reorder buffer. The utilization (effectivity) of Alder Lake's P-core ROB is approximately 25% smaller than the utilization of Zen 4's ROB, while these CPUs are running the same application.

            https://en.wikichip.org/wiki/amd/mic...tectures/zen_4

            https://en.wikichip.org/wiki/intel/m...es/golden_cove

            Originally posted by coder View Post
            Sure, you could have a CPU which identifies spills, but there are practical limits on what can be efficiently implemented in hardware. This stuff not only takes die area, but also energy and time.
            Zen 4 can execute only 2 stores per cycle. Which such a small number of stores per cycle, it is almost pointless to attempt memory renaming.

            A limiting factor to more stores per cycle are page table lookups (paged virtual memory). Memory segmentation would enable a higher number of stores per cycle, but (1) operating system developers (such as: Linus Torvalds) are against memory segmentation because it substantially increases the complexity of the operating system and (2) programming languages without segmentation-friendly type-systems, such as C/C++/Go/Rust/etc, are the prevalent programming languages today.

            Originally posted by coder View Post
            Reducing memory stalls increases the amount of time a CPU core can spend getting useful work done.
            It depends on the ratio of [the number of memory stalls] to [the number of concurrent memory requests in flight].
            Last edited by atomsymbol; 27 September 2022, 02:26 PM.

            Comment


            • #76
              Originally posted by yump View Post

              The pathetic and embarrassing slow-walking of ATX12VO by motherboard and PSU manufacturers has a far greater effect on desktop efficiency than peak CPU power. People are worried about the difference between a 9 minute compile job that takes 230 W vs a 10 minute compile job that takes 142 W, when the computer is idling at 50+W for the many hours they spend reading, thinking, and typing between compile jobs.

              On that note, AMD is claiming big idle power optimizations on the IO die, but I haven't seen any reviews yet that put serious focus on wall power at idle. Although, I haven't looked that hard. PCWorld shows it on their graphs in a few places, and it doesn't look great. But the reviewer motherboards are likely to be super high end with all kinds of energy-wasting RGB nonsense, and with Zen 3, memory frequencies above 3200 MBd caused a significant increase of idle power.
              you have a point there but my main concern is that draw power set the standard for components and that drives costs up and reliability down.

              Having a higher power draw by default force manufacturers(motherboard, psu, etc.) to up their materials and costs because sustain that amperage with the precision the CPU needs require a better tier of components, cables, etc. or there are real risks of fires, shorts, overheating, etc. and on those places in the world where electricity is anything but stable also may severely affect the life of the components.

              For reference i have seen fx9590 back in the day caught fire(the 8 pin CPU connector overheated and shorted) and even blow VRMs on anything but the crop top tier of mobos.

              but if this trend continue don't be alarmed to see 100$ mobos start creping to 200$+ tiers and your entries PSU moving to the 200$ tiers and beyond simply because CPU/GPU start going lazy and just uping power draw every new generation(if they can save money just drawing more power, they will). I mean a raptor lake i7 + 4070 that is not a 4070 but a 4080 because they wanna charge you harder for a 4070 can probably trigger over current protections on a decent ish gold plus 600w PSU when pushing both the CPU+GPU just in one generation.

              Comment


              • #77
                Originally posted by atomsymbol View Post
                That is only partially true, because if a compiler writes to a particular CPU register (such as: %rax) twice per clock cycle then the limit isn't the number of ISA (user-visible) registers but it is the number of physical registers in the CPU and the size of the ROB (reorder buffer).
                I understand how register renaming works, and that once you overwrite a register's contents, it effectively becomes a new register from a data-flow perspective.

                Where you get burned by a limited number of ISA registers is if the compiler runs out of registers to hold all of the intermediate state needed to compute a result, and then has to resort to spilling stuff to memory.

                Originally posted by atomsymbol View Post
                Zen 4 has an integer register file of 224 [64-bit?] entries, floating-point register file of 192 [256-bit?] entries, and a reorder buffer of 320 entries.
                Where did you find these stats?

                And yes, the physical registers referred to would be the 64-bit ones. I'm surprised about the number of physical FPU registers, but if those are indeed 256-bit, then it's not too much more than the 128 you'd need to support 2 threads (AVX-512 has 32 x 512-bit ISA registers = 64 x 256-bit per thread).

                Originally posted by atomsymbol View Post
                IPC gains in the benchmarked applications range from 1% to 36% (the average being 13%), are NOT a coincidence.
                Yes, but I'm saying the IPC gains would be lower, if they'd kept the same DDR4 memory and L2 cache size.

                Originally posted by atomsymbol View Post
                That said, Intel Alder Lake P-core has a ROB of 512 entries which means that Alder Lake is a less-balanced architecture compared to Zen 4 that has a much smaller reorder buffer. The utilization (effectivity) of Alder Lake's P-core ROB is approximately 25% smaller than the utilization of Zen 4's ROB, while these CPUs are running the same application.
                Whether a microarchitecture is balanced probably has more to do with the back end. And how do you know what actual utilization is like, for Alder Lake's Golden Cove cores?

                Originally posted by atomsymbol View Post
                Zen 4 can execute only 2 stores per cycle. Which such a small number of stores per cycle, it is almost pointless to attempt memory renaming.
                You'd think it would be the other way around, right? I think the point of memory renaming is to nullify spills by eliminating the corresponding loads & stores!

                Originally posted by atomsymbol View Post
                A limiting factor to more stores per cycle are page table lookups (paged virtual memory). Memory segmentation would enable a higher number of stores per cycle, but (1) operating system developers (such as: Linus Torvalds) are against memory segmentation because it substantially increases the complexity of the operating system and (2) programming languages without segmentation-friendly type-systems, such as C/C++/Go/Rust/etc, are the prevalent programming languages today.
                This is why I think directly-addressable scratch pad memory deserves a second look, in general-purpose CPUs. GPUs and DSPs continue to use it to deliver very strong performance per W. CPUs need to find a way to do likewise.

                For instance, you could give each thread 1 page of scratchpad memory. You could implement it by locking a corresponding block of L1 cache to a page of physical RAM, with exclusive semantics. That should avoid most of the normal cache & TLB overhead associated with memory accesses to it.

                Originally posted by atomsymbol View Post
                It depends on the ratio of [the number of memory stalls] to [the number of concurrent memory requests in flight].
                What I meant was roughly the amount of time the CPU core was starved by memory latency/bandwidth. In spite of the fancy OoO execution and prefetchers, there are still times these cores are underutilized, simply because most of the work they need to do relies on pending loads. Perhaps a less likely scenario is that they're log-jammed with pending stores.
                Last edited by coder; 29 September 2022, 10:36 AM.

                Comment

                Working...
                X