Announcement

Collapse
No announcement yet.

AMD Posts New "AMD-PSTATE" CPUFreq Driver Leveraging CPPC For Better Perf-Per-Watt

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    Originally posted by numacross View Post

    Hold on, you're touching many points at once. I provided an explanation why desktop parts have higher idle power usage - the Infinity Fabric is an "external" interconnect, and that still holds. No matter how well the CPU chiplets are made they still have to communicate with the IO die, and the distance is long in terms of IC design. That will take a lot of power.

    ...

    Instead of 5980HX you should be comparing desktop APU idle power usage. How do they do at idle? I wasn't able to find a HWiNFO idle screenshot, but by this benchmark it seems that the idle usage has improved by 5-8W.
    Technically Infinity Fabric is an internal and external connect. AMD's GPUs use Infinity Fabric as well and they're monolithic. I'm not disputing that going off-chiplet doesn't use more power but I'm not sure that it's a significant amount at all. For example, 5600G is the 5700G with disabled CPU and GPU cores and lower max clocks but I imagine they have the same idle clocks. Those numbers you provided show that the 5600G system use 3 watts less power at idle. The Zen 3 family puts cores so sleep at idle though so why would the core count make that large of a difference?

    If the chiplet-based desktop CPUs are noticably disadvantaged when it comes to idle power usage because they use chiplets then shouldn't we be seeing a decent jump in idle usage between the one and two chiplet designs? Instead we're seeing consistent idle power from the 5800X to the 5950X. Even if the answer is that the second chiplet is turned off at idle, then why is there 2 watt difference between the 5600X and the 5800X when they're both just one chiplet? I have a feeling that more is being tweaked between them.

    I went and looked at that Guru3D article and the one thing that stood out is that those numbers were provided before the review establishes the hardware they use for the test so I went and looked at what hardware each CPU and APU was tested with.

    Their reviews for the 5600X, 5800X, 5900X, and 5950X were all done with this setup

    X570 ASUS Crosshair VIII HERO
    GeForce RTX 2080 Ti
    2x8 GB DDR4 3600 CL16 MHz (G.Skill)
    The hardware they used in the 5600G and 5700G is listed as

    X570 ASUS Crosshair VIII Formula / ASUS B550 STRIX
    GeForce RTX 3090
    2x32 GB DDR4 3600 MHz (G.Skill)
    That lists three motherboards, two GPUs, and different DIMMs being used between those CPUs and APUs. Since Guru3D measures system power, all of these things can account for the difference in 5 watts. And to be fair, we also don't know if the integrated GPU still sips addition idle power even if it's not being used.

    Originally posted by numacross View Post
    As avem wrote Intel's equivalent monolithic desktop chip can provide roughly equivalent features and it manages to achieve way lower idle power usage than Zen desktop CPUs. I then theorized that in Zen desktop case it's probably a matter of BIOS settings meant to disable power saving features on the IO die in favor of increasing stability. It seems that I was, at least partially, right.
    We're also comparing entirely different architectures on completely different fab processes at that point though. That's far too many variables to try to determine the possible power loss between chiplet-based and monolithic designs. Even then, if we use Guru3D's numbers, their i9 11900K system idles at 82 watts while Zen 3 chiplet-based systems idle at 69 watts. https://www.guru3d.com/index.php?ct=...=file&id=70512

    Hell even among the Zen 3 line up you have a mix a different fabrication processes since the I/O die is on Global Foundry's 14nm process while the core chiplets and the entirety of the APUs use TSMC 7nm.
    Last edited by Myownfriend; 09 September 2021, 08:02 PM.

    Comment


    • #42
      Originally posted by Myownfriend View Post
      Technically Infinity Fabric is an internal and external connect. AMD's GPUs use Infinity Fabric as well and they're monolithic.
      Technically technically Infinity Fabric is an internal (APU), near-external (Ryzen desktop, EPYC on-package), external (Radeon Pro Vega II Duo) and far-external (EPYC between CPU sockets) interconnect

      Originally posted by Myownfriend View Post
      I'm not disputing that going off-chiplet doesn't use more power but I'm not sure that it's a significant amount at all. For example, 5600G is the 5700G with disabled CPU and GPU cores and lower max clocks but I imagine they have the same idle clocks. Those numbers you provided show that the 5600G system use 3 watts less power at idle. The Zen 3 family puts cores so sleep at idle though so why would the core count make that large of a difference?
      We do not know how much power putting a core to sleep really saves. AMD has also stated that they bin their chips heavily not only with respect to frequencies but power curves, which get programmed into the chips themselves for SMU to use. The CPPC2 data comes from this as well.

      Originally posted by Myownfriend View Post
      If the chiplet-based desktop CPUs are noticably disadvantaged when it comes to idle power usage because they use chiplets then shouldn't we be seeing a decent jump in idle usage between the one and two chiplet designs? Instead we're seeing consistent idle power from the 5800X to the 5950X.
      My theory would be that the IF links between chilpet and IO dies do power down when not used, but the internal IF on the IO die does not. Testing with a CCD disabled in BIOS would probably tell us more, but I don't have the 5950X.

      Originally posted by Myownfriend View Post
      Even if the answer is that the second chiplet is turned off at idle, then why is there 2 watt difference between the 5600X and the 5800X when they're both just one chiplet? I have a feeling that more is being tweaked between them.
      The 5600X has 2 less cores which were physically cut, so they shouldn't take any power. How this affects the rumored on-chiplet dual ring bus only AMD knows.

      Originally posted by Myownfriend View Post
      I went and looked at that Guru3D article and the one thing that stood out is that those numbers were provided before the review establishes the hardware they use for the test so I went and looked at what hardware each CPU and APU was tested with.

      [...]

      That lists three motherboards, two GPUs, and different DIMMs being used between those CPUs and APUs. Since Guru3D measures system power, all of these things can account for the difference in 5 watts. And to be fair, we also don't know if the integrated GPU still sips addition idle power even if it's not being used.
      That's unfortunate, and kind of unprofessional of them...

      Originally posted by Myownfriend View Post
      We're also comparing entirely different architectures on completely different manufacturing processes at that point though. That's far too many variables to try to determine the possible power loss between chiplet-based and monolithic designs. Even then, if we use Guru3D's numbers, their i9 11900K system idles at 82 watts while Zen 3 chiplet-based systems idle at 69 watts. https://www.guru3d.com/index.php?ct=...=file&id=70512
      I agree, there's way to many differing parameters here, but an interesting topic to say the least.

      Comment


      • #43
        Originally posted by numacross View Post
        My theory would be that the IF links between chilpet and IO dies do power down when not used, but the internal IF on the IO die does not. Testing with a CCD disabled in BIOS would probably tell us more, but I don't have the 5950X.
        I'm not sure there would be a point to power down the internal IF on the IO die. I have a 5950X but I honestly I don't feel like messing around with it any time soon.

        Originally posted by numacross View Post
        The 5600X has 2 less cores which were physically cut, so they shouldn't take any power.
        As do both chiplets in the 5900X so we should still see similar idle power to the 5600X.

        Originally posted by numacross View Post
        I agree, there's way to many differing parameters here, but an interesting topic to say the least.
        Agreed.

        Comment


        • #44
          Originally posted by Myownfriend View Post
          If the chiplet-based desktop CPUs are noticably disadvantaged when it comes to idle power usage because they use chiplets then shouldn't we be seeing a decent jump in idle usage between the one and two chiplet designs? Instead we're seeing consistent idle power from the 5800X to the 5950X.
          If using chiplets helped with power usage, I feel like there's no way AMD wouldn't be using them in their laptop chips. They've gone to a ton of effort to try and get those chips competitive with what Intel is offering, and if they could even shave off half a watt of power I think it'd be a no-brainer.

          Instead, they've specifically chosen to make those low power chips be monolithic. I think the conclusion to draw from that is pretty straightforward.

          It's true enough we don't have the chip architects testing and notes on the topic, but it's also fairly well known in engineering circles that transferring data around on a chip is fairly power expensive and it's been stated that power use is one of the major blockers for having multiple chips in graphics right now.

          Comment


          • #45
            micheal building a kernel? It must be really, *really* interesting to push him so far
            ## VGA ##
            AMD: X1950XTX, HD3870, HD5870
            Intel: GMA45, HD3000 (Core i5 2500K)

            Comment


            • #46
              Originally posted by smitty3268 View Post
              If using chiplets helped with power usage, I feel like there's no way AMD wouldn't be using them in their laptop chips. They've gone to a ton of effort to try and get those chips competitive with what Intel is offering, and if they could even shave off half a watt of power I think it'd be a no-brainer.
              I never said it did use less power than a monolithic design. I said there's ways it could but that generally it would use slightly more power, I'm just not super convinced it would be a noticeable amount.

              Originally posted by smitty3268 View Post
              Instead, they've specifically chosen to make those low power chips be monolithic. I think the conclusion to draw from that is pretty straightforward.
              I'm not so sure that there desktop chips would be using chiplets if they entire of the line topped out at one CCD and one i/O die. One of the big advantages to using chiplets for AMD is that they're using the same small CCD from their 6 core desktop chips to their 64-core server/workstation chips with one I/O die being used for consumer chips and another for server chips. They were even able to use the same I/O dies two generations in a row. That's four smaller dies being used to create 73 chips over the course of two generations with insanely better yields than if they had decided to make 16 and 64 core monolithic dies 72MB and 256MBs of cache respectively.

              Their desktop and mobile APU lines don't have that same kind of range and have very different needs. Their APUs have integrated GPUs, support an older PCIe standard, needed to have support for an additional memory type, etc. If they used chiplets then their entire line of APUs would probably have consisted of one combined GPU and I/O die paired with one of the same CCDs they used on their desktop chips. There would have been no one-to-many relationship, the GPU and I/O die would already be using up some of the 7nm manufacturing capacity that would be used on CCDs while being larger than the CCD, pairing it with a chiplet would require that the size of the package substrate would have to be larger enough to accommodate the two dies, and yes having a monolithic die would use the slightest bit less electricity, so they probably felt it made more sense to just use a monolithic die at that point.

              Originally posted by smitty3268 View Post
              It's true enough we don't have the chip architects testing and notes on the topic, but it's also fairly well known in engineering circles that transferring data around on a chip is fairly power expensive and it's been stated that power use is one of the major blockers for having multiple chips in graphics right now.
              GPUs and CPUs are very different. GPUs require an insane amount more bandwidth which results in some power issues that apply to monolithic and non-monolithic dies alike but become more of an issue with multi-chip GPUs because multi-GPU real-time graphics workloads don't scale well.

              AMD and Nvidia's GPUs both use immediate mode rendering which means they pull in triangles, texture them, shade them, and spit out pixels to a buffer. This process requires access to a depth buffer to make sure that things are drawn the intended way. The depth buffer is also used to try to determine if the pixel you're going to texture, shade, or blend is going to be behind the pixels that are currently in that location so it can save time and not render it. A GPU does that on multiple triangles at a time so those pipelines have to maintain a certain level of sync so that they don't don't texture and shade finalized pixels that are being transformed at the same time.

              When dealing with multi-GPU setups where NUMA issues crop up, they intentionally give up on trying to efficiently use both pools of memory or try to exchange a bunch of syncing information. In a chiplet based design, they could get the NUMA issue with an IO die like Zen 3 or a primary die like RDNA3 is going to use so there's at least some fix to that.

              Past that, even though both GPUs have access to everything in VRAM, they can't distribute vertex transformation work for split-frame rendering across both GPUs because they don't know where the triangle will be in screen space until after they transform it. Their solution to that is just to either have both GPUs transform all the vertices and then only texture and shade triangles on one side of the screen or do alternate frame rendering. The resulting will sometimes get you nearly twice the performance, but sometimes you'll get worse performance than one GPU but power usage will always go up by the amount of cards you use. It's a nightmare.

              I'm not sure how AMD's primary + secondary die solution is supposed to fix some of those other issues but even if it just has the primary die do the vertex transformation and hand off triangles and texels to the secondary die for fragment shading based on frame position then it could still be an improvement over a dual graphics card setup. Or the two chip setup could only be for server and workstation workloads that already scale better in multi-GPU setups.

              Now I specifically mentioned IMRs before for a reason. Imagination Technologies already has a GPU architecture that scales up to four chiplets because they make GPUs that use tile-based deferred rendering (TBDR) which inherently scales up to multi-chip designs much better. TBDRs transform triangles, dump them back into memory in bins of screen-space tiles, then each chiplet pulls in tiles where the triangles are sorted, textured, shaded, and blended in that chips own on-chip memory. Theoretically the chiplets don't even have to run in-sync with each other, vertex transformation can be distributed across chips or just done by a primary chip, very little interchip communication is needed, and the required bandwidth between the chiplets and external memory can already achieved by AMD's chiplet-based CPUs.

              So yea. In short, GPUs are very different from CPUs and the largest issue that GPUs have to deal with when chipletizing has to do with AMD and Nvidia GPUs using IMR.

              Comment


              • #47
                Originally posted by Myownfriend View Post

                [...]

                Now I specifically mentioned IMRs before for a reason. Imagination Technologies already has a GPU architecture that scales up to four chiplets because they make GPUs that use tile-based deferred rendering (TBDR) which inherently scales up to multi-chip designs much better. TBDRs transform triangles, dump them back into memory in bins of screen-space tiles, then each chiplet pulls in tiles where the triangles are sorted, textured, shaded, and blended in that chips own on-chip memory. Theoretically the chiplets don't even have to run in-sync with each other, vertex transformation can be distributed across chips or just done by a primary chip, very little interchip communication is needed, and the required bandwidth between the chiplets and external memory can already achieved by AMD's chiplet-based CPUs.

                So yea. In short, GPUs are very different from CPUs and the largest issue that GPUs have to deal with when chipletizing has to do with AMD and Nvidia GPUs using IMR.
                Both nVidia and AMD do use some form of tiled rendering for quite some time now.

                Comment


                • #48
                  Originally posted by numacross View Post

                  Both nVidia and AMD do use some form of tiled rendering for quite some time now.
                  They do not and its kind of annoying that David Kanter decided to use that term when he initially discovered it because people keep using it. That's tiled caching and it doesn't fix the issue I mentioned for multi-GPU or multi-chip GPUs and doesn't have anywhere near the same level of bandwidth savings that a single TBR would have. Tiled caching just caches a few thousand polygons, tiles them, then tries to render them out in screen space to the L2 cache before writing the results the g-buffer in VRAM. It works best with smaller polygons that are close to each other and it's main purpose is to capture some of the overdraw during g-buffer creation in on-chip memory and write the fragment data to VRAM in more efficient manor. That doesn't fix the issues I mentioned, it just lessens write bandwidth to VRAM by making better use of the L2 cache. I'm not even sure that AMD ever actually implemented because zi saw posts suggesting that and it wouldn't really have as much of a purpose when you have 128MBs of on-chip cache.

                  The advantage of a TBR is that the entired of the screens primitives are transformed ahead of time and when they're pulled back into the GPU, all of the triangles for that area of the screen are rasterized, shaded, textured, and blended within extremely high bandwidth, low latency on-chip memory. In Imagination, Intel, and Apple's case they also do hidden surface removal by sorting the geometry from front to back ahead of time and writing the depth buffer for all opaque geometry before rendering the other buffers so that no pixel is shaded or texel is sampled that won't contribute to the final image. It's like a z-prepass but better because it gets rid of overdraw on the depth buffer creation and the results of the depth pass don't need to be written out to RAM and then brought back in on a second pass, its just part of the pipeline. These all result in a situations where the entirety of the image can potentially be drawn on-chip. Since the tiles are so small, very little on-chip memory is required per-tile (32KiB per 32x32px tile with a 256-bit g-buffer), it's effectiveness at bandwidth reduction is the same regardless of output resolution, and the chips don't need to exchange load-balancing information to make sure that work is distributed evenly between chips in a multi-chip configuration. A TBR internally a lot like a multi-chip GPU already so it doesn't have to change anything about how it works when it's actually multi-chip. For that reason, early PowerVR Series 5 GPUs were scaled by just duplicating the entire GPU multiple times and connecting them with an arbiter.

                  In a multi-chip IMR GPU with tiled caching, it would still require that one or both chips transform triangles and pass them to the other based on it's screen space and load balance based on the render times of the previous frame or they'd need to render alternate frames. They could also distribute work on a tile basis instead of halves or quadrants and that prevent the need for load balancing but that would so increasing the amount of inter-chip data sharing that's required. They can also try to impersonate TBR rendering in software.
                  Last edited by Myownfriend; 10 September 2021, 10:18 AM.

                  Comment


                  • #49
                    I wonder how long it will be before this benefits other AMD hardware like older APUs . I have a Lonovo with a A12 APU. When i loaded Linux on it the battery life dropped by a third.

                    Comment


                    • #50
                      Originally posted by Dr. Righteous View Post
                      I wonder how long it will be before this benefits other AMD hardware like older APUs . I have a Lonovo with a A12 APU. When i loaded Linux on it the battery life dropped by a third.
                      Hardware prior to Zen 2 doesn't support CPPC and thus no support.
                      Michael Larabel
                      https://www.michaellarabel.com/

                      Comment

                      Working...
                      X