Announcement

Collapse
No announcement yet.

AMD Ryzen 7 5800X3D Continues Showing Much Potential For 3D V-Cache In Technical Computing

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by birdie View Post

    Or maybe a huge L3 cache costs quite a lot of money and could also result in a lot lower yields, so AMD/Intel engineers aren't stupid, they are just constrained. Also, remember Broadwell. Intel executed this feat many years ago.

    If the 3D cache was just a walk in the park, why would AMD only release the only consumer SKU with it?

    Also, from what I've heard consumer Zen 4 CPUs will not include it as well.
    I think they said they want all possible cache dies for their next gpu release. There is no room to get them on current generation.

    Comment


    • #22
      Originally posted by atomsymbol View Post
      Well, but you didn't account for the cost of increasing the number of pins on the CPU which would be required by a quad-channel AM4 socket (all AM4 motherboards, all AM4 CPUs).

      ----

      I don't know the purpose of the 387 extra pins on the AM5 socket (1718 pins) compared to AM4 (1331 pins), considering that AM5 is rumoured to support DDR5 only (no support for DDR4).
      As far as I'm aware, AM5 is DDR5-only, and, I think is supposed to be dual channel. I assume most of the pins are for extra PCIe lanes (likely delivered directly from the CPU rather than the chipset), more power delivery, more pins for whatever the iGPU needs, and I assume DDR5 itself needs more pins. So, 387 sounds reasonable for those kinds of upgrades.
      I don't think the pins for 2 extra channels would make that big of a difference in socket cost; they would cost less than V-cache, anyway. Bear in mind that for first-gen Threadrippers, those sockets were huge and nearly half of the socket was rendered useless. When looking at this diagram, I assume the dark blue pins and some of the pink ones are for RAM:
      https://www.docdroid.net/6cDW11N/am4-pinout-diagram-pdf
      Collectively, those seem to take up about 1/4 of the package. So, maybe about 150 pins per RAM slot. That's a lot, but nothing too crazy, and roughly half of that is I think just power delivery.

      In any case, triple or quad channel is unlikely to happen for AM5. Even if you ignore the CPU, the iGPU could really use the bandwidth.

      Comment


      • #23
        Originally posted by agd5f View Post

        Also an increase in die size for the extra memory PHYs and data fabric routing and reworked packaging. AMD does make a quad memory channel CPU, it's called Threadripper. If you want to look at cost, using the Threadripper socket and package is probably the least costly because it already exists.
        Threadripper (LGA 4096) cost isn't a good indicator of the cost of a hypothetical AM4 with 4-channel DDR4, because Threadripper supports 4 PCIe-x16 slots while AM4 supports 1 PCIe-x16 slot.

        Comment


        • #24
          Originally posted by schmidtbag View Post
          https://www.docdroid.net/6cDW11N/am4-pinout-diagram-pdf
          Collectively, those seem to take up about 1/4 of the package. So, maybe about 150 pins per RAM slot. That's a lot, but nothing too crazy, and roughly half of that is I think just power delivery.

          In any case, triple or quad channel is unlikely to happen for AM5. Even if you ignore the CPU, the iGPU could really use the bandwidth.
          wikichip:socket_am4#Pin_Description

          wikichip:socket_am5#Pin_Description

          AM5 seems to support 2 NVMe devices (2 * PCIe-x4), instead of 1 on AM4. GPU connectivity of AM5 seems to be the same as AM4.

          Comment


          • #25
            Originally posted by piotrj3 View Post

            Half yes half no.

            Triple/Quad channel controllers help you with bandwidth so triple/quad channel memory controller could help you a lot with something like LZ4, that gets no performance bump from 3d cache. It also to spread more workload evenly to more sticks.

            However, triple/quad channel won't remove bottleneck of latency. Think from this perspective, if you can clock memory to same speed/timings on triple/quad channel as dual channel that is at best 50/100% performance increase. And in fact, that is most optimistic performance increase you can get. Meanwhile in ZSTD you have 177% performance increase.
            Two things:
            LZ4 like core speed. It's not much bound by main memory BW nor L3 BW. So, L1 and IPC make LZ4 happy.

            Triple/quad channel will lower *average* latency, but not pointer chasing single threaded latency. I'm not aware of a benchmark that really can tell the difference.

            Comment


            • #26
              Originally posted by willmore View Post
              Triple/quad channel will lower *average* latency, but not pointer chasing single threaded latency. I'm not aware of a benchmark that really can tell the difference.
              If it includes synthetic benchmarks, I suppose one could create a fine-tuned benchmark that chases multiple pointers concurrently in a single thread. A CPU capable of executing 3+ loads per cycle would be required (I do not own such a CPU yet) to show a measurable difference with a 4-channel RAM. With AMD Zen 3, there is a small performance advantage when the AM4 motherboard has 4 memory modules installed instead of 2, although both of these configurations are still dual-channel DDR4 - I am not sure whether this applies to just single-threaded code, to just multi-threaded code, or to both.

              Comment


              • #27
                Originally posted by schmidtbag View Post

                In any case, triple or quad channel is unlikely to happen for AM5. Even if you ignore the CPU, the iGPU could really use the bandwidth.
                Bingo. CPU doesn't really need it, GPU does.

                That being said, consumer desktops dont really need strong IGPs, as the advantages (lower power, vram size, faster transfer, more compact) don't matter much there.


                What *does* need it is consumer + professional laptops. AMD/Intel should absolutely make a 4+ channel laptop platform.
                Last edited by brucethemoose; 02 May 2022, 07:39 PM.

                Comment


                • #28
                  Originally posted by atomsymbol View Post

                  If it includes synthetic benchmarks, I suppose one could create a fine-tuned benchmark that chases multiple pointers concurrently in a single thread. A CPU capable of executing 3+ loads per cycle would be required (I do not own such a CPU yet) to show a measurable difference with a 4-channel RAM. With AMD Zen 3, there is a small performance advantage when the AM4 motherboard has 4 memory modules installed instead of 2, although both of these configurations are still dual-channel DDR4 - I am not sure whether this applies to just single-threaded code, to just multi-threaded code, or to both.
                  Measuring such diffrence won't be hard in workloads (I did it myself with AVX2) with low compute on very large data sets, you can very easly saturate dual channel 3600MHz memory with 1 core only on Ryzen 3600. If 1 core from zen2 can do that, then for sure 16 cores from zen 3 can do that as well.

                  Also there are real benchmarks on intel 7820X working with 1, 2 and 4 sticks of ram (it supports up to quad channel, but can work in single channel as well)

                  (polish website, so you might want to autotranslate it: https://www.purepc.pl/test-pamieci-r...nnel?page=0,14 )

                  Comment


                  • #29
                    Originally posted by piotrj3 View Post
                    (polish website, so you might want to autotranslate it: https://www.purepc.pl/test-pamieci-r...nnel?page=0,14 )
                    Those aren't single-threaded benchmarks. What I mean is: Chasing 3 64-bit pointers in a single thread (on a single CPU core). A CPU capable of executing 3 loads per cycle is required to run such a test (Ryzen 3600 can execute "only" 2 loads per clock). Measuring how 1/2/4-channel DRAM configuration affects the performance of chasing 3 pointers in a single thread is most likely just an academic exercise. Such an exercise would probably need to use 1 GiB huge-pages, not 4 KiB pages and not 2 MiB huge-pages.

                    Comment


                    • #30
                      Originally posted by schmidtbag View Post
                      I've considered this issue, but it depends on your workload. I was thinking that maybe for command rates configured to T2, the pairs of channels could maybe even work asynchronously to reduce latency. I imagine that is very complicated, and could potentially interfere with threads that need more RAM than what a pair of channels has to offer.
                      In any case, more channels can improve performance for a minimal increase in cost. Bigger caches in a lot of cases cost a lot and sometimes yield no benefit.


                      That's assuming all boards include all 4 channels. I'm sure most ITX boards would either stick with 2 slots, or go with SO-DIMMs. Budget boards won't need 4 channels. I think paying 10 extra is well worth the performance gains, when you consider the cache costs a hell of a lot more than that.
                      Latency you are chasing here are very small gains that cannot be overdone.

                      https://www.techpowerup.com/forums/t...aida64.263929/

                      Literally smallest latency (on Ryzens) you can practically gain on daily drive with quite strong XMP profile is around 60ns, and if you want absolute stability, more like 80ns. Meanwhile L3 cache has around 7-10 times lower latency. And you can't do anything about it really, as you can see on leaderboards, latency didn't change much since DDR2! times. Everytime CPU fails to predict what needs to be in cache, it needs to wait until it gets information from RAM.

                      Let's imagine you make simple function call that is in pretty far away land. Functions does something very simple (let's say 8ns). now if CPU fails to predict what is supposed to be come, but CPU has that in L3 cache, you will wait additional 8ns for that cache, so everytime L3 cache call happens performance on such function drops by half.

                      Now imagine thing wasn't in L3 cache, and we need to go... to RAM. now things take 64ns, so now we need to wait 8 times longer. Performance drops by 9 times. This is primary advantage of 3d cache, because cache is larger, there is much higher chance what you need is in cache, and if it is in cache you most of time don't pay such big costs, This is why in general performance increase by itself isn't that great, like most games show in cpu bound scenarios maybe 10% performance improvement to 5800X. But if you look at worst case scanarios (1% and 0.1% lows) 5800X3D isn't 10% ahead, it is often more like 30% ahead. This is why it is often called gamer cpu, because gamers more care about consistent gameplay then few higher average FPS. Meanwhile from polish benchmarking website i shown on 7820X in most games quad channel doesn't provide big benefits over dual channel neither in average neither in lows. In nutshell cache does something quad channel can only wish to achieve.

                      If you want to truly reduce latency CPU <-> RAM we would have to move towards many smaller channels and soldered RAM very close to CPU and RAM being optimized in latency not speed.
                      Last edited by piotrj3; 02 May 2022, 09:18 PM.

                      Comment

                      Working...
                      X