Announcement

Collapse
No announcement yet.

AMD Ryzen Threadripper 7980X & 7970X Linux Performance Benchmarks

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by coder View Post
    I'm the one trying to get to the bottom of what's going on here. All you've done is whine at being called out for making an ignorant statement. And then you double-down and triple-down, because the only way you can win arguments is by trying to outlast the other poor fellow.
    If my statement were truly that ignorant, you could've ignored my post, but no, you wanted to pick a fight.
    Obviously, there's some bottleneck affecting compiling and not certain other things. One of the main differences between it and the benchmarks that scale almost linearly is the amount of I/O it's doing. Again, you're not helping.
    Right, such as any of the things I mentioned, perhaps. And perhaps not.
    Suppose I/O isn't the problem, then what?
    You've cited no fact pattern leading to such a conclusion. How does the data support that?
    I told you what to look for multiple times. You're free to use whatever reviewer you want for the 3990WX.
    Not demands, just an expectation that people are interested in having a constructive discussion. I don't have time for your fragile ego, today.
    Seems to me you've got plenty of time if you've kept this up for so long.
    But you're still replying because... why?
    Do you not see the irony in you asking that when you're the one who ostensibly doesn't want to fight?
    Cool, so if anyone interested in actual facts, conclusion, or solutions leaves, the comments will just be 100% whiners, wankers, and nutters. We're already half way there.
    Welcome to the internet.
    But on a more serious note: as you continue to fail to understand, there are multiple possible causes for this that aren't your big-brain I/O theory. I'm really curious how much you'll cower if you pair the fastest SSD with this CPU and there's still a performance discrepancy. And no, I'm not saying that will happen, because I/O is just as reasonable of a cause as, for example, having 1/3 the memory bandwidth per socket. But you're too arrogant to ever accept that maybe your one and only idea could be wrong.
    You completely missed the point. Since you're not being serious, I'm not going to waste the time to explain it to you.
    Right, just like how you "don't" want to argue with me.

    Comment


    • #32
      Originally posted by schmidtbag View Post
      If my statement were truly that ignorant, you could've ignored my post, but no, you wanted to pick a fight.
      You claimed there were no unusual performance issues. If you don't like being corrected for posting wrong statements, don't post.

      Originally posted by schmidtbag View Post
      ​Suppose I/O isn't the problem, then what?
      I already suggested an experiment to Michael. We'll see if he follows up on that.

      Originally posted by schmidtbag View Post
      I told you what to look for multiple times. You're free to use whatever reviewer you want for the 3990WX.
      That's a completely different product. If there's anything relevant about it to this one, you have yet to make the case.

      Originally posted by schmidtbag View Post
      ​Seems to me you've got plenty of time if you've kept this up for so long.
      At this point, I'm just curious how long you're going to keep whining. What I definitely don't have time for is schooling you in details that are apparently too uninteresting for you to look into, yourself.

      Originally posted by schmidtbag View Post
      as you continue to fail to understand, there are multiple possible causes for this that aren't your big-brain I/O theory.
      I never said their weren't. However, it's not constructive to just throw a bunch of random ideas with no supporting evidence.

      Originally posted by schmidtbag View Post
      I'm really curious how much you'll cower if you pair the fastest SSD with this CPU and there's still a performance discrepancy.
      Unlike you, I don't get ahead of the data and presume to know something I don't. Therefore, I don't lose face if one line of investigation doesn't pan out.

      The only reason not to investigate or look at the data is if we presume to know the answer. In that case, facts can only make us look bad, which is probably why you're so allergic to looking at the data. You can't do good engineering with such a fragile ego, because pursuing well-founded lines of investigation that don't pan out is a normal part of the process.

      Originally posted by schmidtbag View Post
      ​And no, I'm not saying that will happen, because I/O is just as reasonable of a cause as,
      You don't actually know how reasonable either is, because you won't look at the data which might tell you whether it is.

      Originally posted by schmidtbag View Post
      ​​for example, having 1/3 the memory bandwidth per socket.
      The 7980X has 1/3rd as many channels for 2/3rds as many cores. The per-core bandwidth works out to 1/2 of the 96-core EPYC I cited. That's with the 64-core version. For the 32-core 7970X, the per-core memory bandwidth is the same as both the EPYC 9654 the Ryzen 7950X.

      If it were memory-bound, we should therefore expect to see approximately double the performance from the 7970X as on the 7950X, and yet the former is only 48.9% faster at Godot compilation, 39.6% faster an Linux kernel builds, 69.0% faster at LLVM compilation, 30.7% faster at Mesa compilation, and 59.3% faster at Node.js. All of that says there's a significant bottleneck before we even get to memory bandwidth.

      See, this is what I mean about a fact pattern. You'll never find them, as long as you continue to be afraid of diving into the data.

      Originally posted by schmidtbag View Post
      ​​​But you're too arrogant to ever accept that maybe your one and only idea could be wrong.
      I never said that was my only idea. I'm just following where the data seems to point, because unlike you, I actually care what the answer is. If one line of inquiry runs cold, I pursue the next best-supported and so on.

      You're not here to solve any problem. You're just here to fluff your ego by saying things you think sound smart. Your empty words are worthless.

      Comment


      • #33
        Render and compiling are very different tasks. Rendering is 99% compute bound, so increases in memory latency or lower disk read speed don’t have much effect. The dataset needed per core is tiny, and the math is hard and slow. For compilation, though, IO is as important as compute. I’m guessing the problem here is the SSD not feeding the cores fast enough, since these cores are very high frequency.

        Comment


        • #34
          Originally posted by LtdJorge View Post
          Render and compiling are very different tasks. Rendering is 99% compute bound, so increases in memory latency or lower disk read speed don’t have much effect. The dataset needed per core is tiny, and the math is hard and slow.
          With regard to the last claim, I've heard the main reason people still do CPU rendering is simply due to their scenes exceeding the available memory on their GPU. So, I wonder what's your basis for saying that.

          Originally posted by LtdJorge View Post
          For compilation, though, IO is as important as compute.
          Are you just basing this on gut instinct or what? 20 years ago, a greybeard told me the same thing. He was wrong then, and in my experience over the 20 years since, it's almost never I/O bound. However, I'm usually building C++ code on machines with a lot more memory per core than the setup Michael used for these tests.

          The few times I've observed I/O performance to be a bottleneck, it was always a matter of not having enough RAM for the number of cores being used. However, I wasn't doing a sync at the end of my builds - perhaps his compilation benchmarks are.

          Originally posted by LtdJorge View Post
          I’m guessing the problem here is the SSD not feeding the cores fast enough, since these cores are very high frequency.
          I think it could be, given the caveats I mentioned above.

          Also, what I said about the SLC buffer potentially being full. If we can get some data to support this, then I'd suggest doing a sync and fstrim before entering the timed section of the compilation benchmarks. It's not ideal, but it's the easiest way I know to free up probably ample space in the SSD's write buffer.

          Comment


          • #35
            Originally posted by coder View Post
            You claimed there were no unusual performance issues. If you don't like being corrected for posting wrong statements, don't post.
            Except your correction was about my theories for performance issues. If your correction was "there are actually performance issues and here's why I think that" then I'd have probably agreed. After all, when it comes to compiling benchmarks, it is a little unusual to see it not scale linearly. Though, it might not be that unusual depending on what the cause is. Alpha64 seemed to imply there were scaling issues other than compiling, hence my comment that not all tasks scale linearly.
            That's a completely different product. If there's anything relevant about it to this one, you have yet to make the case.
            The whole point of me pointing out that product was to exemplify where I got my "random nonsense" from. Again, I didn't say any of them are the problem; it's possible none of them are. The thing is, history is not immune to repetition. This next-gen Threadripper is much more intelligently designed but it still has a lot of the same differences, which can basically be summarized as sharing a physical socket with a server platform but everything is cut down.
            At this point, I'm just curious how long you're going to keep whining. What I definitely don't have time for is schooling you in details that are apparently too uninteresting for you to look into, yourself.
            At this point I'm curious how many times you're going to say something along the lines of "I don't have time for this" and yet respond anyway. You think you're winning here but you're going against your own word.
            I never said their weren't. However, it's not constructive to just throw a bunch of random ideas with no supporting evidence.
            The 3990WX is supporting evidence, as is any semblance of critical thinking. You're one of those people where everything has to be treated literally and/or in a vacuum (not literally...) or else it's irrelevant. It's genuinely strange to me how you can honestly dismiss all of my ideas as unsupportable, because of what, hubris? To this day, we are still facing issues with schedulers making stupid decisions. It's no secret that power governors can give misleading results. It doesn't take a genius to know that having a significant drop in memory bandwidth while having higher clock speeds has a very good chance of bottlenecking (though as you proved later, perhaps we can eliminate that as a possibility). I think I/O is a very likely candidate to be the problem, but it doesn't make sense to me how you think there's supporting evidence for that. That dual-socket Epyc has triple the core count of a 7980X; that's a lot of additional bandwidth needed for the disk. The SN850 is no slouch of a SSD.
            Anyone genuinely interested in solving this problem and not suffering from a superiority complex would acknowledge that some (not all) of the things I and Alpha64 mentioned are plausible. I don't really care if I'm wrong; I acknowledged there are other possibilities and I didn't try to create an exhaustive list. What I care about is how you're judging me because your autistic brain can't handle more than one possible problem at a time, or, perhaps your ego can't accept there are other possibilities that you didn't come up with.
            Unlike you, I don't get ahead of the data and presume to know something I don't. Therefore, I don't lose face if one line of investigation doesn't pan out.
            You kinda did with the whole I/O thing...
            The only reason not to investigate or look at the data is if we presume to know the answer. In that case, facts can only make us look bad, which is probably why you're so allergic to looking at the data.
            You're kinda supporting my claim that you're not capable of thinking of more than one possibility at a time. There is another reason to not investigate: I don't plan to buy this CPU and I'm not obligated to devote time to help some stranger satisfy his/her curiosity. I pitched in my two cents because that's how much I care to give. It's not nothing, just food for thought. It's a testament to your social skills if you think I'm required to do more diligence for your sake.
            You can't do good engineering with such a fragile ego, because pursuing well-founded lines of investigation that don't pan out is a normal part of the process.
            I totally agree. You also can't do good engineering by dismissing all possible options without testing them. I think swapping in a faster SSD is a great starting point, since that's a very easy variable to swap out. But again: what are you going to do if compiling benchmarks are still too slow after a better drive is swapped in? What if one of my ideas proves to be correct? You don't have evidence that the power governor, the scheduler, microcode/firmware, VRMs, etc aren't the problem, just as I don't have evidence they are the problem. A true engineer would know this.
            The 7980X has 1/3rd as many channels for 2/3rds as many cores. The per-core bandwidth works out to 1/2 of the 96-core EPYC I cited. That's with the 64-core version. For the 32-core 7970X, the per-core memory bandwidth is the same as both the EPYC 9654 the Ryzen 7950X.
            Even half the bandwidth is a substantial drop.
            If it were memory-bound, we should therefore expect to see approximately double the performance from the 7970X as on the 7950X, and yet the former is only 48.9% faster at Godot compilation, 39.6% faster an Linux kernel builds, 69.0% faster at LLVM compilation, 30.7% faster at Mesa compilation, and 59.3% faster at Node.js. All of that says there's a significant bottleneck before we even get to memory bandwidth.

            See, this is what I mean about a fact pattern. You'll never find them, as long as you continue to be afraid of diving into the data.
            What makes you think I'm afraid of that? It's not fear, I'm just lazy, because I don't really care that much. I would say you did a good job to prove how memory bandwidth is unlikely to be the problem. That's great - this means there's one less thing we have to consider. I'm not butthurt about it and I'm not bitter about being wrong. I only mentioned it as a possibility worth investigating for anyone who cares. You care and you did investigate, and now we can most likely eliminate it as the problem. That's a win as far as I'm concerned.
            You're not here to solve any problem. You're just here to fluff your ego by saying things you think sound smart. Your empty words are worthless.
            That whole sentence is hypocritical.
            Last edited by schmidtbag; 22 November 2023, 10:52 AM.

            Comment


            • #36
              Originally posted by coder View Post
              With regard to the last claim, I've heard the main reason people still do CPU rendering is simply due to their scenes exceeding the available memory on their GPU. So, I wonder what's your basis for saying that.


              Are you just basing this on gut instinct or what? 20 years ago, a greybeard told me the same thing. He was wrong then, and in my experience over the 20 years since, it's almost never I/O bound. However, I'm usually building C++ code on machines with a lot more memory per core than the setup Michael used for these tests.

              The few times I've observed I/O performance to be a bottleneck, it was always a matter of not having enough RAM for the number of cores being used. However, I wasn't doing a sync at the end of my builds - perhaps his compilation benchmarks are.


              I think it could be, given the caveats I mentioned above.

              Also, what I said about the SLC buffer potentially being full. If we can get some data to support this, then I'd suggest doing a sync and fstrim before entering the timed section of the compilation benchmarks. It's not ideal, but it's the easiest way I know to free up probably ample space in the SSD's write buffer.
              WRT rendering, the size of the scene doesn't change the time per ray much (not linearly, at least). It of course depends on how the renderer is implemented, but for Montecarlo ones like Cycles the amount of math operations compared to memory fetches should be huge. Rendering with any decent amount of samples, at high depth and with not only-diffuse materials, there's gonna be so many samples per pixel that, if properly optimized, the renderer can do the memory operations asynchronously and amortize the memory latency.

              Edit: think BVH, massively reduces the amount memory accesses. There have been many optimizations invented to reduce the things to iterate on with ray/path-tracing.

              About compile speeds, it's also from experience. I've seen much lower core count chips scale poorly because of runnign from an HDD. It depends massively on the size of the compilation units. Many small files are going to be worse for IO than less huge files. Although the latter is probably worse for parallelism, depending on how interdependent the code is.
              Last edited by LtdJorge; 22 November 2023, 02:36 PM.

              Comment


              • #37
                Originally posted by LtdJorge View Post

                WRT rendering, the size of the scene doesn't change the time per ray much (not linearly, at least). It of course depends on how the renderer is implemented, but for Montecarlo ones like Cycles the amount of math operations compared to memory fetches should be huge. Rendering with any decent amount of samples, at high depth and with not only-diffuse materials, there's gonna be so many samples per pixel that, if properly optimized, the renderer can do the memory operations asynchronously and amortize the memory latency.

                About compile speeds, it's also from experience. I've seen much lower core count chips scale poorly because of runnign from an HDD. It depends massively on the size of the compilation units. Many small files are going to be worse for IO than less huge files. Although the latter is probably worse for parallelism, depending on how interdependent the code is.
                By the way, I've been on Gentoo for a year, and there's a lot of variance with compiles. With projects like LLVM, it can take many seconds for the compiler output to move. However, there are packages where I think my terminal is the bottleneck (Kitty on Sway).

                I think it's fair to think the first is compute bound while the second is I/O bound.

                Comment


                • #38
                  The compiles I run benefit from multicore up to a point, but then are IO- or frequency-dependent. That also tracks with my other workload, which is large database doodling. That’s why I opted for “just” the 7960X. There’s a balancing act between core count and frequency and IO, and for me it makes sense to maximize frequency within the Threadripper range, all of which have enough cores.

                  Comment


                  • #39
                    Originally posted by LtdJorge View Post
                    if properly optimized, the renderer can do the memory operations asynchronously and amortize the memory latency.
                    I take your word on the rest of it, since I haven't been down in the guts of a modern, path-tracing renderer, but this part jumped out at me. How do you mean for the memory operations to run asynchronously? At only 320 entries, Zen 4's reorder buffer isn't big enough to hide an L3 cache miss. In fact, you'd be lucky if it hides the latency to fetch the cache line from an adjacent CCD's L3, and that's only if there's enough other work to do without unmet data dependencies.

                    Originally posted by LtdJorge View Post
                    Edit: think BVH, massively reduces the amount memory accesses.
                    BVH should reduce memory and computation per ray. However, you do first have to build it.

                    Originally posted by LtdJorge View Post
                    About compile speeds, it's also from experience. I've seen much lower core count chips scale poorly because of runnign from an HDD.
                    Up until late 2020, I was occassionally doing builds on a 16-thread machine with 32 GB of RAM and a hardware RAID-5 across 4 hard drives. Even CMake/Ninja builds of a medium-large codebase that took about 45 minutes to compile were pretty much entirely CPU-bound.

                    Originally posted by LtdJorge View Post
                    It depends massively on the size of the compilation units. Many small files are going to be worse for IO than less huge files. Although the latter is probably worse for parallelism, depending on how interdependent the code is.
                    Well, most header files get cached early on, so you basically just have to read the C/C++ source files as they're being compiled. Those are tiny, compared to the amount of time it typically takes to compile one (I'm assuming optimizations and warnings are enabled). The most iowait I'd typically see is a couple %.

                    Linking can definitely hit I/O, if you don't have enough RAM to cache all the .o files between the time they're generated and when they're being linked, or between when libraries were linked and when they're being linked into another library or executable. The first is unlikely, unless you're running extremely low on RAM, since object files are typically generated in a batch and linked right after. The second is a bit more likely, depending on the size of the codebase and how many targets are also linking in some of those libraries.

                    Compared to that RAID, a decent NVMe SSD should be fast enough to keep a much more powerful CPU busy. The place you could run into trouble is if there's a good deal of memory pressure and writes are slow because you're going straight to TLC NAND, instead of the SLC buffer.

                    Originally posted by LtdJorge View Post
                    By the way, I've been on Gentoo for a year, and there's a lot of variance with compiles. With projects like LLVM, it can take many seconds for the compiler output to move. However, there are packages where I think my terminal is the bottleneck (Kitty on Sway).

                    I think it's fair to think the first is compute bound while the second is I/O bound.
                    Uh, so you're basing all of this on mere conjecture, without ever having actually seen iowait spike or watching the iostat -x output? If so, that's not very helpful. That's like saying "my car stopped, I assume because I ran out of gas, since it was a while since I last fueled up."

                    This stuff isn't rocket science.
                    Last edited by coder; 23 November 2023, 04:03 AM.

                    Comment


                    • #40
                      Originally posted by sharpjs View Post
                      The compiles I run benefit from multicore up to a point, but then are IO- or frequency-dependent. That also tracks with my other workload, which is large database doodling. That’s why I opted for “just” the 7960X. There’s a balancing act between core count and frequency and IO, and for me it makes sense to maximize frequency within the Threadripper range, all of which have enough cores.
                      Yeah, I guess the main benefit of the 7960X is double the memory channels of the 7950X. The tradeoff is you give up a little frequency for that + the 8 extra cores. I guess you also get 33% more L3 per core, since it seems like they made it by using 4 CCDs with only 6 cores enabled on each.

                      For me, even all of that would be enough to justify the higher cost. However, if you need the additional PCIe lanes, that could basically force your hand.

                      I do hear you on the frequency point. Much of the time, I'm doing small incremental builds, where I really want decent single-thread performance more than a ton of threads.

                      BTW, if you've got the budget for a ThreadRipper, I sure you you're using a datacenter grade SSD and not consumer trash. That should help a lot with your database I/O, as well as minimize the chance of I/O-related build bottlenecks (but then, so should adequate RAM). If you want to learn more, start here:
                      Last edited by coder; 23 November 2023, 04:16 AM.

                      Comment

                      Working...
                      X