Announcement

Collapse
No announcement yet.

AMD Threadripper 2990WX Linux Benchmarks: The 32-Core / 64-Thread Beast

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    Originally posted by Michael View Post

    Yeah a matter of finding a more optimal test... There are the build GCC and LLVM test profiles in PTS, that take much longer to run albeit perhaps not as representative for users. Also trying to find the right balance of a real-world test where a lot of people would be building it, so not sure how many people build their own GCC/LLVM versus rely upon default.

    Time to build Mesa would be a fun one for Phoronix readers but then there is the mess of dependencies to deal with across distributions, needing up-to-date libdrm, etc. Any other ideas?
    Mesa would probably compile even faster than the kernel. My primary interest would be LLVM, which would be probably a reasonable balance between taking some time but not too much time. Indeed not sure how much it generalizes though, but to be fair I have no idea what people are compiling anyway outside of their own pet projects.

    I saw that Anandtech was doing a chromium compile, which certainly takes long, but probably takes too long, is a mess and not sure how well it generalizes either. (btw they had the interesting result that the 2950x beat the 2990WX on the chromium compile test: https://www.anandtech.com/show/13124...2950x-review/9. Not entirely sure what is up with that.).

    Comment


    • #42
      Originally posted by Michael View Post
      Time to build Mesa would be a fun one for Phoronix readers but then there is the mess of dependencies to deal with across distributions, needing up-to-date libdrm, etc. Any other ideas?
      Anndtech uses Chromium as a compile test. I'm not sure if that's the best choice but it might give some idea.

      Comment


      • #43
        Originally posted by BNieuwenhuizen View Post

        Mesa would probably compile even faster than the kernel. My primary interest would be LLVM, which would be probably a reasonable balance between taking some time but not too much time. Indeed not sure how much it generalizes though, but to be fair I have no idea what people are compiling anyway outside of their own pet projects.

        I saw that Anandtech was doing a chromium compile, which certainly takes long, but probably takes too long, is a mess and not sure how well it generalizes either. (btw they had the interesting result that the 2950x beat the 2990WX on the chromium compile test: https://www.anandtech.com/show/13124...2950x-review/9. Not entirely sure what is up with that.).
        Yeah will probably do more build-llvm test profile usage if no other interesting ones come up.

        Really don't get the people into Chromium build tests as besides taking a long time, doesn't seem too often that many people build out their own web browser.
        Michael Larabel
        https://www.michaellarabel.com/

        Comment


        • #44
          Originally posted by Tomin View Post

          Anndtech uses Chromium as a compile test. I'm not sure if that's the best choice but it might give some idea.
          Chromium compile test is really a mess and not sure why they use it besides compiling something that at least Windows users know what it is.
          Michael Larabel
          https://www.michaellarabel.com/

          Comment


          • #45
            Originally posted by bridgman View Post

            Do you mean "worse than any other NUMA system" or just "worse than a non-NUMA system" ?
            Worse than other NUMA systems, by several orders of magnitude. I tested several dual xeon systems, and the results are comparable to the single CPU intel systems I tested. Cross-NUMA latencies on the threadripper are fine by the way, you get like 90ns access latencies - that's several orders of magnitude below the pathological (100ms - 1000ms) spikes I'm measuring, and even below the ~100μs typical scheduling latencies of a non-broken system. Cross-NUMA memory accesses might make your process run slower, but it won't cause the CPUs to completely freeze to the point where not even the kernel can execute code on it.

            I don't know why disabling NUMA in the BIOS makes the issue more or less disappear. Well, it doesn't go away completely - but the latencies are now constrained to the order of magnitude of 1-5ms, significantly below the results I get with NUMA enabled. However, upon reconsideration, I tested to see what would happen if I only stressed a single NUMA node, and I cannot reproduce the issue in this scenario - loading all 8 cores of node 0 does not trigger the issues, but loading 4 cores of each node does. It seems as though it happens when there's a lot of contention between both NUMA nodes accessing the same memory (?) at the same time, as can often be the case when e.g. compiler processes are communicating via pipes. (Heavy compilation was where I originally observed the issue)

            I'm thinking it might be related to the shared page fault hardware assist, threadripper I believe uses only one per package (?), not per core as is typical for smaller systems. If this gets bottlenecked, the threads could be blocked for long periods of time waiting on the page faults to resolve for every other core first.

            Looking a bit more into that `cyclitest` tool, it really seems to be designed for testing RT kernels. If you've been using a RT kernel, however, your results are indeed extremely worrying.
            I *am* using a realtime kernel, and you are right to worry. Also, on e.g. an intel system, using a realtime (*) kernel just reduces latencies from say ~1ms down to ~100μs. On threadripper, I get latencies in the 1000ms range. This is not the same order of magnitude as realtime vs non-realtime. Even on a non-realtime system it's completely unrealistic for the CPU to stall for this long. This is also about 8 orders of magnitude slower than RAM access latencies (~100ns on the 1950X as tested on my system, even cross-NUMA).

            (*) When I say "realtime" I assume you mean PREEMPT. (As shown by `uname -v`). An actual RT kernel is a separate patchset on top of the kernel. Which, incidentally, I also tried - no difference.

            Which isn't to say that freezes should happen under a normal kernel (that's not nice, regardless of the kernel), but this application is kind of a worst-case scenario. I guess you could work around it by tweaking the scheduler?
            I don't think the scheduler is the problem. `cyclictest` runs with FIFO scheduling priority, and the kernel is fully preempt-capable (including kernel functions). Under normal circumstances (e.g. a system where only one CPU is loaded), this works fine - even on the loaded CPU. The stalls only issue when the full system is loaded. I've also already checking using the ftrace tracing mechanism (preempt, hwlat), and there's no non-preemptible kernel functions running for longer than 10μs, nor are there any detected hardware latencies (which could be caused by e.g. the BIOS).

            P.s. Sorry for my previous double post, I didn't see the confirmation saying that my post was under review, nor could I see it listed so I assumed it got lost.

            Comment


            • #46
              Originally posted by Tomin View Post
              Anndtech uses Chromium as a compile test. I'm not sure if that's the best choice but it might give some idea.
              Chromium compile on Windows is plagued by bugs and race conditions, and held together by workarounds to those.

              See the end of the post for an October 2018 bug fix update, or read the whole story: Flaky failures are the worst. In this particular investigation, which spanned twenty months, we suspected hardwa…


              I would not use it as a reliable benchmark for anything.

              Comment


              • #47
                Originally posted by haasn View Post
                Worse than other NUMA systems, by several orders of magnitude. I tested several dual xeon systems, and the results are comparable to the single CPU intel systems I tested. Cross-NUMA latencies on the threadripper are fine by the way, you get like 90ns access latencies - that's several orders of magnitude below the pathological (100ms - 1000ms) spikes I'm measuring, and even below the ~100μs typical scheduling latencies of a non-broken system. Cross-NUMA memory accesses might make your process run slower, but it won't cause the CPUs to completely freeze to the point where not even the kernel can execute code on it.

                I don't know why disabling NUMA in the BIOS makes the issue more or less disappear. Well, it doesn't go away completely - but the latencies are now constrained to the order of magnitude of 1-5ms, significantly below the results I get with NUMA enabled. However, upon reconsideration, I tested to see what would happen if I only stressed a single NUMA node, and I cannot reproduce the issue in this scenario - loading all 8 cores of node 0 does not trigger the issues, but loading 4 cores of each node does. It seems as though it happens when there's a lot of contention between both NUMA nodes accessing the same memory (?) at the same time, as can often be the case when e.g. compiler processes are communicating via pipes. (Heavy compilation was where I originally observed the issue)

                I'm thinking it might be related to the shared page fault hardware assist, threadripper I believe uses only one per package (?), not per core as is typical for smaller systems. If this gets bottlenecked, the threads could be blocked for long periods of time waiting on the page faults to resolve for every other core first.
                Thanks for the detailed response, very helpful. I have seen those long pauses on dual-Xeon systems running ML apps and found that disabling NUMA page migration made them go away. Probably more work required there to make best use of the system with balancing disabled but my first priority was to find out what was causing the stalls.

                If you have the opportunity, could you try disabling NUMA page migration to see if it affects the stalls ? The command I used was something like:

                echo 0 | sudo tee /proc/sys/kernel/numa_balancing

                (actually I think I just echoed 0 to the sysfs location as root and one of our engineers polished it up and sent it back to me )
                Test signature

                Comment


                • #48
                  haasn bridgman
                  You may also be seeing bad interaction between NUMA and hugepages (cf. Gaud et al., Large Pages May Be Harmful On NUMA Systems, USENIX 2014), if your application uses hugepages explicitly or transparently. Try disabling transparent hugepages support.

                  Comment


                  • #49
                    Originally posted by bridgman View Post

                    Thanks for the detailed response, very helpful. I have seen those long pauses on dual-Xeon systems running ML apps and found that disabling NUMA page migration made them go away. Probably more work required there to make best use of the system with balancing disabled but my first priority was to find out what was causing the stalls.

                    If you have the opportunity, could you try disabling NUMA page migration to see if it affects the stalls ? The command I used was something like:

                    echo 0 | sudo tee /proc/sys/kernel/numa_balancing

                    (actually I think I just echoed 0 to the sysfs location as root and one of our engineers polished it up and sent it back to me )
                    That is a very helpful suggestion. Echoing 0 makes the latency spikes less frequent and shorter in duration on average. I can still see some spikes in the millisecond range, and some isolated events as high as 10ms, but this is already a major improvement over the result with that setting set to `1` (where spikes above 100ms were the norm). Still not quite where it should be, but a step in the right direction.

                    This gives us more tools to play with and investigate the issue further. Maybe we should take this to the kernel.org bug tracker, now that we know a linux setting can influence the behaviour (and therefore linux might be able to do something about it)?

                    You may also be seeing bad interaction between NUMA and hugepages (cf. Gaud et al., Large Pages May Be Harmful On NUMA Systems, USENIX 2014), if your application uses hugepages explicitly or transparently. Try disabling transparent hugepages support.
                    I had previously played around with switching this setting between madvise and always (taking a page from openSuSe Tumbleweed's book here), but didn't notice an improvement (or degradation) in the observed behavior. I tried disabling it completely but also didn't notice any difference; do I need to reboot my system to properly test the effects of that change, or is it only relevant for new processes?

                    Comment


                    • #50
                      Originally posted by haasn View Post
                      I had previously played around with switching this setting between madvise and always (taking a page from openSuSe Tumbleweed's book here), but didn't notice an improvement (or degradation) in the observed behavior. I tried disabling it completely but also didn't notice any difference; do I need to reboot my system to properly test the effects of that change, or is it only relevant for new processes?
                      To my knowledge, THP that are already created and in use will not be disbanded when you disable them via /sys. Boot with transparent_hugepage=never kernel parameter to get rid of them completely.

                      Comment

                      Working...
                      X