Announcement

Collapse
No announcement yet.

A Look At Linux Application Scaling Up To 128 Threads

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    It clearly seems, that running just one program, no matter how well it attempts to scale, is not the purpose of such beast, and doesn't do that machine justice.
    Amdahl's law is inevitable. Inter-core communication bottlenecks are inevitable.

    Comment


    • #22
      Originally posted by Spooktra View Post
      The reality is that Ubuntu has the best hardware support out of all distros, Fedora causes audio issues, such as popping, crackling, audio distortion, I have seen the same thing with Suse, RedHat respins, Manjaro, but Ubuntu just works.
      Never have experienced any audio issues on Fedora. And I've been using it since before it was called Fedora. Either way, what does audio have to do with benchmarks?

      Comment


      • #23
        Michael, thanks for bringing such a beast of a machine into the benchmarking world. I guess that means thanking Dell also.

        What I would love to see is how the HT regressions compare to Intel hardware. This assumes that the Intel hardware and AMD hardware show issues on the same tests. Right now the few HT related performance regressions really don't look that bad. Oh an power, how badly is this machine killing your electrical budget?

        Comment


        • #24
          Originally posted by wizard69 View Post
          Michael, thanks for bringing such a beast of a machine into the benchmarking world. I guess that means thanking Dell also.

          What I would love to see is how the HT regressions compare to Intel hardware. This assumes that the Intel hardware and AMD hardware show issues on the same tests. Right now the few HT related performance regressions really don't look that bad. Oh an power, how badly is this machine killing your electrical budget?
          AMD sent it out, not Dell.
          Michael Larabel
          https://www.michaellarabel.com/

          Comment


          • #25
            Originally posted by Adul View Post
            It clearly seems, that running just one program, no matter how well it attempts to scale, is not the purpose of such beast, and doesn't do that machine justice.
            Amdahl's law is inevitable. Inter-core communication bottlenecks are inevitable.
            Entirely correct. Servers like this are really meant for multitasking, not highly parallel workloads. It is because of this why I don't see myself buying myself a CPU beyond 16 threads for a very long time. I don't have any workloads that are especially demanding of an entire CPU and if I feel the need to multitask, I'll just use more than 1 computer at the same time.

            Comment


            • #26
              This was one fun benchmark, thanks!

              Comment


              • #27
                Originally posted by schmidtbag View Post
                Entirely correct. Servers like this are really meant for multitasking, not highly parallel workloads. It is because of this why I don't see myself buying myself a CPU beyond 16 threads for a very long time. I don't have any workloads that are especially demanding of an entire CPU and if I feel the need to multitask, I'll just use more than 1 computer at the same time.
                x2, probably 16 threads is the sweet spot for a single user workstation. Personally, the most demanding tasks I do are transcoding video with Handbrake, which will happily eat as many cores as you can give it. And running desktop VM's in VirtualBox, where it's nice to load up the VM(s), while still having enough free cores to keep the rest of the machine snappy and responsive. And the occasional code compile.

                FWIW my aging Ivy Bridge Xeon 2680 v2 workstation is still quite capable in this regard with its 10c/20t and 32 GB of ECC DDR3-1866. But I will likely replace it with Epyc next year, after the 7 nm Zen2 parts hit the market. Epyc is clearly superior to Xeon technically, and it's priced better. Win-win. The process improvement going from 14 nm to 7 nm is massive, I'm anticipating some truly amazing numbers from Zen2.
                Last edited by torsionbar28; 11 October 2018, 12:35 PM.

                Comment


                • #28
                  Originally posted by varikonniemi View Post
                  what magic does stockfish and vgr do when 32->64 threads more than doubles performance?
                  We would need to know how those 32 and 64 threads are allocated on cores. Let's look at the two different situations:
                  1) 32 threads all on one package, 64 threads spread across two packages, and no SMT

                  In the 32 thread case, a NUMA aware allocator would keep all memory allocations local to the CCX (if possible). This would limit memory bandwidth to that of one package. When we go to 64 threads, we get twice the memory bandwidth and twice the cache. This can lead to superlinear scaling when working set sizes now fit in the increased cache space where it didn't before. The increase in memory bandwidth will only allow linear scaling, so it's unlikely to be a factor. If the memory allocator isn't NUMA aware or of the program in question doesn't have strict memory locality, then communication overheads can cause either sub or supra linear scaling, it's hard to analyze that without detailed knowledge of the link bandwidth, link latencies, and how bursty these accesses are.

                  2) 32 threads and 64 threads are both spread across both packages

                  Much like the previous situation, the NUMA awareness of the allocator plays a big factor. As does the locallity of the data accesses of the program in question. Thermal considerations may help in this case as 32 cores may clock higher when spread across two packages than being confined to one. This would lead to sublinear scaling as the per core thermal limit would halve when going from 32 to 64.

                  Summary: Way too many variables to really be able to tell and further analysis would require a detailed knowledge of the programs in question and a better knowledge of the low level architecture of this processor.

                  Comment


                  • #29
                    Originally posted by varikonniemi View Post
                    what magic does stockfish and vgr do when 32->64 threads more than doubles performance?
                    It probably has a lot to do with NUMA.

                    Comment


                    • #30
                      Very interesting benchmark, thank you very much. Interestingly enough, on our Dual Epyc 7551 system, we render BMW27 in 54.67 seconds. I wonder what is holding back the Dell server, I would expect a better, not worse, result from an EPYC 7601 system. We also only have eight memory channels equipped, as compared to the 16 in the Dell system.

                      The benchmark result in question: https://opendata.blender.org/benchma...b-71687bbdf83e

                      Comment

                      Working...
                      X