Announcement

Collapse
No announcement yet.

A Look At Linux Application Scaling Up To 128 Threads

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • A Look At Linux Application Scaling Up To 128 Threads

    Phoronix: A Look At Linux Application Scaling Up To 128 Threads

    Arriving last week in our Linux benchmarking lab was a dual EPYC server -- this Dell PowerEdge R7425 is a beast of a system with two AMD EPYC 7601 processors yielding a combined 64 cores / 128 threads, 512GB of RAM (16 x 32GB DDR4), and 20 x 500GB Samsung 860 EVO SSDs. There will be many interesting benchmarks from this server in the days and weeks ahead. For some initial measurements during the first few days of stress testing this 2U rack server, here is a look at how well various benchmarks/applications are scaling from two to 128 threads.

    http://www.phoronix.com/vr.php?view=26947

  • #2
    +1 for AMD being back to the HPC game! ,-)

    Comment


    • #3
      GraphicsMagick seems to scale with log(threads) instead of number of threads, until I miss something.

      Many results have a speedup of >> 2 when going from 32 to 64 threads, probably threads are not split evenly between packages.

      Comment


      • #4
        How much would cost a system like this?

        Comment


        • #5
          what magic does stockfish and vgr do when 32->64 threads more than doubles performance?

          Comment


          • #6
            Did you buy this system or is it on loan from Dell for review purposes? Configuring this server on Dell's online store gets it into the 5 figures very easily just from the CPU and memory options.

            Comment


            • #7
              Originally posted by Mr.Radar View Post
              Did you buy this system or is it on loan from Dell for review purposes? Configuring this server on Dell's online store gets it into the 5 figures very easily just from the CPU and memory options.
              Review sample to be used for future Linux server benchmarking and other interesting performance tests.
              Michael Larabel
              http://www.michaellarabel.com/

              Comment


              • #8
                Awesome machine!
                ## VGA ##
                AMD: X1950XTX, HD3870, HD5870
                Intel: GMA45, HD3000 (Core i5 2500K)

                Comment


                • #9
                  I might have found out the reason for greater than expected scaling between 32 and 64 threads.

                  The specification table shows that each configuration up to 32 threads works at ~2.7GHz frequency, but 64 and 128 thread configurations work at 3.1GHz. The less-thread configurations might not have working CPU turbo enabled, hindering their performance.

                  Comment


                  • #10
                    Originally posted by varikonniemi View Post
                    what magic does stockfish and vgr do when 32->64 threads more than doubles performance?
                    It may depend on several factors, threads(or cores) are only a part of problem, sometimes if your data is big enough it can choke your cache pipelines or even the RAM bandwidth against certain cores on certain numa nodes, L3 victim cache starvation, etc.

                    When you see those cases where speedup is more than double usually means you got enough processing to remove the bottleneck on bandwidth or cache due to enough parallelization allowing the hardware to more efficiently handle smaller chunks of data.

                    Of course there are other factors related and unrelated to bandwidth or cache but usually those just contribute in a smaller scale.

                    There is also the possibility of a runtime algorithm selector in that application, aka sometimes you find a way to make an algorithm really neat and fast to realize later that it hits a ceiling at some point and stop scaling BUT until that point is the fastest implementation you can reach, then after months of breaking your head you realize the other "slow" algorithm you didn't wanted to use because slow before that ceiling turns out to be an scalability Chuck Norris and ends up been a lot faster when it can scale enough. Hence you end up switching algorithms at runtime depending on the size(or any reasonable parameter) of the dataset to use the most effective tool for the job

                    Comment

                    Working...
                    X