Announcement

Collapse
No announcement yet.

A Look At The Windows vs. Linux Scaling Performance Up To 64 Threads With The AMD 2990WX

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by kollo View Post

    Ohh, thanks for the clarification, much appreciated! Then looks like either Microsoft compiler team or OS kernel team has the work cut out for them (or both).
    I know what you are thinking. I spent a few hours going through some of the tests and ran a few of my own.

    What I am having trouble with is some of these Windows based benchmarks at other websites. Either they are just "test runners" and know nothing of what they run, or they are not very inclined to investigate their findings. I have seen several shoulder shrug "oh well" on some Windows testing anomalies. It's one thing to publish a result, but if you can't explain the why of a result. One site did publish a Win and Linux side by side on a WX and Windows "won" hands down. But the level of transparency is lacking, I couldn't get certain details on the test setup.

    So, like you I started examining variances between how an app native to Windows behaves relative to an app native to Linux behaves. Dependencies, libraries, compiler settings, etc.

    For example, I work with an vendor application stack that is written and compiled for Windows first, then they go back and recompile it for Linux. For years we ran this app stack on Linux and took many months to tune it. When the Linux OS version became obsolete, we went back and worked with the vendor on a target OS plan. That is when they informed us that their stack was compiled Windows first, Linux second, and that the Windows version ran 10-15% faster because of that. The "why" was never revealed due to internal policies and we wanted that 15%, so we switched the stack to Windows.

    This is why I am looking "under the covers" to see where people are getting their material. Because PTS is 100% open source, I can examine it all.

    Comment


    • #22
      Originally posted by edwaleni View Post
      This is why I am looking "under the covers" to see where people are getting their material. Because PTS is 100% open source, I can examine it all.
      Was funny a few weeks ago someone on Twitter was suggesting to other review sites to checkout PTS... I think it was Gamer Nexus that responded that their tools were far superior to PTS but too good that they wouldn't want to release them publicly along with some other nonsense.

      I guess AnAndTech also does some automation of their benchmark runs now, but that is all internal too and from the sounds of it just a collection of scripts and a lot of other comments on it of nonsense.
      Michael Larabel
      http://www.michaellarabel.com/

      Comment


      • #23
        Originally posted by MadCatX View Post
        Anyway, would anybody chip in for a benchmark of various CPU governors and its effect on the TR2? I definitely would.
        No one else so far, but will start running up some scaling governor tests tonight and otherwise just make it a premium-only article.
        Michael Larabel
        http://www.michaellarabel.com/

        Comment


        • #24
          Incredibly interesting results. It's also any wonder why so many windows users scream murder at SMT/HT, when evidently Windows doesn't seem to handle it as elegantly as it could.

          One thing to note is that algorithms like cryptonight are almost entirely cache-bound. As soon as you fill it and threads start having to hit memory, you're performance will drop. It's the reason why there is some tuning required when running those miners, as it is basically never optimal to throw all of your cores at it.

          Comment


          • #25
            Originally posted by Michael View Post

            No one else so far, but will start running up some scaling governor tests tonight and otherwise just make it a premium-only article.
            Since when are premium-only articles a thing? I wasn't aware of that.

            And maybe you could take a look at tuned at some point?

            Comment


            • #26
              What happens if we measure the same application using the same compiler version, just switch memory consumption and OS. Using the open source chess engine stockfish this is possible. The stockfish benchmark has a "hashsize" parameter which changes an application mostly working in cache into a memory intensive application.

              Windows 10 results:

              gcc version 7.3.0 (x86_64-posix-seh-rev0, Built by MinGW-W64 project) Thread model: posix
              mingw32-make.exe profile-build ARCH=x86-64-modern

              stockfish 9.0 compiles using gcc but the executable doesn't work in windows, so I used the actual sources from https://github.com/mcostalba/Stockfish:

              ./stockfishModern.exe bench 128 64 23

              Total time (ms) : 47765
              Nodes searched : 3284542185
              Nodes/second : 68764622

              Now lets increase hashtable size by factor 100 (about 13G)

              ./stockfishModern.exe bench 12800 64 23

              Total time (ms) : 62091
              Nodes searched : 2818109663
              Nodes/second : 45386765

              There is a large performance penalty increasing the hashtable.
              Lets try a different compilation option - which is superior on my old intel 6800k:

              mingw32-make.exe profile-build ARCH=x86-64-bmi2

              ./stockfishBmi2.exe bench 128 64 23

              Total time (ms) : 45391
              Nodes searched : 2634876045
              Nodes/second : 58048424

              ./stockfishBmi2.exe bench 12800 64 23

              Total time (ms) : 64589
              Nodes searched : 2800805673
              Nodes/second : 43363508

              bmi2 causes a slowdown with small hashtables.

              Now lets switch to Linux:

              Linux Ubuntu 18.2, kernel 4.18.3

              gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3) Thread model: posix

              make profile-build ARCH=x86-64-modern

              ./stockfishModern bench 128 64 23
              Total time (ms) : 43776
              Nodes searched : 3183692792
              Nodes/second : 72726900

              ./stockfishModern bench 12800 64 23
              Total time (ms) : 50948
              Nodes searched : 3157018126
              Nodes/second : 61965496

              Much reduced slowdown using large hashtables.

              make profile-build ARCH=x86-64-bmi2

              ./stockfishBmi2 bench 128 64 23

              Total time (ms) : 54809
              Nodes searched : 3256340210
              Nodes/second : 59412509

              ./stockfishBmi2 bench 12800 64 23

              Total time (ms) : 54851
              Nodes searched : 2915996663
              Nodes/second : 53162142

              Same with bmi2, but at a lower level.

              What is interesting, if I switch to the old stockfish 9 sources, things change dramatically:

              ./stockfishModern9.0 bench 128 64 23

              Total time (ms) : 43405
              Nodes searched : 3261040664
              Nodes/second : 75130530

              ./stockfishModern9.0 bench 12800 64 23

              Total time (ms) : 71908
              Nodes searched : 3483903614
              Nodes/second : 48449457

              I cannot do this in windows because gcc 7.3 has problems with the 9.0 sources.
              So it is not only the OS, stockfish changed something in its handling of its hashtable since 9.0 which improved performance using large tables, but only in linux.

              Comment


              • #27
                The benchmark king returns!

                This is very important data. You can immediately see the effect (on each operating system) of cross CCX communication, as well as how turning on SMT in Windows seems to trash performance generally. How strange.

                Comment


                • #28
                  I pulled out a very old Lenovo D20 (Westmere-EP) dual X5650 with 12c/24t. It had an Ubuntu distro with kernel 4.15. 48 Gb of ECC RAM. Loaded PTS 8 and ran Michael's Windows vs Linux open benchmark file.

                  Of course anything that used/needed vector support got killed as the Westmere did not have AVX, but that is not what I was testing.

                  Slower CPU, slower RAM, only 24 threads. It should have lost every test hands down.

                  It defeated or tied Windows "stock" Threadripper WX on 9 of the tests. Technically, I shouldn't have come close on anything. (on many it didn't) as that hardware is several generations behind.

                  I can only surmise that Microsoft has an update in the works to better support TR.






                  Comment


                  • #29
                    Originally posted by edwaleni View Post

                    What Netcraft measures is not the same as what Phoronix is measuring. The two don't correlate.
                    They do. Windows doesn't scale. At least when defaults are benchmarked. And don't even think it has better response time than Linux. Stock exchange results speak for themselves. Only morons pay and use slower, less secure OS for serious tasks.
                    Last edited by Pawlerson; 20 August 2018, 02:16 PM.

                    Comment


                    • #30
                      Originally posted by Pawlerson View Post

                      They do. Windows doesn't scale. At least when defaults are benchmarked. And don't even think it has better response time than Linux. Stock exchange results speak for themselves. Only morons pay and use slower, less secure OS for serious tasks.
                      No they don't, while everything you write is true it has no correlation with the Netcraft numbers since none of the listed sites there use single servers. I.e it's not the OS that is making these sites stable, it's their High Availability setup with multiple load-balancers and so on that explains 100% of these numbers.

                      Comment

                      Working...
                      X