Announcement

Collapse
No announcement yet.

A Look At The Windows vs. Linux Scaling Performance Up To 64 Threads With The AMD 2990WX

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • A Look At The Windows vs. Linux Scaling Performance Up To 64 Threads With The AMD 2990WX

    Phoronix: A Look At The Windows vs. Linux Scaling Performance Up To 64 Threads With The AMD 2990WX

    This past week we looked at the Windows 10 vs. Linux performance for AMD's just-launched Ryzen Threadripper 2990WX and given the interest from that then ran some Windows Server benchmarks to see if the performance of this 64-thread CPU would be more competitive to Linux. From those Windows vs. Linux tests there has been much speculation that the performance disparity is due to Windows scheduler being less optimized for high core/thread count processors and its NUMA awareness being less vetted than the Linux kernel. For getting a better idea, here are benchmarks of Windows Server 2019 preview versus Ubuntu Linux when testing varying thread/core counts for the AMD Threadripper 2990WX.

    http://www.phoronix.com/vr.php?view=26736

  • #2
    Minor typo:

    Originally posted by phoronix View Post
    9.0x increase from 4 threads to 64 threads for WIndows

    Comment


    • #3
      This is some great data, thanks for producing this. The first page shows that GCC 7.1.0 was used to build on Windows. Using a compiler for Windows that has been traditionally tuned for Linux operation might bias the data. Reading the command line flags, the benchmarks use -lpthread (POSIX threads) implementation on Windows as well, instead of native Windows threads implementation. The user space threading support library implementation matters greatly as well, an aspect which is easy to overlook. Given that there is a constant factor bias present at all thread counts, and a dramatic drop at SMT, it is not at dispute that the particular "Ubuntu Linux 18.04 OS * Linux GCC 7.1.0 * POSIX threading stack * AMD Threadripper CPU" combination is faster across the board on all tested target software when pitched against "Windows 10 * Linux GCC 7.1.0 ported to Windows * POSIX threading stack ported to Windows * AMD Threadripper CPU" combination. However there are so many variables that concluding it must be Windows 10 kernel at fault is a jump at this point, given the compilers at play. It could be that the Windows 10 scheduler is crap, or it could just be that GCC+pthreads port on Windows is the part that will need some optimization love. (we just cannot know from this data alone)

      It would be great to see scaling benchmarks with the GCC and POSIX threads implementation switched out to native VS2017 compiler and Windows native threading stack, which are traditionally tuned for Windows operation, rather than using the Linux-oriented GCC compiler and Linux-oriented POSIX pthreads stack. I'd love to run and point laughing fingers at Microsoft, but it must be admitted that running POSIX threading stack on Windows is essentially an emulation, similar to the way as running Direct3D games on Linux is. It would not be surprising if in the end the root issue is found there rather than in the Windows 10 scheduler. The scheduler could be as turd as claimed, but then again, it could be that the need to optimize will be in the Windows port of GCC + POSIX pthreads emulation stack. (Or it could be both at fault, and the performance issues are few % here, and few % there)

      It would also be great to see in that combo the AMD Threadripper CPU switched out with an Intel 7980XE CPU - does the same performance scaling differential apply (on Windows vs Linux)?

      Anyhow, great data again, thanks!

      Comment


      • #4
        Originally posted by kollo View Post
        This is some great data, thanks for producing this. The first page shows that GCC 7.1.0 was used to build on Windows. Using a compiler for Windows that has been traditionally tuned for Linux operation might bias the data. Reading the command line flags, the benchmarks use -lpthread (POSIX threads) implementation on Windows as well, instead of native Windows threads implementation. The user space threading support library implementation matters greatly as well, an aspect which is easy to overlook. Given that there is a constant factor bias present at all thread counts, and a dramatic drop at SMT, it is not at dispute that the particular "Ubuntu Linux 18.04 OS * Linux GCC 7.1.0 * POSIX threading stack * AMD Threadripper CPU" combination is faster across the board on all tested target software when pitched against "Windows 10 * Linux GCC 7.1.0 ported to Windows * POSIX threading stack ported to Windows * AMD Threadripper CPU" combination. However there are so many variables that concluding it must be Windows 10 kernel at fault is a jump at this point, given the compilers at play. It could be that the Windows 10 scheduler is crap, or it could just be that GCC+pthreads port on Windows is the part that will need some optimization love. (we just cannot know from this data alone)

        It would be great to see scaling benchmarks with the GCC and POSIX threads implementation switched out to native VS2017 compiler and Windows native threading stack, which are traditionally tuned for Windows operation, rather than using the Linux-oriented GCC compiler and Linux-oriented POSIX pthreads stack. I'd love to run and point laughing fingers at Microsoft, but it must be admitted that running POSIX threading stack on Windows is essentially an emulation, similar to the way as running Direct3D games on Linux is. It would not be surprising if in the end the root issue is found there rather than in the Windows 10 scheduler. The scheduler could be as turd as claimed, but then again, it could be that the need to optimize will be in the Windows port of GCC + POSIX pthreads emulation stack. (Or it could be both at fault, and the performance issues are few % here, and few % there)

        It would also be great to see in that combo the AMD Threadripper CPU switched out with an Intel 7980XE CPU - does the same performance scaling differential apply (on Windows vs Linux)?

        Anyhow, great data again, thanks!
        GCC 7.1 was on the system but for all the tests in this article, they were using the official Windows binaries for the programs, offhand I don't believe any of them are only source based on Windows. For that system table, it basically shows what hardware/software is detected on the system under test.

        Edit: also for anyone new here, all of the test scripts can be explored via https://openbenchmarking.org/
        Last edited by Michael; 08-19-2018, 01:27 PM.
        Michael Larabel
        http://www.michaellarabel.com/

        Comment


        • #5
          Originally posted by kollo View Post
          This is some great data, thanks for producing this. The first page shows that GCC 7.1.0 was used to build on Windows. Using a compiler for Windows that has been traditionally tuned for Linux operation might bias the data. Reading the command line flags, the benchmarks use -lpthread (POSIX threads) implementation on Windows as well, instead of native Windows threads implementation. The user space threading support library implementation matters greatly as well, an aspect which is easy to overlook. Given that there is a constant factor bias present at all thread counts, and a dramatic drop at SMT, it is not at dispute that the particular "Ubuntu Linux 18.04 OS * Linux GCC 7.1.0 * POSIX threading stack * AMD Threadripper CPU" combination is faster across the board on all tested target software when pitched against "Windows 10 * Linux GCC 7.1.0 ported to Windows * POSIX threading stack ported to Windows * AMD Threadripper CPU" combination. However there are so many variables that concluding it must be Windows 10 kernel at fault is a jump at this point, given the compilers at play. It could be that the Windows 10 scheduler is crap, or it could just be that GCC+pthreads port on Windows is the part that will need some optimization love. (we just cannot know from this data alone)

          It would be great to see scaling benchmarks with the GCC and POSIX threads implementation switched out to native VS2017 compiler and Windows native threading stack, which are traditionally tuned for Windows operation, rather than using the Linux-oriented GCC compiler and Linux-oriented POSIX pthreads stack. I'd love to run and point laughing fingers at Microsoft, but it must be admitted that running POSIX threading stack on Windows is essentially an emulation, similar to the way as running Direct3D games on Linux is. It would not be surprising if in the end the root issue is found there rather than in the Windows 10 scheduler. The scheduler could be as turd as claimed, but then again, it could be that the need to optimize will be in the Windows port of GCC + POSIX pthreads emulation stack. (Or it could be both at fault, and the performance issues are few % here, and few % there)

          It would also be great to see in that combo the AMD Threadripper CPU switched out with an Intel 7980XE CPU - does the same performance scaling differential apply (on Windows vs Linux)?

          Anyhow, great data again, thanks!
          In exactly what way would GCC provide optimized binaries tuned especially for Linux? And AFAIK the pthreads library in gcc/msys2 is just a wrapper library around the native win32 threading library so it should have no impact on scaling except the small overhead of an extra function call when you call locking primitives since the library is implemented as:

          Code:
           
           typedef CRITICAL_SECTION pthread_mutex_t; static int pthread_mutex_lock(pthread_mutex_t *m) { 	EnterCriticalSection(m); 	return 0; }  static int pthread_mutex_unlock(pthread_mutex_t *m) { 	LeaveCriticalSection(m); 	return 0; }

          Comment


          • #6
            Was Windows Server's power management set to max performance before testing? By default it's set to run on quite conservative CPU clocks if at all possible.

            If "yes" then Microsoft has quite a lot to think about.
            If "No", then results would not reflect the performance differences correctly.

            Comment


            • #7
              Interesing data
              Even if there were settings to optimize the results on windows. These Benchmarks were run on stock system settings (I guess?).
              Optimization costs time and money where linux is in this case the better choise for out-of-the-box performance.
              It would be nice to see if these differences also exist on high-end Intel CPUs.

              Comment


              • #8
                Pretty interesting results - I sure wasn't expecting 64 threads to have such a major performance regression over 32 in so many tests.

                Comment


                • #9
                  Originally posted by Michael View Post

                  GCC 7.1 was on the system but for all the tests in this article, they were using the official Windows binaries for the programs, offhand I don't believe any of them are only source based on Windows. For that system table, it basically shows what hardware/software is detected on the system under test.

                  Edit: also for anyone new here, all of the test scripts can be explored via https://openbenchmarking.org/
                  Ohh, thanks for the clarification, much appreciated! Then looks like either Microsoft compiler team or OS kernel team has the work cut out for them (or both).

                  Comment


                  • #10
                    Originally posted by aht0 View Post
                    Was Windows Server's power management set to max performance before testing?
                    Michael I'd be interested in hearing this too. It's one of the biggest critiques people have of these benchmarks currently, along with reports that the Windows binaries are not current and not compiled with visual studio.

                    Comment

                    Working...
                    X