Announcement

Collapse
No announcement yet.

Hammering The AMD Ryzen 7 1800X With An Intense, Threaded Workload

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Nice article. I had not been aware of OPM before this.

    Originally posted by chuckula View Post
    This goes to show the strength of Intel's overall core integration strategy at heavy-duty workloads that aren't just L1-cache centric microbenchmarks.
    Not sure that is what we are seeing here:

    Michael At first glance the results suggest memory being the limiting factor - was the 5960 running quad channel ? If so, have you been able to get decently high memory speeds on the Ryzen mobos yet, or are you still running dual-channel 2133 MHz ?

    At one thread, the upscaling test showed similar single-core performance between the Ryzen 7 1800X and Xeon E3 1270 v5 that are clocked the same, but with this Xeon E3 part retailing for almost $200 less for this quad-core + HT workstation CPU.
    Since this was mentioned twice, I have to make the obligatory rude noise and point out that if one is only running single-thread workloads the same argument could be made for even cheaper dual-core parts. Parts with similar single-core performance but fewer cores are usually going to be cheaper.

    The single-thread run *is* useful as a measurement of single-core performance but cost/performance comparisons really should be done at either 8-thread or at "best results for each chip", ie 8 threads for Ryzen and 4 threads for the Xeon.
    Last edited by bridgman; 16 March 2017, 03:13 PM.
    Test signature

    Comment


    • #12
      Originally posted by chuckula View Post
      This goes to show the strength of Intel's overall core integration strategy at heavy-duty workloads that aren't just L1-cache centric microbenchmarks.

      If an AMD part from 2014 had the same margin of victory that we see here over a higher-clocked Intel part with an equivalent core count that was just released this month, then not one person here would be calling the Intel part good, even if it was somewhat cheap (although a $500 chip on a platform that has seen major motherboard support issues isn't exactly "cheap" in any book).

      Incidentally, even if the 5960X out of a new box is still expensive, if you are smart you can find very good open box deals.
      1. Not true. I am frequently looking for bang/buck optimums.

      2. Since Intel was until now only player here, it was only choice to optimize for. Now the picture is changing and I suspect that someone will look into what can be done with core allocations etc. Heck, even current compilers don't have Zen backends.

      3. It's of no use to compare price of new part with that from ebay. Apples and oranges.

      4. There is no data about memory frequency, which does influence CCX communication bandwidth directly

      5, code was compiled with -mtune=generic, so far from what one would use doing soem real work.




      Comment


      • #13
        Originally posted by Brane215 View Post

        1. Not true. I am frequently looking for bang/buck optimums.

        2. Since Intel was until now only player here, it was only choice to optimize for. Now the picture is changing and I suspect that someone will look into what can be done with core allocations etc. Heck, even current compilers don't have Zen backends.

        3. It's of no use to compare price of new part with that from ebay. Apples and oranges.

        4. There is no data about memory frequency, which does influence CCX communication bandwidth directly

        5, code was compiled with -mtune=generic, so far from what one would use doing soem real work.

        The "problem" with this type of benchmark is that it tries to benchmark a very complex software package and without too much performance information about the package. The code appears to be very dependent on the BLAS performance and other libraries (SuperLU, UMFPACK, DUNE, MPI) and there is no information whatsoever about the compilation flags used for them and also is usually not a good idea to use them from the distro repositories. Except for some BLAS operations (like dgemm) most of these libraries are much more memory bandwidth limited than anything else and if the core i7 is really accessing the memory using four channels it is much better positioned than a processor that can only use two channels.

        Comment


        • #14
          Originally posted by chuckula View Post
          Do you know whether or not this package uses AVX heavily?
          For the solver parts, it is questionable how much AVX has to offer, since the load is limited by memory bandwidth. However, for the numerical differentiation, efforts are under way to exploit it:
          https://github.com/OPM/opm-material/pull/213
          Originally posted by bridgman View Post
          Nice article. I had not been aware of OPM before this.

          At first glance the results suggest memory being the limiting factor - was the 5960 running quad channel ? If so, have you been able to get decently high memory speeds on the Ryzen mobos yet, or are you still running dual-channel 2133 MHz ?
          Yes, both benches are memory bandwidth limited when you scale up. The first is a combination of numerical differentiation and an incomplete LU factorization preconditioner combined with a conjugate gradient linear solver. The second is an elliptic problem, and hence uses an algebraic multigrid based solver. Both solver approaches are known to be memory bandwidth limited and cover a large and important part of HPC, i.e., numerical solutions to partial differential equations. Typically, SMT has nothing to offer on these type of problems.

          Comment


          • #15
            Originally posted by Del_ View Post
            For the solver parts, it is questionable how much AVX has to offer, since the load is limited by memory bandwidth. However, for the numerical differentiation, efforts are under way to exploit it:
            https://github.com/OPM/opm-material/pull/213

            Yes, both benches are memory bandwidth limited when you scale up. The first is a combination of numerical differentiation and an incomplete LU factorization preconditioner combined with a conjugate gradient linear solver. The second is an elliptic problem, and hence uses an algebraic multigrid based solver. Both solver approaches are known to be memory bandwidth limited and cover a large and important part of HPC, i.e., numerical solutions to partial differential equations. Typically, SMT has nothing to offer on these type of problems.
            Are you using any packages of the Trilinos library to distribute the mesh along the processes? The Ilu preconditioner is being calculated by the superLU library?

            Comment


            • #16
              Probably a good idea would be to add to the article a run of the triad benchmark, that measures the memory bandwidth. This would show the possible advantage of memory bandwidth of the Core I7 against the ryzen.

              Comment


              • #17
                Originally posted by defaultUser View Post
                Probably a good idea would be to add to the article a run of the triad benchmark, that measures the memory bandwidth. This would show the possible advantage of memory bandwidth of the Core I7 against the ryzen.
                Or if we want to reduce the influence of memory bandwidth, run the tests again but with only 2 channels enabled on the i7, so that they're on equal footing. If we see a big drop in i7 performance in that case, that bodes well for the Ryzen. I.e. once we have faster memory to play with, DDR4-3200 for example, we can expect a large boost in Ryzen performance.

                This would also bode very well for the upcoming Zen server chips, which will have *eight* channels per socket.

                Comment


                • #18
                  If mem. bandwidth is not enough, you could test the 5960X with only two channels instead of 4.

                  Comment


                  • #19
                    Originally posted by defaultUser View Post

                    Are you using any packages of the Trilinos library to distribute the mesh along the processes? The Ilu preconditioner is being calculated by the superLU library?
                    Solvers and preconditioners are typically from the Dune project, I will have to do a code dive to see where the ILU part is from, but not sure superLU is used for much of anything. The parallel part is also Dune based for the Norne bench, Trilinos has no concept of parallel grids afaik, but Dune has. Load balancing is done with Zoltan though, so Trilinos is indeed present :-)

                    Comment


                    • #20
                      explanation of 16 thread result is 4 channels of memory on 5960x

                      Comment

                      Working...
                      X