Announcement

Collapse
No announcement yet.

Hammering The AMD Ryzen 7 1800X With An Intense, Threaded Workload

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Del_
    replied
    Originally posted by defaultUser View Post

    Are you using any packages of the Trilinos library to distribute the mesh along the processes? The Ilu preconditioner is being calculated by the superLU library?
    Solvers and preconditioners are typically from the Dune project, I will have to do a code dive to see where the ILU part is from, but not sure superLU is used for much of anything. The parallel part is also Dune based for the Norne bench, Trilinos has no concept of parallel grids afaik, but Dune has. Load balancing is done with Zoltan though, so Trilinos is indeed present :-)

    Leave a comment:


  • Filiprino
    replied
    If mem. bandwidth is not enough, you could test the 5960X with only two channels instead of 4.

    Leave a comment:


  • torsionbar28
    replied
    Originally posted by defaultUser View Post
    Probably a good idea would be to add to the article a run of the triad benchmark, that measures the memory bandwidth. This would show the possible advantage of memory bandwidth of the Core I7 against the ryzen.
    Or if we want to reduce the influence of memory bandwidth, run the tests again but with only 2 channels enabled on the i7, so that they're on equal footing. If we see a big drop in i7 performance in that case, that bodes well for the Ryzen. I.e. once we have faster memory to play with, DDR4-3200 for example, we can expect a large boost in Ryzen performance.

    This would also bode very well for the upcoming Zen server chips, which will have *eight* channels per socket.

    Leave a comment:


  • defaultUser
    replied
    Probably a good idea would be to add to the article a run of the triad benchmark, that measures the memory bandwidth. This would show the possible advantage of memory bandwidth of the Core I7 against the ryzen.

    Leave a comment:


  • defaultUser
    replied
    Originally posted by Del_ View Post
    For the solver parts, it is questionable how much AVX has to offer, since the load is limited by memory bandwidth. However, for the numerical differentiation, efforts are under way to exploit it:
    This is a work-in-progress PR which demonstrates possible performance improvements for flow_ebos. originally brought up by @andlaus. When specializing the the Evaluation class for a given number of...


    Yes, both benches are memory bandwidth limited when you scale up. The first is a combination of numerical differentiation and an incomplete LU factorization preconditioner combined with a conjugate gradient linear solver. The second is an elliptic problem, and hence uses an algebraic multigrid based solver. Both solver approaches are known to be memory bandwidth limited and cover a large and important part of HPC, i.e., numerical solutions to partial differential equations. Typically, SMT has nothing to offer on these type of problems.
    Are you using any packages of the Trilinos library to distribute the mesh along the processes? The Ilu preconditioner is being calculated by the superLU library?

    Leave a comment:


  • Del_
    replied
    Originally posted by chuckula View Post
    Do you know whether or not this package uses AVX heavily?
    For the solver parts, it is questionable how much AVX has to offer, since the load is limited by memory bandwidth. However, for the numerical differentiation, efforts are under way to exploit it:
    This is a work-in-progress PR which demonstrates possible performance improvements for flow_ebos. originally brought up by @andlaus. When specializing the the Evaluation class for a given number of...

    Originally posted by bridgman View Post
    Nice article. I had not been aware of OPM before this.

    At first glance the results suggest memory being the limiting factor - was the 5960 running quad channel ? If so, have you been able to get decently high memory speeds on the Ryzen mobos yet, or are you still running dual-channel 2133 MHz ?
    Yes, both benches are memory bandwidth limited when you scale up. The first is a combination of numerical differentiation and an incomplete LU factorization preconditioner combined with a conjugate gradient linear solver. The second is an elliptic problem, and hence uses an algebraic multigrid based solver. Both solver approaches are known to be memory bandwidth limited and cover a large and important part of HPC, i.e., numerical solutions to partial differential equations. Typically, SMT has nothing to offer on these type of problems.

    Leave a comment:


  • defaultUser
    replied
    Originally posted by Brane215 View Post

    1. Not true. I am frequently looking for bang/buck optimums.

    2. Since Intel was until now only player here, it was only choice to optimize for. Now the picture is changing and I suspect that someone will look into what can be done with core allocations etc. Heck, even current compilers don't have Zen backends.

    3. It's of no use to compare price of new part with that from ebay. Apples and oranges.

    4. There is no data about memory frequency, which does influence CCX communication bandwidth directly

    5, code was compiled with -mtune=generic, so far from what one would use doing soem real work.

    The "problem" with this type of benchmark is that it tries to benchmark a very complex software package and without too much performance information about the package. The code appears to be very dependent on the BLAS performance and other libraries (SuperLU, UMFPACK, DUNE, MPI) and there is no information whatsoever about the compilation flags used for them and also is usually not a good idea to use them from the distro repositories. Except for some BLAS operations (like dgemm) most of these libraries are much more memory bandwidth limited than anything else and if the core i7 is really accessing the memory using four channels it is much better positioned than a processor that can only use two channels.

    Leave a comment:


  • Brane215
    replied
    Originally posted by chuckula View Post
    This goes to show the strength of Intel's overall core integration strategy at heavy-duty workloads that aren't just L1-cache centric microbenchmarks.

    If an AMD part from 2014 had the same margin of victory that we see here over a higher-clocked Intel part with an equivalent core count that was just released this month, then not one person here would be calling the Intel part good, even if it was somewhat cheap (although a $500 chip on a platform that has seen major motherboard support issues isn't exactly "cheap" in any book).

    Incidentally, even if the 5960X out of a new box is still expensive, if you are smart you can find very good open box deals.
    1. Not true. I am frequently looking for bang/buck optimums.

    2. Since Intel was until now only player here, it was only choice to optimize for. Now the picture is changing and I suspect that someone will look into what can be done with core allocations etc. Heck, even current compilers don't have Zen backends.

    3. It's of no use to compare price of new part with that from ebay. Apples and oranges.

    4. There is no data about memory frequency, which does influence CCX communication bandwidth directly

    5, code was compiled with -mtune=generic, so far from what one would use doing soem real work.




    Leave a comment:


  • bridgman
    replied
    Nice article. I had not been aware of OPM before this.

    Originally posted by chuckula View Post
    This goes to show the strength of Intel's overall core integration strategy at heavy-duty workloads that aren't just L1-cache centric microbenchmarks.
    Not sure that is what we are seeing here:

    Michael At first glance the results suggest memory being the limiting factor - was the 5960 running quad channel ? If so, have you been able to get decently high memory speeds on the Ryzen mobos yet, or are you still running dual-channel 2133 MHz ?

    At one thread, the upscaling test showed similar single-core performance between the Ryzen 7 1800X and Xeon E3 1270 v5 that are clocked the same, but with this Xeon E3 part retailing for almost $200 less for this quad-core + HT workstation CPU.
    Since this was mentioned twice, I have to make the obligatory rude noise and point out that if one is only running single-thread workloads the same argument could be made for even cheaper dual-core parts. Parts with similar single-core performance but fewer cores are usually going to be cheaper.

    The single-thread run *is* useful as a measurement of single-core performance but cost/performance comparisons really should be done at either 8-thread or at "best results for each chip", ie 8 threads for Ryzen and 4 threads for the Xeon.
    Last edited by bridgman; 16 March 2017, 03:13 PM.

    Leave a comment:


  • chuckula
    replied
    Hey Michael, thanks for the in-depth analysis.

    Two questions:
    1. Do you know whether or not this package uses AVX heavily?

    2. Would it be possible to show some power-consumption numbers during one of the longer runs that uses all of the cores? [just the 5960X and 1800X would be fine]
    Last edited by chuckula; 16 March 2017, 02:18 PM.

    Leave a comment:

Working...
X