Announcement

Collapse
No announcement yet.

AMDGPU-PRO vs. NVIDIA OpenCL Performance With ArrayFire Using 18 GPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    devius A few of things. We're trying to estimate GPU performance through Arrayfire. While arrayfire is a fairly solid library, building ontop of the most optimized libraries itself (clMagma, cuBlas on the CUDA backend), we're going to see results like Cholesky_f64 occasionally. Here's a few factors I think are at play:
    • The 780 Ti looks to be a reskinned slightly updated Geforce Titan (Kepler) - The Kepler Titan was a card famous still today (in academic circles) as one of the few good cards for FP64 and was released Feb 2013. Since kepler, f64 performance has basically did nothing but go down by strategic choices/design with the exception of Pascal Teslas.
    • The 780 Ti actually has 336 MB/s memory bandwidth vs the 320 MB/s in the 1080, it's possible there are compute unit disparities leaning towards the 780 Ti as well.
    • Kepler was the first superscalar series but perhaps vectorized implementations performed a bit better (need to confirm clMagma takes such code paths), I've heard this before a couple of times.
    • The next - the 780 Ti's febuary 2013 date is close to the last update of clMagma. Basically there could be a tuning problem in that they didn't take advantage some changes, like thread-block size changes or something else that lead to underutilization of the 1080.
    • And wrt AMD's placement - I suspect clMagma supported AMD gpus but has an inherent tuning bias to NVidia cards, this problem runs deep in academic code and linear algebra since it is closely related- although the datedness of clMagma again applies to these cards - OTOH they did quite well on Cholesky_f32 which suggests maybe AMD just sucked hard at fp64. They've been known too since Terrascale VLIW generations (basically anything GCN or after).
    I suspect on a CUDA benchmark of the nvidia cards and comparing them to Nvidia (and perhaps AMD) openCL, we may see things lean more towards NVidia because cuBLAS is a very frequently tuned optimized libray/money maker, but if it doesn't this means it's hardware manifesting the behavior, not software.

    Comment


    • #12
      nevion
      1) The GFOR Sum numbers for AMD GPUs seem way off. There has to be something wrong with the benchmark suite.

      2) OpenCL backend uses a heavily modified version of clMagma while CUDA backend uses cuSolver library to implement Dense Linear Algebra functions.

      3) I have never seen someone say Kepler Titan is "superscalar". This is the closest thing I can find: http://stackoverflow.com/questions/2...ut-also-supers

      devius if you have any questions about arrayfire, drop the chat room: https://gitter.im/arrayfire/arrayfire

      Source: I was the Chief Engineer of ArrayFire and heavily contributed to the library until September 2016.

      Comment


      • #13
        pavanky

        1) help me find the problem, go clone https://github.com/nevion/arrayfire-benchmark - the enabling core benchmark utility and see if you can reproduce and narrow it down - it may take me a couple of days to get around to it and I'm not as knowledgeable about Gfor magic as I am about general routines. Note I am building arrayfire locally against https://github.com/nevion/arrayfire/tree/pts for the arrayfire-pts test suite and that could be related.

        2) You are technically correct (the best kind of correct) but cuSolver intern heavily relies on cuBlas for dense linear algebra. Point taken but original point stands either way, for both libraries. I'm not sure specifically what cuSolver's cholesky routines actually due as I wasn't able to find source for it. I'll take a look at the local clMagma to determine what's modified, for posterity.

        3) it seems to be a debated application of the term for Kepler but there are plenty of people saying something like it. My point is starting this generation you do not have to use vectorized instructions for great performance, but it may still be beneficial.

        Last edited by nevion; 01-26-2017, 12:59 PM.

        Comment


        • #14
          Originally posted by Marc Driftmeyer View Post
          Why do we give a crap about AMD OpenCL on Linux until the stack is on par with Windows?
          Not to be that guy but i really don't give a damn what is happening on Windows. Outside of work the only thing I care about are UNIX like platforms (yes that included MacOS) even if that means we are looking at results from less mature software.

          Comment


          • #15
            Originally posted by devius View Post
            The results are all over the place. And not just for AMD, but in some tests the 780ti beats the 1080, while in others the 780ti is in last place while the 1080 is in first place. Not sure if these results are meaningful in any way...
            They are meaningful if you are a developer looking of the ""current"" bast platform to run your software on. Beyond that it does blow the common Idea on the net that NVIdia is always the performance choice, it isn't, often AMD GPU's are highly suitable to a problem.

            The only thing that really bothers me here is the GFOR Nvidia numbers which are odd to say the least.

            Comment


            • #16
              Originally posted by nevion View Post
              Wow Fury trumped NVidia on FFTs 1d and 2d! I'm disappointed on AMD sort performance - it's worth looking into whether amd bolt is as slow as arrayfire's. I suspect the linear algebra routines needs tuning and carry Nvidia tuning-bias. Not sure what's up with Gfor, that's a weird routine anyway. I'll also note the ELWISE benchmarks seem mostly to go to NVidia - I suspect it's mainly due to newer compute architectures as it was mostly 980 and 1080s that beat out all the AMD options and the operations for ELWISE are supposed to be dirt simple. I'm actually quite surprised with those as I suspected they'd be memory-bandwidth dominated.

              Fantastic results and while the meaning may be dubious encour for ROCm and CUDA (not opencl)!.
              I run a compute farm and from my experience AMD trashes nVIDIA in terms of price/performance ratio, and has been for a long time. The trend is that nVIDIA cuts back on compute performance for the sake of energy effiency, and sadly in order to stay competitive AMD follows, but still, a much better value, all the way from video processing to structural analysis. The fact that nVIDIA is stuck at OpenCL 1.2 ain't helping either.

              Also, this site has obvious to say the least bias towards nVIDIA. I guess AMD is simply not in the position to afford to pay for publicity as much as the competition.

              Comment


              • #17
                ddriver - I don't mean the bias you see out and about on the web with people picking sides and getting cash for it one way or another, I mean a more endemic one baked into sourcecode development/testing that happens even in academic and buisiness software. Tuning (work groups or shared memory usage) is probably the most frequently offending form of this.

                Comment


                • #18
                  Wasn't the preferred platform for OpenCL based bitcoim farming AMD cards, before they were replaced with dedicated 'calculators'?

                  I asked before about the maturity (I'm ignorant of the history...) of AMD Pro drivers compared to nVidias long running set and these farmers came to mind. I don't recall them using the GCN stuff or Pro.

                  Comment

                  Working...
                  X