No announcement yet.

Looking At The OpenCL Performance Of ATI & NVIDIA On Linux

  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    MandelGPU might suffer from something similar (redefining __constant to __global), but I haven't checked.


    • #12
      Hmm. On my HD5970 for SmallPT 1.6 GPU Caustic3, I'm getting 45200 KSamples/sec on the GPU, and ~16000 KSamples/sec on my Core i7 920. Neither part is overclocked; they're at their factory default clock rates.

      The GPU number is lower than either of Michael's radeons, but still a ways faster than Michael's GT 240. The numbers seem unaffected by whether compiz is on. I find it hard to accept that a HD5970 gets poorer results than a 5770. Even if a HD5970 is two 5850 cores together, shouldn't even one of those cores single-handedly outperform a 5770? And wouldn't OpenCL have the smarts to use both cores automatically to make it nearly twice as fast?

      I noticed something funky about the tests, though. When the test is running, the output visual says at the bottom something like 52000K samples/sec. This is substantially larger than the 45000 Ksamples/sec reported by PTS in the output. I'm not sure why such the large discrepancy. Bug in PTS? Bug in the test?

      Either way, it seems (disappointingly) that a HD5970 is only 3 times faster at this test than a Core i7? It is probably more economical to use a bunch of CPUs than to use GPUs for this kind of workload, seeing how a Core i7 is much cheaper than a dual gpu HD5970. We already know from other tests that a GPU is many, many, many times faster than the CPU at OpenGL 3d rendering, so maybe the parts needed for general purpose GPGPU are kept to a modest level on Evergreen in order to support top-of-the-line 3d graphics. I'm not complaining, since I don't use GPGPU for anything other than PTS


      • #13
        Hmm, also interesting: I got 31011133.23 average for the mandelGPU test. Although the test was not running at 1920x1080, but rather the default resolution, so that may account for the difference.

        If the resolution isn't important though, as it sometimes isn't, then I'm getting about 1.5x the performance of the GTX 460 with the HD5970. This is more in line with what I was expecting.


        • #14
          Originally posted by allquixotic View Post
          Hmm. On my HD5970 for SmallPT 1.6 GPU Caustic3, I'm getting 45200 KSamples/sec on the GPU, and ~16000 KSamples/sec on my Core i7 920. Neither part is overclocked; they're at their factory default clock rates.
          16000 KSamples/sec sounds too good to be true for a CPU, seriously.

          And wouldn't OpenCL have the smarts to use both cores automatically to make it nearly twice as fast?
          No. The GPUs are separate OpenCL devices an the program needs to explicitly use those.


          • #15
            Originally posted by ssam View Post
            do any of those benchmarks use double precision floating point?

            (and as always: it would be nice if you could put error bars on the plots)
            You need at least a HD5830 to test double precision floating point on the AMD side anyways..the 5700 and lower only support single point.
            Those who would give up Essential Liberty to purchase a little Temporary Safety,deserve neither Liberty nor Safety.
            Ben Franklin 1755


            • #16
              Vidia does the same shit: DP is only supported on higher-end cards and on consumer Fermi cards, DP performance has even been artifically reduced.

              AMD doesn't yet properly support double precision anyway; it only works with an AMD-specific OpenCL extension that isn't compatible to the standard extension for DP.


              • #17
                AMD's DP isn't even IEEE compatible

                NVidia does this business jigjag shit because of the lack of competition. sigh...


                • #18
                  not terribly useful

                  Although its probably better than nothing, these tech demo's hardly constitute reasonable benchmarks.

                  Despite aims to the contrary, with OpenCL you really need to tailor the code to suit each device individually - otherwise you can end up with huge performance differences. Orders of magnitude differences. Even the few benchmarks you have demonstrate this with the huge relative swings on the same hardware.

                  It would be useful to analyse the code in question to determine why it might be running so much differently on a given architecture. e.g. on gpu's vectorised code makes no difference, but on intel it makes some difference and on cell it should make a huge difference. On intel you get no memory access coalescing or local memory, nvidia has L1 cache for array accesses, ati has more registers, etc. The wavefront sizes vary between devices. And some devices can run multiple work queues simultaneously.


                  • #19
                    Originally posted by vrodic View Post
                    Maybe Intel has such implementation of the OpenCL compiler, or LLVM has some OpenCL frontend/parser.
                    Intel is not a openCL supporter.


                    • #20
                      Actually, I think, OpenCL has a lower-level API than Cuda, probably because the large variances in vector unit designs.
                      At this point in time, everybody is experimenting with how to interface vector units.
                      Due to the ATI 2000-5000 series cards having a 4simple+1complex unit, it means that some code will run much faster than others. Depending on the types of instructions you use, whereas on Nvidia hardware you have a single complex unit that you stream data into. So their hardware runs much the same, independent of what instruction order you actually use, etc...

                      The ATi design was optimised for rendering, and the symmetry that you require to get consistently good compute performance was not considered. I suspect that the bits for compute performance was added on later on (considering OpenCL only works on 4000+ cards).
                      The new 4d arch is symmetrical, and I assume that its performance will be much more symmetrical, since you only have to reorganize instructions to prevent stalls, almost identically to how a regular CPU works.

                      Meaning that I think we will see much more consistent performance from the new ATi arch.

                      So, it is not so much that OpenCL is flawed, but more that due to the large varied and specialized vector units out there, that performance varies so much.

                      Hmmm, what is Fusion parts also use the 4d symmetrical vector units? That would make even Ontario competitive in OpenCL applications...

                      I guess I'm asking too much :P