Announcement

**Space Beer** · 28 December 2016, 04:44 AM

Originally posted by bridgman View Post

In case you missed my response to your previous post, we switched focus from working on the Mesa/Clover OpenCL to opening up our closed source OpenCL stack some months back, so testing AMDGPU-PRO is the right thing to do even for the open stack.

Great to hear that! Way to go!

I have switched to openSUSE Tumbleweed, and I couldn't manage to use openCL with my 260X. Though I don't need it at the moment

**adakite** · 28 December 2016, 04:45 AM

Are these tests only FP32? If so, FP64 benchs would be interesting. I expect a reverse situation, Kepler being not bad in FP64 and AMD is mush better usually. Thank for these tests though, there are really interesting, Let's wait now AMD's new line coming for 2017.

**LinuxID10T** · 28 December 2016, 10:52 AM

W

Originally posted by dungeon View Post

He is using Debian, where amdgpu-pro isn't avalable and oss driver CL unusable... since some focuses are shifted around zero point

And benchmarking does not tells that so he get nothing in practice, from that POV he misses which isn't an answer... so misses nothing, double or even triple nothing

What about with ROCM OpenCL? I've tried it out but I haven't compared it to the AMDGPU-PRO drivers. Is it on par?

**schmidtbag** · 28 December 2016, 12:27 PM

If OpenCL were a priority to me, AMD seems like a great choice. The 470 is pretty competitive. You could buy 3 of those for the price of a single 1080. Depending on what/how you intend to process your data, you wouldn't have to worry about anything like crossfire, so you'd get a pretty fast system.

**sdack** · 29 December 2016, 10:20 AM

Originally posted by L_A_G View Post

The GTX 680, 760, 780 TI, 950, 960, 970 and 980 all finishing within margin of error in the Masskrug test is pretty intriguing. What's going on there? It looks like they're all limited by some hardware resource other than the traditional available compute units or memory bandwidth.

My guess would be that threads are stalling due to the special function units (which in CUDA-based GPUs are separate from the CUDA cores and much fewer in number) are being used to their capacity and threads are having to wait to use them. If that's the case, then there's probably some optimization work that could be done for much improved performance as a GPU with 2048 cores and a 256 lane wide memory interface should not perform within margin of error of a card with 768 of the same cores and a 128 lane wide memory interface both working at roughly the same clock rate.

Agreed. There is obviously an issue here, but it is impossible to tell if it is within Darktable, the OpenCL API or the driver. The benchmark is evidence of a problem, but that's about it. Titling it "Nvidia vs. AMD" was premature when a 780 Ti outperforms a 980, and both are being outperformed by a factor of 3 by a 1060.

**L_A_G** · 29 December 2016, 11:51 AM

Originally posted by sdack View Post

Agreed. There is obviously an issue here, but it is impossible to tell if it is within Darktable, the OpenCL API or the driver. The benchmark is evidence of a problem, but that's about it. Titling it "Nvidia vs. AMD" was premature when a 780 Ti outperforms a 980, and both are being outperformed by a factor of 3 by a 1060.

I'd argue that it most probably is a case of special function units reaching capacity as they're not updated all that often and the number of them on the same architecture they tend to have the same amount of them regardless of how many CUDA cores there are and how wide the memory bus is.

Last year I had a job at my university where got to work on optimizing an HPC application using CUDA. After making some good optimizations I found that I was getting roughly the same performance on both a GTX 970 and 680. Scratched my head for a while until my professor gave me the advice that I should try to avoid using modulo (%) all that much because it's apparently pretty expensive to use on GPUs. I had been using modulo for threads to figure out what part of the compute job was their task and after changing it so that the same effect was achieved with the regular arithmetic operations of division, multiplication and subtraction I found that the time to do one pretty big job had been reduced by about half on the GTX 970 and 1/4 on the 680.

Not 100% sure if my assessment of what happened was correct, but my guess would be that I was running up against the maximum capacity of the special function units as CUDA cores don't seem to be capable of doing modulo in hardware and the compiler doesn't seem to know how to or just won't how transform those modulo operations into regular arithmetic operations.

Announcement

NVIDIA vs. AMD OpenCL Linux Benchmarks With Darktable 2.2

Comment

Comment

Comment

Comment

Comment

Comment