Originally posted by arjan_intel
View Post
Announcement
Collapse
No announcement yet.
Blender's AMDGPU-PRO OpenCL Performance Is Crazy Slow Compared To NVIDIA CUDA
Collapse
X
-
Cycles is not the best program to benchmark OpenCl, the wiki page says so: https://wiki.blender.org/index.php/OpenCL For true OpenCL benchmark, try Luxrender's LuxMark, now there you will see AMD outperform Nvidia.
- Likes 1
Comment
-
These result are rather strange, my [email protected] desktop were able to render bmw27 at around 3:20, core i7-3720hq laptop done it in 5:10, both under linux. In windows 10 desktop performed bmw27 render in ~8 minutes (dunno why).
Since I'm on r9 280x, I've tried opencl render only under Windows, and render were completely incorrect so it doesn't matter how fast or slow it were.
Comment
-
Originally posted by Michael View Post
With NVIDIA hardware, only CUDA is exposed.
Also, three things regarding timing:
- When rendering for the first time, Cycles compiles the OpenCL kernels, which can take a few minutes. Are you sure that you didn't include that time (Blender does count it as render time)? For accurate reuslts, you have to render twice and record the second time.
- The claim that the OpenCL backend is somehow worse doesn't really make sense, since the CUDA and OpenCL backends share 95% of their code.
- If there is a difference between Cycles and other Benchmarks, it's most likely due to the hige size of the Cycles kernel - typical GPGPU workloads have much smaller kernels. Therefore, issues like register spilling and branch divergence become relevant and can affect performance in unexpected ways.
- Likes 1
Comment
-
Originally posted by lukasstockner97 View Post
As I said in the last article, you can use OpenCL on Nvidia cards using the --debug-value 256 and Megakernel option.
Also, three things regarding timing:
- When rendering for the first time, Cycles compiles the OpenCL kernels, which can take a few minutes. Are you sure that you didn't include that time (Blender does count it as render time)? For accurate reuslts, you have to render twice and record the second time.
- The claim that the OpenCL backend is somehow worse doesn't really make sense, since the CUDA and OpenCL backends share 95% of their code.
- If there is a difference between Cycles and other Benchmarks, it's most likely due to the hige size of the Cycles kernel - typical GPGPU workloads have much smaller kernels. Therefore, issues like register spilling and branch divergence become relevant and can affect performance in unexpected ways.
Comment
-
Originally posted by duby229 View Post
Doesn't that seem like a loud and clear example of doing it the wrong way? It sure does to me. You admit that cycles is the only one that implements such a huge kernel. And you admit that cycles first run can't be trusted... It seems obviously wrong.
As for the huge kernel: Yes, multiple smaller ones would be better. That's why Cycles has the Split Kernel for OpenCL as well as the Megakernel.
Generally, though, renderers are complex. For usual scientific workloads, you may just have some matrix operations or convolution on a large grid, that can be done in a few lines.
For a full rendering engine, however, you need a lot more code. The Split kernel is a step, but the individual pieces are still fairly large.
So. while smaller kernels would
indeed be better, doing it "right" is easier said than done.
- Likes 1
Comment
-
Originally posted by lukasstockner97 View PostRegarding the first run: Well, at some point you have to compile the kernel, that's just how OpenCL works. Also, you obviously have to do so before rendering for the first time. So, there's not many options left...
As for the huge kernel: Yes, multiple smaller ones would be better. That's why Cycles has the Split Kernel for OpenCL as well as the Megakernel.
Generally, though, renderers are complex. For usual scientific workloads, you may just have some matrix operations or convolution on a large grid, that can be done in a few lines.
For a full rendering engine, however, you need a lot more code. The Split kernel is a step, but the individual pieces are still fairly large.
So. while smaller kernels would
indeed be better, doing it "right" is easier said than done.
EDIT: Even from way back, I've always been under the impression that GPU's can't really perform linear jobs, They absolutely need to be programmed using a KIS principle. They -need- to do one thing and in as much parallelism as possible. If you have more than one thing that needs done, then they -need- done one at a time.
Last edited by duby229; 17 July 2016, 08:36 AM.
Comment
-
Originally posted by duby229 View Post
EDIT: Even from way back, I've always been under the impression that GPU's can't really perform linear jobs, They absolutely need to be programmed using a KIS principle. They -need- to do one thing and in as much parallelism as possible. If you have more than one thing that needs done, then they -need- done one at a time.
- Likes 1
Comment
-
Originally posted by lukasstockner97 View PostThe Split kernel is a step, but the individual pieces are still fairly large.
So. while smaller kernels would
indeed be better, doing it "right" is easier said than done.
Comment
Comment