Announcement

Collapse
No announcement yet.

Blender's AMDGPU-PRO OpenCL Performance Is Crazy Slow Compared To NVIDIA CUDA

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by arjan_intel View Post
    I wonder if the "OpenCL" mode actually uses OpenCL, or if it's really just using CPU fallbacks. I've seen (while trying to test beignet) Blender binaries floating around that got this wrong
    This mirrors my experience trying to get OpenCL working on the blender test in PTS. no GPU activity and and fully loaded CPU

    Comment


    • #12
      Cycles is not the best program to benchmark OpenCl, the wiki page says so: https://wiki.blender.org/index.php/OpenCL For true OpenCL benchmark, try Luxrender's LuxMark, now there you will see AMD outperform Nvidia.

      Comment


      • #13
        I'm waiting for the test to be actually run correctly. It's rather obvious by the results that the Intel CPU is absolute garbage at OpenCL. Not surprising since it's not designed for it. The FX-8350 on the other hand is OpenCL 1.2 ready and will do much more than the Intel.

        Comment


        • #14
          These result are rather strange, my [email protected] desktop were able to render bmw27 at around 3:20, core i7-3720hq laptop done it in 5:10, both under linux. In windows 10 desktop performed bmw27 render in ~8 minutes (dunno why).
          Since I'm on r9 280x, I've tried opencl render only under Windows, and render were completely incorrect so it doesn't matter how fast or slow it were.

          Comment


          • #15
            Originally posted by Michael View Post

            With NVIDIA hardware, only CUDA is exposed.
            As I said in the last article, you can use OpenCL on Nvidia cards using the --debug-value 256 and Megakernel option.

            Also, three things regarding timing:
            - When rendering for the first time, Cycles compiles the OpenCL kernels, which can take a few minutes. Are you sure that you didn't include that time (Blender does count it as render time)? For accurate reuslts, you have to render twice and record the second time.
            - The claim that the OpenCL backend is somehow worse doesn't really make sense, since the CUDA and OpenCL backends share 95% of their code.
            - If there is a difference between Cycles and other Benchmarks, it's most likely due to the hige size of the Cycles kernel - typical GPGPU workloads have much smaller kernels. Therefore, issues like register spilling and branch divergence become relevant and can affect performance in unexpected ways.

            Comment


            • #16
              Originally posted by lukasstockner97 View Post

              As I said in the last article, you can use OpenCL on Nvidia cards using the --debug-value 256 and Megakernel option.

              Also, three things regarding timing:
              - When rendering for the first time, Cycles compiles the OpenCL kernels, which can take a few minutes. Are you sure that you didn't include that time (Blender does count it as render time)? For accurate reuslts, you have to render twice and record the second time.
              - The claim that the OpenCL backend is somehow worse doesn't really make sense, since the CUDA and OpenCL backends share 95% of their code.
              - If there is a difference between Cycles and other Benchmarks, it's most likely due to the hige size of the Cycles kernel - typical GPGPU workloads have much smaller kernels. Therefore, issues like register spilling and branch divergence become relevant and can affect performance in unexpected ways.
              Doesn't that seem like a loud and clear example of doing it the wrong way? It sure does to me. You admit that cycles is the only one that implements such a huge kernel. And you admit that cycles first run can't be trusted... It seems obviously wrong.

              Comment


              • #17
                Originally posted by duby229 View Post

                Doesn't that seem like a loud and clear example of doing it the wrong way? It sure does to me. You admit that cycles is the only one that implements such a huge kernel. And you admit that cycles first run can't be trusted... It seems obviously wrong.
                Regarding the first run: Well, at some point you have to compile the kernel, that's just how OpenCL works. Also, you obviously have to do so before rendering for the first time. So, there's not many options left...
                As for the huge kernel: Yes, multiple smaller ones would be better. That's why Cycles has the Split Kernel for OpenCL as well as the Megakernel.
                Generally, though, renderers are complex. For usual scientific workloads, you may just have some matrix operations or convolution on a large grid, that can be done in a few lines.
                For a full rendering engine, however, you need a lot more code. The Split kernel is a step, but the individual pieces are still fairly large.
                So. while smaller kernels would
                indeed be better, doing it "right" is easier said than done.

                Comment


                • #18
                  Originally posted by lukasstockner97 View Post
                  Regarding the first run: Well, at some point you have to compile the kernel, that's just how OpenCL works. Also, you obviously have to do so before rendering for the first time. So, there's not many options left...
                  As for the huge kernel: Yes, multiple smaller ones would be better. That's why Cycles has the Split Kernel for OpenCL as well as the Megakernel.
                  Generally, though, renderers are complex. For usual scientific workloads, you may just have some matrix operations or convolution on a large grid, that can be done in a few lines.
                  For a full rendering engine, however, you need a lot more code. The Split kernel is a step, but the individual pieces are still fairly large.
                  So. while smaller kernels would
                  indeed be better, doing it "right" is easier said than done.
                  I'm no expert so please take this and understand that you are lightyears ahead of me in understanding this stuff.

                  EDIT: Even from way back, I've always been under the impression that GPU's can't really perform linear jobs, They absolutely need to be programmed using a KIS principle. They -need- to do one thing and in as much parallelism as possible. If you have more than one thing that needs done, then they -need- done one at a time.

                  Last edited by duby229; 17 July 2016, 08:36 AM.

                  Comment


                  • #19
                    Originally posted by duby229 View Post

                    EDIT: Even from way back, I've always been under the impression that GPU's can't really perform linear jobs, They absolutely need to be programmed using a KIS principle. They -need- to do one thing and in as much parallelism as possible. If you have more than one thing that needs done, then they -need- done one at a time.
                    Pretty much, yes. To be more specific: Due to the architecture of GPUs, each core processes 32 (for Nvidia) or 64 (for AMD) threads at a time - always executing the same instruction on 32/64 registers. Therefore, you're in trouble when you have lots of branching - for example, if you have an if statement, both parts will be executed after another. If you have a loop, it will be executed until every thread is done. That's why the workload should be as straightforward and parallel as possible.

                    Comment


                    • #20
                      Originally posted by lukasstockner97 View Post
                      The Split kernel is a step, but the individual pieces are still fairly large.
                      So. while smaller kernels would
                      indeed be better, doing it "right" is easier said than done.
                      But kernels need to load data from main memory and then write the result after computation. So having too many small kernels can make your code memory-bound. I did not look at Cycles kernels but having large kernels is fine as long as there is no register spilling and there are enough threads in flight to hide memory latency. This of course is hard to get right as it can vary with GPU architectures...

                      Comment

                      Working...
                      X