Announcement

Collapse
No announcement yet.

Blender Cycles Render Engine Benchmarks With NVIDIA CUDA On Linux

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    The CUDA support isn't due to personal preference or something like that. It's just because the same code is roughly 40% slower with OpenCL vs. CUDA on Nvidia cards, and most users don't value open source ideals enough to justify such a slowdown. Also, CUDA kernels can be precompiled, while OpenCL takes about a minute to build them at runtime in the first place.

    As for why OpenCL's feature set is lower in Cycles - well, AMDs drivers have been (and still are, afaik) broken in that regard since a few years. The Cycles kernel is pretty huge by GPGPU standards, and AMDs driver just refuses to compile it and crashes Blender instead. Since about a year, Cycles does work on AMD with modifications that split the kernel into multiple smaller ones so the driver doesn't crash, but that split code doesn't support some features yet due to the added complexity.

    For benchmarking, it's possible to use OpenCL on Nvidia, both the split and full kernel (Nvidia drivers are more stable in that regard). To do so, run blender with "--debug-value 256", go to the new "Debug" panel in the Render settings and set the OpenCL kernel type to Split or Mega (= full kernel). Now, you can enable OpenCL in the User preferences, just as it works for CUDA.
    A quick test on my system (GTX 780) takes 13.5sec with OpenCL split, 12.4sec with OpenCL Mega and 9.3sec with CUDA.

    Also, the Pascal results are probably pretty bad currently because the work group size etc. isn't specified in the code for Pascal yet, so it just reuses the Maxwell settings. CUDA 8 also produces slower .cubins than 7.5, so if you used a build that was made with CUDA 7.5 (like the official releases) and added a Pascal kernel that was built with CUDA 8, the difference might just be the compiler.

    Comment


    • #12
      Originally posted by lukasstockner97 View Post
      The CUDA support isn't due to personal preference or something like that. It's just because the same code is roughly 40% slower with OpenCL vs. CUDA on Nvidia cards, and most users don't value open source ideals enough to justify such a slowdown. Also, CUDA kernels can be precompiled, while OpenCL takes about a minute to build them at runtime in the first place.
      Well, the OpenCL Kernel is horribly slow on AMD Card. On Luxrender AMD cards beat the shit out of NVidia cards... And if they would use OpenCL 2.0 they could compile their shaders to SPIR. But they do not want to.

      Originally posted by lukasstockner97 View Post
      As for why OpenCL's feature set is lower in Cycles - well, AMDs drivers have been (and still are, afaik) broken in that regard since a few years. The Cycles kernel is pretty huge by GPGPU standards, and AMDs driver just refuses to compile it and crashes Blender instead. Since about a year, Cycles does work on AMD with modifications that split the kernel into multiple smaller ones so the driver doesn't crash, but that split code doesn't support some features yet due to the added complexity.
      But the CUDA implementation is pretty broken too. The current architecture of the CUDA Megakernel makes it hard to add new features. Instead of simply adding them they have to do a ton of workarounds... If they don't, the CUDA Kernel would eat every byte of VRAM the GPU has. Last time I checked they added a fix that CUDA would not eat 1GB of VRAM...

      If they would put the same amount of time into the OpenCL kernel... well... look at LuxRender.

      Originally posted by lukasstockner97 View Post
      Also, the Pascal results are probably pretty bad currently because the work group size etc. isn't specified in the code for Pascal yet, so it just reuses the Maxwell settings. CUDA 8 also produces slower .cubins than 7.5, so if you used a build that was made with CUDA 7.5 (like the official releases) and added a Pascal kernel that was built with CUDA 8, the difference might just be the compiler.
      Same text every new GPU/CUDA Release. Even though NVidia cards are getting faster and faster in gaming and GPGPU in general... the performance in Blender Cycles does not move much. You can double the CUDA Cores and double the clock frequency... in the end you get 20-30% speed increase...

      But the Blender devs do not want to change anything...

      Comment


      • #13
        Originally posted by lukasstockner97 View Post
        Also, CUDA kernels can be precompiled, while OpenCL takes about a minute to build them at runtime in the first place.
        OpenCL kernels can also be precompiled. Since OpenCL 1.0.


        Comment


        • #14
          Originally posted by torsionbar28 View Post
          Hmmm test is not working for me I get the following:

          Code:
          Blender 2.77a:
          pts/blender-1.0.0 [Blend File: BMW27 - Compute: CUDA]
          Test 1 of 4
          Estimated Trial Run Count: 1
          Estimated Test Run-Time: 11 Minutes
          Estimated Time To Completion: 43 Minutes
          Started Run 1 @ 21:48:05Traceback (most recent call last):
          File "/home/george2/.phoronix-test-suite/installed-tests/pts/blender-1.0.0/blender-2.77a-linux-glibc211-x86_64/setgpu.py", line 3, in <module>
          bpy.context.user_preferences.system.compute_device_type = 'CUDA'
          TypeError: bpy_struct: item.attr = val: enum "CUDA" not found in ('NONE')
          
          The test run did not produce a result.
          
          Test Results:
          
          Average: 0 Seconds
          This test failed to run properly.
          You don't have CUDA installed?
          Michael Larabel
          https://www.michaellarabel.com/

          Comment


          • #15
            The issue with openCL has always been AMD, The Cycles Mega kernal was too big and refused to compile on it, so instead of fixing their drivers they decided to try and fix Cycles. ( even Lux whom has a smaller kernel had issue with amd's opencl implementation).

            The reason the pre compiled opencl won't work is because the split kernel is meant to compile only the features that the scene needs (changes with every scene; might work for Nvidia which can compile the mega kernel.)

            They have been keeping the Cuda features they use to a minimum to aid in OpenCL not being left in the dust (the code is similar for now, less code split)

            Another Problem with using the newer openCL version was lack of support for some of the older cards; The kernel already does not compile on some of the older AMD cards that support the required opencl version.

            Cycles relies on a gpus computational power not really all the other features that make the card good for gaming.

            Cycles does have its issues , but they made a good choice with the license so more people are getting involved to make it better.

            Comment


            • #16
              Originally posted by Michael View Post

              You don't have CUDA installed?
              That was it, thanks. When I ran the benchmark, it identified and installed a bunch of prerequisite packages, so I assumed it had everything required.

              Comment


              • #17
                Originally posted by -MacNuke- View Post

                Well, the OpenCL Kernel is horribly slow on AMD Card. On Luxrender AMD cards beat the shit out of NVidia cards... And if they would use OpenCL 2.0 they could compile their shaders to SPIR. But they do not want to.
                That comparison doesn't make any sense at all. First you talk about how CUDA support is bad, then you compare it with a software that doesn't even support CUDA at all?
                Most likely the difference is due to Nvidia's bad OpenCL support, not the cards themselves. Also, using OpenCL 2.0 would mean not supporting lots of old hardware, and that's not great either.

                Originally posted by -MacNuke- View Post
                But the CUDA implementation is pretty broken too. The current architecture of the CUDA Megakernel makes it hard to add new features. Instead of simply adding them they have to do a ton of workarounds... If they don't, the CUDA Kernel would eat every byte of VRAM the GPU has. Last time I checked they added a fix that CUDA would not eat 1GB of VRAM...

                If they would put the same amount of time into the OpenCL kernel... well... look at LuxRender.
                Oh please. The same discussion comes up regarding CPU vs. CUDA every month or so, and usually it boils down to: Do you even know how the implementation works?
                All the actual functionality is implemented once in headers and then included into the .cpp files for CPU, .cu files for CUDA and .cl files for OpenCL (of course, using lots of #defines to get the correct keywords etc.). For most features, you don't have to do any extra work to get them on all three platforms. Really, the platform-specific code is 5% at most.

                Originally posted by -MacNuke- View Post
                Same text every new GPU/CUDA Release. Even though NVidia cards are getting faster and faster in gaming and GPGPU in general... the performance in Blender Cycles does not move much. You can double the CUDA Cores and double the clock frequency... in the end you get 20-30% speed increase...

                But the Blender devs do not want to change anything...
                Yes, of course, the Blender devs should just go and fix the CUDA compiler themselves - oh wait, they can't since it's proprietary. Seriously though, why would you blame Blender for problems with the compiler?
                As for GPGPU speedups being less than Gaming speedups - it makes sense for Nvidia since sales are decided based on gaming performance. Fast GPGPU on GeForce cards just means less Teslas and Quadro FX sold. For most GPGPU applications, by the way, the limiting factor is memory bandwidth and latency, not raw computational performance.

                Also, the obligatory end: Blender is open source. Instead of complaining on the forums, why not just go ahead and improve the OpenCL code? It's too complex? Well, guess what, that's maybe why the Split OpenCL kernel doesn't have all features yet - don't forget, Cycles is maintained by like 3 people.

                Comment


                • #18
                  Originally posted by Michael View Post
                  The blend files used are from: https://code.blender.org/2016/02/new-cycles-benchmark/ and from there these people put their benchmark result files in: https://docs.google.com/spreadsheets...56d095bd#gid=0 That spreadsheet shows the best BMW GPU result as being 3 minutes, 36 seconds for an AMD Fire Pro Duo with split kernel. So not sure you got less than 60 seconds or must have been using a different file...
                  BMW27 (version 2.7) is the same one I used. I can't download those 500 MB fast with my connection here. Maybe will try tomorrow.
                  The original source is this: https://blenderartists.org/forum/sho...k-(Updated-BMW)
                  The results spreadsheet they have been using for years is this: https://docs.google.com/spreadsheet/...WhpVS1hZmV3OGc

                  However, as I mentioned tile size is an important factor. Just tested BMW27.blend I've got here again. Default setting (240x136px) renders in ~3 minutes on Hawaii, single tile (960x540) in 1:30. This is on Windows, obviously. I remember they had regressions in their OpenCL driver (it's also a hot topic for mining folks) and I'm sure I used to run it in about ~1 min. The card was not overclocked, but stable clocks, no downclocking under load.

                  For tile size there are few important rules:
                  - use single tile, when possible (enough memory, no performance collapse), otherwise use few large tiles instead of many small ones - applies on both NVIDIA and AMD, but is more important on AMD. Nvidia seems to have regressions with smaller sizes so they "need" to split the image earlier to maintain performance
                  - use tiles of the same size so no offcut is remaining. So on an image with 1920x1080, a tile size of 800x600 would be bad because you'll have only two 800x600 tiles and then some really slow ones (320x600 + 2x800x480+320x480) - applies on both, AMD and NVIDIA

                  So I'm not sure how useful it is to use the default size (which is obviously slow) for benchmarks...

                  Comment

                  Working...
                  X