Announcement

Collapse
No announcement yet.

LCZero Chess Engine Performance With OpenCL vs. CUDA + cuDNN vs. FP16 With Tensor Cores

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • LCZero Chess Engine Performance With OpenCL vs. CUDA + cuDNN vs. FP16 With Tensor Cores

    Phoronix: LCZero Chess Engine Performance With OpenCL vs. CUDA + cuDNN vs. FP16 With Tensor Cores

    A Phoronix reader pointed out LCZero (Leela Chess Zero) a few days ago as an interesting chess engine powered by neural networks and supports BLAS, OpenCL, and NVIDIA CUDA+cuDNN back-ends. Particularly with the FP16 cuDNN support, this chess engine can be super fast on NVIDIA's latest Turing GPUs with tensor cores...

    http://www.phoronix.com/scan.php?pag...DIA-Benchmarks

  • #2
    Is there an intrinsic reason why OpenCL was so much slower?

    Some possibilities I have considered:

    * OpenCL is slower on this NVIDIA hardware because less effort has gone into optimizing the drivers, firmware & software
    * OpenCL is slower on this hardware because the OpenCL standard itself is less able to take advantage of the hardware features than CUDA is
    * OpenCL is slower on this hardware because NVIDIA have deliberately limited it in order to push their proprietary CUDA standard

    Comment


    • #3
      Originally posted by cybertraveler View Post
      Is there an intrinsic reason why OpenCL was so much slower?

      Some possibilities I have considered:

      * OpenCL is slower on this NVIDIA hardware because less effort has gone into optimizing the drivers, firmware & software
      * OpenCL is slower on this hardware because the OpenCL standard itself is less able to take advantage of the hardware features than CUDA is
      * OpenCL is slower on this hardware because NVIDIA have deliberately limited it in order to push their proprietary CUDA standard
      I don't have a solid answer due to just getting started with lczero, but the upcoming OpenCL NVIDIA vs. AMD tests should shed some light on the lczero CL compute potential.
      Michael Larabel
      http://www.michaellarabel.com/

      Comment


      • #4
        Originally posted by cybertraveler View Post
        Is there an intrinsic reason why OpenCL was so much slower?

        Some possibilities I have considered:

        * OpenCL is slower on this NVIDIA hardware because less effort has gone into optimizing the drivers, firmware & software
        * OpenCL is slower on this hardware because the OpenCL standard itself is less able to take advantage of the hardware features than CUDA is
        * OpenCL is slower on this hardware because NVIDIA have deliberately limited it in order to push their proprietary CUDA standard
        I was wondering the same thing - it's pretty weird for a 2060 in CUDA to outperform a RTX Titan in OpenCL. But I think there's a 4th possibility:
        There's not enough test data in the neural network for OpenCL.

        Comment


        • #5
          Originally posted by schmidtbag View Post

          I was wondering the same thing - it's pretty weird for a 2060 in CUDA to outperform a RTX Titan in OpenCL. But I think there's a 4th possibility:
          There's not enough test data in the neural network for OpenCL.
          The same dataset was fed to both CL and CUDA.
          Michael Larabel
          http://www.michaellarabel.com/

          Comment


          • #6
            Again nothing to do with tensor core, the 2000 series does have 2x fp16 performance.

            Comment


            • #7
              It is pretty known CUDA performs better than openCL on nVidia. But 100% better is more than usual.

              Comment


              • #8
                lc0 doesn't scale well with high thread count, 2 or 3 threads are the best settings for the fastest speed, at least for now. Also keep in mind that different net files can yield different speed outcome, so its best to keep using the same one for benchmarking.

                Comment


                • #9
                  Great to see lc0 in some benchmarks. It is BTW one of the strongest chess engines in the world.
                  It would be interesting to also to see numbers for Windows.
                  I get 1100 on an Nvidia m1200 (Cuda), and around 2500 on a Vega 56.
                  The later result seems low to me.
                  My machine learning colleagues all prefer Nvidia, seems like support for tensorflow is better.

                  Comment


                  • #10
                    Thanks for the benchmarks.

                    The main reason it runs much better on cuda/cudnn mode is because it uses cudnn library for convolutions which is optimized by nvidia. The OpenCL mode uses hand-written kernels by gcp (originally written for the leela (go) zero project).

                    Few notes:

                    1. From the results it seems the benchmark was run in default mode (e.g: lc0 benchmark, or lc0 benchmark --backend=cudnn-fp16), which is good and works well across all hardware. However for very fast GPUs (like Titan RTX with fp16), the test would finish too soon and not have enough work to fully utilize the GPU (I think that's the reason for smaller gap between 2080Ti and Titan RTX in fp16 compared to openCL and cudnn fp32).
                    to run it longer, you can use the --nodes option (unfortunately undocumented right now). e.g:
                    lc0 benchmark --backend=cudnn-fp16 --nodes=1000000
                    (this should get ~40knps on Titan RTX).

                    2. As the NN used by lc0 is still learning (and making progress), the network id to use is a moving target. It's better to stick to one network id for all benchmarks to make it comparable as the nps will vary with the network used (the network can change the shape of the search tree which affects NPS). Right now, recent networks from T30 run are the strongest for game-play (e.g: 32616). T40 run has just started from scratch and is relatively very weak (but we hope it will catch up and exceed T30 soon). Networks from T35 run (35xxx-36xxx) are of smaller size and will run much faster but are weaker - unless you are running on very slow hardware at fast time control.

                    3. lc0 supports multi-GPU setups too. You can mix and match different GPUs - even from different generations or vendors. To use it you need the multiplexing backend. E.g for two GPUs:
                    ./lc0 benchmark --backend=multiplexing --backend-opts="(gpu=0,backend=cudnn-fp16),(gpu=1,backend=cudnn-fp16)" --threads=4 --nodes=2000000
                    --thread parameter controls how many CPU threads to use for the search. Default is 2 which is likely the best setting for single GPU.
                    For multiple GPUs, numGPUs+1 (or +2) is a good setting.
                    Higher no of threads can slow it down (due to synchronization / locks).

                    Right now it doesn't scale over ~100knps due to CPU bottlenecks so 3 x 2080s is probably the fastest configuration for now. Anything faster than that will hit CPU bottlenecks.

                    Comment

                    Working...
                    X