Announcement

Collapse
No announcement yet.

OpenCL on vega: libamdoclsc64.so not present / Memory access fault by GPU node-1

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • OpenCL on vega: libamdoclsc64.so not present / Memory access fault by GPU node-1

    I've been trying to get my Vega card running on either Ubuntu 16.04.3 or Debian stretch. I've tried both AMDGPU-Pro (latest 17.30) and ROCm (1.6.180) but in both cases it's a complete failure:

    - the 17.30 driver seems to start fine (clinfo connects to DRI, does a bunch of ioctls) but at some points decides it absolutely needs "amdoclsc64" or "libamdoclsc64.so" which is nonexistent and this results in "Number of platforms 0" or "terminate called after throwing an instance of 'cl::Error' what(): clGetPlatformIDs aborted" when using AMD's clinfo.
    - ROCm seems better at first (kernel is 4.11.0-kfd-compute-rocm-rel-1.6-180), clinfo works but when I start to use a real OpenCL application, in this case luxmark 3.1, I get: Memory access fault by GPU node-1 on address 0x111a205000. Reason: Page not present or supervisor privilege. luxmark works fine on Polaris with 17.10 so I doubt it is at fault. I did set OPENCL_ROOT to /opt/rocm/opencl

    OTOH OpenGL video acceleration with 17.30 seems to be working fine.

    I very much appreciate the fact that AMD has open-sourced their driver, which is why I put my money into a Vega card, but at this point I'm a bit desperate. Has anyone here been able to get OpenCL with their Vega card? Please share your experience

  • #2
    Did you install the optional ROCm packages (near the end of the install instructions) and set LLVM_BIN etc. ? Without them you will get symptoms something like what you are seeing...
    Test signature

    Comment


    • #3
      Originally posted by bridgman View Post
      Did you install the optional ROCm packages (near the end of the install instructions) and set LLVM_BIN etc. ? Without them you will get symptoms something like what you are seeing...
      Hi bridgman,

      Thank you for the reply. I did follow the ROCm quick start guide, and in addition to the "rocm" meta-package installed two packages: rocm-opencl and rocm-opencl-dev. Right now I have currently installed: linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-180 linux-image-4.11.0-kfd-compute-rocm-rel-1.6-180 rocm rocm-dev rocm-device-libs rocm-opencl rocm-opencl-dev rocm-profiler rocm-smi rocm-utils

      Are you sure about LLVM_BIN? I googled it and could only find references to amdgpu-pro (with people setting it to /opt/amdgpu-pro/bin, which oddly only contains clinfo). I tried setting LLVM_BIN to /opt/rocm/hcc-1.0/bin but this makes no difference.

      When running luxmark for the first time, I noticed the phase "compiling kernels" takes a little bit of time, so I'm assuming llvm is found and the kernel is compiled. When running it again, it crashes much faster: I imagine the kernel was already in a compilation cache somewhere.

      Right now I'm not sure what to do. I you need me to run additional tests with debug variables setup (to get a more complete trace output), please do not hesitate to let me know.

      Comment


      • #4
        So moments after I wrote the above post I found the solution: simply install the meta-package "rocm-amdgpu-pro" which is provided by ANDGPU-Pro 17.30. With the vendor amdocl-rocr64.icd I have working OpenCL (amdocl64.icd must be removed).

        However, you were right about LLVM_BIN: it has to point to the correct location of llvm. It can either point to /opt/amdgpu-pro/bin (after installing LLVM 5.0-465504 from AMDGPU-Pro 17.30) or to /opt/rocm/hcc-1.0/bin. It's worth noting the performance I get is not much different whether I use LLVM from 17.30 or from repo.radeon.com: in both cases the kernel runs not very much faster than a RX 470, so I guess there is some room for optimization.
        Last edited by cde1; 23 October 2017, 06:00 AM.

        Comment


        • #5
          Originally posted by bridgman View Post
          Did you install the optional ROCm packages (near the end of the install instructions) and set LLVM_BIN etc. ? Without them you will get symptoms something like what you are seeing...
          Hi again,

          Strangely amdocl-rocr64.icd (required for Vega) will not work with Polaris. bridgman, is there a particular reason Vega needs amdocl-rocr64.icd?

          Comment


          • #6
            My initial response was based on my (incorrect) understanding that you were using AMDGPU-PRO 17.30 kernel etc. rather than the ROC kernel.

            I don't have the package names in my head to know which come from AMDGPU-PRO and which come from ROCm, but what I can say is that on AMDGPU-PRO we use ROCm paths for OpenCL on Vega while we use the older graphics-driver based path on Polaris.
            Test signature

            Comment


            • #7
              Originally posted by bridgman View Post
              My initial response was based on my (incorrect) understanding that you were using AMDGPU-PRO 17.30 kernel etc. rather than the ROC kernel.

              I don't have the package names in my head to know which come from AMDGPU-PRO and which come from ROCm, but what I can say is that on AMDGPU-PRO we use ROCm paths for OpenCL on Vega while we use the older graphics-driver based path on Polaris.
              Thanks bridgman. As it turns out I've been about to figure out a solution that brings the best performance:

              The problem was that rocm-opencl provides libamdocl64.so but amdgpu-pro also provides a non-working (for Vega) libamdocl64.so. amdocl-rocr64.icd seems to work but brings low performance.

              So I switched back to amdocl64.icd but this time removed amdgpu-pro-i386.conf and amdgpu-pro-x86_64.conf in /etc/ld.so.conf.d and created instead "rocm-opencl.conf" which contains "/opt/rocm/opencl/lib/x86_64". In this way the pure ROCm 1.6 stack is used, and it provides as expected around twice the performance of a couple RX 470.

              By the way, the kernel "linux-image-4.11.0-kfd-compute-rocm-rel-1.6-180" was not necessary; the amdgpu.ko module compiled from 17.30 worked fine with the 1.6 ROCm stack.

              I hope eventually things will be smoothed out and be fully integrated in a distro. Still I understand there are growing pains which is normal for a project of this size.

              Comment


              • #8
                Originally posted by bridgman View Post
                My initial response was based on my (incorrect) understanding that you were using AMDGPU-PRO 17.30 kernel etc. rather than the ROC kernel.

                I don't have the package names in my head to know which come from AMDGPU-PRO and which come from ROCm, but what I can say is that on AMDGPU-PRO we use ROCm paths for OpenCL on Vega while we use the older graphics-driver based path on Polaris.
                By the way, do you think Wattman will be available at some point on Linux? The "powersave mode" on Windows is really nice as it reduces power consumption from ~380W to 300W (at the wall) with no apparent degradation in performance (and thus better thermals / less fan noise).

                Comment


                • #9
                  Hi, I have a slightly (not really) related question.

                  What is the status of mainlining the KFD bits for ROCm? I would like to test the ROCm stuff on my Vega, but I am already using a custom 4.13 kernel with DC (for obvious display-related reasons) and am not even sure if it is enough or not for ROCm and if not what needs to be added. Ideally, the 4.15 kernel with finally-mainlined DC code would cover both of these cases (one can dream).

                  Comment


                  • #10
                    The KFD code for dGPUs is on its way upstream but we are past the cutoff point for 4.15 so doesn't seem likely to make it:

                    Test signature

                    Comment

                    Working...
                    X