Announcement

Collapse
No announcement yet.

Taking Radeon ROCm 2.0 OpenCL For A Benchmarking Test Drive

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Taking Radeon ROCm 2.0 OpenCL For A Benchmarking Test Drive

    Phoronix: Taking Radeon ROCm 2.0 OpenCL For A Benchmarking Test Drive

    Last week AMD officially released ROCm 2.0 as the newest major release of the Radeon Open Compute stack. Here are some initial benchmark figures for that Radeon Linux compute component on Polaris and Vega hardware.

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Typos:

    Originally posted by phoronix View Post
    For your viewing pleasusre today
    Originally posted by phoronix View Post
    In the single precision test, the RX Vega 56 comes out ahead of the GTX 1080 Ti
    No, it does not.

    Comment


    • #3
      Thanks Michael, very interesting benchmarks! I've found out that raising the HBM frequency can be very beneficial in OpenCL workloads for the Vega cards, for example if I raise it from 945 MHz to 1100 Luxmark goes up to almost 37000 for the Vega 64.

      Comment


      • #4
        I would be very interested in Blender OpenCl tests.

        Comment


        • #5
          I see the AMD compute stack has massive potential but is still stuck at the start line.

          Comment


          • #6
            Originally posted by Aeder View Post
            I see the AMD compute stack has massive potential but is still stuck at the start line.
            Indeed,
            The first Part of the work seems to be there already..

            Now PowerPlay tunning is needed,
            Because in Linux, AMD cards consumes a lot more power than on Windows..
            Also, of course latency and Optimizations should follow, or be done in parallel..

            But this Rocm version seems to be a good ground for all optimizations stuff..

            Comment


            • #7
              ROCm OpenCL now works on AMD Mobile Raven Ridge (Ryzen 2500U as found onHP Envy x360 Convertible) running on updated Fedora 29 after following the procedure. All previous OpenCL installation from amdgpu-pro are removed prior to that.

              As confirmed from rocminfo
              Code:
              /opt/rocm/bin/rocminfo  
              =====================     
              HSA System Attributes     
              =====================     
              Runtime Version:         1.1
              System Timestamp Freq.:  1000.000000MHz
              Sig. Max Wait Duration:  18446744073709551615 (number of timestamp)
              Machine Model:           LARGE                               
              System Endianness:       LITTLE                              
               
              ==========                
              HSA Agents                
              ==========                
              *******                   
              Agent 1                   
              *******                   
                Name:                    AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
                Vendor Name:             CPU                                 
                Feature:                 None specified                      
                Profile:                 FULL_PROFILE                        
                Float Round Mode:        NEAR                                
                Max Queue Number:        0                                   
                Queue Min Size:          0                                   
                Queue Max Size:          0                                   
                Queue Type:              MULTI                               
                Node:                    0                                   
                Device Type:             CPU                                 
                Cache Info:               
                  L1:                      32KB                                
                Chip ID:                 5597                                
                Cacheline Size:          64                                  
                Max Clock Frequency (MHz):2000                                
                BDFID:                   768                                 
                Compute Unit:            8                                   
                Features:                None
                Pool Info:                
                  Pool 1                    
                    Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
                    Size:                    16776832KB                          
                    Allocatable:             TRUE                                
                    Alloc Granule:           4KB                                 
                    Alloc Alignment:         4KB                                 
                    Acessible by all:        TRUE                                
                ISA Info:                 
                  N/A                       
              *******                   
              Agent 2                   
              *******                   
                Name:                    gfx902                              
                Vendor Name:             AMD                                 
                Feature:                 KERNEL_DISPATCH                     
                Profile:                 FULL_PROFILE                        
                Float Round Mode:        NEAR                                
                Max Queue Number:        128                                 
                Queue Min Size:          4096                                
                Queue Max Size:          131072                              
                Queue Type:              MULTI                               
                Node:                    0                                   
                Device Type:             GPU                                 
                Cache Info:               
                  L1:                      16KB                                
                Chip ID:                 5597                                
                Cacheline Size:          64                                  
                Max Clock Frequency (MHz):1100                                
                BDFID:                   768                                 
                Compute Unit:            11                                  
                Features:                KERNEL_DISPATCH  
                Fast F16 Operation:      FALSE                               
                Wavefront Size:          64                                  
                Workgroup Max Size:      1024                                
                Workgroup Max Size Per Dimension:
                  Dim[0]:                  67109888                            
                  Dim[1]:                  50332672                            
                  Dim[2]:                  0                                   
                Grid Max Size:           4294967295                          
                Waves Per CU:            160                                 
                Max Work-item Per CU:    10240                               
                Grid Max Size per Dimension:
                  Dim[0]:                  4294967295                          
                  Dim[1]:                  4294967295                          
                  Dim[2]:                  4294967295                          
                Max number Of fbarriers Per Workgroup:32                                  
                Pool Info:                
                  Pool 1                    
                    Segment:                 GROUP                               
                    Size:                    64KB                                
                    Allocatable:             FALSE                               
                    Alloc Granule:           0KB                                 
                    Alloc Alignment:         0KB                                 
                    Acessible by all:        FALSE                               
                ISA Info:                 
                  ISA 1                     
                    Name:                    amdgcn-amd-amdhsa--gfx902+xnack     
                    Machine Models:          HSA_MACHINE_MODEL_LARGE             
                    Profiles:                HSA_PROFILE_BASE                    
                    Default Rounding Mode:   NEAR                                
                    Default Rounding Mode:   NEAR                                
                    Fast f16:                TRUE                                
                    Workgroup Max Dimension:  
                      Dim[0]:                  67109888                            
                      Dim[1]:                  1024                                
                      Dim[2]:                  16777217                            
                    Workgroup Max Size:      1024                                
                    Grid Max Dimension:       
                      x                        4294967295                          
                      y                        4294967295                          
                      z                        4294967295                          
                    Grid Max Size:           4294967295                          
                    FBarrier Max Size:       32                                  
              *** Done ***
              From clinfo
              Code:
              clinfo
              Number of platforms                               1
                Platform Name                                   AMD Accelerated Parallel Processing
                Platform Vendor                                 Advanced Micro Devices, Inc.
                Platform Version                                OpenCL 2.1 AMD-APP (2783.0)
                Platform Profile                                FULL_PROFILE
                Platform Extensions                             cl_khr_icd cl_amd_event_callback cl_amd_offline_devices  
                Platform Host timer resolution                  1ns
                Platform Extensions function suffix             AMD
               
                Platform Name                                   AMD Accelerated Parallel Processing
              Number of devices                                 1
                Device Name                                     gfx902-xnack
                Device Vendor                                   Advanced Micro Devices, Inc.
                Device Vendor ID                                0x1002
                Device Version                                  OpenCL 1.2  
                Driver Version                                  2783.0 (HSA1.1,LC)
                Device OpenCL C Version                         OpenCL C 2.0  
                Device Type                                     GPU
                Device Available                                Yes
                Device Profile                                  FULL_PROFILE
                Device Board Name (AMD)                         AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
                Device Topology (AMD)                           PCI-E, 03:00.0
                Max compute units                               11
                SIMD per compute unit (AMD)                     4
                SIMD width (AMD)                                16
                SIMD instruction width (AMD)                    1
                Max clock frequency                             1100MHz
                Graphics IP (AMD)                               9.2
                Device Partition                                (core)
                  Max number of sub-devices                     11
                  Supported partition types                     None
                Max work item dimensions                        3
                Max work item sizes                             1024x1024x1024
                Max work group size                             256
                Compiler Available                              Yes
                Linker Available                                Yes
                Preferred work group size multiple              64
                Wavefront width (AMD)                           64
                Preferred / native vector sizes                  
                  char                                                 4 / 4        
                  short                                                2 / 2        
                  int                                                  1 / 1        
                  long                                                 1 / 1        
                  half                                                 1 / 1        (cl_khr_fp16)
                  float                                                1 / 1        
                  double                                               1 / 1        (cl_khr_fp64)
                Half-precision Floating-point support           (cl_khr_fp16)
                  Denormals                                     No
                  Infinity and NANs                             No
                  Round to nearest                              No
                  Round to zero                                 No
                  Round to infinity                             No
                  IEEE754-2008 fused multiply-add               No
                  Support is emulated in software               No
                Single-precision Floating-point support         (core)
                  Denormals                                     Yes
                  Infinity and NANs                             Yes
                  Round to nearest                              Yes
                  Round to zero                                 Yes
                  Round to infinity                             Yes
                  IEEE754-2008 fused multiply-add               Yes
                  Support is emulated in software               No
                  Correctly-rounded divide and sqrt operations  Yes
                Double-precision Floating-point support         (cl_khr_fp64)
                  Denormals                                     Yes
                  Infinity and NANs                             Yes
                  Round to nearest                              Yes
                  Round to zero                                 Yes
                  Round to infinity                             Yes
                  IEEE754-2008 fused multiply-add               Yes
                  Support is emulated in software               No
                Address bits                                    64, Little-Endian
                Global memory size                              7360856064 (6.855GiB)
                Global free memory (AMD)                        7188336 (6.855GiB)
                Global memory channels (AMD)                    2
                Global memory banks per channel (AMD)           4
                Global memory bank width (AMD)                  256 bytes
                Error Correction support                        No
                Max memory allocation                           6256727654 (5.827GiB)
                Unified memory for Host and Device              Yes
                Minimum alignment for any data type             128 bytes
                Alignment of base address                       1024 bits (128 bytes)
                Global Memory cache type                        Read/Write
                Global Memory cache size                        16384 (16KiB)
                Global Memory cache line size                   64 bytes
                Image support                                   Yes
                  Max number of samplers per kernel             5597
                  Max size for 1D images from buffer            65536 pixels
                  Max 1D or 2D image array size                 2048 images
                  Max 2D image size                             16384x16384 pixels
                  Max 3D image size                             2048x2048x2048 pixels
                  Max number of read image args                 128
                  Max number of write image args                8
                Local memory type                               Local
                Local memory size                               65536 (64KiB)
                Local memory syze per CU (AMD)                  65536 (64KiB)
                Local memory banks (AMD)                        32
                Max constant buffer size                        6256727654 (5.827GiB)
                Max number of constant args                     8
                Max size of kernel argument                     1024
                Queue properties                                 
                  Out-of-order execution                        No
                  Profiling                                     Yes
                Prefer user sync for interop                    Yes
                Profiling timer resolution                      1ns
                Profiling timer offset since Epoch (AMD)        0ns (Wed Dec 31 16:00:00 1969)
                Execution capabilities                           
                  Run OpenCL kernels                            Yes
                  Run native kernels                            No
                  Thread trace supported (AMD)                  No
                printf() buffer size                            4194304 (4MiB)
                Built-in kernels                                 
                Device Extensions                               cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program  
               
              NULL platform behavior
                clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
                clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
                clCreateContext(NULL, ...) [default]            No platform
                clCreateContext(NULL, ...) [other]              Success [AMD]
                clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
                  Platform Name                                 AMD Accelerated Parallel Processing
                  Device Name                                   gfx902-xnack
                clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
                clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
                  Platform Name                                 AMD Accelerated Parallel Processing
                  Device Name                                   gfx902-xnack
                clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
                clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
                clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
                  Platform Name                                 AMD Accelerated Parallel Processing
                  Device Name                                   gfx902-xnack
              One unfortunate minor issue is the use of no longer maintained pth package since 2006 as dependency instead of modern pthsem. Hopefully it will be rectified in a future and a more simplified installation instruction will be welcome. Darktable, Blender, Gimp and KDenlive were able to detect and use ROCm OpenCL.

              What a good way to end 2018 with that present for mobile Raven Ridge users.

              Comment


              • #8
                I ran LuxMark Luxball HDR on an RX580 using Clover, and got 21887. That's strangely better than Michael's results – for the GTX1080Ti. Is there anything obvious I might be doing wrong, aside from using PTS v8.0.0, locking dpm clocks to mid-high, and not uploading results yet?

                Comment


                • #9
                  Originally posted by finalzone View Post
                  ROCm OpenCL now works on AMD Mobile Raven Ridge (Ryzen 2500U as found onHP Envy x360 Convertible) running on updated Fedora 29
                  ...
                  *******
                  Agent 2
                  *******
                  ...
                  Fast F16 Operation: FALSE
                  ...
                  ISA Info:
                  ISA 1
                  Fast f16: TRUE

                  Preferred / native vector sizes
                  char 4 / 4
                  short 2 / 2
                  int 1 / 1
                  long 1 / 1
                  half 1 / 1 (cl_khr_fp16)
                  float 1 / 1
                  double 1 / 1 (cl_khr_fp64)
                  Half-precision Floating-point support (cl_khr_fp16)
                  Denormals No
                  Infinity and NANs No
                  Round to nearest No
                  Round to zero No
                  Round to infinity No
                  IEEE754-2008 fused multiply-add No
                  Support is emulated in software No
                  I'm confused... does mobile Vega has double FLOPS for F16 or not?

                  Originally posted by utrrrongeeb View Post
                  I ran LuxMark Luxball HDR on an RX580 using Clover, and got 21887
                  Does it pass image validation for this result?

                  Comment


                  • #10
                    Originally posted by klokik View Post
                    Does it pass image validation for this result?
                    I hadn't thought to check. Having taken another look, my impression of LuxMark is that it iteratively refines the render for a fixed time of 120 seconds, and compares whatever's finished at that point with a converged reference render. There's a threshold (around 15% of pixels mismatch?) where the displayed judgment changes from "pass" to "fail," but the process appears the same. PTS does not appear to read or consider this pass/fail judgment. If I remove the gpu clock limits, the RX580's score heads towards 29686, and somewhere in between it crosses into the "pass" range with 13.26% error. Visually, the output looks right / on the right track; if you know more about Luxmark's technical details, please explain. (I'm also cheating a bit by caching compiled kernels, but that shouldn't explain the size of the results' differences.)

                    One wonders whether the GTX1080Ti "passed" in this article, and whether it's getting better than 140 points per watt. :-)

                    For the other Luxmark scenes, compilation takes an unreasonably long time (ten minutes), but in most cases works or can be adjusted to work (try disabling -cl-mad-enable). For example, I'm seeing a score of 42710 for Microphone, at 8.85% error. I haven't comprehensively saved results and compared them with Michael's yet. There's the small disadvantage of the mouse-cursor becoming unresponsive when the heavier benchmarks are running, at least in Wayland.
                    Last edited by utrrrongeeb; 03 January 2019, 10:27 PM. Reason: added Mic and rough power efficiency

                    Comment

                    Working...
                    X