Announcement

Collapse
No announcement yet.

Taking Radeon ROCm 2.0 OpenCL For A Benchmarking Test Drive

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Taking Radeon ROCm 2.0 OpenCL For A Benchmarking Test Drive

    Phoronix: Taking Radeon ROCm 2.0 OpenCL For A Benchmarking Test Drive

    Last week AMD officially released ROCm 2.0 as the newest major release of the Radeon Open Compute stack. Here are some initial benchmark figures for that Radeon Linux compute component on Polaris and Vega hardware.

    http://www.phoronix.com/vr.php?view=27324

  • #2
    Typos:

    Originally posted by phoronix View Post
    For your viewing pleasusre today
    Originally posted by phoronix View Post
    In the single precision test, the RX Vega 56 comes out ahead of the GTX 1080 Ti
    No, it does not.

    Comment


    • #3
      Thanks Michael, very interesting benchmarks! I've found out that raising the HBM frequency can be very beneficial in OpenCL workloads for the Vega cards, for example if I raise it from 945 MHz to 1100 Luxmark goes up to almost 37000 for the Vega 64.

      Comment


      • #4
        I would be very interested in Blender OpenCl tests.

        Comment


        • #5
          I see the AMD compute stack has massive potential but is still stuck at the start line.

          Comment


          • #6
            Originally posted by Aeder View Post
            I see the AMD compute stack has massive potential but is still stuck at the start line.
            Indeed,
            The first Part of the work seems to be there already..

            Now PowerPlay tunning is needed,
            Because in Linux, AMD cards consumes a lot more power than on Windows..
            Also, of course latency and Optimizations should follow, or be done in parallel..

            But this Rocm version seems to be a good ground for all optimizations stuff..

            Comment


            • #7
              Still hopping for the HIP benchmark.

              Comment


              • #8
                ROCm OpenCL now works on AMD Mobile Raven Ridge (Ryzen 2500U as found onHP Envy x360 Convertible) running on updated Fedora 29 after following the procedure. All previous OpenCL installation from amdgpu-pro are removed prior to that.

                As confirmed from rocminfo
                Code:
                /opt/rocm/bin/rocminfo  
                =====================     
                HSA System Attributes     
                =====================     
                Runtime Version:         1.1
                System Timestamp Freq.:  1000.000000MHz
                Sig. Max Wait Duration:  18446744073709551615 (number of timestamp)
                Machine Model:           LARGE                               
                System Endianness:       LITTLE                              
                 
                ==========                
                HSA Agents                
                ==========                
                *******                   
                Agent 1                   
                *******                   
                  Name:                    AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
                  Vendor Name:             CPU                                 
                  Feature:                 None specified                      
                  Profile:                 FULL_PROFILE                        
                  Float Round Mode:        NEAR                                
                  Max Queue Number:        0                                   
                  Queue Min Size:          0                                   
                  Queue Max Size:          0                                   
                  Queue Type:              MULTI                               
                  Node:                    0                                   
                  Device Type:             CPU                                 
                  Cache Info:               
                    L1:                      32KB                                
                  Chip ID:                 5597                                
                  Cacheline Size:          64                                  
                  Max Clock Frequency (MHz):2000                                
                  BDFID:                   768                                 
                  Compute Unit:            8                                   
                  Features:                None
                  Pool Info:                
                    Pool 1                    
                      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
                      Size:                    16776832KB                          
                      Allocatable:             TRUE                                
                      Alloc Granule:           4KB                                 
                      Alloc Alignment:         4KB                                 
                      Acessible by all:        TRUE                                
                  ISA Info:                 
                    N/A                       
                *******                   
                Agent 2                   
                *******                   
                  Name:                    gfx902                              
                  Vendor Name:             AMD                                 
                  Feature:                 KERNEL_DISPATCH                     
                  Profile:                 FULL_PROFILE                        
                  Float Round Mode:        NEAR                                
                  Max Queue Number:        128                                 
                  Queue Min Size:          4096                                
                  Queue Max Size:          131072                              
                  Queue Type:              MULTI                               
                  Node:                    0                                   
                  Device Type:             GPU                                 
                  Cache Info:               
                    L1:                      16KB                                
                  Chip ID:                 5597                                
                  Cacheline Size:          64                                  
                  Max Clock Frequency (MHz):1100                                
                  BDFID:                   768                                 
                  Compute Unit:            11                                  
                  Features:                KERNEL_DISPATCH  
                  Fast F16 Operation:      FALSE                               
                  Wavefront Size:          64                                  
                  Workgroup Max Size:      1024                                
                  Workgroup Max Size Per Dimension:
                    Dim[0]:                  67109888                            
                    Dim[1]:                  50332672                            
                    Dim[2]:                  0                                   
                  Grid Max Size:           4294967295                          
                  Waves Per CU:            160                                 
                  Max Work-item Per CU:    10240                               
                  Grid Max Size per Dimension:
                    Dim[0]:                  4294967295                          
                    Dim[1]:                  4294967295                          
                    Dim[2]:                  4294967295                          
                  Max number Of fbarriers Per Workgroup:32                                  
                  Pool Info:                
                    Pool 1                    
                      Segment:                 GROUP                               
                      Size:                    64KB                                
                      Allocatable:             FALSE                               
                      Alloc Granule:           0KB                                 
                      Alloc Alignment:         0KB                                 
                      Acessible by all:        FALSE                               
                  ISA Info:                 
                    ISA 1                     
                      Name:                    amdgcn-amd-amdhsa--gfx902+xnack     
                      Machine Models:          HSA_MACHINE_MODEL_LARGE             
                      Profiles:                HSA_PROFILE_BASE                    
                      Default Rounding Mode:   NEAR                                
                      Default Rounding Mode:   NEAR                                
                      Fast f16:                TRUE                                
                      Workgroup Max Dimension:  
                        Dim[0]:                  67109888                            
                        Dim[1]:                  1024                                
                        Dim[2]:                  16777217                            
                      Workgroup Max Size:      1024                                
                      Grid Max Dimension:       
                        x                        4294967295                          
                        y                        4294967295                          
                        z                        4294967295                          
                      Grid Max Size:           4294967295                          
                      FBarrier Max Size:       32                                  
                *** Done ***
                From clinfo
                Code:
                clinfo
                Number of platforms                               1
                  Platform Name                                   AMD Accelerated Parallel Processing
                  Platform Vendor                                 Advanced Micro Devices, Inc.
                  Platform Version                                OpenCL 2.1 AMD-APP (2783.0)
                  Platform Profile                                FULL_PROFILE
                  Platform Extensions                             cl_khr_icd cl_amd_event_callback cl_amd_offline_devices  
                  Platform Host timer resolution                  1ns
                  Platform Extensions function suffix             AMD
                 
                  Platform Name                                   AMD Accelerated Parallel Processing
                Number of devices                                 1
                  Device Name                                     gfx902-xnack
                  Device Vendor                                   Advanced Micro Devices, Inc.
                  Device Vendor ID                                0x1002
                  Device Version                                  OpenCL 1.2  
                  Driver Version                                  2783.0 (HSA1.1,LC)
                  Device OpenCL C Version                         OpenCL C 2.0  
                  Device Type                                     GPU
                  Device Available                                Yes
                  Device Profile                                  FULL_PROFILE
                  Device Board Name (AMD)                         AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
                  Device Topology (AMD)                           PCI-E, 03:00.0
                  Max compute units                               11
                  SIMD per compute unit (AMD)                     4
                  SIMD width (AMD)                                16
                  SIMD instruction width (AMD)                    1
                  Max clock frequency                             1100MHz
                  Graphics IP (AMD)                               9.2
                  Device Partition                                (core)
                    Max number of sub-devices                     11
                    Supported partition types                     None
                  Max work item dimensions                        3
                  Max work item sizes                             1024x1024x1024
                  Max work group size                             256
                  Compiler Available                              Yes
                  Linker Available                                Yes
                  Preferred work group size multiple              64
                  Wavefront width (AMD)                           64
                  Preferred / native vector sizes                  
                    char                                                 4 / 4        
                    short                                                2 / 2        
                    int                                                  1 / 1        
                    long                                                 1 / 1        
                    half                                                 1 / 1        (cl_khr_fp16)
                    float                                                1 / 1        
                    double                                               1 / 1        (cl_khr_fp64)
                  Half-precision Floating-point support           (cl_khr_fp16)
                    Denormals                                     No
                    Infinity and NANs                             No
                    Round to nearest                              No
                    Round to zero                                 No
                    Round to infinity                             No
                    IEEE754-2008 fused multiply-add               No
                    Support is emulated in software               No
                  Single-precision Floating-point support         (core)
                    Denormals                                     Yes
                    Infinity and NANs                             Yes
                    Round to nearest                              Yes
                    Round to zero                                 Yes
                    Round to infinity                             Yes
                    IEEE754-2008 fused multiply-add               Yes
                    Support is emulated in software               No
                    Correctly-rounded divide and sqrt operations  Yes
                  Double-precision Floating-point support         (cl_khr_fp64)
                    Denormals                                     Yes
                    Infinity and NANs                             Yes
                    Round to nearest                              Yes
                    Round to zero                                 Yes
                    Round to infinity                             Yes
                    IEEE754-2008 fused multiply-add               Yes
                    Support is emulated in software               No
                  Address bits                                    64, Little-Endian
                  Global memory size                              7360856064 (6.855GiB)
                  Global free memory (AMD)                        7188336 (6.855GiB)
                  Global memory channels (AMD)                    2
                  Global memory banks per channel (AMD)           4
                  Global memory bank width (AMD)                  256 bytes
                  Error Correction support                        No
                  Max memory allocation                           6256727654 (5.827GiB)
                  Unified memory for Host and Device              Yes
                  Minimum alignment for any data type             128 bytes
                  Alignment of base address                       1024 bits (128 bytes)
                  Global Memory cache type                        Read/Write
                  Global Memory cache size                        16384 (16KiB)
                  Global Memory cache line size                   64 bytes
                  Image support                                   Yes
                    Max number of samplers per kernel             5597
                    Max size for 1D images from buffer            65536 pixels
                    Max 1D or 2D image array size                 2048 images
                    Max 2D image size                             16384x16384 pixels
                    Max 3D image size                             2048x2048x2048 pixels
                    Max number of read image args                 128
                    Max number of write image args                8
                  Local memory type                               Local
                  Local memory size                               65536 (64KiB)
                  Local memory syze per CU (AMD)                  65536 (64KiB)
                  Local memory banks (AMD)                        32
                  Max constant buffer size                        6256727654 (5.827GiB)
                  Max number of constant args                     8
                  Max size of kernel argument                     1024
                  Queue properties                                 
                    Out-of-order execution                        No
                    Profiling                                     Yes
                  Prefer user sync for interop                    Yes
                  Profiling timer resolution                      1ns
                  Profiling timer offset since Epoch (AMD)        0ns (Wed Dec 31 16:00:00 1969)
                  Execution capabilities                           
                    Run OpenCL kernels                            Yes
                    Run native kernels                            No
                    Thread trace supported (AMD)                  No
                  printf() buffer size                            4194304 (4MiB)
                  Built-in kernels                                 
                  Device Extensions                               cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program  
                 
                NULL platform behavior
                  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
                  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
                  clCreateContext(NULL, ...) [default]            No platform
                  clCreateContext(NULL, ...) [other]              Success [AMD]
                  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
                    Platform Name                                 AMD Accelerated Parallel Processing
                    Device Name                                   gfx902-xnack
                  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
                  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
                    Platform Name                                 AMD Accelerated Parallel Processing
                    Device Name                                   gfx902-xnack
                  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
                  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
                  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
                    Platform Name                                 AMD Accelerated Parallel Processing
                    Device Name                                   gfx902-xnack
                One unfortunate minor issue is the use of no longer maintained pth package since 2006 as dependency instead of modern pthsem. Hopefully it will be rectified in a future and a more simplified installation instruction will be welcome. Darktable, Blender, Gimp and KDenlive were able to detect and use ROCm OpenCL.

                What a good way to end 2018 with that present for mobile Raven Ridge users.

                Comment


                • #9
                  I ran LuxMark Luxball HDR on an RX580 using Clover, and got 21887. That's strangely better than Michael's results – for the GTX1080Ti. Is there anything obvious I might be doing wrong, aside from using PTS v8.0.0, locking dpm clocks to mid-high, and not uploading results yet?

                  Comment


                  • #10
                    Originally posted by finalzone View Post
                    ROCm OpenCL now works on AMD Mobile Raven Ridge (Ryzen 2500U as found onHP Envy x360 Convertible) running on updated Fedora 29
                    ...
                    *******
                    Agent 2
                    *******
                    ...
                    Fast F16 Operation: FALSE
                    ...
                    ISA Info:
                    ISA 1
                    Fast f16: TRUE

                    Preferred / native vector sizes
                    char 4 / 4
                    short 2 / 2
                    int 1 / 1
                    long 1 / 1
                    half 1 / 1 (cl_khr_fp16)
                    float 1 / 1
                    double 1 / 1 (cl_khr_fp64)
                    Half-precision Floating-point support (cl_khr_fp16)
                    Denormals No
                    Infinity and NANs No
                    Round to nearest No
                    Round to zero No
                    Round to infinity No
                    IEEE754-2008 fused multiply-add No
                    Support is emulated in software No
                    I'm confused... does mobile Vega has double FLOPS for F16 or not?

                    Originally posted by utrrrongeeb View Post
                    I ran LuxMark Luxball HDR on an RX580 using Clover, and got 21887
                    Does it pass image validation for this result?

                    Comment

                    Working...
                    X