Announcement

**tildearrow** · 28 December 2018, 01:22 PM

Typos:

Originally posted by phoronix View Post

For your viewing pleasusre today

Originally posted by phoronix View Post

In the single precision test, the RX Vega 56 comes out ahead of the GTX 1080 Ti

No, it does not.

**kbios** · 28 December 2018, 01:25 PM

Thanks Michael, very interesting benchmarks! I've found out that raising the HBM frequency can be very beneficial in OpenCL workloads for the Vega cards, for example if I raise it from 945 MHz to 1100 Luxmark goes up to almost 37000 for the Vega 64.

**Tim Blokdijk** · 28 December 2018, 05:00 PM

I would be very interested in Blender OpenCl tests.

**Aeder** · 28 December 2018, 06:18 PM

I see the AMD compute stack has massive potential but is still stuck at the start line.

**tuxd3v** · 28 December 2018, 07:36 PM

Originally posted by Aeder View Post

I see the AMD compute stack has massive potential but is still stuck at the start line.

Indeed,
The first Part of the work seems to be there already..

Now PowerPlay tunning is needed,
Because in Linux, AMD cards consumes a lot more power than on Windows..
Also, of course latency and Optimizations should follow, or be done in parallel..

But this Rocm version seems to be a good ground for all optimizations stuff..

**finalzone** · 29 December 2018, 02:36 AM

ROCm OpenCL now works on AMD Mobile Raven Ridge (Ryzen 2500U as found onHP Envy x360 Convertible) running on updated Fedora 29 after following the procedure. All previous OpenCL installation from amdgpu-pro are removed prior to that.

As confirmed from rocminfo

Code:

/opt/rocm/bin/rocminfo  
=====================     
HSA System Attributes     
=====================     
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (number of timestamp)
Machine Model:           LARGE                               
System Endianness:       LITTLE                              
 
==========                
HSA Agents                
==========                
*******                   
Agent 1                   
*******                   
  Name:                    AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
  Vendor Name:             CPU                                 
  Feature:                 None specified                      
  Profile:                 FULL_PROFILE                        
  Float Round Mode:        NEAR                                
  Max Queue Number:        0                                   
  Queue Min Size:          0                                   
  Queue Max Size:          0                                   
  Queue Type:              MULTI                               
  Node:                    0                                   
  Device Type:             CPU                                 
  Cache Info:               
    L1:                      32KB                                
  Chip ID:                 5597                                
  Cacheline Size:          64                                  
  Max Clock Frequency (MHz):2000                                
  BDFID:                   768                                 
  Compute Unit:            8                                   
  Features:                None
  Pool Info:                
    Pool 1                    
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    16776832KB                          
      Allocatable:             TRUE                                
      Alloc Granule:           4KB                                 
      Alloc Alignment:         4KB                                 
      Acessible by all:        TRUE                                
  ISA Info:                 
    N/A                       
*******                   
Agent 2                   
*******                   
  Name:                    gfx902                              
  Vendor Name:             AMD                                 
  Feature:                 KERNEL_DISPATCH                     
  Profile:                 FULL_PROFILE                        
  Float Round Mode:        NEAR                                
  Max Queue Number:        128                                 
  Queue Min Size:          4096                                
  Queue Max Size:          131072                              
  Queue Type:              MULTI                               
  Node:                    0                                   
  Device Type:             GPU                                 
  Cache Info:               
    L1:                      16KB                                
  Chip ID:                 5597                                
  Cacheline Size:          64                                  
  Max Clock Frequency (MHz):1100                                
  BDFID:                   768                                 
  Compute Unit:            11                                  
  Features:                KERNEL_DISPATCH  
  Fast F16 Operation:      FALSE                               
  Wavefront Size:          64                                  
  Workgroup Max Size:      1024                                
  Workgroup Max Size Per Dimension:
    Dim[0]:                  67109888                            
    Dim[1]:                  50332672                            
    Dim[2]:                  0                                   
  Grid Max Size:           4294967295                          
  Waves Per CU:            160                                 
  Max Work-item Per CU:    10240                               
  Grid Max Size per Dimension:
    Dim[0]:                  4294967295                          
    Dim[1]:                  4294967295                          
    Dim[2]:                  4294967295                          
  Max number Of fbarriers Per Workgroup:32                                  
  Pool Info:                
    Pool 1                    
      Segment:                 GROUP                               
      Size:                    64KB                                
      Allocatable:             FALSE                               
      Alloc Granule:           0KB                                 
      Alloc Alignment:         0KB                                 
      Acessible by all:        FALSE                               
  ISA Info:                 
    ISA 1                     
      Name:                    amdgcn-amd-amdhsa--gfx902+xnack     
      Machine Models:          HSA_MACHINE_MODEL_LARGE             
      Profiles:                HSA_PROFILE_BASE                    
      Default Rounding Mode:   NEAR                                
      Default Rounding Mode:   NEAR                                
      Fast f16:                TRUE                                
      Workgroup Max Dimension:  
        Dim[0]:                  67109888                            
        Dim[1]:                  1024                                
        Dim[2]:                  16777217                            
      Workgroup Max Size:      1024                                
      Grid Max Dimension:       
        x                        4294967295                          
        y                        4294967295                          
        z                        4294967295                          
      Grid Max Size:           4294967295                          
      FBarrier Max Size:       32                                  
*** Done ***

From clinfo

Code:

clinfo
Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.1 AMD-APP (2783.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback cl_amd_offline_devices  
  Platform Host timer resolution                  1ns
  Platform Extensions function suffix             AMD
 
  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 1
  Device Name                                     gfx902-xnack
  Device Vendor                                   Advanced Micro Devices, Inc.
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 1.2  
  Driver Version                                  2783.0 (HSA1.1,LC)
  Device OpenCL C Version                         OpenCL C 2.0  
  Device Type                                     GPU
  Device Available                                Yes
  Device Profile                                  FULL_PROFILE
  Device Board Name (AMD)                         AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
  Device Topology (AMD)                           PCI-E, 03:00.0
  Max compute units                               11
  SIMD per compute unit (AMD)                     4
  SIMD width (AMD)                                16
  SIMD instruction width (AMD)                    1
  Max clock frequency                             1100MHz
  Graphics IP (AMD)                               9.2
  Device Partition                                (core)
    Max number of sub-devices                     11
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             256
  Compiler Available                              Yes
  Linker Available                                Yes
  Preferred work group size multiple              64
  Wavefront width (AMD)                           64
  Preferred / native vector sizes                  
    char                                                 4 / 4        
    short                                                2 / 2        
    int                                                  1 / 1        
    long                                                 1 / 1        
    half                                                 1 / 1        (cl_khr_fp16)
    float                                                1 / 1        
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     No
    Infinity and NANs                             No
    Round to nearest                              No
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              7360856064 (6.855GiB)
  Global free memory (AMD)                        7188336 (6.855GiB)
  Global memory channels (AMD)                    2
  Global memory banks per channel (AMD)           4
  Global memory bank width (AMD)                  256 bytes
  Error Correction support                        No
  Max memory allocation                           6256727654 (5.827GiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        16384 (16KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             5597
    Max size for 1D images from buffer            65536 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             2048x2048x2048 pixels
    Max number of read image args                 128
    Max number of write image args                8
  Local memory type                               Local
  Local memory size                               65536 (64KiB)
  Local memory syze per CU (AMD)                  65536 (64KiB)
  Local memory banks (AMD)                        32
  Max constant buffer size                        6256727654 (5.827GiB)
  Max number of constant args                     8
  Max size of kernel argument                     1024
  Queue properties                                 
    Out-of-order execution                        No
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      1ns
  Profiling timer offset since Epoch (AMD)        0ns (Wed Dec 31 16:00:00 1969)
  Execution capabilities                           
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Thread trace supported (AMD)                  No
  printf() buffer size                            4194304 (4MiB)
  Built-in kernels                                 
  Device Extensions                               cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program  
 
NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              Success [AMD]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx902-xnack
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx902-xnack
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx902-xnack

One unfortunate minor issue is the use of no longer maintained pth package since 2006 as dependency instead of modern pthsem. Hopefully it will be rectified in a future and a more simplified installation instruction will be welcome. Darktable, Blender, Gimp and KDenlive were able to detect and use ROCm OpenCL.

What a good way to end 2018 with that present for mobile Raven Ridge users.

**utrrrongeeb** · 01 January 2019, 11:01 PM

I ran LuxMark Luxball HDR on an RX580 using Clover, and got 21887. That's strangely better than Michael's results – for the GTX1080Ti. Is there anything obvious I might be doing wrong, aside from using PTS v8.0.0, locking dpm clocks to mid-high, and not uploading results yet?

**klokik** · 02 January 2019, 04:20 PM

Originally posted by finalzone View Post

ROCm OpenCL now works on AMD Mobile Raven Ridge (Ryzen 2500U as found onHP Envy x360 Convertible) running on updated Fedora 29
...
*******
Agent 2
*******
...
Fast F16 Operation: FALSE
...
ISA Info:
ISA 1
Fast f16: TRUE

Preferred / native vector sizes
char 4 / 4
short 2 / 2
int 1 / 1
long 1 / 1
half 1 / 1 (cl_khr_fp16)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (cl_khr_fp16)
Denormals No
Infinity and NANs No
Round to nearest No
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No

I'm confused... does mobile Vega has double FLOPS for F16 or not?

Originally posted by utrrrongeeb View Post

I ran LuxMark Luxball HDR on an RX580 using Clover, and got 21887

Does it pass image validation for this result?

**utrrrongeeb** · 03 January 2019, 09:57 PM

Originally posted by klokik View Post

Does it pass image validation for this result?

I hadn't thought to check. Having taken another look, my impression of LuxMark is that it iteratively refines the render for a fixed time of 120 seconds, and compares whatever's finished at that point with a converged reference render. There's a threshold (around 15% of pixels mismatch?) where the displayed judgment changes from "pass" to "fail," but the process appears the same. PTS does not appear to read or consider this pass/fail judgment. If I remove the gpu clock limits, the RX580's score heads towards 29686, and somewhere in between it crosses into the "pass" range with 13.26% error. Visually, the output looks right / on the right track; if you know more about Luxmark's technical details, please explain. (I'm also cheating a bit by caching compiled kernels, but that shouldn't explain the size of the results' differences.)

One wonders whether the GTX1080Ti "passed" in this article, and whether it's getting better than 140 points per watt. :-)

For the other Luxmark scenes, compilation takes an unreasonably long time (ten minutes), but in most cases works or can be adjusted to work (try disabling -cl-mad-enable). For example, I'm seeing a score of 42710 for Microphone, at 8.85% error. I haven't comprehensively saved results and compared them with Michael's yet. There's the small disadvantage of the mouse-cursor becoming unresponsive when the heavier benchmarks are running, at least in Wayland.

Announcement

Taking Radeon ROCm 2.0 OpenCL For A Benchmarking Test Drive

Taking Radeon ROCm 2.0 OpenCL For A Benchmarking Test Drive

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment