Announcement

**defaultUser** · 14 January 2017, 01:43 PM

Originally posted by Shnatsel View Post

TL;DR: just use GIMP? It sure would be nice to improve the GIMP profile to a point where it would be compiled inside the test to make the results reproducible. But I'm afraid that still would not provide way to force OpenCL usage though, or even to check for it.

Would the "gegl" command-line program allocate buffers more optimally than tests do? Perhaps I could patch it to output whether it's using OpenCL or not when verbose output is requested.

Actually this an issue with how the GEGL uses OpenCL. There is no workaround, basically the performance is most times limited by these copying operations.

**Jumbotron** · 14 January 2017, 02:28 PM

Originally posted by defaultUser View Post

The folder gegl/perf contains some performance benchmarks. However is important to notice that due to the GEGL architecture the performance of these tests is sub optimal. Since for every tile (the gegl decomposes the image in tiles) is necessary to allocate and copy to the GPU. For this case what you are going to measure is basically the timing to allocate things on the GPU and to copy to it. For real use, for instance on a gimp session, the buffers pre allocated are preserved. However I believe that still necessary to copy to and from the gpu. Again interfering with the results

Pardon my potential ignorance.....but wouldn't HSA ( Heterogeneous System Architecture ) as promoted by AMD and ARM and HMM (Heterogeneous Memory Managment ) as promoted by Intel and Nvidia take care of this issue? Or at least such a time as for the appropriate code optimizations to appear in GIMP ?

Here's a blurb from AMD itself about the benefits of HSA....

" HSA creates an improved processor design that exposes the benefits and capabilities of mainstream programmable compute elements, working together seamlessly. With HSA, applications can create data structures in a single unified address space and can initiate work items on the hardware most appropriate for a given task. Sharing data between compute elements is as simple as sending a pointer. Multiple compute tasks can work on the same coherent memory regions, utilizing barriers and atomic memory operations as needed to maintain data synchronization (just as multi-core CPUs do today).

The HSA team at AMD analyzed the performance of Haar Face Detect, a commonly used multi-stage video analysis algorithm used to identify faces in a video stream. The team compared a CPU/GPU implementation in OpenCL™ against an HSA implementation. The HSA version seamlessly shares data between CPU and GPU, without memory copies or cache flushes because it assigns each part of the workload to the most appropriate processor with minimal dispatch overhead. The net result was a 2.3x relative performance gain at a 2.4x reduced power level*. This level of performance is not possible using only multicore CPU, only GPU, or even combined CPU and GPU with today’s driver model. Just as important, it is done using simple extensions to C++, not a totally different programming model. "

**taxi_bs** · 14 January 2017, 02:57 PM

Suggestion for benchmarking: Flightgear. I am not really familiar with the cli-options, but it looks quite possible:

Command line options - FlightGear wiki

http://wiki.flightgear.org/Command_line_options

**Michael** · 14 January 2017, 03:08 PM

Originally posted by taxi_bs View Post

Suggestion for benchmarking: Flightgear. I am not really familiar with the cli-options, but it looks quite possible:
http://wiki.flightgear.org/Command_line_options

People have suggested it before, but as far as I know at last check still didn't allow for automated benchmarking properly. Just searched for benchmark / demo on that page mentioned and not seeing any hits.

**defaultUser** · 14 January 2017, 04:04 PM

Originally posted by Jumbotron View Post

Pardon my potential ignorance.....but wouldn't HSA ( Heterogeneous System Architecture ) as promoted by AMD and ARM and HMM (Heterogeneous Memory Managment ) as promoted by Intel and Nvidia take care of this issue? Or at least such a time as for the appropriate code optimizations to appear in GIMP ?

Here's a blurb from AMD itself about the benefits of HSA....

" HSA creates an improved processor design that exposes the benefits and capabilities of mainstream programmable compute elements, working together seamlessly. With HSA, applications can create data structures in a single unified address space and can initiate work items on the hardware most appropriate for a given task. Sharing data between compute elements is as simple as sending a pointer. Multiple compute tasks can work on the same coherent memory regions, utilizing barriers and atomic memory operations as needed to maintain data synchronization (just as multi-core CPUs do today).

The HSA team at AMD analyzed the performance of Haar Face Detect, a commonly used multi-stage video analysis algorithm used to identify faces in a video stream. The team compared a CPU/GPU implementation in OpenCL™ against an HSA implementation. The HSA version seamlessly shares data between CPU and GPU, without memory copies or cache flushes because it assigns each part of the workload to the most appropriate processor with minimal dispatch overhead. The net result was a 2.3x relative performance gain at a 2.4x reduced power level*. This level of performance is not possible using only multicore CPU, only GPU, or even combined CPU and GPU with today’s driver model. Just as important, it is done using simple extensions to C++, not a totally different programming model. "

There is two things at play here. First about the HMM (disclaimer, I just started reading about the HMM). In essence it appears to be a way to move and synchronise data between the system memory (RAM attached to the CPU) to other devices (think things like high performance network cards, and gpu connected using the PCI Express). Freeing the programmer the responsibility from bookkeeping of pointers (device or host in the CUDA parlance) and also synchronising and coping from the host (cpu+ram) to the device (gpu, network cards, nvm) the HMM will do these things automatically and potentially in a more efficient way. However the issue of doing multiple copies or synchronisations remains since it is related to the limitations of the hardware, for instance of the PCI Express bus.
As far I understand HSA is "just" a better way to use integrated accelerators on a SOC, for instance integrated graphics, dsp chips. Since these things is inside the package all of them can have access/share the main memory. However this memory is very slow when compared to the memory used on discrete GPU's (actually at the time of chipsets with north bridge nvidia IGP's also have this ability). For discrete GPU's without direct access to the main memory. There is no magic (Because of that Nvidia and other are investing in things like the NVlink that removes the bottleneck of pci express) you need to copy dat back and forth. However there are various tricks to do that, streaming the data to the gpu and overlapping the communication with computation.

**Marc Driftmeyer** · 14 January 2017, 08:19 PM

Wake us up when your test suite is PHP 7.1 enabled.

**Michael** · 14 January 2017, 08:21 PM

Originally posted by Marc Driftmeyer View Post

Wake us up when your test suite is PHP 7.1 enabled.

Uhhh it has always been. What makes you think it's not?

Announcement

New Benchmark Test Profiles This Weekend: GIMP, Memcached, JPEG Turbo, More OpenCL

Comment

Comment

Comment

Comment

Comment

Comment

Comment