Announcement

**marlock** · 28 October 2023, 10:10 AM

sigh...

I'm tempted to say that the fact that this benchmark has no AMD gpus says everything about how easy it is to use OpenCL on an AMD gpu on linux in 2023... but I rather ask first...

Michael,
is there a way to include them in this? do you even have supported amd cards?

would these benchmarks work? is their perf, perf/watt and perf/price at least presentable?

ps: all of the above questions both for the official (AMDGPU module) and community (ROCm) opensource options...

**biergaizi** · 05 November 2023, 05:20 AM

Michael

For the last two years, you've repeatedly suggested that the low performance of some Nvidia GPUs in FluidX3D benchmarks is possibly due to low OpenCL performance on Nvidia. This is incorrect. FluidX3D is a physics simulation with a low arithmetic intensity, on almost all but the slowest GPUs, its performance is limited by memory bandwidth alone - in other words, FluidX3D is basically clpeak's Global Memory Bandwidth benchmark is disguise. The real culprit is that most Nvidia's RTX 40-series GPUs have poor memory bandwidth. This can be easily shown via roofline analysis, familiar to all programmers working in the field of HPC; or more simply, it can be shown by its strong correlation with the hardware's memory bandwidth - either datasheet spec or memory benchmark results.

Most low and mid-end Nvidia GPUs in the RX 40-series have rather low memory bandwidth compared to their 30-series predecessors. This may not be an issue for gaming, as many rendering tasks are compute-bound, so the low bandwidth can be partially mitigated by increased cache size. However, this is not the case of memory-bound physics simulations such as FluidX3D, which sweeps the entire simulation domain (> 1 GB) in every timestep with 0% cache hit rate.

For example, the Nvidia RTX 3060 had 240 or 360 GB/s of memory bandwidth (depending on hardware revision, but from the benchmarks, you definitely have a 360 GB/s card), and the RTX 3070 had 448 GB/s. Meanwhile, the RTX 4060 only has 272 GB/s of bandwidth. Theoretically, 360 / 272 = 1.32, which means a performance reduction of 1.32x on the RTX 4060 compared to the RTX 3060 - and indeed, if you check their benchmark scores (in FP32/FP32 mode), 2141 / 1631 = 1.31, it's really close.

This fact can be further illustrated by comparing the clpeak's Global Memory Bandwidth ranking (page 6) and FluidX3D's FP32/FP32 simulation ranking (page 5) for Nvidia GPUs. You would see that the rank is identical for Nvidia GPUs: RTX 3080 > RTX 3070 Ti > RTX 3070 > RTX 3060 Ti > RTX 3060 > RTX 4060. Furthermore, according to official FluidX3D benchmarks, most Nvidia GPUs, including 30 and 40 series, operate at around 80% of their theoretical peak hardware performance (which is memory bandwidth), at least in FP32/FP32 mode.

Thus, it would be incorrect and highly misleading to attribute this performance gap to OpenCL inefficiency. I'd like to suggest Phoronix to stop making this factually incorrect and misleading claim, and to correctly attribute such differences to memory bandwidth (if it correlates well to clpeak's bandwidth results) in future tests.

Announcement

Intel's Open-Source Compute Runtime Performing Increasingly Well Against NVIDIA's Proprietary Linux Driver

Comment

Comment