Announcement

Collapse
No announcement yet.

Intel's Open-Source Compute Runtime Performing Increasingly Well Against NVIDIA's Proprietary Linux Driver

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • biergaizi
    replied
    Michael

    For the last two years, you've repeatedly suggested that the low performance of some Nvidia GPUs in FluidX3D benchmarks is possibly due to low OpenCL performance on Nvidia. This is incorrect. FluidX3D is a physics simulation with a low arithmetic intensity, on almost all but the slowest GPUs, its performance is limited by memory bandwidth alone - in other words, FluidX3D is basically clpeak's Global Memory Bandwidth benchmark is disguise. The real culprit is that most Nvidia's RTX 40-series GPUs have poor memory bandwidth. This can be easily shown via roofline analysis, familiar to all programmers working in the field of HPC; or more simply, it can be shown by its strong correlation with the hardware's memory bandwidth - either datasheet spec or memory benchmark results.

    Most low and mid-end Nvidia GPUs in the RX 40-series have rather low memory bandwidth compared to their 30-series predecessors. This may not be an issue for gaming, as many rendering tasks are compute-bound, so the low bandwidth can be partially mitigated by increased cache size. However, this is not the case of memory-bound physics simulations such as FluidX3D, which sweeps the entire simulation domain (> 1 GB) in every timestep with 0% cache hit rate.

    For example, the Nvidia RTX 3060 had 240 or 360 GB/s of memory bandwidth (depending on hardware revision, but from the benchmarks, you definitely have a 360 GB/s card), and the RTX 3070 had 448 GB/s. Meanwhile, the RTX 4060 only has 272 GB/s of bandwidth. Theoretically, 360 / 272 = 1.32, which means a performance reduction of 1.32x on the RTX 4060 compared to the RTX 3060 - and indeed, if you check their benchmark scores (in FP32/FP32 mode), 2141 / 1631 = 1.31, it's really close.

    This fact can be further illustrated by comparing the clpeak's Global Memory Bandwidth ranking (page 6) and FluidX3D's FP32/FP32 simulation ranking (page 5) for Nvidia GPUs. You would see that the rank is identical for Nvidia GPUs: RTX 3080 > RTX 3070 Ti > RTX 3070 > RTX 3060 Ti > RTX 3060 > RTX 4060. Furthermore, according to official FluidX3D benchmarks, most Nvidia GPUs, including 30 and 40 series, operate at around 80% of their theoretical peak hardware performance (which is memory bandwidth), at least in FP32/FP32 mode.

    Thus, it would be incorrect and highly misleading to attribute this performance gap to OpenCL inefficiency. I'd like to suggest Phoronix to stop making this factually incorrect and misleading claim, and to correctly attribute such differences to memory bandwidth (if it correlates well to clpeak's bandwidth results) in future tests.
    Last edited by biergaizi; 05 November 2023, 05:26 AM.

    Leave a comment:


  • marlock
    replied
    sigh...

    I'm tempted to say that the fact that this benchmark has no AMD gpus says everything about how easy it is to use OpenCL on an AMD gpu on linux in 2023... but I rather ask first...

    Michael,
    is there a way to include them in this? do you even have supported amd cards?

    would these benchmarks work? is their perf, perf/watt and perf/price at least presentable?

    ps: all of the above questi‚Äčons both for the official (AMDGPU module) and community (ROCm) opensource options...

    Leave a comment:


  • bug77
    replied
    Wth, another developer offering a compute stack nicely integrated with the gfx driver? Shouldn't it be a completely separate implementation that users have to hunt down and install it themselves? I thought that was the modern, open approach to supporting compute :P

    Leave a comment:


  • stormcrow
    replied
    Intel can do the numbers as well as anyone else. Those results are probably why the A580 is priced where it is. It appears Intel is positioning its dGPUs as an inexpensive compute module alternative to Nvidia more than competition for the already sewn up and highly unforgiving gaming market. It's bold, and it may pay off. I think they may be over estimating the number of people that only want a low end compute system and don't play games or only casually so, but perhaps not.

    Edit to add: If they can get a handle on the idle power burn with a firmware update, or subsequent generations, they've probably got a winner for the low end compute niche they'll be able to use later on to muscle in on the higher margin server compute market. Familiarity with a product and technology has its own kind of loyalty. Similar in the way x86 eventually muscled out more capable, but far more expensive and less available "big iron" and proprietary Unix in the aughts.
    Last edited by stormcrow; 26 October 2023, 08:00 PM.

    Leave a comment:


  • pong
    replied
    Good article, Michael Larabel, thanks for making it!

    It is interesting to see for someone who has interest in both the ARC & NVIDIA card capabilities / performance worlds!
    Using the nice PTS benchmarking test suite was part of my initial test / burn in / benchmark tests of my recent system build using ARC / RYZEN.

    Although it has been widely discussed incidentally, it might be interesting for ARC (and NVIDIA, AMD) users to add showing for future benchmarks the
    power consumption achieved by the GPU (and for that matter the whole system as well?) just sitting at the graphical desktop system idle but running,
    idle with the screen blanked but system running, single monitor attached to the GPU 1080Px60Hz, single 4Kx60 Hz, dual monitors, one/two monitors
    turned soft-off.

    Perhaps the lower power "Watts" bars on the GPU power consumption "in use tests" already to some extent for some tests measure the lowest achieved
    levels if the lowest power state was briefly achieved / measured just before or just after the high activity test parts commenced. The A770 seemed to have
    a low bar near (rough graphical estimate) 35W while the 3070 got down to ~20W in the FluidX32 2.9 test for instance, though I'm surprised I didn't see any of the NV cards drop further down than that but the test wasn't intended to measure that aspect of course.

    ARC cards in the 7x series tend to have very bad idle power consumption absent some Intel commended ASPM settings many cannot seem to apply due to missing BIOS enable options or system level incompatibility when trying to use them. And even if those are enabled if one exceeds a modest resolution x VSYNC rate for
    one or more monitors, or more than 1 monitor it's likely to not drop to a low power consumption level in any case.
    e.g. I've never seen my A770-16-LE take less than 41W when I've measured it at an idle LINUX desktop with X570 / RYZEN 5900X / ubuntu / manjaro tests.

    I imagine the results will vary per. motherboard / BIOS ASPM capabilities / settings and whatever LINUX kernel settings may support / override those; at one point I even thought there was some kind of "force" ASPM enable flag that might be relevant to ARC idle but I'll have to research it; I think the implication was that maybe in modern times LINUX should do that automatically (most / initially relevant for laptop use cases) in which case maybe it's not relevant to adjust.

    I've lost track of what's included in other benchmarks but with all the new-ish additions of NV having tensor RT acceleration improvements for some of the stable diffusion UI configurations / versions / use cases, as well as Intel's corresponding OpenVino accelerations / their better DG2 support for pytorch, TF in GPU mode I know there are cases where both NV and ARC are performing a lot better for those SD UI workflows so that (if not already in the benchmark mix / reports for some classes of reports) might be interesting to follow for many.

    On a totally tangential front one thing I've not seen good data about for consumer GPUs is the extent to which their RAM has integrity and their calculations are correct; the enterprise ones can have the equivalent of ECC in some GDDR variants but I don't know that anyone's consumer GPUs actually enable it or have it and the consumer cards tend to be clocked more aggressively than the industrial ones. So for those doing GPU compute the question is relevant to know whether in computing for a hour, day, week, month if the RAM / computation is likely corrupted anyway for these consumer GPUs and similar case for consumer non-ECC motherboards / RAM of DDR4/5 UDIMM type.

    Just mentioning some things that could be of general interest as ideas if you're ever looking for directions to measure / publish stuff that might be able to be derivative from your existing setups / data etc. but present information as to more areas of use / concern.

    Leave a comment:


  • brucethemoose
    replied
    Intel is performing/functioning well in the AI space too.

    If there is a 24GB+ Battlemage card, it will almost certainly be the replacement for my 3090. AMD is an unlikely choice with their current software state, and Nvidia is outrageously priced and a pain to deal with on linux.
    Last edited by brucethemoose; 26 October 2023, 05:15 PM.

    Leave a comment:


  • ChirpinBird99
    replied
    Funny how 4060 is worse than 3060 in some tests

    Leave a comment:


  • Danny3
    replied
    Wow, good job Intel!
    I wonder WTF is AMD doing in this area?

    Leave a comment:


  • CochainComplex
    replied
    Originally posted by FireBurn View Post
    How about we compare it against an old Voodoo Banshee to make it really look good :P
    Yeah Lets compare it against gfx cards costeing 5 -10x as much.

    Do you have the same requierements for Car Reviewer?
    Oh Yes, compare everyhting against a Lamborghini or Bentley. The new VW golf is very nice but still 0-60mph its way slower then a Lamborghini. Pro Size / Con Performance

    Leave a comment:


  • cypx
    replied
    Originally posted by FireBurn View Post
    How about we compare it against an old Voodoo Banshee to make it really look good :P
    RTX 4060 is the most recent graphic card released by Nvidia so it seems fair

    Leave a comment:

Working...
X