Announcement

**Jabberwocky** · 23 November 2022, 04:36 PM

I wonder if this could help Jeff Geerling's attempts at getting a discrete GPU working with the raspberry pi cm4.

I know these are mostly compilation changes and nowhere close to hardware QA on foreign hardware.

At the very least if it at least gives Jeff another vendor to test with (Intel) it could help figuring out what's going on.

Three more graphics cards on the Raspberry Pi CM4 | Jeff Geerling

https://www.jeffgeerling.com/blog/2021/three-more-graphics-cards-on-raspberry-pi-cm4

**Unklejoe** · 23 November 2022, 09:00 PM

What type of issues prevent this from working on arm already? Is there a lot of hand written x86 assembly in the code?

edit: never mind - it’s a cflush thing, as mentioned in the article

**coder** · 24 November 2022, 12:30 AM

the driver code explicitly requires the Intel x86/x86_64 CLFLUSH instruction for clearing cache lines.

OMG, why??? I'm 100% certain the kernel must have a portable way of flushing CPU cache. Why don't they just replace it with that, and then you don't need to clutter up the code with a bunch of conditional SUPPORT_INTEL_INTEGRATED_GPUS blocks.

**mangeek** · 24 November 2022, 12:42 AM

One of the coolest moments in my early nerd life was when I started booting my G3 Mac to Linux and at some point, tried out a 'PC' NIC card I had scavenged from the trash at work. It didn't work under Mac OS, but it worked fine under Linux. It opened up my eyes to the idea that the ecosystem I had been using was much smaller and more limited than I thought, and that alternative software options could open new doors.

**nyanmisaka** · 24 November 2022, 02:23 AM

Originally posted by Jabberwocky View Post

I wonder if this could help Jeff Geerling's attempts at getting a discrete GPU working with the raspberry pi cm4.

I know these are mostly compilation changes and nowhere close to hardware QA on foreign hardware.

At the very least if it at least gives Jeff another vendor to test with (Intel) it could help figuring out what's going on.

https://www.jeffgeerling.com/blog/20...spberry-pi-cm4

The dGPU issue on Arm SBC seems to be RaspberryPi specific. AMD Polaris works just fine on the RK 3588 Rock 5b SBC.

**archsway** · 24 November 2022, 09:12 PM

Originally posted by coder View Post

why??? I'm 100% certain the kernel must have a portable way of flushing CPU cache. Why don't they just replace it with that, and then you don't need to clutter up the code with a bunch of conditional SUPPORT_INTEL_INTEGRATED_GPUS blocks.

Syscalls are slow. If you are doing a 300 byte buffer upload, you don't want to have to go all the way to the kernel to invalidate and clean the cache.

Perhaps there should be a vDSO function for cache clean/invalidate, but that doesn't exist so an architecture-specific compiler intrinsic must be used.

**coder** · 25 November 2022, 12:14 AM

Originally posted by archsway View Post

Syscalls are slow. If you are doing a 300 byte buffer upload, you don't want to have to go all the way to the kernel to invalidate and clean the cache.

The text I quoted said "driver", hence no syscall because it's already in the kernel.

And, if there were a need to flush the CPU cache from userspace (which I rather doubt, because that's usually a detail handled in conjunction with other operations that need to happen at driver-level, like initiating a DMA transfer), the I would expect this to be common enough that glibc, Mesa, or some other userspace library would have a portable function for doing it.

Basically, there's no way this is a need that's unique to Intel. Not a chance. It's indefensible.

**archsway** · 25 November 2022, 12:51 AM

Originally posted by coder View Post

The text I quoted said "driver", hence no syscall because it's already in the kernel.

And, if there were a need to flush the CPU cache from userspace (which I rather doubt, because that's usually a detail handled in conjunction with other operations that need to happen at driver-level, like initiating a DMA transfer), the I would expect this to be common enough that glibc, Mesa, or some other userspace library would have a portable function for doing it.

Basically, there's no way this is a need that's unique to Intel. Not a chance. It's indefensible.

For many pieces of hardware, such as a video decoder, it is simple for the kernel driver to flush the cache—it just needs to clean the video bitstream before submitting the command, and invalidate the image data at the end.

But for a GPU driver, there could be thousands of buffers and a total of gigabytes of memory. Doing a cache flush for all of this memory would take far too long, so the only realistic option would be to throw out the entire L3 cache, which would still massively hurt performance. Only the userspace driver knows when memory is actually updated, so only it can do fine-grained flushing of only the memory that needs to be flushed.

GPUs are an odd situation—they are "far away" enough from the CPU that having coherent caches doesn't always make sense, but there is still a lot of fine-grained memory access going on—upload a 300 byte uniform buffer here, read from a 2 KB SSBO there, update a 64x64 portion of the lightmap texture… So only GPU drivers require a fast way to do cache flushes from userspace.

While GCC has the architecture-independent __builtin___clear_cache function, it does not "reach" far enough to be of use here—I don't think it does anything at all on x86 CPUs, but for other CPUs it only acts on the L1 caches, as that is all that is required for code execution to read updated memory.

So why is the cache-flush Intel-specific?

While Mesa shares a lot between drivers for the frontend code, there is not so much sharing on the back end
Other vendors could have IGPUs with coherent caches
For external GPUs, I think PCIe guarantees coherency
The Arm-based drivers tend to map memory write-combine, so memory is not cached for reads and not kept in the cache for a long time for writes, so cache flushes are unnecessary
Even if Arm-based drivers did need to cache-flush, it is unlikely that the GPU would ever be used on an x86 system, so what is the point of a common function?

**mangeek** · 25 November 2022, 01:25 AM

For a long time, I've hoped that ARM-based SoCs with Intel graphics would happen. I know this isn't that, but the Intel graphics hardware and software seem like they'd be really good as a 'tile' strapped to some A7x-Cortex or Neoverse cores. Just having a mainstream graphics software stack that worked out of the gate instead of waiting years for things like VideoCore, Adreno, or Mali would be so good for the small board/tinkerer/embedded/low-end communities.

Announcement

Intel Mesa Driver Changes Land For Building On Non-x86 CPUs

Intel Mesa Driver Changes Land For Building On Non-x86 CPUs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment