Announcement

**coder** · 25 November 2022, 06:36 AM

Originally posted by archsway View Post

upload a 300 byte uniform buffer here, read from a 2 KB SSBO there, update a 64x64 portion of the lightmap texture… So only GPU drivers require a fast way to do cache flushes from userspace.

All of those cases involve at least one call to the driver, do they not? In that case, the driver can do the cache invalidation on the address range you specify.

Originally posted by archsway View Post

While GCC has the architecture-independent __builtin___clear_cache function, it does not "reach" far enough to be of use here

For data, flushing only from L1 would be utterly pointless. However, the docs say that builtin isn't for data, but actually instruction cache. Since caches are only split for instruction & data at L1, that's probably why you read that it works at L1-only.

Originally posted by archsway View Post

For external GPUs, I think PCIe guarantees coherency

No.

Originally posted by archsway View Post

The Arm-based drivers tend to map memory write-combine, so memory is not cached for reads and not kept in the cache for a long time for writes, so cache flushes are unnecessary

No, that's only referring to static memory ranges, like those used for memory-mapped I/O. It doesn't help with DMAs.

Originally posted by archsway View Post

Even if Arm-based drivers did need to cache-flush, it is unlikely that the GPU would ever be used on an x86 system, so what is the point of a common function?

Like I said in my original post, you want the driver code to be readable, which is hurt both by having blocks of inline assembly and blocks of code that are conditional for iGPU-only. Maintaining that hard split between iGPU and dGPU can be a cumbersome and fragile exercise, assuming there's much code-sharing at all.

Finally, Intel's foundry business is trying to market their various IP blocks, in a mix-n-match fashion with ARM and RISC-V CPU cores. So, it's quite plausible there could be a SoC with ARM cores and an Intel iGPU.

**archsway** · 25 November 2022, 07:24 AM

Originally posted by coder View Post

All of those cases involve at least one call to the driver, do they not? In that case, the driver can do the cache invalidation on the address range you specify.

I guess the answer is actually almost "yes". Unlike the drivers for Arm SoCs, it turns out that the "big three" GPU manufacturers mostly use DMA for uploads and downloads, and so the DMA engine will be designed in a way that makes it coherent with the CPU, doing whatever cache invalidation is necessary. (The kernel does not read the command buffer, so it has no idea what is going on.) There are actually only a few cases (not the ones I listed) where DMA cannot be done.

(Perhaps related, ARM is a lot better at using memory bandwidth than x86 is… maybe DMA isn't needed as much.)

For data, flushing only from L1 would be utterly pointless. However, the docs say that builtin isn't for data, but actually instruction cache. Since caches are only split for instruction & data at L1, that's probably why you read that it works at L1-only.

I mean, you have to both flush the data cache, then invalidate the instruction cache.

So for AArch64 that is DC CVAU followed by IC IVAU.

(To actually flush the cache in a way that is coherent with the GPU is DC CVAC.)

No, that's only referring to static memory ranges, like those used for memory-mapped I/O. It doesn't help with DMAs.

What DMAs?…

The only drivers that actually use DMA are the ones for AMD, Intel and nVidia GPUs. (And SVGA, but that doesn't count.)

Like I said in my original post, you want the driver code to be readable, which is hurt both by having blocks of inline assembly and blocks of code that are conditional for iGPU-only. Maintaining that hard split between iGPU and dGPU can be a cumbersome and fragile exercise, assuming there's much code-sharing at all.

Which is why, apart from one instance, the only clflush call is done from a file intel_clflush.h.

And the flush is actually Atom-only, most IGPUs don't need it.

**coder** · 25 November 2022, 08:04 AM

Originally posted by archsway View Post

I mean, you have to both flush the data cache, then invalidate the instruction cache.

So for AArch64 that is DC CVAU followed by IC IVAU.

This is irrelevant for data transfers to/from the GPU. The only point of clearing the CPU's instruction cache is for runtime code generation.

Originally posted by archsway View Post

What DMAs?…

The only drivers that actually use DMA are the ones for AMD, Intel and nVidia GPUs.

I'm sure the Imagination-based dGPUs that are emerging from China do it, also. It's a dGPU thing, and you listed the only current dGPU makers.

**archsway** · 25 November 2022, 09:19 AM

Originally posted by coder View Post

This is irrelevant for data transfers to/from the GPU. The only point of clearing the CPU's instruction cache is for runtime code generation.

I know, that is why I also mentioned DC CVAC. The related DC CIVAC seems to be the closest instruction on AArch64 to CLFLUSH.

I'm sure the Imagination-based dGPUs that are emerging from China do it, also. It's a dGPU thing, and you listed the only current dGPU makers.

Perhaps…

Announcement

Intel Mesa Driver Changes Land For Building On Non-x86 CPUs

Comment

Comment

Comment

Comment