Announcement

Collapse
No announcement yet.

Intel Mesa Driver Changes Land For Building On Non-x86 CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • archsway
    replied
    Originally posted by coder View Post
    This is irrelevant for data transfers to/from the GPU. The only point of clearing the CPU's instruction cache is for runtime code generation.
    I know, that is why I also mentioned DC CVAC. The related DC CIVAC seems to be the closest instruction on AArch64 to CLFLUSH.

    I'm sure the Imagination-based dGPUs that are emerging from China do it, also. It's a dGPU thing, and you listed the only current dGPU makers.
    Perhaps…

    Leave a comment:


  • coder
    replied
    Originally posted by archsway View Post
    I mean, you have to both flush the data cache, then invalidate the instruction cache.

    So for AArch64 that is DC CVAU followed by IC IVAU.
    This is irrelevant for data transfers to/from the GPU. The only point of clearing the CPU's instruction cache is for runtime code generation.

    Originally posted by archsway View Post
    What DMAs?…

    The only drivers that actually use DMA are the ones for AMD, Intel and nVidia GPUs.
    I'm sure the Imagination-based dGPUs that are emerging from China do it, also. It's a dGPU thing, and you listed the only current dGPU makers.

    Leave a comment:


  • archsway
    replied
    Originally posted by coder View Post
    All of those cases involve at least one call to the driver, do they not? In that case, the driver can do the cache invalidation on the address range you specify.
    I guess the answer is actually almost "yes". Unlike the drivers for Arm SoCs, it turns out that the "big three" GPU manufacturers mostly use DMA for uploads and downloads, and so the DMA engine will be designed in a way that makes it coherent with the CPU, doing whatever cache invalidation is necessary. (The kernel does not read the command buffer, so it has no idea what is going on.) There are actually only a few cases (not the ones I listed) where DMA cannot be done.

    (Perhaps related, ARM is a lot better at using memory bandwidth than x86 is… maybe DMA isn't needed as much.)

    For data, flushing only from L1 would be utterly pointless. However, the docs say that builtin isn't for data, but actually instruction cache. Since caches are only split for instruction & data at L1, that's probably why you read that it works at L1-only.
    I mean, you have to both flush the data cache, then invalidate the instruction cache.

    So for AArch64 that is DC CVAU followed by IC IVAU.

    (To actually flush the cache in a way that is coherent with the GPU is DC CVAC.)

    No, that's only referring to static memory ranges, like those used for memory-mapped I/O. It doesn't help with DMAs.
    What DMAs?…

    The only drivers that actually use DMA are the ones for AMD, Intel and nVidia GPUs. (And SVGA, but that doesn't count.)

    Like I said in my original post, you want the driver code to be readable, which is hurt both by having blocks of inline assembly and blocks of code that are conditional for iGPU-only. Maintaining that hard split between iGPU and dGPU can be a cumbersome and fragile exercise, assuming there's much code-sharing at all.
    Which is why, apart from one instance, the only clflush call is done from a file intel_clflush.h.

    And the flush is actually Atom-only, most IGPUs don't need it.

    Leave a comment:


  • coder
    replied
    Originally posted by archsway View Post
    upload a 300 byte uniform buffer here, read from a 2 KB SSBO there, update a 64x64 portion of the lightmap texture… So only GPU drivers require a fast way to do cache flushes from userspace.
    All of those cases involve at least one call to the driver, do they not? In that case, the driver can do the cache invalidation on the address range you specify.

    Originally posted by archsway View Post
    While GCC has the architecture-independent __builtin___clear_cache function, it does not "reach" far enough to be of use here
    For data, flushing only from L1 would be utterly pointless. However, the docs say that builtin isn't for data, but actually instruction cache. Since caches are only split for instruction & data at L1, that's probably why you read that it works at L1-only.

    Originally posted by archsway View Post
    For external GPUs, I think PCIe guarantees coherency
    No.

    Originally posted by archsway View Post
    The Arm-based drivers tend to map memory write-combine, so memory is not cached for reads and not kept in the cache for a long time for writes, so cache flushes are unnecessary
    No, that's only referring to static memory ranges, like those used for memory-mapped I/O. It doesn't help with DMAs.

    Originally posted by archsway View Post
    Even if Arm-based drivers did need to cache-flush, it is unlikely that the GPU would ever be used on an x86 system, so what is the point of a common function?
    Like I said in my original post, you want the driver code to be readable, which is hurt both by having blocks of inline assembly and blocks of code that are conditional for iGPU-only. Maintaining that hard split between iGPU and dGPU can be a cumbersome and fragile exercise, assuming there's much code-sharing at all.

    Finally, Intel's foundry business is trying to market their various IP blocks, in a mix-n-match fashion with ARM and RISC-V CPU cores. So, it's quite plausible there could be a SoC with ARM cores and an Intel iGPU.
    Last edited by coder; 25 November 2022, 06:42 AM.

    Leave a comment:


  • mangeek
    replied
    For a long time, I've hoped that ARM-based SoCs with Intel graphics would happen. I know this isn't that, but the Intel graphics hardware and software seem like they'd be really good as a 'tile' strapped to some A7x-Cortex or Neoverse cores. Just having a mainstream graphics software stack that worked out of the gate instead of waiting years for things like VideoCore, Adreno, or Mali would be so good for the small board/tinkerer/embedded/low-end communities.

    Leave a comment:


  • archsway
    replied
    Originally posted by coder View Post
    The text I quoted said "driver", hence no syscall because it's already in the kernel.

    And, if there were a need to flush the CPU cache from userspace (which I rather doubt, because that's usually a detail handled in conjunction with other operations that need to happen at driver-level, like initiating a DMA transfer), the I would expect this to be common enough that glibc, Mesa, or some other userspace library would have a portable function for doing it.

    Basically, there's no way this is a need that's unique to Intel. Not a chance. It's indefensible.
    For many pieces of hardware, such as a video decoder, it is simple for the kernel driver to flush the cache—it just needs to clean the video bitstream before submitting the command, and invalidate the image data at the end.

    But for a GPU driver, there could be thousands of buffers and a total of gigabytes of memory. Doing a cache flush for all of this memory would take far too long, so the only realistic option would be to throw out the entire L3 cache, which would still massively hurt performance. Only the userspace driver knows when memory is actually updated, so only it can do fine-grained flushing of only the memory that needs to be flushed.

    GPUs are an odd situation—they are "far away" enough from the CPU that having coherent caches doesn't always make sense, but there is still a lot of fine-grained memory access going on—upload a 300 byte uniform buffer here, read from a 2 KB SSBO there, update a 64x64 portion of the lightmap texture… So only GPU drivers require a fast way to do cache flushes from userspace.

    While GCC has the architecture-independent __builtin___clear_cache function, it does not "reach" far enough to be of use here—I don't think it does anything at all on x86 CPUs, but for other CPUs it only acts on the L1 caches, as that is all that is required for code execution to read updated memory.

    So why is the cache-flush Intel-specific?
    • While Mesa shares a lot between drivers for the frontend code, there is not so much sharing on the back end
    • Other vendors could have IGPUs with coherent caches
    • For external GPUs, I think PCIe guarantees coherency
    • The Arm-based drivers tend to map memory write-combine, so memory is not cached for reads and not kept in the cache for a long time for writes, so cache flushes are unnecessary
    • Even if Arm-based drivers did need to cache-flush, it is unlikely that the GPU would ever be used on an x86 system, so what is the point of a common function?

    Leave a comment:


  • coder
    replied
    Originally posted by archsway View Post
    Syscalls are slow. If you are doing a 300 byte buffer upload, you don't want to have to go all the way to the kernel to invalidate and clean the cache.
    The text I quoted said "driver", hence no syscall because it's already in the kernel.

    And, if there were a need to flush the CPU cache from userspace (which I rather doubt, because that's usually a detail handled in conjunction with other operations that need to happen at driver-level, like initiating a DMA transfer), the I would expect this to be common enough that glibc, Mesa, or some other userspace library would have a portable function for doing it.

    Basically, there's no way this is a need that's unique to Intel. Not a chance. It's indefensible.
    Last edited by coder; 25 November 2022, 12:17 AM.

    Leave a comment:


  • archsway
    replied
    Originally posted by coder View Post
    why??? I'm 100% certain the kernel must have a portable way of flushing CPU cache. Why don't they just replace it with that, and then you don't need to clutter up the code with a bunch of conditional SUPPORT_INTEL_INTEGRATED_GPUS blocks.
    Syscalls are slow. If you are doing a 300 byte buffer upload, you don't want to have to go all the way to the kernel to invalidate and clean the cache.

    Perhaps there should be a vDSO function for cache clean/invalidate, but that doesn't exist so an architecture-specific compiler intrinsic must be used.

    Leave a comment:


  • nyanmisaka
    replied
    Originally posted by Jabberwocky View Post
    I wonder if this could help Jeff Geerling's attempts at getting a discrete GPU working with the raspberry pi cm4.

    I know these are mostly compilation changes and nowhere close to hardware QA on foreign hardware.

    At the very least if it at least gives Jeff another vendor to test with (Intel) it could help figuring out what's going on.

    https://www.jeffgeerling.com/blog/20...spberry-pi-cm4
    The dGPU issue on Arm SBC seems to be RaspberryPi specific. AMD Polaris works just fine on the RK 3588 Rock 5b SBC.
    You do not have permission to view this gallery.
    This gallery has 1 photos.

    Leave a comment:


  • mangeek
    replied
    One of the coolest moments in my early nerd life was when I started booting my G3 Mac to Linux and at some point, tried out a 'PC' NIC card I had scavenged from the trash at work. It didn't work under Mac OS, but it worked fine under Linux. It opened up my eyes to the idea that the ecosystem I had been using was much smaller and more limited than I thought, and that alternative software options could open new doors.

    Leave a comment:

Working...
X