I just read through this thread again and I'm still not seeing what the benefit of eliminating the CPU would be. You still need "persistent code" and "per-data-element code"... the only difference would be that the persistent code would move to the GPU. It is certainly possible to run persistent threads on the GPU but it tends to be either not particularly efficient or not particularly simple (you get to choose).
GPUs have a few major differences from CPUs... list below is simplified a bit but enough for discussion:
#1 - SIMD architecture with large number of ALUs but very simple implementation, either in-order or very close to it
#2 - inconsistent (but faster) memory model where writes are not immediately visible to reads unless you explicitly flush or bypass cache
#3 - dedicated hardware to spawn a large number of threads, one per data element, driven by structures such as "vertices in a triangle strip", "pixels in a triangle", "elements in a matrix" etc...
#4 - dedicated hardware to efficiently coordinate large numbers of threads (barriers, atomic operations)
GPUs tend to be much less efficient than CPUs if your code can not take advantage of SIMD or MIMD processing. When doing "GPU things" this is usually not an issue because the dedicated hardware from #3 above automatically loads up the SIMDs with a set of threads, each working on a different data element from the driving structure (eg triangle strip) but it's very hard to make efficient use of GPU hardware when running the typical persistent threads that make up an application.
Anyways, it's certainly possible but from a performance and efficiency perspective it seems like a step backwards. You would basically have to put at least a simplified GPU driver in the SBIOS/UEFI code in order to get things running, and a lot of the OS and application code would end up running on either a single element of a 32-way SIMD (aka vector processor) or running on the scalar processor if the GPU cores have them (might be AMD-only).
Quite a bit of the hardware in a CPU goes to providing a simple programming model for the developer on top of very complex hardware, while GPUs expose the complexity but semi-hide it behind high level driver APIs. I don't see any obvious way around exposing that complexity to OS developers, since running an OS on top of a driver abstraction model would be very inefficient.
That said, perhaps 70% of the problem boils down to figuring out how OS and application code can make good use of SIMD hardware and most of the rest boils down to managing the trade-offs associated with a "dirty write" memory consistency model. Both of those appear to be big gnarly problems, however.
There were a few examples of hardware that split the difference between CPU and GPU with a large number of simple cores but keeping a CPU-like programming model and so were able to boot a standard OS. Those would be a better candidate for what I believe you have in mind, however I don't believe any of them survived in the marketplace. Thinking of products like Knights Landing... I thought they showed promise but they required a massive rework of existing GPU drivers in order to make good use of the hardware and AFAIK that never came together in time.
GPUs have a few major differences from CPUs... list below is simplified a bit but enough for discussion:
#1 - SIMD architecture with large number of ALUs but very simple implementation, either in-order or very close to it
#2 - inconsistent (but faster) memory model where writes are not immediately visible to reads unless you explicitly flush or bypass cache
#3 - dedicated hardware to spawn a large number of threads, one per data element, driven by structures such as "vertices in a triangle strip", "pixels in a triangle", "elements in a matrix" etc...
#4 - dedicated hardware to efficiently coordinate large numbers of threads (barriers, atomic operations)
GPUs tend to be much less efficient than CPUs if your code can not take advantage of SIMD or MIMD processing. When doing "GPU things" this is usually not an issue because the dedicated hardware from #3 above automatically loads up the SIMDs with a set of threads, each working on a different data element from the driving structure (eg triangle strip) but it's very hard to make efficient use of GPU hardware when running the typical persistent threads that make up an application.
Anyways, it's certainly possible but from a performance and efficiency perspective it seems like a step backwards. You would basically have to put at least a simplified GPU driver in the SBIOS/UEFI code in order to get things running, and a lot of the OS and application code would end up running on either a single element of a 32-way SIMD (aka vector processor) or running on the scalar processor if the GPU cores have them (might be AMD-only).
Quite a bit of the hardware in a CPU goes to providing a simple programming model for the developer on top of very complex hardware, while GPUs expose the complexity but semi-hide it behind high level driver APIs. I don't see any obvious way around exposing that complexity to OS developers, since running an OS on top of a driver abstraction model would be very inefficient.
That said, perhaps 70% of the problem boils down to figuring out how OS and application code can make good use of SIMD hardware and most of the rest boils down to managing the trade-offs associated with a "dirty write" memory consistency model. Both of those appear to be big gnarly problems, however.
There were a few examples of hardware that split the difference between CPU and GPU with a large number of simple cores but keeping a CPU-like programming model and so were able to boot a standard OS. Those would be a better candidate for what I believe you have in mind, however I don't believe any of them survived in the marketplace. Thinking of products like Knights Landing... I thought they showed promise but they required a massive rework of existing GPU drivers in order to make good use of the hardware and AFAIK that never came together in time.
Comment