Announcement

**zxy_thf** · 21 March 2013, 04:50 AM

Originally posted by mateli View Post

Actually they would need to write a Mesa driver, as Mesa already have an OpenGL Server.

What do you mean? The mesa drivers for NV/AMD are at 3.1 to. The Xeon Phi with a good Mesa driver have a fair chance to give us performance that's neither NV or AMD can currently match.

Yes I know that the proprietary driver have more features and performance, but that's totally irrelevant. For a bunch of reasons i need FOSS drivers and have to judge a device based on how it perform with FOSS drivers. And I know I'm not alone with such use cases.

I said we need a "server" because Mesa needs to be splited into two processes for Xeon Phi.
One is the library for traditional OpenGL applications who have no idea about Xeon Phi.
Another is a Xeon Phi application, running in a different process located at a different host, because Xeon Phi is a standalone machine connected with PCI-e bus actually.

A brief TODO list for Xeon Phi rendering:
1. Write a Mesa driver on the host (your core i?, athlon, ppc or any cpu you like)
The driver needs to translate OpenGL commands into some intermediate form messages and pass them to the server running on Xeon Phi.
In other words, the state tracker (gallium) is left on the host.
(Sending OpenGL commands directly is possible, but I'd rather run the state tracker on a superscalar processor)

2. Write a OpenGL server on Xeon Phi
The server needs to parse messages and complete the rendering work.
This server can use llvmpipe, but rewriting from scratch is also possible, esp. for some commerical OpenGL vendors.

It's another story to optimize the OpenGL Sever, but I believe Xeon Phi is the furture for OSS high performance 3D.

**bridgman** · 21 March 2013, 09:15 AM

Originally posted by Figueiredo View Post

bridgman,

Due to my ignorance in the subject I couldn't grasp from AMD's roadmap if such "programability" is also also expected in AMD camp. Obviously you can only share what's been made public already, but if can be so kind as to briefly clarify how the HSA improvements differs from Xeon+XeonPhi chip I'm sure us layman users would greatly appreciate.

Sorry, just noticed this now. Even without HSA, most of the programmability is already included in currently shipping GPUs. The main differences are :

1. GPUs keep texture filtering in fixed-function hardware rather than moving it to general purpose processors. Texture processing is generally required for small rectangular areas of texture rather than individual pixels, and there are some significant performance & power efficiency benefits to be had from using fixed-function hardware because of the ability to share results from intermediate calculations more efficiently.

2. GPUs handle the task of spreading work across parallel threads and cores using fixed function hardware rather than software, which helps a lot with scaling issues.

Pretty much everything else has already moved from fixed function hardware into the general purpose processors. The ISA on the general purpose processors is a bit more focused on graphics and HPC tasks -- that's what the ISA guide for each new HW generation covers.

In KC-speak, the HD 79xx has 32 independent cores, each with a scalar ALU and a 2048-bit SIMD floating point ALU (organized as 4 x 512-bit, ie 4 x 16-way SIMD), running up to 40 threads on each core. The fixed-function hardware that spreads work across threads and cores allows each of the cores to have relatively more floating point power (which is what graphics and HPC both require) and relatively less scalar power.

What HSA brings is tighter integration between the main (superscalar) CPU cores and the GPU cores to reduce the overhead and programming complexity of offloading work to a separate device -- shared pageable virtual memory, cache coherency between CPU and GPU cores, simpler/faster dispatch of work between GPU and CPU etc...

**bridgman** · 21 March 2013, 10:49 AM

re: #2, couple more comments for completeness...

For compute, the fixed function hardware takes N-dimensional array-level compute commands and spreads the work across cores & threads.

For graphics, the fixed function hardware takes "draw using these lists of triangles" commands and implements the non-programmable parts of GL/DX graphics pipelines :

- pick out individual vertices and spread the vertex shader processing across cores and threads
- reassemble processed vertices into triangles, scan convert each triangle to identify pixels
- spread the pixel/fragment shader work across cores & threads

(a modern graphics pipeline has a lot more stages than just vertex & fragment processing but you get the idea, same applies to the other stages as well)

**zxy_thf** · 21 March 2013, 12:35 PM

Originally posted by bridgman View Post

In KC-speak, the HD 79xx has 32 independent cores, each with a scalar ALU and a 2048-bit SIMD floating point ALU (organized as 4 x 512-bit, ie 4 x 16-way SIMD), running up to 40 threads on each core. The fixed-function hardware that spreads work across threads and cores allows each of the cores to have relatively more floating point power (which is what graphics and HPC both require) and relatively less scalar power.

Maybe I've learn something wrong but I can't get the idea about "40 threads".
I think the cores in SI run wavefront, not independent threads. Isn't SMT an unnecessary complexity for GPUs?

**bridgman** · 21 March 2013, 12:57 PM

Originally posted by zxy_thf View Post

Maybe I've learn something wrong but I can't get the idea about "40 threads".
I think the cores in SI run wavefront, not independent threads. Isn't SMT an unnecessary complexity for GPUs?

What I'm calling a thread in KC-speak is a wavefront in GPU-speak. Basically the same thing these days... a single thread using the SIMD hardware to process 64 elements in parallel (each 16-way SIMD actually performs a vector operation on 64 elements in 4 clocks).

SMT these days usually refers to dynamically sharing execution units in a superscalar processor, which is complex as you say. GPUs generally rely on thread-level parallelism rather than instruction-level parallelism (although VLIW shader cores use both, with the compiler implementing ILP), so running multiple threads is a lot less complex.

Think about the old "barrel processor" model from the 60s and 70s, where the processor has multiple register sets and switches between threads on a per-clock basis rather than using the parallel execution units required for superscalar operation to run instructions from more than one thread in a single clock.

IIRC Larrabee uses the same approach -- multiple threads per core but only one thread at a time. I'm not sure which model KC uses but I suspect it also runs one thread at a time per core.

**bridgman** · 21 March 2013, 01:34 PM

Looks like "the old barrel processor model" is now called "fine grained temporal multithreading" and is trendy again

**zxy_thf** · 22 March 2013, 09:00 AM

Originally posted by bridgman View Post

Looks like "the old barrel processor model" is now called "fine grained temporal multithreading" and is trendy again

Thanks to memory's high latency

Announcement

Is Xeon Phi every OSS enthusiast`s wet dream

Comment

Comment

Comment

Comment

Comment

Comment

Comment