Announcement

Collapse
No announcement yet.

A New Radeon Shader Compiler For Mesa

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • phoronix
    started a topic A New Radeon Shader Compiler For Mesa

    A New Radeon Shader Compiler For Mesa

    Phoronix: A New Radeon Shader Compiler For Mesa

    While Gallium3D is gaining a lot of momentum and has picked up a number of new state trackers (OpenVG, OpenGL ES, and OpenCL and OpenGL 3.1 is coming soon) and features (i.e. network debugging support) in recent months, there is still a lot of work left before this architecture will enter the limelight...

    http://www.phoronix.com/vr.php?view=NzQxMA

  • elanthis
    replied
    After reading your explanations, it seems reasonable to assume that a shader compiler is *very* hardware-specific, thus probably developed in-house,
    A large part of a shader compiler is still the generic optimization passes. GCC has different backends for different CPUs, and different optimization passes enabled for different targets, but the meat of the optimization engine is still generic across all architectures. Same with shaders. Giving up those optimization passes could get competitors a huge advantage (though, honestly, general opinion is that ATI's drivers really suck compared to NVIDIA's, so I'm not sure there's anything ATI has that NVIDIA or even Intel is likely to be really interested in).

    Leave a comment:


  • nanonyme
    replied
    Originally posted by rohcQaH View Post
    IIRC the official statement regarding opening fglrx was a mixture of "you don't want to know what's inside" and "there's 3rd party code inside we're not allowed to open".
    I was talking of comparing the open drivers to a theoretical optimal driver though. (which would in all likelyhood be better than ATi's proprietary driver in optimizing operations into instructions) Even if ATi opened up their shader compiler code, it still would at best allow open drivers to mimick them and reach similar performance, never allowing them to become better. (since becoming better might require a different design for the same stuff) While the current approach probably is slower, it might yield interesting results given enough talented developers. (and assuming there's still enough space between ATi's implementation and optimal implementation to dance mambo in)

    Leave a comment:


  • rohcQaH
    replied
    Originally posted by bridgman View Post
    Yep. One of the contributors to the "60-70%" performance estimate for open source vs fglrx drivers was the use of a sub-optimal shader compiler in the open source drivers.
    maybe you aren't allowed to answer this, but..

    IIRC the official statement regarding opening fglrx was a mixture of "you don't want to know what's inside" and "there's 3rd party code inside we're not allowed to open".

    After reading your explanations, it seems reasonable to assume that a shader compiler is *very* hardware-specific, thus probably developed in-house, which could give you the option of opening it. Wouldn't that both save your OS-developers valuable time AND result in better OS-drivers and maybe (through community patches) better CS-drivers?

    Or are there some simple problems I'm overlooking, like incompatible compiler APIs/tight coupling with fglrx? Or more complicated problems, like patents/IP, non-obvious licensing problems, too much work to clear the whole thing, fear of nvidia stealing your shinies, ..?

    I don't want to sound ungrateful, AMD/ATI has shared a lot and done a lot for us linux-users and it's impudent to ask for more.. But since you keep taunting us that fglrx will always be superior because it has that great shader compiler, I'm just curious about the reasons

    Leave a comment:


  • bridgman
    replied
    Yep. One of the contributors to the "60-70%" performance estimate for open source vs fglrx drivers was the use of a sub-optimal shader compiler in the open source drivers.

    Pipelining is not much of an issue for GPUs, at least not for ours. The actual graphics pipeline is *very* long, potentially thousands of clocks or more, but data only flows one way most of time and read-after-write situations (eg rendering into a pixmap then using the results as a texture) are treated as exceptions with explicit cache flushes.

    Inside the shader core itself pipelines are for all practical purposes non-existent. This is possible because GPUs are almost always dealing with hugely parallel workloads, so single-thread performance doesn't really matter. We run at relatively low frequencies compared to a CPU (which allows a much shorter pipeline), and the SIMD engines process 4 clocks worth of work at a time (eg 64 threads at a time for a 16-way SIMD, processing 64 threads in 4 clocks) which allows the remaining bit of pipelining to be hidden from the programming model.
    Last edited by bridgman; 07-28-2009, 09:59 PM.

    Leave a comment:


  • nanonyme
    replied
    Originally posted by bridgman View Post
    Normally you use multiple slots per instruction automatically since you're dealing with 3-4 component vertex or pixel vectors, so it's pretty easy to get "decent" utilization -- but you can definitely get some extra performance in shader-intensive operations by packing other operations into the unused slots.
    Well, yeah... When I mean optimal, I also meant as in ideal situation.
    Btw, do you end up in similar pipelining challenges with GPU's as there are with CPU's?

    Leave a comment:


  • bridgman
    replied
    Normally you use multiple slots per instruction automatically since you're dealing with 3-4 component vertex or pixel vectors, so it's pretty easy to get "decent" utilization -- but you can definitely get some extra performance in shader-intensive operations by packing other operations into the unused slots.
    Last edited by bridgman; 07-28-2009, 07:55 PM.

    Leave a comment:


  • nanonyme
    replied
    Originally posted by bridgman View Post
    R3xx-R5xx and RS6xx GPUs support two simultaneous operations per instruction (1 vector + 1 scalar) while R6xx-R7xx GPUs support up to five independent operations per instruction.
    Am I to assume if you fail to fill all five, you end up with suboptimal instructions and slower drivers? (of course, optimal driver probably doesn't and can't exist anyway; more of a matter how far you are from it)
    Last edited by nanonyme; 07-28-2009, 07:28 PM.

    Leave a comment:


  • bridgman
    replied
    Interest in LLVM seems to have moved "up the stack" a bit, ie using LLVM to generate TGSI rather than using LLVM to translate from TGSI to native hardware instructions. It's certainly possible to use LLVM in both places, but LLVM doesn't currently handle explicitly superscalar hardware so for now the back-end translation seems to be handled best by hardware-specific code.

    R3xx-R5xx and RS6xx GPUs support two simultaneous operations per instruction (1 vector + 1 scalar) while R6xx-R7xx GPUs support up to five independent operations per instruction. Operations in a single instruction need to share inputs to some extent, so packing operations into instruction words is non-trivial at best.

    Leave a comment:


  • Ant P.
    replied
    I wonder what happened to that LLVM shader compiler that was talked about months ago...

    Leave a comment:

Working...
X