Announcement

Collapse
No announcement yet.

A Big Comparison Of The AMD Catalyst, Mesa & Gallium3D Drive

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • The Intel GLSL compiler goes from GLSL source to an IR (currently in a two step process, first to a compiler-specific IR (aka "GLSL IR" then converted to Mesa IR (and then to TGSI, I guess).

    Jerome's shader compiler goes from TGSI to hardware instructions, ie it does the rest of the work. Similarly, the shader compiler in the r600 driver goes from Mesa IR to hardware instructions.

    FYI the fglrx driver also works in two stages - the GL driver compiles GLSL / ARB_*P down to a proprietary representation (what we call "IL") and then the shader compiler goes from IL to hardware instructions in a second step.

    The Intel devs are thinking about generating hardware instructions directly from GLSL IR, rather than going through Mesa IR or TGSI.
    Test signature

    Comment


    • Originally posted by bridgman View Post
      The problem is that so far the test results aren't supporting our initial suspicions. Going in I think most of us suspected that the bottlenecks were likely to be in the kernel driver (synchronization, memory mapping etc..) but test results seem to suggest that common mesa code in the usermode 3D driver is a bigger factor. There's a lot more testing required though, and there are conflicting views re: how to interpret the test results so far.

      Performance optimization is basically :

      - run some benchmarks & save the results
      repeat forever {
      - do some profiling
      - form a theory re: where the bottleneck is
      - change some code to test the theory
      - re-run the benchmarks to see if things go faster
      - (4 times out of 5) curse and discard the theory (or save as the basis for a more complex theory)
      - (1 time out of 5) make happy noises and get some sleep
      }
      WHAT, are all AMD Linux workflows this slow, and wasting lots of cycles! your obviously doing it wrong, even a NON Head of AMD manager of linux can see that

      your saying you dont even have/write as yet a simple C app doing x264 type checkasm tests down to the decicycles range of all the C and assembly routines on all code ?

      to see exactly what im referring to here
      if you compile x264 from git
      a simple
      make checkasm;./checkasm
      checkasm ?bench
      gives you these results to help find and easily spot the real bottle necks without effort, and so decide what routines to optimise first, perhaps you could even extend this type of performance tool into all the existing Gfx kernel space C/assembly too in time, perhaps you can ask/commission on #x264dev pengvado,Dark_Shikari,holger to write you a basic x264 checkasm type app if you cant be bothered or have the time, then every one win's

      Comment


      • Originally posted by spirit View Post
        I confirm, on my RV620 chipset:

        cat /var/log/Xorg.0.log|grep Pageflipping
        7.614] (II) RADEON(0): KMS Pageflipping: enabled
        What kernel version are you using? 2.6.38 i assume? Does Pageflipping make much of a difference for you?

        Comment


        • Originally posted by popper View Post
          WHAT, are all AMD Linux workflows this slow, and wasting lots of cycles! your obviously doing it wrong, even a NON Head of AMD manager of linux can see that
          Popper, with respect you are completely missing the point. The developers have good tools for figuring out where CPU cycles are going, but performance tuning on a graphics driver is a lot more complicated - you're dealing with a half dozen independent hardware blocks in the GPU with invisible queues between them. Performance tuning on a software implementation is much simpler... unfortunately.
          Test signature

          Comment


          • Originally posted by popper View Post
            WHAT, are all AMD Linux workflows this slow, and wasting lots of cycles! your obviously doing it wrong, even a NON Head of AMD manager of linux can see that

            your saying you dont even have/write as yet a simple C app doing x264 type checkasm tests down to the decicycles range of all the C and assembly routines on all code ?

            to see exactly what im referring to here
            if you compile x264 from git
            a simple
            make checkasm;./checkasm
            checkasm ?bench
            gives you these results to help find and easily spot the real bottle necks without effort, and so decide what routines to optimise first, perhaps you could even extend this type of performance tool into all the existing Gfx kernel space C/assembly too in time, perhaps you can ask/commission on #x264dev pengvado,Dark_Shikari,holger to write you a basic x264 checkasm type app if you cant be bothered or have the time, then every one win's
            Its quite easy to optimise a single codec, only an idiot would think this scales to anything like a generic GL stack.

            Having experience in one small area of computing doesn't mean you are actually an expert.

            Dave.

            Comment


            • Originally posted by bridgman View Post
              Um... yes there's some head scratching going on (just like in the picture ) but the associated question is more along the lines of "the code in the open source driver looks pretty good, but it seems to run a lot slower than Catalyst and we don't know why".
              If there's something I've learned after 15 years of coding is that code that looks good usually performs poorer than complicated looking code

              Comment


              • Originally posted by bridgman View Post
                Popper, with respect you are completely missing the point. The developers have good tools for figuring out where CPU cycles are going, but performance tuning on a graphics driver is a lot more complicated - you're dealing with a half dozen independent hardware blocks in the GPU with invisible queues between them. Performance tuning on a software implementation is much simpler... unfortunately.
                Popper, FYI the checkasm stuff you are talking about corresponds to a single line item in the workflow, "- do some profiling".
                Test signature

                Comment


                • Originally posted by airlied View Post
                  Its quite easy to optimise a single codec, only an idiot would think this scales to anything like a generic GL stack.
                  now now dave (Airlie) , at No point did i say or imply it was/did
                  there's no need to take offence, or calling names because you mistakenly read things in to posts that are not there, this is a Linux User support message board after all.

                  Originally posted by airlied View Post
                  Having experience in one small area of computing doesn't mean you are actually an expert.

                  Dave.
                  again with the read things in to posts that are not there... whats making a simple performance tool to help All developers(and users alike) run and submit tests to find bottlenecks, got to do with actually an expert in whatever area

                  bridgman "with respect you are completely missing the point"
                  No .considered that.

                  "performance tuning on a graphics driver is a lot more complicated"

                  Sure, and the more complicated the code, the more reason to consider writing that performance tool covering the basic parts you can easily cover to start with, and extend it over time as more and more dev's begin to use it, after all Peter Clifton even tries to implement and improve such performance tools http://lists.freedesktop.org/archive...er/008557.html

                  Comment


                  • popper> You apparently live in the illusion that profilers or any kind of measuring tool can show you what to optimize. The graphics driver stack is huge, there are hundreds of functions that call each other, each spending very little time in itself. The right question usually isn't "how to speed up this function", it's rather something like "can I somehow change the upper layers so that this function is called less often?" Now if you start with that kind of question, you realize that if you don't know what's REALLY going on in the code at various levels, profiling is mostly USELESS and will only make you spend time on parts of code that will give you very little speedup, if any (e.g. you may end up wondering why atomic increment is so high on the profile ).

                    Comment


                    • Originally posted by popper View Post
                      "performance tuning on a graphics driver is a lot more complicated"

                      Sure, and the more complicated the code, the more reason to consider writing that performance tool covering the basic parts you can easily cover to start with, and extend it over time as more and more dev's begin to use it, after all Peter Clifton even tries to implement and improve such performance tools http://lists.freedesktop.org/archive...er/008557.html
                      So you can grasp some understanding, a GL stack is several layer of software communicating through various means. A quick sketch :
                      1 [mesa GL->gallium->pipe driver]
                      2 [xorg<->ddx]
                      3 [kernel]
                      4 [GPU hw]

                      1 communicate with 2 through dri1/dri2 (can impact performance)
                      2 communicate with 3 through kernel drm api (can impact performance)
                      1 communicate with 3 through kernel drm api (can impact performance)
                      3 communicate with 4

                      So obviously with so many player it's hard to point a finger and no tools will help you there unless you do a tools which is capable of spying on everyone (such tools would likely be insanely complex). Bottom line is given the resource we have we would waste time on the tools that could spend today to improve perf. If you think that tools will be usefull in the future than carefully consider that any of the link btw each 4 components will likely change in non compatible way in the future and you see that unless you get enough manpower writting such tools is doomed to be outdated by the time it can gives insight.

                      Oh and spying on things such as dri1/dri2 and trying to evaluate it's cost and what going wrong is, if i had to guess, likely very complex to achieve without interfering in non trivial way with dri2. Same goes for all the others link, it's not easy to instrument links.

                      As i said elsewhere i think we have still too much CPU bottleneck to start considering GPU optimization worth while (at least if you care about slow CPU and don't have this crazy 30" screen that is starving your bandwidth). So things such as a GPU top would be of little use.

                      Comment

                      Working...
                      X