Announcement

Collapse
No announcement yet.

OpenCL Support In GCC?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by bridgman View Post
    Running a kernel under OpenCL is pretty similar to running a pixel shader program under OpenGL -- the app says "for every pixel run this program", then throws triangles or rectangles at the GPU. The GPU then runs the appropriate shader program on every pixel, and on modern GPUs that involves running hundreds of threads in parallel (an RV770 can execute 160 instructions in parallel, each doing up to 5 floating point MADs, or 10 FLOPs per instruction).

    The per-pixel output from the shader program usually goes to the screen, but it could go into a buffer which gets used elsewhere or read back into system memory. The Mesa driver runs on the CPU but the shader programs run on the GPU.

    Same with OpenCL; driver runs on the CPU but a bunch of copies of the kernel run in parallel on the GPU. The key point is that the GPU is only working on one task at a time, but within that task it can work on hundreds of data items in parallel. That's why GPUs are described as data-parallel rather than task-parallel.

    The data-parallel vs task-parallel distinction is also why the question of "how many cores does a GPU have ?" is so tricky to answer. Depending on your criteria, an RV770 can be described as single-core, 10 core, 160 core or 800 core. The 10-core answer is probably most technically correct, while the 160-core answer probably gives the most accurate idea of throughput relative to a conventional CPU.

    Anyways, since a GPU fundamentally works on one task at a time and the drriver time-slices between different tasks it should be possible to hook into the driver and track what percentage of the time is being used by each of the tasks. That hasn't been useful in the past (since all the GPU workload typically comes from whatever app you are running at the moment) but as we start juggling multiple tasks on the GPU that will probably become more important (and more interesting to watch ).
    Okay, so a pixel shader is more or less an infinite while-loop?

    So if we have OpenCL into play, does that mean, that the OpenGL and OpenCL driver schedule which turn it is to get data processed, as the GPU only can handle one task at a time?

    Let's say I write a OpenCL program that simulates a flow. Is that program the kernel for the GPU? Or is the kernel something Mesa would write to intercept my flow simulation program?

    How many kernels can the GPU have running?

    Comment


    • #12
      Originally posted by bridgman View Post
      It's supposed to hurt. That means you're starting to understand. Congratulations

      Ever since the introduction of programmable shaders GPU drivers have included an on-the-fly compilation step (going from, say, GLSL to GPU shader instructions) and the GPU hardware has run many copies of those compiled shader programs in parallel to get acceptable performance.

      GPU vendors did a good job of hiding that complexity from the application -- but with OpenCL you get to see all the scary stuff behind the scenes.

      Back in 2002 the R300 (aka 9700) was running 8 copies of the pixel shader program in parallel, each working on a different pixel. The RV730 is comparable in terms of pixel throughput but can run 64 copies of a shader program in parallel, ie the ratio of shader power to pixel-pushing power is 4-8 times higher on the RV730. This is why modern chips can run so much *faster* on complex 3D applications even if they run *slower* on glxgears.

      Unified shader GPUs use multiple shader blocks in order to handle the mix of vertex, geometry and pixel shader work that comes with a single drawing task. In principle the blocks could be designed to work on totally different tasks but that would require a lot more silicon (more $$) and the added complexity would probably *reduce* overall performance.

      The most important concept to grasp is that with conventional programming you have a single task, executing a program which steps through an array and calculates the results for each element. With data-parallel programming you write a program that calculates the value of ONE element, then the OpenCL / Stream / CUDA runtime executes a copy of the program for each element in the array, using parallel hardware as much as possible.

      Having the runtime take care of parallelism (rather than the application) makes it possible for an application to run on anything from a single-core CPU to a stack of GPUs without recompilation.
      I am beginning to get the feeling that GPU vendors start off by shooting down an UFO, and use their technology to make GPU's.

      I hope you don't have small green antennas

      Comment


      • #13
        Originally posted by Louise View Post
        Okay, so a pixel shader is more or less an infinite while-loop?
        Sort of.. more like one pass through the loop, but with many copies running in parallel each on a separate piece of the answer.

        Originally posted by Louise View Post
        So if we have OpenCL into play, does that mean, that the OpenGL and OpenCL driver schedule which turn it is to get data processed, as the GPU only can handle one task at a time?
        Yep. This is how it works today when both the X driver and the Mesa driver want to use the chip. The drm driver arbitrates between multiple clients and uses a lock so that only one of them can have the GPU at a time.

        Originally posted by Louise View Post
        Let's say I write a OpenCL program that simulates a flow. Is that program the kernel for the GPU? Or is the kernel something Mesa would write to intercept my flow simulation program?
        Can I use a simpler example 'cause I slept through too many physics classes ? Let's say we have a couple of arrays, and we want to run a complex program against those arrays to create a third array. The kernel would contain the code required to generate the result for one element of that third array... then OpenCL would run a separate copy of that program for each element in the third array, passing it the appropriate parameters so that the proper portions of the first two arrays would be used in the calculation.

        You would write the kernel program in C, following the OpenCL guidelines. The OpenCL runtime (mostly the driver) would compile that program on the fly to hardware-specific instructions (see the r600_isa doc for details) and then run a bazillion copies of that compiled program in parallel.

        Originally posted by Louise View Post
        How many kernels can the GPU have running?
        It varies from chip to chip, but I think it's in the "thousands" range. It depends a bit on how complex the kernel program is, specifically how many different registers it uses. The GPU is built around a big honkin' register file -- the more registers required by an individual thread, the fewer threads you can run at the same time.

        The GPU won't actually execute thousands of threads at the same time; many threads may be waiting for memory accesses and so the hardware scheduler runs only the threads which are not waiting for anything. The RV770 has enough stream processors to actually *execute* 160 threads at a time (10 16-way SIMD blocks), with each thread performing up to 5 floating point instructions per clock. That's where the 1.2 teraflop number comes from -- 160 threads x 5 operations per thread per clock x 2 FLOPs per operation (Multiply-Add) x 750 MHz clock rate.

        Originally posted by Louise View Post
        I am beginning to get the feeling that GPU vendors start off by shooting down an UFO, and use their technology to make GPU's.
        It's possible, but nobody is talking. The Terminator explanation seems more believable to me.

        Originally posted by Louise View Post
        I hope you don't have small green antennas
        No, but I'm not the hardware designer
        Last edited by bridgman; 01 February 2009, 07:16 PM.
        Test signature

        Comment


        • #14
          Originally posted by bridgman View Post
          Sort of.. more like one pass through the loop, but with many copies running in parallel each on a separate piece of the answer.
          Reading your entire answer I think I understand now

          Originally posted by bridgman View Post
          Can I use a simpler example 'cause I slept through too many physics classes ? Let's say we have a couple of arrays, and we want to run a complex program against those arrays to create a third array. The kernel would contain the code required to generate the result for one element of that third array... then OpenCL would run a separate copy of that program for each element in the third array, passing it the appropriate parameters so that the proper portions of the first two arrays would be used in the calculation.
          Excellent. So my program becomes the kernel.

          On Windows are there a lot of malware e.g. the proof of concept Blue Pill virtual machine, which is very hard to measure that it is running.

          So I am just thinking about security. Should malware running in the GPU be a concern for Windows users?

          Originally posted by bridgman View Post
          You would write the kernel program in C, following the OpenCL guidelines. The OpenCL runtime (mostly the driver) would compile that program on the fly to hardware-specific instructions (see the r600_isa doc for details) and then run a bazillion copies of that compiled program in parallel.
          If only it worked the same way for CPU's

          Originally posted by bridgman View Post
          It varies from chip to chip, but I think it's in the "thousands" range. It depends a bit on how complex the kernel program is, specifically how many different registers it uses. The GPU is built around a big honkin' register file -- the more registers required by an individual thread, the fewer threads you can run at the same time.
          It is impressive how much technology the customer get for only $100 now a days! What a great time we are living in!

          Originally posted by bridgman View Post
          It's possible, but nobody is talking. The Terminator explanation seems more believable to me.
          Scary

          Originally posted by bridgman View Post
          No, but I'm not the hardware designer
          So the surviving aliens are only used for designing the hardware?

          I guess that justifies it

          Comment


          • #15
            Originally posted by Louise View Post
            So I am just thinking about security. Should malware running in the GPU be a concern for Windows users?
            It should be a concern for all users, but since GPUs can't really have long-running processes on them (yet) the main concern is malware running on the CPU but using the GPU to gain access to areas of memory which are blocked for CPU access. There are a number of safeguards in place (in all OSes, not just Windows) to prevent this.

            One of the recurring debates on both #radeon and #dri-devel is exactly how much checking the drm should do before passing a command buffer to the GPU, eg validating memory addresses, register offsets etc...
            Test signature

            Comment


            • #16
              I'd rather see LLVM support than GCC support, but in the end, just support from anywhere would make me a happy coder.

              Anyway, time to rain on the parade a bit;
              It would definitely be interesting to convert for example GEGL to OpenCL, but coding for OpenCL is tedious and just suitable for computational code (preferably data parellell code).
              This is a great talk;


              Also, Kernels in the thousands range sounds like a order of magnitude to much. Nvidia Tesla tops at 240 cores that produce a total of 933 single precision GFlops, 78 double precision GFlops (and it certainly costs ALOT more than 100 USD), while a good new high-end CPU will maybe produce around 30-40 GFlops and the CELL with 8 SPE's can deliver 100 double precision GFlops (theoretically around 200 or so I think)
              I found this from wikipedia as well:
              "nVidia 8800 Ultra performs around 576 (single precision) GFLOPS on 128 Processing elements"

              Ok, so these numbers are often unrealistic for real applications, but comparing Flops is probably the most accurate way to compare these different units.

              But hey, i do plan to write code for OpenCL as soon as i get the chance, however in most cases I would have preferred double precision so I will probably have to look forward running my code only on the CELL or on normal clusters.

              Edit:
              The wikipedia page has a nice short example:


              Louise; about the kernel, i'd just like to clarify that a typical kernel is just a short function (perferably one that requires alot of work, but uses little memory) in your code that you wish to apply to a large set of data (data parallell), or several different kernels you wish to perform in parallell (task parallell). GPU's are best at data parallell, (I dont even know if it's feasible to run anything task parallell on a GPU).
              Last edited by Micket; 01 February 2009, 08:56 PM.

              Comment


              • #17
                If double precision is important you might want to look at the AMD FireStream boards. A single-GPU board has 160 5-ALU cores, 1200 single-precision GFLOPS and 240 double-precision GFLOPS.

                FireStream boards cost a lot more than $100 too
                Last edited by bridgman; 01 February 2009, 09:13 PM.
                Test signature

                Comment


                • #18
                  Originally posted by bridgman View Post
                  It should be a concern for all users, but since GPUs can't really have long-running processes on them (yet) the main concern is malware running on the CPU but using the GPU to gain access to areas of memory which are blocked for CPU access. There are a number of safeguards in place (in all OSes, not just Windows) to prevent this.
                  Maybe a SPU on the PS3 can't be compared to a GPU, but the last SPU on PS3 is not disabled to allow great yield, like Sony would like users to think. It is running the hypervisor.

                  So I was just concerned that malware programs could do the same.

                  Originally posted by bridgman View Post
                  One of the recurring debates on both #radeon and #dri-devel is exactly how much checking the drm should do before passing a command buffer to the GPU, eg validating memory addresses, register offsets etc...
                  In terms of performance vs security?

                  Watching the hacker talks from ccc.de, it is just amazing how they utilized every little security mistake to get Linux on XBox*, Wii, PSP.

                  In one of the talks Felix Domke (him that write the open source 3D driver for XBox360) and Michael Steil explains 17 classes of mistakes MS did on the Xbox1. Several mistakes was done more than once.

                  One of the mistakes was: "Never under estimate combination of weaknesses". E.g. MS checks for a poke PCI command, so the hackers can't use that, but the hackers just used 4 legal commands to do that job of the poke pci command.

                  They also explain that there is no such thing as security vs performance. Either the system is secure, or not secure.


                  I guess if AMD let you give a talk at CCC about the security implementation in the open source driver, you would at the end of the day know if it was done right

                  For those who don't know CCC, then it is a convention where hardware and software hackers gather to talk about the stuff they have hacked during the last year.

                  Comment


                  • #19
                    Originally posted by Micket View Post
                    Louise; about the kernel, i'd just like to clarify that a typical kernel is just a short function (perferably one that requires alot of work, but uses little memory) in your code that you wish to apply to a large set of data (data parallell), or several different kernels you wish to perform in parallell (task parallell). GPU's are best at data parallell, (I dont even know if it's feasible to run anything task parallell on a GPU).
                    I think it is a very interesting topic, but far above my programming level. E.g. I have never tried to use an API in C...

                    But my interest in OpenCL comes because I use Matlab for fluid simulations, and I get that they will have OpenCL support at some point, so I would need to know about data parallelism

                    Comment


                    • #20
                      Originally posted by Micket View Post
                      Kernels in the thousands range sounds like a order of magnitude to much. Nvidia Tesla tops at 240 cores
                      Both NVidia and AMD GPUs keep thousands of threads "in flight" but only execute "hundreds" at any one time. The NVidia GPU you mentioned has 240 scalar cores while the AMD GPU I mentioned has 160 superscalar cores, but both have many more threads than that in flight in order to hide the latency associated with memory accesses.

                      The Cell processor only accesses local (fast but small) memory so latency is not really a factor.

                      Originally posted by Micket View Post
                      GPU's are best at data parallell, (I dont even know if it's feasible to run anything task parallell on a GPU).
                      I don't think so, at least not with today's GPUs. You can time-share the pipeline and have multiple tasks in the pipe at transitions (the end of one task plus the start of the next) but that's about it AFAIK. GPUs are so much faster than their memory that access locality (which translates into cache hits) is important, and running multiple tasks at once waters down the effective cache size very quickly.
                      Last edited by bridgman; 01 February 2009, 09:40 PM.
                      Test signature

                      Comment

                      Working...
                      X