Announcement

Collapse
No announcement yet.

OpenCL Support In GCC?

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    I'd rather see LLVM support than GCC support, but in the end, just support from anywhere would make me a happy coder.

    Anyway, time to rain on the parade a bit;
    It would definitely be interesting to convert for example GEGL to OpenCL, but coding for OpenCL is tedious and just suitable for computational code (preferably data parellell code).
    This is a great talk;
    http://software.intel.com/en-us/blog...h-tim-mattson/

    Also, Kernels in the thousands range sounds like a order of magnitude to much. Nvidia Tesla tops at 240 cores that produce a total of 933 single precision GFlops, 78 double precision GFlops (and it certainly costs ALOT more than 100 USD), while a good new high-end CPU will maybe produce around 30-40 GFlops and the CELL with 8 SPE's can deliver 100 double precision GFlops (theoretically around 200 or so I think)
    I found this from wikipedia as well:
    "nVidia 8800 Ultra performs around 576 (single precision) GFLOPS on 128 Processing elements"

    Ok, so these numbers are often unrealistic for real applications, but comparing Flops is probably the most accurate way to compare these different units.

    But hey, i do plan to write code for OpenCL as soon as i get the chance, however in most cases I would have preferred double precision so I will probably have to look forward running my code only on the CELL or on normal clusters.

    Edit:
    The wikipedia page has a nice short example:
    http://en.wikipedia.org/wiki/OpenCL

    Louise; about the kernel, i'd just like to clarify that a typical kernel is just a short function (perferably one that requires alot of work, but uses little memory) in your code that you wish to apply to a large set of data (data parallell), or several different kernels you wish to perform in parallell (task parallell). GPU's are best at data parallell, (I dont even know if it's feasible to run anything task parallell on a GPU).
    Last edited by Micket; 02-01-2009, 07:56 PM.

    Comment


    • #17
      If double precision is important you might want to look at the AMD FireStream boards. A single-GPU board has 160 5-ALU cores, 1200 single-precision GFLOPS and 240 double-precision GFLOPS.

      FireStream boards cost a lot more than $100 too
      Last edited by bridgman; 02-01-2009, 08:13 PM.

      Comment


      • #18
        Originally posted by bridgman View Post
        It should be a concern for all users, but since GPUs can't really have long-running processes on them (yet) the main concern is malware running on the CPU but using the GPU to gain access to areas of memory which are blocked for CPU access. There are a number of safeguards in place (in all OSes, not just Windows) to prevent this.
        Maybe a SPU on the PS3 can't be compared to a GPU, but the last SPU on PS3 is not disabled to allow great yield, like Sony would like users to think. It is running the hypervisor.

        So I was just concerned that malware programs could do the same.

        Originally posted by bridgman View Post
        One of the recurring debates on both #radeon and #dri-devel is exactly how much checking the drm should do before passing a command buffer to the GPU, eg validating memory addresses, register offsets etc...
        In terms of performance vs security?

        Watching the hacker talks from ccc.de, it is just amazing how they utilized every little security mistake to get Linux on XBox*, Wii, PSP.

        In one of the talks Felix Domke (him that write the open source 3D driver for XBox360) and Michael Steil explains 17 classes of mistakes MS did on the Xbox1. Several mistakes was done more than once.

        One of the mistakes was: "Never under estimate combination of weaknesses". E.g. MS checks for a poke PCI command, so the hackers can't use that, but the hackers just used 4 legal commands to do that job of the poke pci command.

        They also explain that there is no such thing as security vs performance. Either the system is secure, or not secure.


        I guess if AMD let you give a talk at CCC about the security implementation in the open source driver, you would at the end of the day know if it was done right

        For those who don't know CCC, then it is a convention where hardware and software hackers gather to talk about the stuff they have hacked during the last year.

        Comment


        • #19
          Originally posted by Micket View Post
          Louise; about the kernel, i'd just like to clarify that a typical kernel is just a short function (perferably one that requires alot of work, but uses little memory) in your code that you wish to apply to a large set of data (data parallell), or several different kernels you wish to perform in parallell (task parallell). GPU's are best at data parallell, (I dont even know if it's feasible to run anything task parallell on a GPU).
          I think it is a very interesting topic, but far above my programming level. E.g. I have never tried to use an API in C...

          But my interest in OpenCL comes because I use Matlab for fluid simulations, and I get that they will have OpenCL support at some point, so I would need to know about data parallelism

          Comment


          • #20
            Originally posted by Micket View Post
            Kernels in the thousands range sounds like a order of magnitude to much. Nvidia Tesla tops at 240 cores
            Both NVidia and AMD GPUs keep thousands of threads "in flight" but only execute "hundreds" at any one time. The NVidia GPU you mentioned has 240 scalar cores while the AMD GPU I mentioned has 160 superscalar cores, but both have many more threads than that in flight in order to hide the latency associated with memory accesses.

            The Cell processor only accesses local (fast but small) memory so latency is not really a factor.

            Originally posted by Micket View Post
            GPU's are best at data parallell, (I dont even know if it's feasible to run anything task parallell on a GPU).
            I don't think so, at least not with today's GPUs. You can time-share the pipeline and have multiple tasks in the pipe at transitions (the end of one task plus the start of the next) but that's about it AFAIK. GPUs are so much faster than their memory that access locality (which translates into cache hits) is important, and running multiple tasks at once waters down the effective cache size very quickly.
            Last edited by bridgman; 02-01-2009, 08:40 PM.

            Comment


            • #21
              Originally posted by bridgman View Post
              It's supposed to hurt. That means you're starting to understand. Congratulations

              Ever since the introduction of programmable shaders GPU drivers have included an on-the-fly compilation step (going from, say, GLSL to GPU shader instructions) and the GPU hardware has run many copies of those compiled shader programs in parallel to get acceptable performance.

              GPU vendors did a good job of hiding that complexity from the application -- but with OpenCL you get to see all the scary stuff behind the scenes.

              Back in 2002 the R300 (aka 9700) was running 8 copies of the pixel shader program in parallel, each working on a different pixel.
              Why not have 1 copy of the pixel shader operating on 8 different pixels? like SIMD. What's the rationale behind using so many processors if they are all running the same code?

              Comment


              • #22
                For those interested in bridgmans posts about the hardware, I highly recommend the articles anandtech wrote for the GT200/RV770 launches.

                http://www.anandtech.com/video/showdoc.aspx?i=3334&p=2
                http://www.anandtech.com/video/showdoc.aspx?i=3341&p=3

                Yes, they support more than 1000 threads.

                It's very interesting to see the different approaches ATI and NVidia took. NVidia went with very simple SPs, which are all capable of doing all operations, while ATI went with more complex ones which can run multiple instructions at once - with the tradeoff being that the instructions have to be a certain mix which means the compiler has to be very smart about how to use the resources.

                Comment


                • #23
                  Originally posted by monraaf View Post
                  Why not have 1 copy of the pixel shader operating on 8 different pixels? like SIMD. What's the rationale behind using so many processors if they are all running the same code?
                  AFAIK we have used SIMDs for pixel shaders right back to the R300. Not sure if the vertex shaders were SIMD or not. An RV770 has 10 SIMD blocks, with a single program counter per SIMD. Each SIMD block runs the same superscalar instruction on 16 threads, and each instruction can perform up to 5 floating point operations per clock.

                  Other GPU vendors use SIMDs in a similar fashion, but without the superscalar instructions.We have multiple SIMDs because even for a single task you need some granularity to handle the mix of vertex, geometry and pixel shader processing.
                  Last edited by bridgman; 02-01-2009, 10:27 PM.

                  Comment


                  • #24
                    Originally posted by Louise View Post
                    So a GPU can have processes running? Does that mean, that there could be made a "ps" and "top" for GPU processes?

                    It would be so cool to have a Gnome/KDE GPU load monitor

                    [...]
                    Sorry for dragging up an old thread, but yes, you can have "top" for the GPU, behold intel_gpu_top:



                    I guess a someone still needs to write a load monitor for your favourite desktop

                    Comment

                    Working...
                    X