Announcement

Collapse
No announcement yet.

Rootbeer: A High-Performance GPU Compiler For Java

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Rootbeer: A High-Performance GPU Compiler For Java

    Phoronix: Rootbeer: A High-Performance GPU Compiler For Java

    In recent months there has been an initiative underway called Rootbeer, which is a GPU compiler for Java code. Rootbeer claims to be more advanced than CUDA or OpenCL bindings for Java as it does static code analysis of the Java Bytecode and takes it automatically to the GPU...

    http://www.phoronix.com/vr.php?view=MTE1ODk

  • #2
    I guess the only question I have is : will this make my Minecraft run faster on my laptop, given its AMD 5750? Because Optifine made no difference and despite my i7 processor, I'm lucky to hit 50fps on a small(ish) window.

    Fingers crossed and all that.

    Comment


    • #3
      I think Notch would have to compile Minecraft with Rootbeer, for that to work.

      If I understand the principles of Rootbeer correctly, then yes, Minecraft would very well be faster (in Theory, at least).

      Edit: Now the only question that remains for me is: Minecraft is very GPU intensive (many, many blocks to render), so wouldn't the performance of Minecraft suffer, because it would have to also execute what the cpu would normally execute?
      Last edited by bug!; 08-13-2012, 09:58 AM.

      Comment


      • #4
        Originally posted by bug! View Post
        If I understand the principles of Rootbeer correctly, then yes, Minecraft would very well be faster.
        Not at all. Minecraft would then run on the GPU, but that doesn't make it faster.

        GPUs are not faster than CPUs. They're just optimized for a different kind of workload. The trick is to keep CPU-affine workloads on the CPU, while moving GPU-affine workloads to the GPU.

        Just dumping everything on the GPU is going to end up even worse than using the CPU for rendering.


        But I don't think it would even run on the GPU. Some tasks - most importantly OS calls - need to be executed by the CPU, where the OS resides. Think file system access, audio/video-output, mouse and keyboard input, timing functions like vsync. Those are not available on your GPU.


        So while this project should make it easier to move some suitable sub-routines onto the GPU, just re-compiling everything is neither going to work, nor would it make the game faster.

        Comment


        • #5
          Firstly, GPUs are faster than CPUs, at least modern ones anyways. Utilizing more than two or three of the cores is where the real challenge comes in, but in raw performance, CPUs are no match.

          However, the one real question is can we see some benchmarks?

          Comment


          • #6
            Originally posted by scaine View Post
            I guess the only question I have is : will this make my Minecraft run faster on my laptop, given its AMD 5750? Because Optifine made no difference and despite my i7 processor, I'm lucky to hit 50fps on a small(ish) window.

            Fingers crossed and all that.
            Probalby not. Minecraft is using LWJGL, which contains an OpenGL wrapper. OpenGL is OpenGL, no matter if it's called from C or Java or whatever. Still, maybe Minecraft has some CPU bottlenecks, but executing this on the GPU won't help sind it's already doing OpenGL-Stuff.

            Rootbeer is more like a Java-based alternative for something like OpenCL. I could imagine, that Fork/Join-Stuff would be pretty cool on the GPU (You could also do your rendering that way, but OpenGL is a bit more powerful for that ^^)

            Comment


            • #7
              rootbeer only works with cuda. there is no opencl backend yet.

              Comment


              • #8
                The upside: I often prototype in java. It's probably due to the fact that I am more experienced with java than other languages. This allows me to take advantage of the GPU now, which is neat.

                The downside: After delivering a Java proof-of-concept, I am less likely to rewrite in a GPU-friendly language because I now have this feature within java. The world is now burdened with my java prototypes. MWAHAHAHAHAHA!

                F

                Comment


                • #9
                  Originally posted by coder543 View Post
                  Firstly, GPUs are faster than CPUs, at least modern ones anyways. Utilizing more than two or three of the cores is where the real challenge comes in, but in raw performance, CPUs are no match.

                  However, the one real question is can we see some benchmarks?
                  GPU aren't faster than CPU and CPU aren't faster than GPU [if you ever wanna work in HPC you need to understand this]

                  analogy/ GPUs are like warriors dumbs but with lots of brute strength and CPUs are the geek squad very smart but lack brute strength /analogy

                  so for example if you take a very parallel algorithm like MxM IDCT FLOAT and run it only once with 1 block of data the CPU is worlds faster cuz just pass the data to the GPU take longer than the entire time the cpu needed to complete the operation BUT if you have something like a video with millions of datablocks the CPU will stagnate very fast cuz been faster it don't have the brute strength and here is where the GPU shines[and the cost of loading the data in the GPU can be neglected] cuz even when every shader unit is massively slower[and dumber] than a CPU core the GPU has so many of this that allow it to work in parallel[for example 1500 shader cores can process 1500 data block per cycle while having the next 1500 waiting <-- theorically never is that good but that is the general idea]

                  this all means GPU are good when you need to crunch numbers with basic operations in massive quantities with/or very long precision types and basically thats it, if you try another thing in a GPU it will become massively slower cuz GPU are not general computing devices so they lack the hardware that is present in a CPU[branch prediction, prefetch, pipelining,etc]

                  so the CPU is as needed as the GPU and for the correct task both are extremely fast so don't believe PR crap that tesla can crunch 1teraflop and a CPU cant cuz that 1 tf is a best case scenario absoletely useless in a real life software with a very carefully optimzed dataset and to reach practically close to that speed you need to be extremely creative with your code so you can always pass optimal data[and fast enough to keep the GPU feeded 100% of the time] in the optimal configuration to the specific GPU you are working with

                  Comment


                  • #10
                    Originally posted by jrch2k8 View Post
                    GPU aren't faster than CPU and CPU aren't faster than GPU [if you ever wanna work in HPC you need to understand this]
                    Well, actually they are, unless you got a problem which can not be parallelized.

                    Originally posted by jrch2k8 View Post
                    this all means GPU are good when you need to crunch numbers with basic operations in massive quantities with/or very long precision types and basically thats it, if you try another thing in a GPU it will become massively slower cuz GPU are not general computing devices so they lack the hardware that is present in a CPU[branch prediction, prefetch, pipelining,etc]
                    They actually have most of that, i.e. pipelining.

                    Comment


                    • #11
                      Originally posted by alexThunder View Post
                      Well, actually they are, unless you got a problem which can not be parallelized.
                      They actually have most of that, i.e. pipelining.
                      most algorithms are very hard to parallelize and those parallel friendly algoritm need optimizations depending on the GPU and the dataset you use [<-- is a very hard task -- wanna a widespread reference google CABAC GPU]

                      they have them but very rudimentary and optimized for the GPU tasks so they differ quite a lot from a CPU counterpart, don't believe me try a matrix multiply[1000x1000 for example] 1 with branching and 1 without [pick the CL language you like] and check the time both take to complete [non branched wins for a factor of X times] so you see what i mean.

                      this is what i meaned when i say none is faster than the other they are different tools designed to atack efficiently very different scale of problems

                      Comment


                      • #12
                        Originally posted by jrch2k8 View Post
                        most algorithms are very hard to parallelize and those parallel friendly algoritm need optimizations depending on the GPU and the dataset you use [<-- is a very hard task -- wanna a widespread reference google CABAC GPU]

                        they have them but very rudimentary and optimized for the GPU tasks so they differ quite a lot from a CPU counterpart, don't believe me try a matrix multiply[1000x1000 for example] 1 with branching and 1 without [pick the CL language you like] and check the time both take to complete [non branched wins for a factor of X times] so you see what i mean.

                        this is what i meaned when i say none is faster than the other they are different tools designed to atack efficiently very different scale of problems
                        Fortunately I don't have to test this again. I ripped some pages out of the lecture I heard about it and combined them for you:

                        http://www.uploadarea.de/files/2g1ae...gv83mxiyyj.pdf

                        It's in german, but that's not that important. There is (some part) of the actual host programm and the OpenCL kernel. On the last two pages you'll find some graphics, which show how the (very) simple kernel performs against a sequential CPU proramm, one with PThreads (4 core machine with HT), OpenCL CPU and OpenCL GPU. The last page shows the perforamnce of the GPU kernel after some optimizations (better usage of the memory).

                        (Only the most simpel kernel is one these pages, not the optimized one)

                        The y-Axis shows the time in seconds. The program was tested with a 1680x1680 matrix on a i7 920 and a Geforce 9800 GX2

                        FYI: The lecture I attended isn't publicly online anymore, but the most recent one is: http://pvs.uni-muenster.de/pvs/lehre...vorlesung.html (german)

                        Comment


                        • #13
                          Originally posted by alexThunder View Post
                          Well, actually they are, unless you got a problem which can not be parallelized.
                          They are if and only if you have a GPU-suitable workload. If you don't, they're slower. Not sure why you insist otherwise.

                          Is it because they're said to have more FLOPS? Sure they do, in theory. Now compare BOPS, Branching Operations per Second, and see your GPU weep.

                          There are quite a few requirements for a workload to be GPU-suitable:
                          * it has to be massively parallelizable
                          * all parallel threads must be homogenous. CPUs can easily run a physics thread on one core and a gameplay thread on a second, which isn't easy to do on GPUs.
                          * no communication between the threads.
                          * it should contain as few branches as possible and simple data structures. GPUs are a lot less forgiving if your cache locality sucks.
                          * There must not be any latency requirements or actual streaming of data. You send all the input, you wait, you get all the output. No partial data anywhere.
                          * The whole workload must take long enough to overcome the overhead of setting up the GPU.

                          Of course a 40-second task with the textbook-algorithm for parallelizability is going to end up faster on the GPU. That doesn't prove anything for the general case.

                          Comment


                          • #14
                            Originally posted by rohcQaH View Post
                            They are if and only if you have a GPU-suitable workload. If you don't, they're slower. Not sure why you insist otherwise.
                            I don't. In general my point is, that they're just faster, although they might not be usable for everything.

                            For instance: Motorbikes are usually faster than cars. The fact that they hardly can carry anything compared to a car, doesn't make them slower, does it?

                            Originally posted by rohcQaH View Post
                            * it has to be massively parallelizable
                            Leave out the "massively". If problems are not that well parallelizable, it might still be enough to outperform the CPU by a significant degree.

                            Originally posted by rohcQaH View Post
                            * all parallel threads must be homogenous. CPUs can easily run a physics thread on one core and a gameplay thread on a second, which isn't easy to do on GPUs.
                            Ever heard of PhysX? It's used in some games. Guess, where this is executed.

                            Originally posted by rohcQaH View Post
                            * no communication between the threads.
                            And how would you do synchronization then?

                            Originally posted by rohcQaH View Post
                            * it should contain as few branches as possible and simple data structures. GPUs are a lot less forgiving if your cache locality sucks.
                            Yes, although the number of branches and using local memory are not necessarily related (in terms of performance).
                            Still, if you look at the PDF I uploaded, the graphics on the last page show naive OpenCL implementation, usage of warp sizes and usage of local memory/cache (in that order).
                            Without locality, it's still fast.

                            Originally posted by rohcQaH View Post
                            * There must not be any latency requirements or actual streaming of data. You send all the input, you wait, you get all the output. No partial data anywhere.
                            * The whole workload must take long enough to overcome the overhead of setting up the GPU.
                            Yes.

                            Originally posted by rohcQaH View Post
                            Of course a 40-second task with the textbook-algorithm for parallelizability is going to end up faster on the GPU. That doesn't prove anything for the general case.
                            Right, therefore you should do a bit more than that ;P

                            Comment


                            • #15
                              Originally posted by alexThunder View Post
                              For instance: Motorbikes are usually faster than cars. The fact that they hardly can carry anything compared to a car, doesn't make them slower, does it?
                              With cars you have a pretty strict metric of speed = distance over time. Which metric are you using to claim that GPUs are faster than CPUs? "Computation over time" is just too hard to define, and you'll find definitions that favor either side. Thus, none can be declared the winner.

                              Originally posted by alexThunder View Post
                              Ever heard of PhysX? It's used in some games. Guess, where this is executed.
                              It can run on either. If it does run on the GPU, it does not run concurrently with other GPU threads. They're run one after the other, with those expensive context switches and CPU involvement in between.

                              On the CPU, both could run concurrently on their own cores with virtually no overhead.

                              (Aside: there is evidence that PhysX could be faster on the CPU, but nVidia has purposefully crippled the CPU implementation to make their GPU compute look better.)

                              Originally posted by alexThunder View Post
                              And how would you do synchronization then?
                              On a GPU? You don't. The only synchronization primitive is "The CPU task is informed that the current batch of data has been processed and the results are ready."
                              Which disables quite a bit of parallel algorithms.

                              Comment

                              Working...
                              X