Rootbeer: A High-Performance GPU Compiler For Java

allison replied

04 December 2012, 10:55 PM
Nice discusstion here! I am learning Java!
Leave a comment:
deanjo replied

03 December 2012, 10:29 PM
Originally posted by alexThunder View Post

I don't. In general my point is, that they're just faster, although they might not be usable for everything.

For instance: Motorbikes are usually faster than cars. The fact that they hardly can carry anything compared to a car, doesn't make them slower, does it?

If you were to continue on the vehicle analogy, a GPU would be more like a truck and the CPU more like a motorbike. It seems to me like both of you are correct in saying which is faster but are both going by a bit different definition as to what "faster" means. A GPU is "faster" in the sense that it can a crap load of parallel code at the same time, on the other hand the CPU is "faster" on a per thread basis.
Leave a comment:
CARMEN replied

03 December 2012, 10:18 PM
Nice discussion!
Leave a comment:
Veerappan replied

15 August 2012, 01:25 PM
Originally posted by alexThunder View Post

Then tell me, what do i.e. local/global memory fences do in OpenCL or what they're good for.

Memory fences and barrier(CLK_GLOBAL_MEM_FENCE) can only synchronize things within a given WORKGROUP. The GLOBAL is in reference to the global memory space, not synchronizing all GPU threads.

They are useful for making sure that threads within a workgroup don't get out of sync, but they CANNOT be used to attempt to synchronize all global work items in an OpenCL kernel invocation. Trust me, I've tried this, and I've tried to create global synchronization mechanisms in OpenCL (and found fun ways to lock up my GPU in the process).
Leave a comment:
alexThunder replied

14 August 2012, 02:04 PM
Originally posted by rohcQaH View Post

With cars you have a pretty strict metric of speed = distance over time. Which metric are you using to claim that GPUs are faster than CPUs? "Computation over time" is just too hard to define, and you'll find definitions that favor either side.

Would you understand this comparison without that information? Or do I have to explain how stylistic devices work?

Btw. it's time to response.

Originally posted by rohcQaH View Post

Thus, none can be declared the winner.

Which is why we still have SIMD and MIMD devices.

Originally posted by rohcQaH View Post

It can run on either. If it does run on the GPU, it does not run concurrently with other GPU threads. They're run one after the other, with those expensive context switches and CPU involvement in between.

On the CPU, both could run concurrently on their own cores with virtually no overhead.

Right, and which one still runs faster?

Originally posted by rohcQaH View Post

On a GPU? You don't. The only synchronization primitive is "The CPU task is informed that the current batch of data has been processed and the results are ready."
Which disables quite a bit of parallel algorithms.

Then tell me, what do i.e. local/global memory fences do in OpenCL or what they're good for.
Leave a comment:
rohcQaH replied

14 August 2012, 01:10 PM
Originally posted by alexThunder View Post

For instance: Motorbikes are usually faster than cars. The fact that they hardly can carry anything compared to a car, doesn't make them slower, does it?

With cars you have a pretty strict metric of speed = distance over time. Which metric are you using to claim that GPUs are faster than CPUs? "Computation over time" is just too hard to define, and you'll find definitions that favor either side. Thus, none can be declared the winner.

Originally posted by alexThunder View Post

Ever heard of PhysX? It's used in some games. Guess, where this is executed.

It can run on either. If it does run on the GPU, it does not run concurrently with other GPU threads. They're run one after the other, with those expensive context switches and CPU involvement in between.

On the CPU, both could run concurrently on their own cores with virtually no overhead.

(Aside: there is evidence that PhysX could be faster on the CPU, but nVidia has purposefully crippled the CPU implementation to make their GPU compute look better.)

Originally posted by alexThunder View Post

And how would you do synchronization then?

On a GPU? You don't. The only synchronization primitive is "The CPU task is informed that the current batch of data has been processed and the results are ready."
Which disables quite a bit of parallel algorithms.
Leave a comment:
alexThunder replied

14 August 2012, 09:54 AM
Originally posted by rohcQaH View Post

They are if and only if you have a GPU-suitable workload. If you don't, they're slower. Not sure why you insist otherwise.

I don't. In general my point is, that they're just faster, although they might not be usable for everything.

For instance: Motorbikes are usually faster than cars. The fact that they hardly can carry anything compared to a car, doesn't make them slower, does it?

Originally posted by rohcQaH View Post

* it has to be massively parallelizable

Leave out the "massively". If problems are not that well parallelizable, it might still be enough to outperform the CPU by a significant degree.

Originally posted by rohcQaH View Post

* all parallel threads must be homogenous. CPUs can easily run a physics thread on one core and a gameplay thread on a second, which isn't easy to do on GPUs.

Ever heard of PhysX? It's used in some games. Guess, where this is executed.

Originally posted by rohcQaH View Post

* no communication between the threads.

And how would you do synchronization then?

Originally posted by rohcQaH View Post

* it should contain as few branches as possible and simple data structures. GPUs are a lot less forgiving if your cache locality sucks.

Yes, although the number of branches and using local memory are not necessarily related (in terms of performance).
Still, if you look at the PDF I uploaded, the graphics on the last page show naive OpenCL implementation, usage of warp sizes and usage of local memory/cache (in that order).
Without locality, it's still fast.

Originally posted by rohcQaH View Post

* There must not be any latency requirements or actual streaming of data. You send all the input, you wait, you get all the output. No partial data anywhere.
* The whole workload must take long enough to overcome the overhead of setting up the GPU.

Yes.

Originally posted by rohcQaH View Post

Of course a 40-second task with the textbook-algorithm for parallelizability is going to end up faster on the GPU. That doesn't prove anything for the general case.

Right, therefore you should do a bit more than that ;P
Leave a comment:
rohcQaH replied

14 August 2012, 04:45 AM
Originally posted by alexThunder View Post

Well, actually they are, unless you got a problem which can not be parallelized.

They are if and only if you have a GPU-suitable workload. If you don't, they're slower. Not sure why you insist otherwise.

Is it because they're said to have more FLOPS? Sure they do, in theory. Now compare BOPS, Branching Operations per Second, and see your GPU weep.

There are quite a few requirements for a workload to be GPU-suitable:
* it has to be massively parallelizable
* all parallel threads must be homogenous. CPUs can easily run a physics thread on one core and a gameplay thread on a second, which isn't easy to do on GPUs.
* no communication between the threads.
* it should contain as few branches as possible and simple data structures. GPUs are a lot less forgiving if your cache locality sucks.
* There must not be any latency requirements or actual streaming of data. You send all the input, you wait, you get all the output. No partial data anywhere.
* The whole workload must take long enough to overcome the overhead of setting up the GPU.

Of course a 40-second task with the textbook-algorithm for parallelizability is going to end up faster on the GPU. That doesn't prove anything for the general case.
Leave a comment:
alexThunder replied

13 August 2012, 06:24 PM
Originally posted by jrch2k8 View Post

most algorithms are very hard to parallelize and those parallel friendly algoritm need optimizations depending on the GPU and the dataset you use [<-- is a very hard task -- wanna a widespread reference google CABAC GPU]

they have them but very rudimentary and optimized for the GPU tasks so they differ quite a lot from a CPU counterpart, don't believe me try a matrix multiply[1000x1000 for example] 1 with branching and 1 without [pick the CL language you like] and check the time both take to complete [non branched wins for a factor of X times] so you see what i mean.

this is what i meaned when i say none is faster than the other they are different tools designed to atack efficiently very different scale of problems

Fortunately I don't have to test this again. I ripped some pages out of the lecture I heard about it and combined them for you:

Page not found – Upload Area

http://www.uploadarea.de/files/2g1aeyvxy2djw4ugv83mxiyyj.pdf

It's in german, but that's not that important. There is (some part) of the actual host programm and the OpenCL kernel. On the last two pages you'll find some graphics, which show how the (very) simple kernel performs against a sequential CPU proramm, one with PThreads (4 core machine with HT), OpenCL CPU and OpenCL GPU. The last page shows the perforamnce of the GPU kernel after some optimizations (better usage of the memory).

(Only the most simpel kernel is one these pages, not the optimized one)

The y-Axis shows the time in seconds. The program was tested with a 1680x1680 matrix on a i7 920 and a Geforce 9800 GX2

FYI: The lecture I attended isn't publicly online anymore, but the most recent one is: http://pvs.uni-muenster.de/pvs/lehre...vorlesung.html (german)
Leave a comment:
jrch2k8 replied

13 August 2012, 05:48 PM
Originally posted by alexThunder View Post

Well, actually they are, unless you got a problem which can not be parallelized.
They actually have most of that, i.e. pipelining.

most algorithms are very hard to parallelize and those parallel friendly algoritm need optimizations depending on the GPU and the dataset you use [<-- is a very hard task -- wanna a widespread reference google CABAC GPU]

they have them but very rudimentary and optimized for the GPU tasks so they differ quite a lot from a CPU counterpart, don't believe me try a matrix multiply[1000x1000 for example] 1 with branching and 1 without [pick the CL language you like] and check the time both take to complete [non branched wins for a factor of X times] so you see what i mean.

this is what i meaned when i say none is faster than the other they are different tools designed to atack efficiently very different scale of problems
Leave a comment:

Announcement

Rootbeer: A High-Performance GPU Compiler For Java

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: