Announcement
Collapse
No announcement yet.
Rootbeer: A High-Performance GPU Compiler For Java
Collapse
X
-
Originally posted by alexThunder View PostI don't. In general my point is, that they're just faster, although they might not be usable for everything.
For instance: Motorbikes are usually faster than cars. The fact that they hardly can carry anything compared to a car, doesn't make them slower, does it?
Leave a comment:
-
Originally posted by alexThunder View PostThen tell me, what do i.e. local/global memory fences do in OpenCL or what they're good for.
They are useful for making sure that threads within a workgroup don't get out of sync, but they CANNOT be used to attempt to synchronize all global work items in an OpenCL kernel invocation. Trust me, I've tried this, and I've tried to create global synchronization mechanisms in OpenCL (and found fun ways to lock up my GPU in the process).
Leave a comment:
-
Originally posted by rohcQaH View PostWith cars you have a pretty strict metric of speed = distance over time. Which metric are you using to claim that GPUs are faster than CPUs? "Computation over time" is just too hard to define, and you'll find definitions that favor either side.
Btw. it's time to response.
Originally posted by rohcQaH View PostThus, none can be declared the winner.
Originally posted by rohcQaH View PostIt can run on either. If it does run on the GPU, it does not run concurrently with other GPU threads. They're run one after the other, with those expensive context switches and CPU involvement in between.
On the CPU, both could run concurrently on their own cores with virtually no overhead.
Originally posted by rohcQaH View PostOn a GPU? You don't. The only synchronization primitive is "The CPU task is informed that the current batch of data has been processed and the results are ready."
Which disables quite a bit of parallel algorithms.
Leave a comment:
-
Originally posted by alexThunder View PostFor instance: Motorbikes are usually faster than cars. The fact that they hardly can carry anything compared to a car, doesn't make them slower, does it?
Originally posted by alexThunder View PostEver heard of PhysX? It's used in some games. Guess, where this is executed.
On the CPU, both could run concurrently on their own cores with virtually no overhead.
(Aside: there is evidence that PhysX could be faster on the CPU, but nVidia has purposefully crippled the CPU implementation to make their GPU compute look better.)
Originally posted by alexThunder View PostAnd how would you do synchronization then?
Which disables quite a bit of parallel algorithms.
Leave a comment:
-
Originally posted by rohcQaH View PostThey are if and only if you have a GPU-suitable workload. If you don't, they're slower. Not sure why you insist otherwise.
For instance: Motorbikes are usually faster than cars. The fact that they hardly can carry anything compared to a car, doesn't make them slower, does it?
Originally posted by rohcQaH View Post* it has to be massively parallelizable
Originally posted by rohcQaH View Post* all parallel threads must be homogenous. CPUs can easily run a physics thread on one core and a gameplay thread on a second, which isn't easy to do on GPUs.
Originally posted by rohcQaH View Post* no communication between the threads.
Originally posted by rohcQaH View Post* it should contain as few branches as possible and simple data structures. GPUs are a lot less forgiving if your cache locality sucks.
Still, if you look at the PDF I uploaded, the graphics on the last page show naive OpenCL implementation, usage of warp sizes and usage of local memory/cache (in that order).
Without locality, it's still fast.
Originally posted by rohcQaH View Post* There must not be any latency requirements or actual streaming of data. You send all the input, you wait, you get all the output. No partial data anywhere.
* The whole workload must take long enough to overcome the overhead of setting up the GPU.
Originally posted by rohcQaH View PostOf course a 40-second task with the textbook-algorithm for parallelizability is going to end up faster on the GPU. That doesn't prove anything for the general case.
Leave a comment:
-
Originally posted by alexThunder View PostWell, actually they are, unless you got a problem which can not be parallelized.
Is it because they're said to have more FLOPS? Sure they do, in theory. Now compare BOPS, Branching Operations per Second, and see your GPU weep.
There are quite a few requirements for a workload to be GPU-suitable:
* it has to be massively parallelizable
* all parallel threads must be homogenous. CPUs can easily run a physics thread on one core and a gameplay thread on a second, which isn't easy to do on GPUs.
* no communication between the threads.
* it should contain as few branches as possible and simple data structures. GPUs are a lot less forgiving if your cache locality sucks.
* There must not be any latency requirements or actual streaming of data. You send all the input, you wait, you get all the output. No partial data anywhere.
* The whole workload must take long enough to overcome the overhead of setting up the GPU.
Of course a 40-second task with the textbook-algorithm for parallelizability is going to end up faster on the GPU. That doesn't prove anything for the general case.
Leave a comment:
-
Originally posted by jrch2k8 View Postmost algorithms are very hard to parallelize and those parallel friendly algoritm need optimizations depending on the GPU and the dataset you use [<-- is a very hard task -- wanna a widespread reference google CABAC GPU]
they have them but very rudimentary and optimized for the GPU tasks so they differ quite a lot from a CPU counterpart, don't believe me try a matrix multiply[1000x1000 for example] 1 with branching and 1 without [pick the CL language you like] and check the time both take to complete [non branched wins for a factor of X times] so you see what i mean.
this is what i meaned when i say none is faster than the other they are different tools designed to atack efficiently very different scale of problems
It's in german, but that's not that important. There is (some part) of the actual host programm and the OpenCL kernel. On the last two pages you'll find some graphics, which show how the (very) simple kernel performs against a sequential CPU proramm, one with PThreads (4 core machine with HT), OpenCL CPU and OpenCL GPU. The last page shows the perforamnce of the GPU kernel after some optimizations (better usage of the memory).
(Only the most simpel kernel is one these pages, not the optimized one)
The y-Axis shows the time in seconds. The program was tested with a 1680x1680 matrix on a i7 920 and a Geforce 9800 GX2
FYI: The lecture I attended isn't publicly online anymore, but the most recent one is: http://pvs.uni-muenster.de/pvs/lehre...vorlesung.html (german)
Leave a comment:
-
Originally posted by alexThunder View PostWell, actually they are, unless you got a problem which can not be parallelized.
They actually have most of that, i.e. pipelining.
they have them but very rudimentary and optimized for the GPU tasks so they differ quite a lot from a CPU counterpart, don't believe me try a matrix multiply[1000x1000 for example] 1 with branching and 1 without [pick the CL language you like] and check the time both take to complete [non branched wins for a factor of X times] so you see what i mean.
this is what i meaned when i say none is faster than the other they are different tools designed to atack efficiently very different scale of problems
Leave a comment:
Leave a comment: