Announcement

**rohcQaH** · 23 December 2014, 03:55 PM

Originally posted by nakedhitman View Post

RAM and I/O seem like bigger bottlenecks than compute, these days.

We have a winner.

In many types of queries, compiling GPU kernels and routing data through your GPU only serves to slow things down even further. GPU accelerated query processing (or even FPGA based processing - it's not mainstream, but there is a demand) exist for specific types of queries that a) are long running and b) have a high ratio of computation to I/O. If you were using these types of queries, you'd know.

Most databases out there, like backends for a webserver, usually run many small queries. In those scenarios, the largest overhead is in query parsing and ensuring transaction safety, not in the execution time of relational operators. For common installations of wordpress, mediawiki, phpbb or something, I/O is only a problem if the site gets really busy or there's not enough cache. I'd wager that the hot set of e.g. phoronix fits entirely in ram.

**schmidtbag** · 23 December 2014, 04:39 PM

@rohcQaH
I think nakedhitman was referring to memory bandwidth, not memory capacity. Memory bandwidth is a very real problem, where many people seem to think "oh I just need more MHz" or "oh I just need more GB" but in reality it's the additional memory channels and the balance of frequency and latency that gets you the best results. On a side note, what I find interesting is AMD actually has a noticeably faster memory controller than intel per-channel, but, intel has more channels and supports higher frequencies.

On the note of memory bandwidth and using a GPU for openCL - this is where I think AMD will really shine, due to their investments in HSA. For some databases, an APU could easily be the best processor you can get, since both CPU and GPU would be actively involved in the same data. I feel that in most cases, an APU doesn't have any significant advantage over a discrete GPU in terms of memory performance. Maybe I'm wrong though - I don't do GPGPU development.

**Marc Driftmeyer** · 23 December 2014, 08:21 PM

Originally posted by schmidtbag View Post

@rohcQaH
I think nakedhitman was referring to memory bandwidth, not memory capacity. Memory bandwidth is a very real problem, where many people seem to think "oh I just need more MHz" or "oh I just need more GB" but in reality it's the additional memory channels and the balance of frequency and latency that gets you the best results. On a side note, what I find interesting is AMD actually has a noticeably faster memory controller than intel per-channel, but, intel has more channels and supports higher frequencies.

On the note of memory bandwidth and using a GPU for openCL - this is where I think AMD will really shine, due to their investments in HSA. For some databases, an APU could easily be the best processor you can get, since both CPU and GPU would be actively involved in the same data. I feel that in most cases, an APU doesn't have any significant advantage over a discrete GPU in terms of memory performance. Maybe I'm wrong though - I don't do GPGPU development.

One person gets it.

**fuzz** · 23 December 2014, 08:31 PM

Originally posted by schmidtbag View Post

On the note of memory bandwidth and using a GPU for openCL - this is where I think AMD will really shine, due to their investments in HSA. For some databases, an APU could easily be the best processor you can get, since both CPU and GPU would be actively involved in the same data.

I was going to say the very same thing.

I don't think the type of problem is limited to databases, either -- a lot of programming paradigms have been designed around single (or just few) threads. Up until now, most would correctly argue that opencl isn't worth it due to the transfer to graphics memory.

HSA is what may end up making highly-parallel programming languages (like some functional ones) shine and APUs along with them. Pure speculation, of course. I want this tech!

**rohcQaH** · 24 December 2014, 09:13 AM

Originally posted by schmidtbag View Post

I think nakedhitman was referring to memory bandwidth, not memory capacity.

I'm pretty sure his suggestion of using GPU memory as an additional cache refers to capacity, not bandwidth.

Of course there are situations were the DB gets bottlenecked by memory bandwidth. But if that's the bottleneck, I'm not sure if postgres is the right database.

Originally posted by schmidtbag View Post

On the note of memory bandwidth and using a GPU for openCL - this is where I think AMD will really shine, due to their investments in HSA. For some databases, an APU could easily be the best processor you can get, since both CPU and GPU would be actively involved in the same data.

Maybe. I'm waiting for the first proof of concept implementations and benchmarks to appear.

HSA is not a free lunch. Sharing memory between CPU and GPU means that memory access gets slower to guarantee cache consistency between CPU and GPU. In our tests, it was often faster to go the traditional opencl route: migrate the buffer to GPU, run the CL kernel with fast memory access, migrate the buffer back to CPU. HSA is absolutely an advantage because it enables cheap zero copy buffer migration. I'm not yet sold on concurrent accesses, at least for our uses.

**schmidtbag** · 24 December 2014, 10:26 AM

Originally posted by rohcQaH View Post

Of course there are situations were the DB gets bottlenecked by memory bandwidth. But if that's the bottleneck, I'm not sure if postgres is the right database.

Depends on the implementation. Unlike games, where the VRAM needs to constantly be refreshed with new data every frame, you could load a hefty chunk of a database into VRAM and just have the GPU process that data based on its own instructions. GPUs have very high-bandwidth memory so it could end up having great results, if coded properly. But yeah, if you were to feed the GPU one record at a time, It'd probably be slower than just doing everything on CPU.

HSA is not a free lunch. Sharing memory between CPU and GPU means that memory access gets slower to guarantee cache consistency between CPU and GPU. In our tests, it was often faster to go the traditional opencl route: migrate the buffer to GPU, run the CL kernel with fast memory access, migrate the buffer back to CPU. HSA is absolutely an advantage because it enables cheap zero copy buffer migration. I'm not yet sold on concurrent accesses, at least for our uses.

I thought the whole point of HSA was specifically to eliminate the need of redundant memory and therefore improve memory access? I may be wrong - I don't fully understand how HSA works. Also, in APUs, isn't the L3 cache also shared by the GPU? Again, I may be wrong.

**rohcQaH** · 26 December 2014, 02:11 PM

Originally posted by schmidtbag View Post

Depends on the implementation. Unlike games, where the VRAM needs to constantly be refreshed with new data every frame, you could load a hefty chunk of a database into VRAM and just have the GPU process that data based on its own instructions. GPUs have very high-bandwidth memory so it could end up having great results, if coded properly. But yeah, if you were to feed the GPU one record at a time, It'd probably be slower than just doing everything on CPU.

Sure, but then you don't want a relational database with iterator-based operators, you'll need something that implements batch processing. Take a look at column stores. I'm not aware of any running on the GPU yet (mostly because for the use cases were column stores get used, VRAM is way too small), but I'd expect those to be the first to jump on APUs and/or profit from FPGAs.

Originally posted by schmidtbag View Post

I thought the whole point of HSA was specifically to eliminate the need of redundant memory and therefore improve memory access? I may be wrong - I don't fully understand how HSA works. Also, in APUs, isn't the L3 cache also shared by the GPU? Again, I may be wrong.

I'm not sure if the L3 cache is shared, and too lazy to look it up. However, lower caches are not shared. If you want a buffer to be accessible from CPU and GPU, there's a penalty for that, because at least one of them must not use the lower cache levels.

You can change the type of memory access for each buffer using the opencl API when mapping the buffer, and the runtime will ensure that the correct access mode is used. Lazy programmers can just set everything to shared (and enjoy the race conditions), but according to our preliminary tests, that's probably not going to be the fastest way to do it.

Then again, our opencl kernels are very small and run on lots of data (e.g. take two arrays with 10 million entries each and add them), so memory bandwidth is our primary concern. YMMV.

Announcement

GPU-Based Acceleration For PostgreSQL

Comment

Comment

Comment

Comment

Comment

Comment

Comment