hmm.. what's stronger: 48x 8086 or 1x 80386?
This really doesn't mean much.
AMD Core Counts and Bulldozer: Preparing for an APU World
Who are these guys? And where do they get those wonderful toys from?
I wonder that would mean for programmers.AMD did add that eventually, in a matter of 3 - 5 years, most floating point workloads would be moved off of the CPU and onto the GPU. At that point you could even argue against including any sort of FP logic on the "CPU" at all. It's clear that AMD's design direction with Bulldozer is to prepare for that future.
Last edited by Louise; 12-03-2009 at 06:46 PM.
I'm not sure moving *all* of the floating point logic off the CPU would ever make sense, since all kinds of programs use floating point variables and you still want those programs to run without falling back to SW floating point emulation.
I believe the discussion is more about the SIMD instruction extensions, which work on explicitly vectorized code.
Last edited by bridgman; 12-03-2009 at 07:16 PM.
Code:sed -i 's/%f/%f_gpu/g' *
Are there other things that GPU's are unquestionable better at than CPU's?
I suppose that preforming FP calculations are very parallelizable?
It's typically highly parallelizable low-precision floating point code, but over the last few years a lot of progress has been made to improve the GPU in other areas and I'm sure that will continue.Are there other things that GPU's are unquestionable better at than CPU's?
Not intrinsically, no. Multiplying 2.5 * 3.5 is no more parallel than multiplying 2 * 3. However, many of the applications that heavily use fp hardware are - anything that benefits from SSE support, for example, is probably at least somewhat parallelizable.I suppose that preforming FP calculations are very parallelizable?
Also, GPU hardware is built from the ground up around floating point operations, while most of the code that goes through a cpu is integer based. Floating point hardware is a lot more complicated (and therefore expensive) than integer hardware, so the amount of fp resources that can be justified on a cpu are fairly limited.
I believe they're talking about having a single gpu on chip to handle multiple cpu cores (threads) at a time, so that will let the hardware stretch it's legs a little and of course the compilers will probably be tuned to try to schedule as many fp operations at a time as they can.
It is sort of like the Cell architecture in that they're talking about having different kinds of specialized hardware on the same chip. Cell had a weak general purpose core with a bunch of highly specialized SPUs, while AMD is talking about multiple x86 cores along with a gpu that would handle most of the fp load.So something like the Cell architecture?
I'm just guessing here, but I think the idea would be for the hardware to automatically do all the offloading here, which is different than Cell. With Cell, you had to be very careful about what you were programming where, and I would assume that this stuff from AMD would just take normal code and have the cpu fetch/decode logic forward fp calls through to another part of the chip automatically for you.
Last edited by smitty3268; 12-03-2009 at 09:48 PM.
Sorry, when I mentioned SIMD instruction extensions I was talking about the SIMD extensions which are already built into x86 processors today. SIMD extensions are how most of the serious floating point work is done on x86 today, but unless you're writing math libraries or game engines you probably don't see them.
Originally x86 processors only handled integer work, and a separate x87 coprocessor handled floating point operations (or you trapped down to a software emulation library). The coprocessor had its own set of registers and other state information, but it pulled instructions out of the same stream as the rest of your program. The floating point coprocessor moved onto the same die as the CPU around the 486 days, but the separate instructions and registers remained (and are still there today AFAIK).
Enter SIMD. The Intel MMX extensions added integer SIMD functions -- the ability to process multiple sets of data with a single instruction. AMD's 3DNow! added the first floating point SIMD extensions, targetted at 3D game geometry but useful in other areas as well. Intel's SSE extensions (Streaming SIMD aka Screaming Sindy) added more floating point capabilities along with a third set of registers.
In general compilers don't directly use the SIMD extensions - you get at them through math libraries or calls into hand-tweeked assembler routines. This is starting to change but we're still at the early stages - OpenCL is one attempt to make the SIMD extensions on CPUs generally accessible.
While all this was happening, GPUs were gradually evolving into floating point SIMD engines as well, but with a higher degree of streaming (trading off cache coherence for throughput) and parallelism than CPUs (HD5870 has 20 SIMD engines each executing 80 floating point ops per clock, ie 1600 operations per clock or 3200 FLOPs/clock for MAD).
This obviously reminds one of Seymour Cray's comment about the emerging conflict between highly parallel microprocessor-based systems and optimized supercomputers :
It took a lot of years, but the chickens are finally starting to win.If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?
The GPU vs CPU discussion is primarily about whether the work currently handled by the CPU's SIMD extensions (SSEx) can be handled as well or better by a GPU-type architecture instead. So far it looks pretty good - the biggest application of SIMD floating point was 3D geometry and that moved to the GPU quite a few years ago. Math libraries come next, and many of them have been ported to GPUs. APIs like OpenCL and DirectCompute are designed to pick up most of the remaining workload and isolate it from the hardware specifics.
Last edited by bridgman; 12-03-2009 at 11:23 PM.
Personally I've never thought that Larrabee made much sense -- why put x86 instructions into a massively parallel architecture if you could use a new instruction set and eliminate all the complex instruction decoding? -- but I wouldn't write it off yet.