Originally posted by xfcemint
View Post
1) it looks like you know quite a bit more than me about GPU texturisation (i'm learning!), where when we get to it i was counting on input from Jacob (designer of Kazan), Mitch Alsup (one of the architects behind Samsung's recent GPU), and so on.
2) how 6600-style OoO works. this bit i *do* know about, and i forgot to mention something: namely that the way it works is, every "operation" is monitored for its start and completion, and that it *doesn't matter* if it's a FSM (like a DIV unit), or a single-stage pipeline, or a multi-stage pipeline, or an early-out variable-length pipeline: what matters and only matters is that every operation is "monitored" from start to finish, 100% without fail.
consequently, what you describe (the texture unit, with its cache), *can be slotted in as a Function Unit into the 6600 OoO architecture*.
in fact, we could if necessary add many more than just one of them.
For Larabee, if Intel didn't at least do a separate opcode for bilinear filtering, well, that is "beyond stupid". I thought they had a texture unit attached to each x86 core, that would be a minimum for a GPU from my point of view.
It is very hard for me to predict how your solution is going to perform.
he then shows precisely how to then optimise the *architecture* to get better performance. and there are some real surprises in it, the L1 cache munches enormous amounts of power for example. you should be able to track the paper down via this https://www.researchgate.net/publica...pen_source_GPU
You have a big OoO scheduler and a rather small ALU compared to a GPU.
the question then becomes: *should* you do that, what's the business case for doing so, and will people pay money for the resultant product?
That kind of CPU has very limited compute resources to compete with a GPU. You try to compensate by using triple instruction issue to get better utilization of ALU units.
Well, if you already have an OoO scheduler, at least your solution looks relatively simple to design.
by tackling head-on what's normally considered to be a "hard" processor design we've given ourselves the design flexibility to just go "yawn, did we need to up the number of DIV units again? ahh let me just stretch across to the keyboard and change that 2 to a 3, ahh that was really difficult"
What can I say? The simplest way to increase the compute capability of your CPU is SIMD.
we can do exactly the same thing.
we can put down *MULTIPLE* SIMD ALUs.
we do *NOT* have to do the insanity of increasing the SIMD width from 64 bit to 128 bit to 256 bit to 512 bit.
we have the option - the flexibility - to put down 1x 64-bit SIMD ALUs, 2x 64-bit SIMD ALUs, 3x, 4x, 8x - and have the OoO Execution Engine take care of it.
(actually, what's more likely to happen is, when we come to do high-performance CPU-GPUs, because the Dependency Matrices increase in size O(N^2), is that we'll put down 1x 64-bit SIMD ALU, 1x 128-bit SIMD ALU, 1x 256-bit SIMD ALU and so on)
and, remember: all of this complexity - whatever happens at the back-end - is entirely hidden from the developer with a "VL" front-end, all exactly the same programs. *no* need to even know that there's NNN back-end SIMD units.
In that case, you need to look at the common shader code and how can it be compiled for your CPU. Shaders can be run massively in parallel, so how is your CPU going to exploit that?
Think about adding a separate texture unit, connected to the ALU. Texture unit has its own cache and direct access to memory.
studying and learning about this (and properly implementing it) i think took about 2 out of those 5 months of learning about augmented 6600 from Mitch Alsup.
I'm going to think some more about it, but I think that I don't have sufficient knowledge and experience on this subject to be of much further help. It is too hard to see how the shader code is going to be compiled and what amount of compute capability per transistor can your CPU provide.
Comment