Announcement

**WillyThePimp** · 25 September 2010, 12:13 AM

Originally posted by bridgman View Post

A couple of minor clarifications :

1. Most modern x86 CPUs have multiple "processors" behind the instruction decoder capable of executing multiple instructions in a single clock. Essentially this is VLIW behind an x86 decoder.

2. Most modern GPUs are SIMD (same instruction is executed on multiple data elements in parallel) but AFAIK only the ATI/AMD GPUs are VLIW *and* SIMD.

3. The equivalent of SIMD in an x86 processor is the SIMD instructions (3DNow, SSE etc), while the equivalent of VLIW in an x86 processor is the ability to execute multiple instructions per clock.

4. The problem with early VLIW processors was that they exposed the VLIW instruction set to application programs. Most current applications of VLIW hide the VLIW instruction set behind either a high level API and driver stack (GPUs) or behind a stable instruction set (with either a hardware instruction decoder or a software translator).

1. Kinda false, the whole point of VLIW designs is to extend even further some key RISC principles: To move complex structures out of the processor and replace those with software solutions, E.g, the reorder buffer for OoOE doesn't exist in any VLIW processor, since all the instruction packing is done on compile time. Though, introduced with AMD's K5 was the ability to translate variable length x86 instructions to fixed length and somewhat simpler RISC-like instructions, something that was extended on k7 to allow macro and micro op fusion. But even then, this is far from what a VLIW is suposed to be.

2. PowerVR SGX gpus, as far as I know, are VLIW processors with SIMD execution capabilities, but it is indeed true that ATI was the first company to make a VLIW SIMD GPU.

3. The equivalent to VLIW in a modern day X86 CPU is absolutely nothing, I think you are confused with the term "pipelined superscalar", which also describes the ability to execute and process more than 1 instruction per cycle. To exemplify further, you can have a pure CISC design that executes more than 1 instruction per clock (A pipelined superscalar 1 cycle latency for everything Pentium 1 kind of thing).

4. This one is pretty hard for me to understand. Haven't there been programming languages for every type of processor since ever? Aren't APIS supposed to be hardware agnostic? I'm sure the LLVMpipe could prove this one! (BTW, the stable instruction set you talk about isn't VLIW, it's Intel's EPIC, which is based on VLIW but adresses its shortcomings.)

**bridgman** · 25 September 2010, 01:38 AM

1. You are being a bit more literal than I was. Think about the reorder buffers as part of the x86 wrapper, not the VLIW processor itself, and the instruction packing being done by the x86 decoder at runtime rather than by the compiler at compile time. I'm not saying it *is* a VLIW processor, just that the multiple execution units operate in a way that is very similar to the execution units of a VLIW processor.

You could also argue that a real VLIW processor would include flow control, whereas in a modern x86 CPU the flow control is handled at an instruction decoder level, but arguably this is no different from a 6xx-and-higher shader core where flow control is handled at a clause level while ALU operations only happen within a clause (yes this is a bit of a stretch but I'm only trying to suggest philosophical similarity not that there is an actual VLIW processor inside each x86 CPU).

2. Didn't know that, thanks.

3. Again, I'm not saying that an x86 CPU *has* a VLIW processor in it but if you look at the multiple execution units inside any superscalar CPU (doesn't have to be x86) then they're going to look awfully like the multiple execution units in a VLIW processor. The difference is that in a traditional VLIW processor the bundling of multiple instructions is done at compile time, while with a modern superscalar processor the bundling (determining which operations can be issued in parallel) is done at runtime.

4. The point I was trying to make was that if you compile applications all the way down to VLIW binary code then distribute the binary you run into problems when you try to run the same code on older and newer VLIW processors (where the new ones have more parallel execution units). EPIC definitely helped, but my impression was that it didn't scale down to the low end enough to really catch on (ie nobody was making $50 EPIC processors). Next step was arguably the "distribute everything in source code" approach, but eventually the idea of distributing in (say) x86 code then extracting the ILP at runtime won out.

The "stable instruction set" I was talking about was x86, not VLIW.

**OlegOlegovich** · 28 September 2010, 03:30 AM

Originally posted by bridgman View Post

...since the graphics world essentially uses ... a JIT compiler for shader programs...

Sorry to bother, but could you elaborate a bit on JIT please? I previously thought shader programs were compiled AOT when passing through driver layer.

**V!NCENT** · 28 September 2010, 05:32 AM

Originally posted by OlegOlegovich View Post

Sorry to bother, but could you elaborate a bit on JIT please? I previously thought shader programs were compiled AOT when passing through driver layer.

JIT compilers optimise from high level to low level at each layer "Just In Time" so the entire process gets optimised by itself, whereas AOT just compiles "Ahead Of Time".

Just-in-time compilation - Wikipedia

http://en.wikipedia.org/wiki/Just-in-time_compilation

**OlegOlegovich** · 28 September 2010, 05:48 AM

VINCENT, sorry, but I just do not understand. What's the point of constantly recompiling the statically-typed code with primitive data types. A half-decent optimizing compiler will be a perfect suit. That's strictly my uninformed opinion.

**V!NCENT** · 28 September 2010, 06:16 AM

Originally posted by OlegOlegovich View Post

VINCENT, sorry, but I just do not understand. What's the point of constantly recompiling the statically-typed code with primitive data types. A half-decent optimizing compiler will be a perfect suit. That's strictly my uninformed opinion.

Would you please just read the fscking summary of the Wikipedia article I posted here? It explains in what way JIT differs form AOT and what the benifits are over AOT.

It's in the fscking summary!

**OlegOlegovich** · 28 September 2010, 06:22 AM

Originally posted by V!NCENT View Post

Aren't Phoronix forums lovely

But I seriously hope, bridgman or anyone else will answer my question, if it's not too stupid.

**V!NCENT** · 28 September 2010, 06:50 AM

Originally posted by OlegOlegovich View Post

Aren't Phoronix forums lovely

But I seriously hope, bridgman or anyone else will answer my question, if it's not too stupid.

OK fine... <_<'

JIT compiling can, due to its very nature, make on the fly optmisations. For example if you'd compile AOT for a CPU to do software rendering you'd have to keep in mind you should only compile for functions all CPU's share. If you'd JIT compile for a CPU, the JIT compiler can go ahead and say "Hey you got a SSE instruction set and multiple cores, you I can compile for that nstead of the standard x86 instruction set and make it run faster.

A JIT compiler also runs on the fly, so if it sees it gets to compile the same stuff multiple times it can just tap into cache and say "Hey I already done that, so I'll just forward what I've already done.

On GPU's you have Intel, nVidia and ATI cores and differences in these cores, so if you have a single State Tracker (OpenGL API for example) and you JIT compile then the JIT compiler can take advantages of GPU specific features to optimise the code. If you'd let these State Trackers compile AOT then you can't perform such specific stuff.

And once again this is in the WIkipedia article that you were to lazy to read.

**cb88** · 28 September 2010, 07:06 AM

@V!NCENT you forget branch prediction.... as the code is running it can be optimised for the most common branch taken... where as that is pretty much impossible with AOT unless you generate a profile even then that can change.

**BlackStar** · 28 September 2010, 07:08 AM

Originally posted by OlegOlegovich View Post

Sorry to bother, but could you elaborate a bit on JIT please? I previously thought shader programs were compiled AOT when passing through driver layer.

No, shaders are always JIT compiled just before execution on the GPU. There's no other way, since the same shader must be able to run on every compatible GPU, regardless of their instruction sets. There's no single GPU architecture like x86: GPUs from different vendors and different families offer wildly different instruction sets.

Note that while Direct3D offers an offline compiler, its output is always (JIT) recompiled before being executed on the GPU. OpenGL didn't offer any form of AOT compilation at all before version 4.1 (and even now, the binary code is specific to your GPU & drivers and cannot be redistributed - it's only useful as a cache mechanism).

Announcement

Linux 2.6.36-rc5 Kernel Released; Fixes 14 Year Old Bug

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment