Phoronix: Gallium3D LLVMpipe Isn't Yet Fit For ARM
While OpenGL is becoming a requirement for more of the Linux desktops out there, and ARM open-source graphics drivers aren't yet commonplace, using the Gallium3D LLVMpipe software rasterizer on ARM isn't yet a really viable solution...
I keep seeing these articles about LLVMpipe not working well on ARM and I wonder why anyone expects it should? To my knowledge, no one has done any work on improving LLVM performance on ARM. I can't imagine what you expect to have changed since the last time an article was published about this.
I would figure Apple has a very vested interest in LLVM doing well on ARM. (They're likely holding back stuff, but still, they're moving to LLVM/Clang for iOS...)
The problem appears to be that there just isn't enough CPU power. If you need a multicore amd64 platform to run LLVMpipe adequetely, there isn't much chance of even an A15 platform doing well. To be fair, doing modern 3D in a regular CPU is hard, look at what happened to Larabee.
Indeed. Although I meant that no one has done any performance work on LLVMpipe (not LLVM) for ARM. LLVMpipe is full of code to generate SSE* instructions. There's nothing similar for NEON.
I figured it was something similar. I've noticed that most of the Phoronix testing of ARM compilers are done without NEON (excepting a recent article from earlier this week). If LLVMpipe has SSE optimizations and nothing for NEON, then it's no surprise that it's slow on ARM (completely discarding the raw throughput differences of the CPUs in question).
I think when LLVMpipe was first written, LLVM wasn't able to generate good SSE code given vectorized IR, so the authors worked around this by using intrinsics.
I wouldn't really call it "worked around". It is true some intrinsics are used because early llvm versions didn't quite work right (for instance the code for comparison/select where up to llvm 3.0 or so backends choked on doing this vectorized).
However, with llvm IR you also cannot express some things sse (or other vector instruction sets for that matter) can do, and if you try to do it you end up with llvm IR which is too complex for llvm backends to synthesize back into simple cpu instructions.
Some of these I would blame on llvm (I really hate it for instance llvm doesn't have min/max instructions which are a fairly general concept pretty much all vector extensions can do that, but you'll need to code it as compare/select and last time I checked llvm was unable to fuse that back into a min or max so with only sse2 you'll end up with a cmp instruction plus and/andnot/or (for select) and even if you have sse41 the cmp/select isn't really ideal).
But most of the intrinsics which are used by llvmpipe don't really fall into that category, they are simply "too weird" to make sense in a generic IR. For instance the pack intrinsics - these are very useful and used extensively, if you fall back on llvm ir if they aren't available it's way more complicated (you can use trunc but the necessary clamping makes it a mess and the generated code terrible).
Frankly I'm pleasantly surprised it would work at all on arm, I guess the arm backend is in pretty good shape then.
Nothing is stopping someone from adding support for neon intrinsics however - just recently there's ongoing work for altivec intrinsics on powerpc.