Originally posted by gens
View Post
- use const and const reference wherever possible: this is good both for performance and code safety!
- use references (not pointers) for big objects. This is also true for C, but references are known to be non-NULL, which is a great boon for most users. If you use a modern C++ (C++ 11), you may not need even to do this, as there is Move-Semantics that does not copy big objects by default.
- for calls that you make them often, write them in headers, to give to the compiler the opportunity to inline it. If you use STL, this is done for you (as templates have to be written just in headers, they are very often inlined by most compilers)
- for "experts": at compile time make evaluation templates that compute the constant values of the expressions
- for very tight code you may need to use intrinsics, or to write the code compiler friendly, so the compiler "catches" the optimizations you may want.
Is it anything I missed?
Let's compare with what a compiler does, and you have to do write in your own assembly optimizations:
- you have for jumps to take in account the instruction distance, it is better to use short-jumps over long jumps
- you have to know assembly instruction length (time is not an issue for OOO CPUs), so to use Xor over Mov for setting values to zero. This can change the jump kind
- you have to make the register coloring (an NP Hard problem) by yourself to assign registers optimally. There is a huge bunch of literature that shows strategies of register allocation. "Linear Scan Register Allocator" (LRSA) is good enough for most purposes, but also a good strategy is to make full colorization. This increases the register allocation time by 10x (in the comparison paper between Client and Server compiler in Java's HotSpot), but gives like 10-15% faster code. The greedy strategy of LLVM 3.0+ is a way to allocate registers better than LRSA and to not do full colorization (it is very similar with the Standford's compiler course algorithm of full RA, but it doesn't work for entire function). Wish you luck matching by default in a non-trivial function your register allocation.
- for a medium/long function, you have to compute by yourself all redundancies and simplifications that a compiler will do it for you (shown in the previous post)
- you have to make instructions to not be inter-dependent so as many stages of computation to be executed into the Out-Of-Order CPU pipeline
- you have to inline the function by itself, this also implying that you have to re-assign registers
- you don't have such of a great profilers, and as far as I know there is just Valgrind (CacheGrind) that is basically free (CG works also for C/C++). VTune's tools are out of reach for hobby developers
- you have to take in account different CPU instruction sets, caches, CPU features and so on
- you have to make all this, and you have to do it safe
The list could be longer (like: there are calling conventions that can be tuned, (it is not such of a big deal to write: __fastcall before a function) which can be done on C++ side (and they have to be taught on assembly too), but for main part, I can say that the lists are fairly accurate of my understanding of what assembly and C++ optimizations imply.
Leave a comment: