Originally posted by NotMine999
View Post
- Mutex unlocking has subtly different semantics from the userland counterpart that causes bugs if the developers using them are unaware of the differences. This is done just for the sake of a tiny bit more performance. This is a reference to how unlocking a mutex will continue accessing memory to process the waiter list just to let a thread in the fast lock path execute sooner.
- Ticketed spinlocks were invented to reduce SMP contention on NUMA systems. This one is not insane, but the idea that anyone would try to sequeeze out extra performance from a fundamental locking mechanism that nobody thought could be made better is just mind boggling.
- RCU has been used everywhere to improve concurrency, including in certain trees (which is a pain to understand)
- Efforts have been made to eliminate locking in favor of the absolute minimal memory barriers necessary to make things work as fast as they can openness on the DEC Alpha (which has the most relaxed barrier model there is) have been made.
- CPU prefetch has been overused to the point of harming performance in some cases, such as in linked list traversal.
- Compiler hints in the form of likely and unlikely are peppered throughout the code. This could turn into hints for the CPU branch predictor depending on the architecture, or can just result in ordering things differently so that the branch predictor is inclined to predict a certain way. It is an extremely esoteric form of optimization.
- Various tiny accessor/setter functions have been made into preprocessor definitions to forcibly inline them to avoid function call overhead. It is possible to program without using such functions, but it makes things more maintainable. Doing it the way the kernel has done it gives you the best of both worlds, but makes debugging more difficult as you cannot instrument processor definitions.
- container_of was implemented to allow certain structures that extend other structures to be implemented in a way that avoids a single pointer. The way that this is implemented uses typeof, which is a compiler extension that is incompatible with the C standard.
- plenty of kernel config options have remarks about slight slowdowns or small increases in memory usage.
- skbufs are incredibly ugly, but they reduce pointer indirections to increase network performance.
- kernel virtual memory has been crippled to ensure that it is not used very much for the sake of slightly faster execution.
- direct reclaim has been implemented to make memory allocations complete sooner under low memory situations (although this is debatable).
- they implemented an ugly hack called ->bmap so that swap files are as performant as swap devices. This is not compatible with anything that is not an in place filesystem and the value of it is questionable, but that is a tangent for another thread.
- they adopted a few hacks from IRIX to speed up performance in certain things. Namely, the non-standardized (and poorly defined) O_DIRECT and short extended attributes (that allow storage in inodes to avoid an extra disk access or two).
Of course, there are areas where you will see that not much effort is given (like /dev/urandom until recently), but overall, the Linux kernel’s mainline developers do more for performance than those of just about any other platform. In some benchmarks comparing platforms, the differences show themselves quite prominently.
Comment