Linux Kernel Live Patching Working Fairly Well For Millions Of Meta Servers
As with most organizations looking at kernel live-patching, they turned to it in order to reduce server downtime on kernel updates -- primarily for the never-ending flow of security updates. Fully rebooting the servers and the often lengthy POST times can be rather problematic while with kernel live-patching they can near-seamlessly move to the new kernel when everything goes according to plan.
Livepatching allows for kernel functions to be safely patched in-place at run-time. Beyond the livepatch infrastructure within the kernel, Meta went with Red Hat's Kpatch while SUSE continues to maintain kGraft and Oracle also has Ksplice.
Along the way in using Linux live-patching on "millions of servers", they have found issues to overcome with tracing issues and there have been some performance issues encountered. The performance issues reported are possible 1~2 second issues during live-patching of higher I/O and fsync latency as well as higher TCP re-transmit rates.
Meta engineers continue working on dealing with corner cases, better handling for cases like Clang-compiled PGO-optimized kernel builds, and other items to increase robustness.
Those curious about Meta's kernel live-patching at scale work can see the LPC 2022 slide deck and the video recording embedded below.