Linux Kernel Live Patching Working Fairly Well For Millions Of Meta Servers

As with most organizations looking at kernel live-patching, they turned to it in order to reduce server downtime on kernel updates -- primarily for the never-ending flow of security updates. Fully rebooting the servers and the often lengthy POST times can be rather problematic while with kernel live-patching they can near-seamlessly move to the new kernel when everything goes according to plan.
Livepatching allows for kernel functions to be safely patched in-place at run-time. Beyond the livepatch infrastructure within the kernel, Meta went with Red Hat's Kpatch while SUSE continues to maintain kGraft and Oracle also has Ksplice.
Meta: Kernel Live Patching at Scale
Along the way in using Linux live-patching on "millions of servers", they have found issues to overcome with tracing issues and there have been some performance issues encountered. The performance issues reported are possible 1~2 second issues during live-patching of higher I/O and fsync latency as well as higher TCP re-transmit rates.
Meta engineers continue working on dealing with corner cases, better handling for cases like Clang-compiled PGO-optimized kernel builds, and other items to increase robustness.
Those curious about Meta's kernel live-patching at scale work can see the LPC 2022 slide deck and the video recording embedded below.
3 Comments