The Linux Kernel Seeing Backport Progress Finally For The "$1.5 Million Dollar Bug"
Several weeks ago we wrote about a kernel fix for Linux 5.4 to address performance issues for highly-threaded Linux software running under CFS quotas. The fix can yield up to a 30x improvement in performance and one company estimated the impact of the bug cost them at least $1.5 million USD in extra resources/hardware. But now it looks like it will soon appear in a Linux 5.3 point release and possible back-ports to earlier kernels.
The CFS quota performance issue was spotted with Kubernetes workloads that make use of a CFS scheduler quota to restrict CPU shared resources. The bug was highly-threaded software in turn not getting their fair access to the CPU leading to higher latency and lower performance.
The bug has been around since late 2017 and fixed for Linux 5.4 Git but getting it back-ported has been a bit of a pain. Job portal Indeed.com who did a lot of the due diligence on this bug estimated this lone issue caused them around 1.5 million dollars in additional capacity to make up for lost resources. "Our java applications are particularly hard hit as java tends to be very thread happy. Doing some napkin math, given the roughly 9000 applications we have running in our clouds we've had to over-allocate each one's CPU quota by roughly .5 cpu to account for this behavior change (worst case scenario is actually .01 cpu * cores in machine, but not all of our applications are affected equally). 9000 applications * .5 CPU = 4500 cores 4500 Cores / 88 cores per node = 51 additional machines required to satisfy the inflated quota requirements. Given each 88 core machine costs roughly $30k that equates to $1.5M that this issue is costing us."
The good news for stable Linux users is Greg KH commenting today that the fix is applying cleanly for Linux 5.3.x and thus should be picked up there for the next patch release. Though for getting into the Linux 4.14/4.19 LTS trees that will require a re-based patch onto those older kernel branches.
The CFS quota performance issue was spotted with Kubernetes workloads that make use of a CFS scheduler quota to restrict CPU shared resources. The bug was highly-threaded software in turn not getting their fair access to the CPU leading to higher latency and lower performance.
The bug has been around since late 2017 and fixed for Linux 5.4 Git but getting it back-ported has been a bit of a pain. Job portal Indeed.com who did a lot of the due diligence on this bug estimated this lone issue caused them around 1.5 million dollars in additional capacity to make up for lost resources. "Our java applications are particularly hard hit as java tends to be very thread happy. Doing some napkin math, given the roughly 9000 applications we have running in our clouds we've had to over-allocate each one's CPU quota by roughly .5 cpu to account for this behavior change (worst case scenario is actually .01 cpu * cores in machine, but not all of our applications are affected equally). 9000 applications * .5 CPU = 4500 cores 4500 Cores / 88 cores per node = 51 additional machines required to satisfy the inflated quota requirements. Given each 88 core machine costs roughly $30k that equates to $1.5M that this issue is costing us."
The good news for stable Linux users is Greg KH commenting today that the fix is applying cleanly for Linux 5.3.x and thus should be picked up there for the next patch release. Though for getting into the Linux 4.14/4.19 LTS trees that will require a re-based patch onto those older kernel branches.
7 Comments