The Linux Kernel Seeing Backport Progress Finally For The "$1.5 Million Dollar Bug"

The CFS quota performance issue was spotted with Kubernetes workloads that make use of a CFS scheduler quota to restrict CPU shared resources. The bug was highly-threaded software in turn not getting their fair access to the CPU leading to higher latency and lower performance.
The bug has been around since late 2017 and fixed for Linux 5.4 Git but getting it back-ported has been a bit of a pain. Job portal Indeed.com who did a lot of the due diligence on this bug estimated this lone issue caused them around 1.5 million dollars in additional capacity to make up for lost resources. "Our java applications are particularly hard hit as java tends to be very thread happy. Doing some napkin math, given the roughly 9000 applications we have running in our clouds we've had to over-allocate each one's CPU quota by roughly .5 cpu to account for this behavior change (worst case scenario is actually .01 cpu * cores in machine, but not all of our applications are affected equally). 9000 applications * .5 CPU = 4500 cores 4500 Cores / 88 cores per node = 51 additional machines required to satisfy the inflated quota requirements. Given each 88 core machine costs roughly $30k that equates to $1.5M that this issue is costing us."
The good news for stable Linux users is Greg KH commenting today that the fix is applying cleanly for Linux 5.3.x and thus should be picked up there for the next patch release. Though for getting into the Linux 4.14/4.19 LTS trees that will require a re-based patch onto those older kernel branches.
7 Comments