Google Has A Problem With Linux Server Reboots Too Slow Due To Too Many NVMe Drives

Written by Michael Larabel in Hardware on 29 March 2022 at 05:30 AM EDT. 42 Comments
HARDWARE
Hyperscaler problems these days? Linux servers taking too long to reboot due to having too many NVMe drives. Thankfully Google is working on an improvement to address this where some of their many-drive servers can take more than one minute for the Linux kernel to carry out its shutdown tasks while this work may benefit other users too albeit less notably.

Google engineers are proposing an asynchronous shutdown interface for the Linux kernel. Currently the Linux kernel's shutdown APIs at the bus level are synchronous, which can cause problems like Google reports with having too many NVMe storage drives in a single server. Due to the synchronous nature during the shutdown handling, each NVMe drive can take about 4.5 seconds to shutdown. With Google servers now having 16+ NVMe devices, that can mean an extra minute to shutdown and go through the reboot phase... With the asynchronous shutdown interface and adapting the NVMe driver to use it, their reboots -- and ultimately the amount of server down time -- can be easily reduced by one minute.


The proposed patches from Google allow for an optional asynchronous shutdown interface at the bus level. The new interface maintains backwards compatibility with the synchronous implementation. As part of the patches, all PCI Express based devices are moved to use the async interface, implements the changes at the PCIe level, and then the changes to the NVMe driver to exploit the async shutdown interface.

This proposed async shutdown interface in current form is just around one hundred lines of new code, granted, just one driver making use of it at the moment. But with modern high performance Linux servers continuing to add in more NVMe drives and other PCIe devices where the Linux kernel's synchronous shutdown interface can mean extra downtime, hopefully these patches will manage to move ahead and mainline in short order along with adapting more drivers to make use of it.
Related News
About The Author
Michael Larabel

Michael Larabel is the principal author of Phoronix.com and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and OpenBenchmarking.org automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via MichaelLarabel.com.

Popular News This Week