Linux 6.1 Will Make It A Bit Easier To Help Spot Faulty CPUs

Written by Michael Larabel in Linux Kernel on 25 August 2022 at 09:12 AM EDT. 6 Comments
LINUX KERNEL
While mostly of benefit to server administrators with large fleets of hardware, Linux 6.1 aims to make it easier to help spot problematic CPUs/cores by reporting the likely socket and core when a segmentation fault occurs, which can help in spotting any trends if routinely finding the same CPU/core is causing problems.

Queued up now in TIP's x86/cpu branch for the Linux 6.1 merge window in October is a patch to print the likely CPU at segmentation fault time. Printing the likely CPU core and socket when a seg fault occurs can be beneficial if routinely finding seg faults happening on the same CPU package or particular core.

Rik van Riel who authored the change summed it up as:
In a large enough fleet of computers, it is common to have a few bad CPUs. Those can often be identified by seeing that some commonly run kernel code, which runs fine everywhere else, keeps crashing on the same CPU core on one particular bad system.

However, the failure modes in CPUs that have gone bad over the years are often oddly specific, and the only bad behavior seen might be segfaults in programs like bash, python, or various system daemons that run fine everywhere else.

Add a printk() to show_signal_msg() to print the CPU, core, and socket at segfault time.

This is not perfect, since the task might get rescheduled on another CPU between when the fault hit, and when the message is printed, but in practice this has been good enough to help people identify several bad CPU cores.

This little helper to assist in spotting potentially faulty processors will be there for use starting on Linux 6.1 later this year.


Not directly related: I Bent A Kabylake CPU & It Still Works


It's a small but useful complement to the likes of the new Intel In-Field Scan, MCEs, EDAC reporting, etc.
Related News
About The Author
Michael Larabel

Michael Larabel is the principal author of Phoronix.com and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and OpenBenchmarking.org automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via MichaelLarabel.com.

Popular News This Week