Linux 6.1 Will Make It A Bit Easier To Help Spot Faulty CPUs

Written by Michael Larabel in Linux Kernel on 25 August 2022 at 09:12 AM EDT. 6 Comments

While mostly of benefit to server administrators with large fleets of hardware, Linux 6.1 aims to make it easier to help spot problematic CPUs/cores by reporting the likely socket and core when a segmentation fault occurs, which can help in spotting any trends if routinely finding the same CPU/core is causing problems.

Queued up now in TIP's x86/cpu branch for the Linux 6.1 merge window in October is a patch to print the likely CPU at segmentation fault time. Printing the likely CPU core and socket when a seg fault occurs can be beneficial if routinely finding seg faults happening on the same CPU package or particular core.

Rik van Riel who authored the change summed it up as:

In a large enough fleet of computers, it is common to have a few bad CPUs. Those can often be identified by seeing that some commonly run kernel code, which runs fine everywhere else, keeps crashing on the same CPU core on one particular bad system.

However, the failure modes in CPUs that have gone bad over the years are often oddly specific, and the only bad behavior seen might be segfaults in programs like bash, python, or various system daemons that run fine everywhere else.

Add a printk() to show_signal_msg() to print the CPU, core, and socket at segfault time.

This is not perfect, since the task might get rescheduled on another CPU between when the fault hit, and when the message is printed, but in practice this has been good enough to help people identify several bad CPU cores.

This little helper to assist in spotting potentially faulty processors will be there for use starting on Linux 6.1 later this year.

Not directly related: I Bent A Kabylake CPU & It Still Works

It's a small but useful complement to the likes of the new Intel In-Field Scan, MCEs, EDAC reporting, etc.

6 Comments