Linux 6.10 Preps For "When Things Go Seriously Wrong" On Bigger Servers
While machine check exception (MCE) events tend to be uncommon, a change made by Intel engineers is accommodating the ability in the Linux kernel to store more machine check records for "when things go seriously wrong" on increasingly high core count servers.
The Linux kernel to now had maintained a memory pool for being able to store 80 machine check exception records but Intel's Tony Luck has increased that threshold for accommodating increasingly larger server processors:
The new behavior implemented in Linux 6.10 is to maintain a pool size of at least 80 records or otherwise two records per CPU core, whichever ends up being greater... In other words, on Linux 6.10+ systems with 40 CPU cores or more will see an expanded pool for storing MCE records when the system state goes awry.
The change was merged as the only RAS updates for Linux 6.10.
The Linux kernel to now had maintained a memory pool for being able to store 80 machine check exception records but Intel's Tony Luck has increased that threshold for accommodating increasingly larger server processors:
"Systems with a large number of CPUs may generate a large number of machine check records when things go seriously wrong. But Linux has a fixed buffer that can only capture a few dozen errors.
Allocate space based on the number of CPUs (with a minimum value based on the historical fixed buffer that could store 80 records)."
The new behavior implemented in Linux 6.10 is to maintain a pool size of at least 80 records or otherwise two records per CPU core, whichever ends up being greater... In other words, on Linux 6.10+ systems with 40 CPU cores or more will see an expanded pool for storing MCE records when the system state goes awry.
The change was merged as the only RAS updates for Linux 6.10.
5 Comments