Linux "PSI" Patches Report Stall/Pressure Information For CPU / Memory / Storage
One of the interesting patch series in the works is the "PSI" work by Johannes Weiner of Facebook.
PSI in this context is actually Pressure Stall Information. This information to be exposed by future versions of the Linux kernel make it possible to quantify resource pressure on the system across CPU, memory, and I/O -- including within cgroups.
On a "Pressure Stall Information" enabled kernel, the percentage of time the system is stalled on the CPU, memory, or I/O is exposed via /proc/pressure/ as "pressure percentages" as a numerical way to quantify the system health around resource over-commitments. The PSI pressure percentages can also be used as an indicator for close the system may be to running out-of-memory or lockups.
The latest of this patch series can be found on the kernel mailing list. Hopefully this interesting PSI addition to the Linux kernel will be ready for merging in the near future.
PSI in this context is actually Pressure Stall Information. This information to be exposed by future versions of the Linux kernel make it possible to quantify resource pressure on the system across CPU, memory, and I/O -- including within cgroups.
On a "Pressure Stall Information" enabled kernel, the percentage of time the system is stalled on the CPU, memory, or I/O is exposed via /proc/pressure/ as "pressure percentages" as a numerical way to quantify the system health around resource over-commitments. The PSI pressure percentages can also be used as an indicator for close the system may be to running out-of-memory or lockups.
When CPU, memory or IO devices are contended, workloads experience latency spikes, throughput losses, and run the risk of OOM kills.
Without an accurate measure of such contention, users are forced to either play it safe and under-utilize their hardware resources, or roll the dice and frequently suffer the disruptions resulting from excessive overcommit.
The psi feature identifies and quantifies the disruptions caused by such resource crunches and the time impact it has on complex workloads or even entire systems.
Having an accurate measure of productivity losses caused by resource scarcity aids users in sizing workloads to hardware--or provisioning hardware according to workload demand.
As psi aggregates this information in realtime, systems can be managed dynamically using techniques such as load shedding, migrating jobs to other systems or data centers, or strategically pausing or killing low priority or restartable batch jobs.
This allows maximizing hardware utilization without sacrificing workload health or risking major disruptions such as OOM kills.
The latest of this patch series can be found on the kernel mailing list. Hopefully this interesting PSI addition to the Linux kernel will be ready for merging in the near future.
14 Comments