Intel "In-Field Scan" Coming With Sapphire Rapids As New Silicon Failure Testing Feature
Intel In-Field Scan is a hardware feature being initially introduced with at least some of the upcoming Xeon "Sapphire Rapids" processor SKUs and allows running circuit level tests on a CPU core for detecting hardware problems not caught by parity or ECC checks. The intent with Intel In-Field Scan (it is using an "IFS" acronymn, not to be confused with Intel Foundry Services) has hardware hooks for performing per-core tests and reporting any silicon failures from said tests. Intel In-Field Scan is designed to be used by cloud providers, OEMs, and other hyperscalers for running tests and finding any in-field failures due to aging silicon or other hardware problems that would not otherwise be detected by existing hardware checks such as ECC memory errors or other machine check exceptions.
Intel In-Field Scan makes a lot of sense for future Xeon Scalable server processors for helping to detect any silicon issues prior to deployment into production or after being deployed with routine monitoring of the aging silicon.
As for what all of these silicon-level hardware tests that will be conducted, that isn't entirely clear. This proposed Intel IFS kernel driver is just the infrastructure for handling In-Field Scan while the tests themselves will be loaded as a binary similar to the Intel CPU microcode. The Intel IFS tests will be loaded from a file and are specific to particular CPU Family/Model/Stepping. These files are authenticated prior to use and when loaded stored within secure memory.
When running on supported Intel processors with a driver having the Intel IFS driver and having the test IFS images available, the tests can be loaded via /sys/devices/system/cpu/ifs/reload. Triggering the IFS tests to then execute on all available CPU cores can then be carried out via writing to /sys/devices/system/cpu/ifs/run_test. The IFS driver also allows testing individual specific CPU cores as well via sysfs.
After carrying out an In-Field Scan test, the results are written to /sys/devices/system/cpu/ifs/status for reporting if all CPU cores passed or failed. There are sysfs files as well for reporting specific CPU cores that passed/failed or were untested.
These interfaces will allow for OEMs and hyperscalers to easily carry out these silicon failure tests whenever desired prior to deployment or in an ongoing manner to look for any issues stemming from the aging silicon.
The Intel In-Field Scan Linux kernel driver is currently under review on the kernel mailing list and amounts to around 1.5k lines of new code -- not counting the to-be-published CPU model specific test files that will seemingly be coming out later once Sapphire Rapids is formally launched.