Linux Will Now Better Handle AMD SEV-SNP To Avoid Undefined Behavior For Old VMs
Some AMD SEV-SNP features need guest-side support to work correctly and so if a modern Linux host with a recent kernel that supports newer features of modern AMD EPYC CPUs tries booting a guest virtual machine with a kernel lacking support for some SEV features, there can be problems -- problems that aren't necessarily straight-forward to diagnose. Surprisingly it took until yesterday for the mainline Linux kernel to receive SEV-SNP guest feature negotiation support to deal with this real possibility of the host/hypervisor having a newer kernel than what is found on the guest VMs.
From the patch adding this SEV-SNP guest feature negotiation support:
"The hypervisor can enable various new features (SEV_FEATURES[1:63]) and start the SNP guest. Some of these features need guest side implementation. If any of these features are enabled without guest side implementation, the behavior of the SNP guest will be undefined. The SNP guest boot may fail in a non-obvious way making it difficult to debug.
Instead of allowing the guest to continue and have it fail randomly later, detect this early and fail gracefully.
SEV_STATUS MSR indicates features which the hypervisor has enabled. While booting, SNP guests should ascertain that all the enabled features have guest side implementation. In case any feature is not implemented in the guest, the guest terminates booting with GHCB protocol Non-Automatic Exit(NAE) termination request event. Populate SW_EXITINFO2 with mask of unsupported features that the hypervisor can easily report to the user."
This is being treated as a fix so it was picked up for Linux 6.2-rc6 rather than waiting for the next merge window. In turn it will also be back-ported soon to stable Linux kernel series.
Yesterday's x86/urgent pull request characterized the problem with, "The SEV-SNP patch looks a bit largish and perhaps, at a first glance, not really urgent material but the intent behind it is to fail gracefully when booting older kernels on newer hypervisors when latter support features which those older kernels do not know of yet. Therefore, it should go to stable so sending it now is as good a time as any...Have a SEV-SNP guest check explicitly for features enabled by the hypervisor and fail gracefully if some are unsupported by the guest instead of failing in a non-obvious and hard-to-debug way."