The Linux 5.2+ "Register Corruption" Bug / Golang Issue Was A One-Line Kernel Caching Issue
Coming together just over a day ago was the Google folks working on Golang figuring out many Go issues stemmed from a bug on the Linux 5.2 kernel and newer that was worked out to be some sort of a register corruption issue. That issue is now sorted out and fortunately it's a one-line kernel fix and boils down to being a caching issue.
Kernel developer Sebastian Andrzej Siewior figured out the bug -- since confirmed to address both the C test case and the Golang issues -- from caching access to the fpu_fpregs_owner_ctx context. The context was being cached but as the kernel deferred loading the FPU registers on return to userland, fpu_fpregs_owner_ctx could change during preemption and shouldn't be cached, per the patch devised to fix the issue.
Simply by passing fpu_fpregs_owner_ctx to this_cpu_read() instead of this_cpu_read_stable(), where as the latter allows the value to be cached rather than making GCC load the per-CPU variable each time it's accessed, fixes the issue of Linux 5.2+ built by GCC9 leading to the problems hitting Golang and other bugs.
This one-line change is confirmed to resolve the problem so will hopefully be rolling out to the kernel stable series shortly.
Kernel developer Sebastian Andrzej Siewior figured out the bug -- since confirmed to address both the C test case and the Golang issues -- from caching access to the fpu_fpregs_owner_ctx context. The context was being cached but as the kernel deferred loading the FPU registers on return to userland, fpu_fpregs_owner_ctx could change during preemption and shouldn't be cached, per the patch devised to fix the issue.
Simply by passing fpu_fpregs_owner_ctx to this_cpu_read() instead of this_cpu_read_stable(), where as the latter allows the value to be cached rather than making GCC load the per-CPU variable each time it's accessed, fixes the issue of Linux 5.2+ built by GCC9 leading to the problems hitting Golang and other bugs.
This one-line change is confirmed to resolve the problem so will hopefully be rolling out to the kernel stable series shortly.
14 Comments