That Nasty Linux Kernel Lockup Bug Is Still Unresolved
Nearly one month ago back during the Linux 3.18 release candidates there was a worrisome regression uncovered by kernel developers, but now with the Linux 3.19 merge window nearly over, that issue still has yet to be firmly addressed.
Throughout the Linux 3.18 kernel cycle and likely impacting Linux 3.17 too has been a nasty Linux kernel lock-up issue that was first widely reported by Red Hat's Dave Jones and then he's been spending the last several weeks bisecting kernels, testing patches, and trying to figure out the root cause. Other kernel developers have also been able to reproduce the problem, various kernel patches proposed, but as of this morning the issue is still present in Git master. Like reported a few weeks ago in the last Phoronix article on the matter, it looks like the issue might be related to the Xen code within the Linux kernel.
There's been many mailing list posts in the "frequent lockups in 3.18rc4" thread but no conclusion. The most recent post by Dave Jones was this morning:
Earlier this week, Linus Torvalds was looking at a potentially related issue within the kernel. Linus noted, "there's something funny going on there. Anyway, I've looked at the page fault patch, and I mentioned this last time it came up: there's a nasty possible kernel loop in the 'retry' case if there's also a fatal signal pending, and we're returning to kernel mode rather than returning to user mode." Linus came up with an (untested) patch for that issue and then replied with, "So after looking at this more, I'm actually really convinced that this was a pretty nasty bug. I'm *not* convinced that it's necessarily [Dave Jones'] bug, but I still think it could be." Given Dave's emails after that point, this wasn't the root problem, but it looks like another bug was squashed in the process or at least more kernel code cleaned-up.
Stay tuned to Phoronix for when there's any further firm leads on this Linux kernel lockup issue... At least the issue doesn't appear to be too widespread and I haven't yet encountered it with the many systems that are automatically testing the Linux kernel daily.
Throughout the Linux 3.18 kernel cycle and likely impacting Linux 3.17 too has been a nasty Linux kernel lock-up issue that was first widely reported by Red Hat's Dave Jones and then he's been spending the last several weeks bisecting kernels, testing patches, and trying to figure out the root cause. Other kernel developers have also been able to reproduce the problem, various kernel patches proposed, but as of this morning the issue is still present in Git master. Like reported a few weeks ago in the last Phoronix article on the matter, it looks like the issue might be related to the Xen code within the Linux kernel.
There's been many mailing list posts in the "frequent lockups in 3.18rc4" thread but no conclusion. The most recent post by Dave Jones was this morning:
Bah, I was getting all optimistic. I came home this evening to a locked up machine. Serial console had a *lot* more traces than usual though. Full log below. The 12xxx.xxxxxx traces we seemed to recover from, followed by silence for a while, before the real fun begins at 157xx.xxxxxxThat's the end of the thread at the time of writing.
...
Earlier this week, Linus Torvalds was looking at a potentially related issue within the kernel. Linus noted, "there's something funny going on there. Anyway, I've looked at the page fault patch, and I mentioned this last time it came up: there's a nasty possible kernel loop in the 'retry' case if there's also a fatal signal pending, and we're returning to kernel mode rather than returning to user mode." Linus came up with an (untested) patch for that issue and then replied with, "So after looking at this more, I'm actually really convinced that this was a pretty nasty bug. I'm *not* convinced that it's necessarily [Dave Jones'] bug, but I still think it could be." Given Dave's emails after that point, this wasn't the root problem, but it looks like another bug was squashed in the process or at least more kernel code cleaned-up.
Stay tuned to Phoronix for when there's any further firm leads on this Linux kernel lockup issue... At least the issue doesn't appear to be too widespread and I haven't yet encountered it with the many systems that are automatically testing the Linux kernel daily.
8 Comments