Nasty Lockup Issue Still Being Investigated For Linux 3.18

Dave Jones of Red Hat was the one to first report his experience of frequent lockups with the Linux 3.18 kernel. At the time he found Linux 3.17 to be okay but under 3.18 he experience lockups with a kernel panic but the origin of the issue was tough to gather.
Since Sunday, Dave Jones has continued with his testing -- which at times can be putting load on the kernel for a day to see if the issue occurs -- which is resulting in tracking down this issue being a long and tedious process. In the kernel mailing list thread relating to the issue, this week there were some interesting comments:
The issue might be related to Xen. A patch dating back to 2005 was pushed for Xen to fix a vmalloc_fault() path that was similar to what was reported by Dave. The patch had a comment that read "the line below does not always work. Needs investigating!" But it looks like this issue was never properly investigated.
However, on Wednesday, Dave Jones woke up to find a different issue with a soft lock-up that wasn't Xen-related.
Multiple issues might be at play. Due to the nature of the bug and its difficulty in tracking down, developers/testers might be finding multiple but similar bugs within the kernel. Juergen Gross wrote, "Digging deeper I found something making me believe I've seen another
issue than Dave which just looked similar on the surface. :-("
Linux 3.17 might be affected too. Dave Jones confirmed from his testing, "So 3.17 also has this problem. Good news I guess in that it's not a regression, but damn I really didn't want to have to go digging through the mists of time to find the last 'good' point. At least it shouldn't hold up 3.18."
It might be related to the kernel's watchdog code due to research by Linus Torvalds. "So I'm looking at the watchdog code, and it seems racy [with regard to] parking and startup...Quite frankly, I'm just grasping for straws here, but a lot of the watchdog traces really have seemed spurious..."
The last post at the time of writing this article was by Dave Jones saying he was testing the Linux 3.16 kernel to make sure the issue doesn't at least happen with that older version. That message was sent out on Thursday.
As of now Linux 3.18-rc7 is expected tomorrow but there's no fix yet in Git nor any new mailing list posts Friday or so far today. With the US holidays this week, testing is only further complicated with many taking time away from their computers. I'll keep monitoring the thread so stay tuned to Phoronix for new posts about this Linux 3.18 and likely 3.17 problem.
12 Comments