That Peculiar Linux 3.18 Kernel Bug Might Be Closed Soon
For the past month there's been kernel developers investigating "a big unknown worry in a regression" that have left many key kernel developers -- including Linus Torvalds -- puzzled. It looks like that investigation is finally being close to being resolved.
The issue was first reported by Dave Jones of Red Hat and it's been messy to track down and figure out especially as testing each kernel revision when bisecting this regression can take a number of hours to verify whether the bug happens or not. There's been thoughts the regression could be related to Xen kernel code, multiple issues could be at hand, Linux 3.17 might be affected, it could be related to the kernel watchdog code, or it could be something else. In the process of trying to figure out this regression, other kernel bugs in the code have been uncovered and patched.
As of the middle of December, the issue was still being investigated while Linux 3.18 was already released. Now it looks like the investigation is coming to an end. Dave Jones of Red Hat also has to return that system this coming Monday as he's leaving Red Hat.
The latest belief is the issue might be related to HPET, the High Precision Event Timer. Yesterday Linus Torvalds posted that he agrees it could be HPET related and the cause would likely come down to an actual hardware bug, some SMM/BIOS power management feature causing the problem, a bug in the kernel's clock-source handling, or "gremlins" -- something freakish happening. In a follow up response, Dave Jones is starting to think it could be due to some BIOS / motherboard error with the Intel server motherboard he's been using given there was a sort of similar bug report from last year. The issue for Dave though only started recently with the kernel, but if it looks to be a hardware issue, the HPET problem wouldn't be investigated further.
Here's the last post at the time of writing. It looks like Dave will still be running some kernel tests this weekend though after that he has to turn in the system. For now at least those involved are thinking it's HPET related and possibly specific to the Intel "Shark Bay" motherboard at hand. For those unfamiliar with the High Precisiou Event Timer of modern chipsets, there's an overview via Wikipedia.
The issue was first reported by Dave Jones of Red Hat and it's been messy to track down and figure out especially as testing each kernel revision when bisecting this regression can take a number of hours to verify whether the bug happens or not. There's been thoughts the regression could be related to Xen kernel code, multiple issues could be at hand, Linux 3.17 might be affected, it could be related to the kernel watchdog code, or it could be something else. In the process of trying to figure out this regression, other kernel bugs in the code have been uncovered and patched.
As of the middle of December, the issue was still being investigated while Linux 3.18 was already released. Now it looks like the investigation is coming to an end. Dave Jones of Red Hat also has to return that system this coming Monday as he's leaving Red Hat.
The latest belief is the issue might be related to HPET, the High Precision Event Timer. Yesterday Linus Torvalds posted that he agrees it could be HPET related and the cause would likely come down to an actual hardware bug, some SMM/BIOS power management feature causing the problem, a bug in the kernel's clock-source handling, or "gremlins" -- something freakish happening. In a follow up response, Dave Jones is starting to think it could be due to some BIOS / motherboard error with the Intel server motherboard he's been using given there was a sort of similar bug report from last year. The issue for Dave though only started recently with the kernel, but if it looks to be a hardware issue, the HPET problem wouldn't be investigated further.
Here's the last post at the time of writing. It looks like Dave will still be running some kernel tests this weekend though after that he has to turn in the system. For now at least those involved are thinking it's HPET related and possibly specific to the Intel "Shark Bay" motherboard at hand. For those unfamiliar with the High Precisiou Event Timer of modern chipsets, there's an overview via Wikipedia.
10 Comments