This morning after writing Intel Sandy Bridge On Ubuntu 11.04 Is Still Troubling, I proceeded to build the latest Mesa / Linux kernel / libdrm / DDX Git stack to see where the latest Intel SNB code is at and how it's running for the popular Core i5 2500K processor. Before leaving three weeks ago, everything was running great, but to much surprise, this morning it was a broken mess. Intel just regressed hard in their Sandy Bridge support for the about-to-be-released Linux 2.6.39 kernel. Whoops!
The problems experienced on the Git stack as of this morning were similar to what was encountered under a clean Ubuntu 11.04 installation with the Linux 2.6.38 kernel... When running any OpenGL test, the frame-rate was extremely choppy due to the GPU stalling. The dmesg filled up with i915_hangcheck_ring_idle errors from the Intel DRM driver. This at least signaled it being a kernel problem and also a definite regression since this i915_hangcheck_ring_idle problem was corrected earlier on in the Linux 2.6.39 cycle.
Checking the mailing lists revealed that this was also not isolated but a definite regression and one that occurred in the past two weeks while I was away. There was the PROBLEM: i915 regression between 2.6.39-rc6 and 2.6.39-rc7 message saying the i915_hangcheck_ring_idle regression took place between 2.6.39-rc6 and 2.6.39-rc7. Separately, another e-mail entitled Sandy Bridge GPU hang reproducer... indicated Sandy Bridge being hung with the latest Linux 2.6.39-rc7 kernel.
This hurts a lot seeing this regression introduced late in the Linux 2.6.39 kernel cycle. The Linux 2.6.39-rc7 release came a week ago and it has expected to be the last release candidate before going gold. In fact, the Linux 2.6.39 kernel is likely just hours away from release and this Sandy Bridge support is borked.
None of the mailing list messages received an official response from Intel yet nor was the regression tracked down. Fortunately, however, the Phoronix Test Suite stack in conjunction with OpenBenchmarking.org and Phoromatic can systematically bisect the entire Linux kernel (among other Git-controlled code-bases) to find this problem. I initiated the automatic bisection and soon enough this regression was tracked down to the particular commit. What is borking the Intel SNB code in the Linux kernel Git is this commit:
Author: Andy Lutomirski <firstname.lastname@example.org>
Date: Fri May 13 12:14:54 2011 -0400
drm/i915: Revert i915.semaphore=1 default from i915 merge
My Q67 / i7-2600 box has rev09 Sandy Bridge graphics. It hangs instantly when GNOME loads and it hangs so hard the reset button does not work. Setting i915.semaphore=0 fixes it.
Semaphores were disabled in a1656b9090f7 ("drm/i915: Disable GPU semaphores by default") in 2.6.38 but were then re-enabled (by mistake?) by the merge 47ae63e0c2e5 ("Merge branch 'drm-intel-fixes' into drm-intel-next").
(It's worth noting that the offending change is i915_drv.c, which was not marked as a conflict - although a 'git show --cc' on the merge does show that neither parent had it set to 1)
Signed-off-by: Andy Lutomirski <email@example.com>
Signed-off-by: Linus Torvalds <firstname.lastname@example.org>
A semantic change on Friday the 13th caused the Sandy Bridge support to break. This commit disables Intel driver semaphores by default, which was the case in the Linux 2.6.38 kernel, but then later re-enabled. This was done because evidently a user (also of SNB graphics) experienced a hang on loading when semaphores were enabled. However, for other Sandy Bridge users, disabling semaphores will cause the graphics processor to still hang, but later on. Fortunately, the semaphores can be controlled by a kernel module parameter (i915.semaphores=[0,1]). When reverting this commit, sure enough, the i915_hangcheck_ring_idle regression went away. Alternatively, when simply pulling the latest kernel Git, setting i915.semaphores=1 corrects the situation on this test machine as well without needing to rebuild the kernel.
It may be too late though for this to be corrected in the Linux 2.6.39 kernel with its release likely just hours away (it's been more than a week since the last -rc and commit activity has settled down signaling a possible imminent release). This semaphores might not also be reverted yet again if the opposite behavior is affecting other Sandy Bridge hardware differently. Some other solution may be needed elsewhere to properly correct this problem for all Intel Sandy Bridge customers. Regardless, if you are an Intel user affected by these hangs, try setting i915.semaphores=1 as a temporary fix from the kernel command-line. Update: under some select workloads with semaphores enabled, I can still manage to lockup the GPU, but it's less frequent than with it disabled. It's too bad though this regression has been living in the Linux kernel so late in the cycle for five days now without being properly addressed, especially when it could have been spotted immediately with our automated per-commit driver testing had that tracker and hardware been setup. At least it's not as bad as the months-old and wide-scale Linux kernel power regression.