Here Is Why The Linux 4.2 Kernel Is Messing Up On Some Ubuntu Systems
The Linux 4.2 kernel that's currently under development ships many new features, but as I've been writing about for a while and tweeting, the 4.2 Git code hasn't been booting on many systems in my test lab for over one week. Various Phoronix readers have also been able to reproduce these different kernel panics that happen almost immediately into the boot process. Here's the root problem affecting Linux 4.2 on my daily Linux benchmarking systems...
As I hadn't seen any Linux kernel mailing list thread similar to the issues I've been encountering, this weekend I finally got around to investigating the problem further on some of the test systems. I had deferred doing so earlier over lack of time and many of the systems running into the problems are slow -- the fastest system I could use for bisecting is a dual socket AMD Opteron 2384 box, not any Broadwell/Haswell greatness.
In the past I've been able to automatically bisect Linux kernel regressions via the Phoronix Test Suite benchmarking software for finding power consumption regressions and performance changes, but this time I wasn't initially able to do so since the affected systems were left into an unbootable state. So rather than working to bisect the Linux 4.2 oddity myself when already having a lot of work on the table to do, I took it to improving the Phoronix Test Suite and Phoromatic so it could automatically solve this situation and be more useful for similar problems going forward.
The first step was making some changes to the Phoronix Test Suite for its interaction with the system. When running under a test sanctioned by the Phoromatic Server, during the testing process or the arbitrary, set-context process for getting the system into the desired state (e.g. a specific version of the Linux kernel and then rebooting prior to running the tests), if the system panics or fails to boot, reboot automatically and then boot into the last successfully booted kernel rather than the latest kernel. Once that was taken care of, it was just a matter of making a PTS test profile in the pass/fail type that would simply pass if the system booted fine, just by modifying one of Intel's open-sourced test profiles (systemd-boot-kernel).
With those changes to be able to recover after a kernel boot failure, it became possible to automatically bisect the Linux 4.2 kernel code for finding the revision that was causing the systems to fail to boot on a handful of systems in the basement server room. The Phoromatic-controlled process then went away and leveraged its Git bisecting. But it didn't find the problem.
After being a bit puzzled and wondering whether the code was broken, I realized it must have been a kernel Kconfig issue. When doing the kernel bisecting, the custom script is using "make localmodconfig" for speeding up the kernel build and just building the modules needed by the host system. However, when running the daily benchmarks of the Linux kernel Git at LinuxBenchmarking.com, the Ubuntu Mainline Kernel PPA is used. As explained before, those packages are used for providing a generic Ubuntu kernel built from the mainline vanilla/virgin source code without any distribution patches, etc. By having them public on a daily basis, it saves time from having to build the packages in our automated test farm while they're publicly available to anyone for those wishing to reproduce our test results, know the kernel configuration, etc. It's just a nice independent source for easily fetching a kernel that can be grabbed by others to reproduce our tests.
With there being many kernel configuration differences between "make localmodconfig" on the clean source tree and the generic configuration shipped by the Ubuntu Mainline Kernel PPA, it next was a matter of thinking of how the Phoronix Test Suite could leverage the just-completed modifications to then determine the problematic CONFIG_* options. Fortunately, years ago I had been playing around with PTS for "Project Karsk" with finding software/driver optimizations easier and a way to autonomously generate an ideal kernel configuration. While those efforts were about finding the maximum kernel performance, for these purposes I just needed to figure out what CONFIG option(s) were breaking the system. After doing that, and several more beers in the process, progress was being made...