Announcement

Collapse
No announcement yet.

Linux 6.1 Will Try To Print The CPU Core Where A Seg Fault Occurs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Linux 6.1 Will Try To Print The CPU Core Where A Seg Fault Occurs

    Phoronix: Linux 6.1 Will Try To Print The CPU Core Where A Seg Fault Occurs

    A change now merged for Linux 6.1 will attempt to print the CPU core where a segmentation fault happens. The hope by printing the CPU/core where a segmentation fault happens is that over time trends may materialize with this information potentially being useful for helping to spot faulty CPUs...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    What a strange idea, I thought segfaults were just normal C behaviour.

    Comment


    • #3
      Originally posted by Mahboi View Post
      What a strange idea, I thought segfaults were just normal C behaviour.
      Well written C programs shouldn't segfault. Frequent faults usually points to hardware failure. Bad RAM is the most common problem in my experience followed by storage devices going bad, then roughly on par with each other are PSU & motherboard failure (often because the one kills the other). I rarely have hit a bad CPU that didn't just completely fail (once in 30 years), but in large systems pinpointing a failed or failing CPU is probably a fairly acute problem if you have hundreds or thousands of physical packages with tens of cores on each package.

      Edit to add: For my personal systems in the past 10 years or so I've had 3 GPU failures (two AMD one Nvidia), 2 separate instances of bad RAM modules, three PSUs & motherboards dying (it's what broke me of ever buying Gigabyte boards ever again, the other was an inherited ThinkServer), and 4 mechanical hard drive failures. Interestingly enough no CPU or SSD failures.

      The usual indicator of hardware failure is programs start crashing (segfaults or the equivalent) if there's not an outright failure to POST/boot.
      Last edited by stormcrow; 06 October 2022, 07:27 PM.

      Comment


      • #4
        So is it easier to lock processes to specific cores? If I watch CPU utilisation for loads which only partially load a system, the usage bounces between cores at an astonishing rate unless I faff around locking individual threads to individual cores.

        Comment


        • #5
          Originally posted by Paradigm Shifter View Post
          So is it easier to lock processes to specific cores? If I watch CPU utilisation for loads which only partially load a system, the usage bounces between cores at an astonishing rate unless I faff around locking individual threads to individual cores.
          While I won't venture an opinion on the wisdom of locking threads to cores in any particular case, the reason threads and processes "bounce between" cores when the CPU isn't saturated is IO load/cache/thermal balancing, not because of some wacky indeterminate weirdness. Also far more likely to happen going forward is moving processes between performance/efficiency cores (for Intel/ARM systems - AMD manages that internally).

          Comment


          • #6
            Originally posted by stormcrow View Post

            While I won't venture an opinion on the wisdom of locking threads to cores in any particular case, the reason threads and processes "bounce between" cores when the CPU isn't saturated is IO load/cache/thermal balancing, not because of some wacky indeterminate weirdness. Also far more likely to happen going forward is moving processes between performance/efficiency cores (for Intel/ARM systems - AMD manages that internally).
            Yes, I know the reasons. But unless a core is really bad, and a process segfaults as soon as it lands on it, there is the potential to mis-identify? As corruption may not be immediately evident but require further operations? That was pretty much my only point, although I wasn't really clear with how I expressed it.

            Comment


            • #7
              Originally posted by Paradigm Shifter View Post

              Yes, I know the reasons. But unless a core is really bad, and a process segfaults as soon as it lands on it, there is the potential to mis-identify? As corruption may not be immediately evident but require further operations? That was pretty much my only point, although I wasn't really clear with how I expressed it.
              Ah understood.

              Comment


              • #8
                Originally posted by Paradigm Shifter View Post

                Yes, I know the reasons. But unless a core is really bad, and a process segfaults as soon as it lands on it, there is the potential to mis-identify? As corruption may not be immediately evident but require further operations? That was pretty much my only point, although I wasn't really clear with how I expressed it.
                Corruption can be *very* rare but nonzero. With the so-called "mercurial cores" that google and other hyperscalers found, they had to watch some CPUs for hours or days before they'd do something wrong. Or, it would be something really weird where only very specific values lead to wrong results. Compounding this it also was sometimes an age thing, where some CPUs would only exhibit this behavior after being burned in for a few years, or even in their "old age."

                With reporting the segfault a core occurs on, the hope may be to log it and hopefully see a long-term trend.

                Comment


                • #9
                  Something interesting that came up with Sun's infamous E-cache parity error issue during an Oxide twitter space a while back:

                  The E-cache on the E10K's CPU modules would get corrupted and because it only had parity and not ECC it could be detected but not corrected. This was something that just periodically happened, though perhaps some CPU modules were more prone to it because of the silicon lottery. It turned out later that when it was most likely to be discovered was when a *different CPU* (than the one with the corruption) snooped the corrupted cache line and upon reading it found it to be incorrect, and then the issue would appear as having occurred on the wrong core. Sun techs would replace that core thinking they had solved the issue, only for it to reappear. They might end up replacing every CPU module before finally replacing the one with the problem.

                  I expect the general intent is to detect things that happen in-core, and so probably can't be snooped on by other cores like cache lines, but there is a cautionary tale: just because the error message came from one core doesn't necessarily mean it originated there.

                  Comment


                  • #10
                    It shouldn't be only SIGSEG, but also SIGBUS. Others?

                    Comment

                    Working...
                    X