Announcement

**Mahboi** · 06 October 2022, 06:54 PM

What a strange idea, I thought segfaults were just normal C behaviour.

**stormcrow** · 06 October 2022, 07:13 PM

Originally posted by Mahboi View Post

What a strange idea, I thought segfaults were just normal C behaviour.

Well written C programs shouldn't segfault. Frequent faults usually points to hardware failure. Bad RAM is the most common problem in my experience followed by storage devices going bad, then roughly on par with each other are PSU & motherboard failure (often because the one kills the other). I rarely have hit a bad CPU that didn't just completely fail (once in 30 years), but in large systems pinpointing a failed or failing CPU is probably a fairly acute problem if you have hundreds or thousands of physical packages with tens of cores on each package.

Edit to add: For my personal systems in the past 10 years or so I've had 3 GPU failures (two AMD one Nvidia), 2 separate instances of bad RAM modules, three PSUs & motherboards dying (it's what broke me of ever buying Gigabyte boards ever again, the other was an inherited ThinkServer), and 4 mechanical hard drive failures. Interestingly enough no CPU or SSD failures.

The usual indicator of hardware failure is programs start crashing (segfaults or the equivalent) if there's not an outright failure to POST/boot.

**Paradigm Shifter** · 06 October 2022, 07:30 PM

So is it easier to lock processes to specific cores? If I watch CPU utilisation for loads which only partially load a system, the usage bounces between cores at an astonishing rate unless I faff around locking individual threads to individual cores.

**stormcrow** · 06 October 2022, 07:57 PM

Originally posted by Paradigm Shifter View Post

So is it easier to lock processes to specific cores? If I watch CPU utilisation for loads which only partially load a system, the usage bounces between cores at an astonishing rate unless I faff around locking individual threads to individual cores.

While I won't venture an opinion on the wisdom of locking threads to cores in any particular case, the reason threads and processes "bounce between" cores when the CPU isn't saturated is IO load/cache/thermal balancing, not because of some wacky indeterminate weirdness. Also far more likely to happen going forward is moving processes between performance/efficiency cores (for Intel/ARM systems - AMD manages that internally).

**Paradigm Shifter** · 06 October 2022, 08:09 PM

Originally posted by stormcrow View Post

While I won't venture an opinion on the wisdom of locking threads to cores in any particular case, the reason threads and processes "bounce between" cores when the CPU isn't saturated is IO load/cache/thermal balancing, not because of some wacky indeterminate weirdness. Also far more likely to happen going forward is moving processes between performance/efficiency cores (for Intel/ARM systems - AMD manages that internally).

Yes, I know the reasons. But unless a core is really bad, and a process segfaults as soon as it lands on it, there is the potential to mis-identify? As corruption may not be immediately evident but require further operations? That was pretty much my only point, although I wasn't really clear with how I expressed it.

**stormcrow** · 06 October 2022, 08:22 PM

Originally posted by Paradigm Shifter View Post

Yes, I know the reasons. But unless a core is really bad, and a process segfaults as soon as it lands on it, there is the potential to mis-identify? As corruption may not be immediately evident but require further operations? That was pretty much my only point, although I wasn't really clear with how I expressed it.

Ah understood.

**Developer12** · 06 October 2022, 08:39 PM

Originally posted by Paradigm Shifter View Post

Yes, I know the reasons. But unless a core is really bad, and a process segfaults as soon as it lands on it, there is the potential to mis-identify? As corruption may not be immediately evident but require further operations? That was pretty much my only point, although I wasn't really clear with how I expressed it.

Corruption can be *very* rare but nonzero. With the so-called "mercurial cores" that google and other hyperscalers found, they had to watch some CPUs for hours or days before they'd do something wrong. Or, it would be something really weird where only very specific values lead to wrong results. Compounding this it also was sometimes an age thing, where some CPUs would only exhibit this behavior after being burned in for a few years, or even in their "old age."

With reporting the segfault a core occurs on, the hope may be to log it and hopefully see a long-term trend.

**Developer12** · 06 October 2022, 09:46 PM

Something interesting that came up with Sun's infamous E-cache parity error issue during an Oxide twitter space a while back:

The E-cache on the E10K's CPU modules would get corrupted and because it only had parity and not ECC it could be detected but not corrected. This was something that just periodically happened, though perhaps some CPU modules were more prone to it because of the silicon lottery. It turned out later that when it was most likely to be discovered was when a *different CPU* (than the one with the corruption) snooped the corrupted cache line and upon reading it found it to be incorrect, and then the issue would appear as having occurred on the wrong core. Sun techs would replace that core thinking they had solved the issue, only for it to reappear. They might end up replacing every CPU module before finally replacing the one with the problem.

I expect the general intent is to detect things that happen in-core, and so probably can't be snooped on by other cores like cache lines, but there is a cautionary tale: just because the error message came from one core doesn't necessarily mean it originated there.

**coder** · 06 October 2022, 09:51 PM

It shouldn't be only SIGSEG, but also SIGBUS. Others?

Announcement

Linux 6.1 Will Try To Print The CPU Core Where A Seg Fault Occurs

Linux 6.1 Will Try To Print The CPU Core Where A Seg Fault Occurs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment