Linux 6.1 Will Try To Print The CPU Core Where A Seg Fault Occurs

marlock replied

08 October 2022, 04:46 PM
here is the thing...

Situation A) the segfault is caused by core X being defective but triggered only later in any core (same or another, expected to be somewhat random), so the segfault report mentions a random core each time

Situation B) the segfault is triggered immediately by the faulty core X, and the segfault report repeatedly mentions core X

the article states that the report helps case B, while not in any way being relevant to case A

is there a case C where core Y always triggers the segfault after core X fails at its task? is it bad to at least have the data to investigate a repeatedly reported core even if this is possible?

Last edited by marlock; 08 October 2022, 04:47 PM. Reason: fixed missing word and typo
Leave a comment:
Serafean replied

07 October 2022, 04:52 AM
The absolute best failure I had was a chipset failure (historically would have been called south bridge) .
Half the PCIe lines corrupted specific bits during data tranfer. (The x16 GPU was fine)
This meant that about a third of USB ports didn't work, one out of two M.2 ports corrupted data (thanks btrfs for saving it) , the onboard NIC didn't work, but a PCIe NIC worked depending where it was connected...
Weirdest debugging session ever, hunting down interconnect diagrams of the X570 chipset. All started with corrupted DHCP response packets. and no network.
Likes 1
Leave a comment:
AdrianBc replied

07 October 2022, 04:23 AM
Originally posted by coder View Post

That's a lot of failures!

In the past 15 years, I've had the following hardware failures (@ homelab):
1 motherboard

1 hard disk

1 PSU (but it was a cheap one I inherited, rather than one I bought).

The bearings died on a cheap case fan.

Replaced a barely-used hard drive, when it hit a few uncorrectable read errors.

A TV Tuner PCI card and a TV tuner USB stick both got too flaky for me to use.

I have another motherboard that never worked, but it was a cheap BGA board that was already out of warranty and the issue might've actually been incompatible RAM. It wasn't worth it for me to troubleshoot further, so I cut my losses.

For best hardware reliability:
Don't overclock. A few % more performance likely isn't worth any flakiness before you're ready to replace the HW.

Use a quality UPS with line filtration & over/under protection. Ideally, sinewave output.

Use a quality PSU (power supply).

Use ECC memory (requires compatible motherboard + CPU), if possible.
If not, at least avoid budget RAM and run > 1 pass of memtest when you install it. I also buy RAM rated for a slightly faster speed than I plan to use.

Try to have ample cooling, especially of motherboard components like VRMs.

Either use a case with dust filters or clean it, periodically.
If dust filters, periodically clean them!

Don't be an early adopter. Waiting until stuff has been released for 6-8 months can often net you lower prices and later revs/steppings of both boards and CPUs. Also, later firmware/microcode revisions. I've done well, on this point. Here are 2 big wins I scored:
https://www.guru3d.com/articles-page...-review,1.html

https://www.overclock.net/threads/ry...dates.1796602/

If buying Intel, Xeon-branded SKUs tend to be binned for greater reliability. Not sure about Ryzen Pro, but maybe?

Update the firmware of your SSDs, when you first install them (i.e. before formatting & copying data onto them).

I agree with this advice. I just want to add that by always using ECC memory in any computer larger than a NUC, I have experienced a few cases where some memory modules that had previously worked fine for many years started to have frequent errors which were corrected by ECC. Being warned by this, I was able to replace those modules.

Without ECC, it is likely that I would not have noticed any errors before some unrecoverable data corruption would have happened in my files.
Likes 3
Leave a comment:
joshx1 replied

07 October 2022, 03:03 AM
Great change for those using curve optimizer on AMD. Linux shows which cores have problems already, this should help further
Leave a comment:
CochainComplex replied

07 October 2022, 02:49 AM
Originally posted by coder View Post

[*]Don't be an early adopter. Waiting until stuff has been released for 6-8 months can often net you lower prices and later revs/steppings of both boards and CPUs. Also, later firmware/microcode revisions. I've done well, on this point. Here are 2 big wins I scored:

For reliabiltiry I agree with all points but concerning prices. Unfortunately the last few years with "the flu" and shortcomings related to supplychains - some hardware got a bit more expensive 6 month after debut and then went back to MSRP.

Ryzen 3600X German Market since lunch

screenshot.png
Likes 1
Leave a comment:
oiaohm replied

06 October 2022, 10:45 PM
Originally posted by Paradigm Shifter View Post

Yes, I know the reasons. But unless a core is really bad, and a process segfaults as soon as it lands on it, there is the potential to mis-identify? As corruption may not be immediately evident but require further operations? That was pretty much my only point, although I wasn't really clear with how I expressed it.

Its pointer maths and threads transfer between cores that causes this to be likely instead of 100 percent sure.

So lest say you do X address + 5 that is Y then you do a store to Y. And the bug is that the CPU is not doing correct maths so it added lets say 10 instead of 5 so now you have a segfault.

Now between the X address having value added to make Y and the store using Y its possible the thread has been transferred between CPU cores by the scheduler for some reason.

In reality the reporting is 100% sure that it was this CPU core that caused the MMU with a store command to report a segfault. Problem is its only like 99.9% sure that this is the CPU core with the possible problem. The 0.1 is that the thread could have been transfer between cores just at the right time to make the fault appear core that does not have defect. Please note this could be closed to 99.999 or 1 in 10 in real world we don't know yet without real world testing..

You don't want to be taking a perfectly functional core off line. Lets say you did take a core off line as soon as it segfaulted because the code you are running is segfault free with this percent of error its in theory possible to disable every core bar the defective core first if everything goes the right way.

Yes the more reports on the same core about the same problem the more likely the problem is real. This has to be field deployed and monitored to attempt to work out what the percent risk of the thread from a defect core being transferred to another core at just the right point that the other core segfaults instead of the one responsible. This would then come if a core has over X number of segfaults in Y time frame you can be fairly much sure it dud so disable it. Of course this need your code to be segfault free or this will come another source of false positives.

Good adding this feature working out how to use in deployment that will take some time. But once it can be used it will be a good thing. Think about it you have a 64 core chip disabling one or two cores is already done in some cases to increase system performance by reducing power consume that reduces heat that allows the other cores to go faster. So disabling a core is not a really big deal any more because it might not in fact cost you any performance. Having a core do incorrect calculations is a really big deal.
Likes 2
Leave a comment:
coder replied

06 October 2022, 10:23 PM
Originally posted by stormcrow View Post

Edit to add: For my personal systems in the past 10 years or so I've had 3 GPU failures (two AMD one Nvidia), 2 separate instances of bad RAM modules, three PSUs & motherboards dying (it's what broke me of ever buying Gigabyte boards ever again, the other was an inherited ThinkServer), and 4 mechanical hard drive failures. Interestingly enough no CPU or SSD failures.

That's a lot of failures!

In the past 15 years, I've had the following hardware failures (@ homelab):
1 motherboard

1 hard disk

1 PSU (but it was a cheap one I inherited, rather than one I bought).

The bearings died on a cheap case fan.

Replaced a barely-used hard drive, when it hit a few uncorrectable read errors.

A TV Tuner PCI card and a TV tuner USB stick both got too flaky for me to use.

I have another motherboard that never worked, but it was a cheap BGA board that was already out of warranty and the issue might've actually been incompatible RAM. It wasn't worth it for me to troubleshoot further, so I cut my losses.

For best hardware reliability:
Don't overclock. A few % more performance likely isn't worth any flakiness before you're ready to replace the HW.

Use a quality UPS with line filtration & over/under protection. Ideally, sinewave output.

Use a quality PSU (power supply).

Use ECC memory (requires compatible motherboard + CPU), if possible.
If not, at least avoid budget RAM and run > 1 pass of memtest when you install it. I also buy RAM rated for a slightly faster speed than I plan to use.

Try to have ample cooling, especially of motherboard components like VRMs.

Either use a case with dust filters or clean it, periodically.
If dust filters, periodically clean them!

Don't be an early adopter. Waiting until stuff has been released for 6-8 months can often net you lower prices and later revs/steppings of both boards and CPUs. Also, later firmware/microcode revisions. I've done well, on this point. Here are 2 big wins I scored:
https://www.guru3d.com/articles-page...-review,1.html

https://www.overclock.net/threads/ry...dates.1796602/

If buying Intel, Xeon-branded SKUs tend to be binned for greater reliability. Not sure about Ryzen Pro, but maybe?

Update the firmware of your SSDs, when you first install them (i.e. before formatting & copying data onto them).
Last edited by coder; 06 October 2022, 11:33 PM.
Likes 3
Leave a comment:
coder replied

06 October 2022, 09:51 PM
It shouldn't be only SIGSEG, but also SIGBUS. Others?
Likes 1
Leave a comment:
Developer12 replied

06 October 2022, 09:46 PM
Something interesting that came up with Sun's infamous E-cache parity error issue during an Oxide twitter space a while back:

The E-cache on the E10K's CPU modules would get corrupted and because it only had parity and not ECC it could be detected but not corrected. This was something that just periodically happened, though perhaps some CPU modules were more prone to it because of the silicon lottery. It turned out later that when it was most likely to be discovered was when a *different CPU* (than the one with the corruption) snooped the corrupted cache line and upon reading it found it to be incorrect, and then the issue would appear as having occurred on the wrong core. Sun techs would replace that core thinking they had solved the issue, only for it to reappear. They might end up replacing every CPU module before finally replacing the one with the problem.

I expect the general intent is to detect things that happen in-core, and so probably can't be snooped on by other cores like cache lines, but there is a cautionary tale: just because the error message came from one core doesn't necessarily mean it originated there.
Likes 2
Leave a comment:
Developer12 replied

06 October 2022, 08:39 PM
Originally posted by Paradigm Shifter View Post

Yes, I know the reasons. But unless a core is really bad, and a process segfaults as soon as it lands on it, there is the potential to mis-identify? As corruption may not be immediately evident but require further operations? That was pretty much my only point, although I wasn't really clear with how I expressed it.

Corruption can be *very* rare but nonzero. With the so-called "mercurial cores" that google and other hyperscalers found, they had to watch some CPUs for hours or days before they'd do something wrong. Or, it would be something really weird where only very specific values lead to wrong results. Compounding this it also was sometimes an age thing, where some CPUs would only exhibit this behavior after being burned in for a few years, or even in their "old age."

With reporting the segfault a core occurs on, the hope may be to log it and hopefully see a long-term trend.
Likes 5
Leave a comment:

Announcement

Linux 6.1 Will Try To Print The CPU Core Where A Seg Fault Occurs

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: