This can help with debugging hardware/compilers/interpreters. The most pain I have experienced in my professional (excluding politics and mismanagement) was debugging unexpected float overflows in PHP 5 on Core2 based Xeons.
Re: Strange segfaults....
Fast forward to 2020. I spent a long time debugging crashes when my system was under pressure. Turned out there was nothing wrong with my system.
It was my UPS relay (no issues with UPS on battery or plugging directly into the wall plug).
Announcement
Collapse
No announcement yet.
Linux 6.1 Will Try To Print The CPU Core Where A Seg Fault Occurs
Collapse
X
-
Originally posted by marlock View PostIf you're aware of this config, you could configure the task from core Y to core Z, see if the segfault reports core Z or if core Y keeps segfaulting other tasks?
Also you can always just replace core Y and see what happens... if the new core segfaults you know the issue wasn't on core Y but on another core or the software itself (and put the core back to use)... if it fixes things, hurray, the feature helped prevent a blind hunt.
Setting configurations you can force things with thread placement but the reality is right workload can produce the same thing with the Linux kernel without any settings. True do you fell unlucky punk?
Corner cases are true pain. The base of them SODs law "if something can go wrong, it will" or Murphy law "Anything that can go wrong will go wrong".
Basically you need to be aware that the CPU core reporting segfault may not be the cause and deeper looking is required due to some of the wacky corner cases that are possible..
Last edited by oiaohm; 10 October 2022, 09:46 AM.
Leave a comment:
-
Originally posted by marlock View PostIf you're aware of this config, you could configure the task from core Y to core Z, see if the segfault reports core Z or if core Y keeps segfaulting other tasks?
Also you can always just replace core Y and see what happens... if the new core segfaults you know the issue wasn't on core Y but on another core or the software itself (and put the core back to use)... if it fixes things, hurray, the feature helped prevent a blind hunt.
If disabling it causes the total failure rate to drop down to baseline, then you have your culprit. If not, then perhaps the entire CPU needs to be replaced (unless there's another core with disproportionately high errors).
The key is simply to aggregate enough data.
- Likes 1
Leave a comment:
-
If you're aware of this config, you could configure the task from core Y to core Z, see if the segfault reports core Z or if core Y keeps segfaulting other tasks?
Also you can always just replace core Y and see what happens... if the new core segfaults you know the issue wasn't on core Y but on another core or the software itself (and put the core back to use)... if it fixes things, hurray, the feature helped prevent a blind hunt.
Leave a comment:
-
Originally posted by marlock View PostOn your case, does processor Y stay the same across reboots or can the ultimately segfaulting process get reassigned to a new core frim time to time?
In case of an actual defective core, it will always be the same core across reboots, different failing processes, etc.
Processes don't always have dynamic assignment to cores it depends on what users have configured.
Leave a comment:
-
Originally posted by oiaohm View PostI forgot C case were X failed for defect reason to allocate memory as in use yet creates the pointer and then passes it along in a multi threaded process then Y core attempt to use a pointer that happens to point to allocated memory so segfault. If process on Y has error handling for segfaults and is locked to Y its going to repeatily error on that core.
In case of an actual defective core, it will always be the same core across reboots, different failing processes, etc.
Leave a comment:
-
Linux used to print a single page to /dev/console. If you had a monitor plugged into /dev/console. You could take a picture of the single page of the Segfault. So we all need the code was there to capture a segfault. No one knew just how much information it captured. The rest of the segfault wound up unordered in the bit bucket.
GLad to see the segfault will have a place to live in perpetuity.
Leave a comment:
-
Originally posted by marlock View Posthere is the thing...
Situation A) the segfault is caused by core X being defective but triggered only later in any core (same or another, expected to be somewhat random), so the segfault report mentions a random core each time
Situation B) the segfault is triggered immediately by the faulty core X, and the segfault report repeatedly mentions core X
the article states that the report helps case B, while not in any way being relevant to case A
is there a case C where core Y always triggers the segfault after core X fails at its task? is it bad to at least have the data to investigate a repeatedly reported core even if this is possible?
Defective CPU core can be quite creative like doing maths wrong and losing information. Yes losing information is lossing page table update information that should be passed into the system to say memory pages is in use to avoid segfault. Losing this information is at some point going to cause a segfault. Multi threads application lot of cases not on the core responsible.
Of course there is the simple case that its not a defective core but defective code. So at least 4 different case could cause the reporting. There could be more.
Leave a comment:
-
here is the thing...
Situation A) the segfault is caused by core X being defective but triggered only later in any core (same or another, expected to be somewhat random), so the segfault report mentions a random core each time
Situation B) the segfault is triggered immediately by the faulty core X, and the segfault report repeatedly mentions core X
the article states that the report helps case B, while not in any way being relevant to case A
is there a case C where core Y always triggers the segfault after core X fails at its task? is it bad to at least have the data to investigate a repeatedly reported core even if this is possible?
Leave a comment:
-
The absolute best failure I had was a chipset failure (historically would have been called south bridge) .
Half the PCIe lines corrupted specific bits during data tranfer. (The x16 GPU was fine)
This meant that about a third of USB ports didn't work, one out of two M.2 ports corrupted data (thanks btrfs for saving it) , the onboard NIC didn't work, but a PCIe NIC worked depending where it was connected...
Weirdest debugging session ever, hunting down interconnect diagrams of the X570 chipset. All started with corrupted DHCP response packets. and no network.
- Likes 1
Leave a comment:
Leave a comment: