Announcement

Collapse
No announcement yet.

Linux 6.1 Will Try To Print The CPU Core Where A Seg Fault Occurs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Jabberwocky
    replied
    This can help with debugging hardware/compilers/interpreters. The most pain I have experienced in my professional (excluding politics and mismanagement) was debugging unexpected float overflows in PHP 5 on Core2 based Xeons.

    Re: Strange segfaults....

    Fast forward to 2020. I spent a long time debugging crashes when my system was under pressure. Turned out there was nothing wrong with my system.

    It was my UPS relay (no issues with UPS on battery or plugging directly into the wall plug).

    Leave a comment:


  • oiaohm
    replied
    Originally posted by marlock View Post
    If you're aware of this config, you could configure the task from core Y to core Z, see if the segfault reports core Z or if core Y keeps segfaulting other tasks?

    Also you can always just replace core Y and see what happens... if the new core segfaults you know the issue wasn't on core Y but on another core or the software itself (and put the core back to use)... if it fixes things, hurray, the feature helped prevent a blind hunt.
    That is also the catch if you are aware. You can have cases where by pure luck a thread and process is always been assigned to the same core and this can be multi threads end up like this. Remember Linux kernel NUMA scheduler will attempt to avoid transferring thread between cores because there are costs transferring threads between cores. Order everything starts on the system could basically result in fixed core placement of threads by luck. And this by luck be every time you boot the system with a particular workload everything is landing on the same cores and nothing in the the workload gives the kernel reason to move anything between core with nothing configured to tell the kernel todo this.

    Setting configurations you can force things with thread placement but the reality is right workload can produce the same thing with the Linux kernel without any settings. True do you fell unlucky punk?

    Corner cases are true pain. The base of them SODs law "if something can go wrong, it will"​ or Murphy law "Anything that can go wrong will go wrong"​.

    Basically you need to be aware that the CPU core reporting segfault may not be the cause and deeper looking is required due to some of the wacky corner cases that are possible..
    Last edited by oiaohm; 10 October 2022, 09:46 AM.

    Leave a comment:


  • coder
    replied
    Originally posted by marlock View Post
    If you're aware of this config, you could configure the task from core Y to core Z, see if the segfault reports core Z or if core Y keeps segfaulting other tasks?

    Also you can always just replace core Y and see what happens... if the new core segfaults you know the issue wasn't on core Y but on another core or the software itself (and put the core back to use)... if it fixes things, hurray, the feature helped prevent a blind hunt.
    I think the idea is that the thread <-> core mapping will naturally tend to get jumbled, from one run to the next. Over time, a picture should emerge from this stochastic process, when one core has a markedly higher failure rate than the rest.

    If disabling it causes the total failure rate to drop down to baseline, then you have your culprit. If not, then perhaps the entire CPU needs to be replaced (unless there's another core with disproportionately high errors).

    The key is simply to aggregate enough data.

    Leave a comment:


  • marlock
    replied
    If you're aware of this config, you could configure the task from core Y to core Z, see if the segfault reports core Z or if core Y keeps segfaulting other tasks?

    Also you can always just replace core Y and see what happens... if the new core segfaults you know the issue wasn't on core Y but on another core or the software itself (and put the core back to use)... if it fixes things, hurray, the feature helped prevent a blind hunt.

    Leave a comment:


  • oiaohm
    replied
    Originally posted by marlock View Post
    On your case, does processor Y stay the same across reboots or can the ultimately segfaulting process get reassigned to a new core frim time to time?

    In case of an actual defective core, it will always be the same core across reboots, different failing processes, etc.
    This is where corner cases get so horrible. Think real-time where a process will start up assigned to a particular core and never change cores even after reboots. So X core failing to create memory could be something like that creating a buffer to be filled in by another process to be returned.

    Processes don't always have dynamic assignment to cores it depends on what users have configured.

    Leave a comment:


  • marlock
    replied
    Originally posted by oiaohm View Post
    I forgot C case were X failed for defect reason to allocate memory as in use yet creates the pointer and then passes it along in a multi threaded process then Y core attempt to use a pointer that happens to point to allocated memory so segfault. If process on Y has error handling for segfaults and is locked to Y its going to repeatily error on that core.
    On your case, does processor Y stay the same across reboots or can the ultimately segfaulting process get reassigned to a new core frim time to time?

    In case of an actual defective core, it will always be the same core across reboots, different failing processes, etc.

    Leave a comment:


  • Brook-trout
    replied
    Linux used to print a single page to /dev/console. If you had a monitor plugged into /dev/console. You could take a picture of the single page of the Segfault. So we all need the code was there to capture a segfault. No one knew just how much information it captured. The rest of the segfault wound up unordered in the bit bucket.

    GLad to see the segfault will have a place to live in perpetuity.

    Leave a comment:


  • oiaohm
    replied
    Originally posted by marlock View Post
    here is the thing...

    Situation A) the segfault is caused by core X being defective but triggered only later in any core (same or another, expected to be somewhat random), so the segfault report mentions a random core each time

    Situation B) the segfault is triggered immediately by the faulty core X, and the segfault report repeatedly mentions core X

    the article states that the report helps case B, while not in any way being relevant to case A

    is there a case C where core Y always triggers the segfault after core X fails at its task? is it bad to at least have the data to investigate a repeatedly reported core even if this is possible?
    I forgot C case were X failed for defect reason to allocate memory as in use yet creates the pointer and then passes it along in a multi threaded process then Y core attempt to use a pointer that happens to point to allocated memory so segfault. If process on Y has error handling for segfaults and is locked to Y its going to repeatily error on that core.

    Defective CPU core can be quite creative like doing maths wrong and losing information. Yes losing information is lossing page table update information that should be passed into the system to say memory pages is in use to avoid segfault. Losing this information is at some point going to cause a segfault. Multi threads application lot of cases not on the core responsible.

    Of course there is the simple case that its not a defective core but defective code. So at least 4 different case could cause the reporting. There could be more.

    Leave a comment:


  • marlock
    replied
    here is the thing...

    Situation A) the segfault is caused by core X being defective but triggered only later in any core (same or another, expected to be somewhat random), so the segfault report mentions a random core each time

    Situation B) the segfault is triggered immediately by the faulty core X, and the segfault report repeatedly mentions core X

    the article states that the report helps case B, while not in any way being relevant to case A

    is there a case C where core Y always triggers the segfault after core X fails at its task? is it bad to at least have the data to investigate a repeatedly reported core even if this is possible?
    Last edited by marlock; 08 October 2022, 04:47 PM. Reason: fixed missing word and typo

    Leave a comment:


  • Serafean
    replied
    The absolute best failure I had was a chipset failure (historically would have been called south bridge) .
    Half the PCIe lines corrupted specific bits during data tranfer. (The x16 GPU was fine)
    This meant that about a third of USB ports didn't work, one out of two M.2 ports corrupted data (thanks btrfs for saving it) , the onboard NIC didn't work, but a PCIe NIC worked depending where it was connected...
    Weirdest debugging session ever, hunting down interconnect diagrams of the X570 chipset. All started with corrupted DHCP response packets. and no network.

    Leave a comment:

Working...
X