Announcement

Collapse
No announcement yet.

Linux 6.1 Will Try To Print The CPU Core Where A Seg Fault Occurs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by marlock View Post
    On your case, does processor Y stay the same across reboots or can the ultimately segfaulting process get reassigned to a new core frim time to time?

    In case of an actual defective core, it will always be the same core across reboots, different failing processes, etc.
    This is where corner cases get so horrible. Think real-time where a process will start up assigned to a particular core and never change cores even after reboots. So X core failing to create memory could be something like that creating a buffer to be filled in by another process to be returned.

    Processes don't always have dynamic assignment to cores it depends on what users have configured.

    Comment


    • #22
      If you're aware of this config, you could configure the task from core Y to core Z, see if the segfault reports core Z or if core Y keeps segfaulting other tasks?

      Also you can always just replace core Y and see what happens... if the new core segfaults you know the issue wasn't on core Y but on another core or the software itself (and put the core back to use)... if it fixes things, hurray, the feature helped prevent a blind hunt.

      Comment


      • #23
        Originally posted by marlock View Post
        If you're aware of this config, you could configure the task from core Y to core Z, see if the segfault reports core Z or if core Y keeps segfaulting other tasks?

        Also you can always just replace core Y and see what happens... if the new core segfaults you know the issue wasn't on core Y but on another core or the software itself (and put the core back to use)... if it fixes things, hurray, the feature helped prevent a blind hunt.
        I think the idea is that the thread <-> core mapping will naturally tend to get jumbled, from one run to the next. Over time, a picture should emerge from this stochastic process, when one core has a markedly higher failure rate than the rest.

        If disabling it causes the total failure rate to drop down to baseline, then you have your culprit. If not, then perhaps the entire CPU needs to be replaced (unless there's another core with disproportionately high errors).

        The key is simply to aggregate enough data.

        Comment


        • #24
          Originally posted by marlock View Post
          If you're aware of this config, you could configure the task from core Y to core Z, see if the segfault reports core Z or if core Y keeps segfaulting other tasks?

          Also you can always just replace core Y and see what happens... if the new core segfaults you know the issue wasn't on core Y but on another core or the software itself (and put the core back to use)... if it fixes things, hurray, the feature helped prevent a blind hunt.
          That is also the catch if you are aware. You can have cases where by pure luck a thread and process is always been assigned to the same core and this can be multi threads end up like this. Remember Linux kernel NUMA scheduler will attempt to avoid transferring thread between cores because there are costs transferring threads between cores. Order everything starts on the system could basically result in fixed core placement of threads by luck. And this by luck be every time you boot the system with a particular workload everything is landing on the same cores and nothing in the the workload gives the kernel reason to move anything between core with nothing configured to tell the kernel todo this.

          Setting configurations you can force things with thread placement but the reality is right workload can produce the same thing with the Linux kernel without any settings. True do you fell unlucky punk?

          Corner cases are true pain. The base of them SODs law "if something can go wrong, it will"​ or Murphy law "Anything that can go wrong will go wrong"​.

          Basically you need to be aware that the CPU core reporting segfault may not be the cause and deeper looking is required due to some of the wacky corner cases that are possible..
          Last edited by oiaohm; 10 October 2022, 09:46 AM.

          Comment


          • #25
            This can help with debugging hardware/compilers/interpreters. The most pain I have experienced in my professional (excluding politics and mismanagement) was debugging unexpected float overflows in PHP 5 on Core2 based Xeons.

            Re: Strange segfaults....

            Fast forward to 2020. I spent a long time debugging crashes when my system was under pressure. Turned out there was nothing wrong with my system.

            It was my UPS relay (no issues with UPS on battery or plugging directly into the wall plug).

            Comment

            Working...
            X