Announcement

Collapse
No announcement yet.

Linux 6.1 Will Try To Print The CPU Core Where A Seg Fault Occurs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by stormcrow View Post
    Edit to add: For my personal systems in the past 10 years or so I've had 3 GPU failures (two AMD one Nvidia), 2 separate instances of bad RAM modules, three PSUs & motherboards dying (it's what broke me of ever buying Gigabyte boards ever again, the other was an inherited ThinkServer), and 4 mechanical hard drive failures. Interestingly enough no CPU or SSD failures.
    That's a lot of failures!

    In the past 15 years, I've had the following hardware failures (@ homelab):
    • 1 motherboard
    • 1 hard disk
    • 1 PSU (but it was a cheap one I inherited, rather than one I bought).
    • The bearings died on a cheap case fan.
    • Replaced a barely-used hard drive, when it hit a few uncorrectable read errors.
    • A TV Tuner PCI card and a TV tuner USB stick both got too flaky for me to use.

    I have another motherboard that never worked, but it was a cheap BGA board that was already out of warranty and the issue might've actually been incompatible RAM. It wasn't worth it for me to troubleshoot further, so I cut my losses.

    For best hardware reliability:
    1. Don't overclock. A few % more performance likely isn't worth any flakiness before you're ready to replace the HW.
    2. Use a quality UPS with line filtration & over/under protection. Ideally, sinewave output.
    3. Use a quality PSU (power supply).
    4. Use ECC memory (requires compatible motherboard + CPU), if possible.
      • If not, at least avoid budget RAM and run > 1 pass of memtest when you install it. I also buy RAM rated for a slightly faster speed than I plan to use.
    5. Try to have ample cooling, especially of motherboard components like VRMs.
    6. Either use a case with dust filters or clean it, periodically.
      • If dust filters, periodically clean them!
    7. Don't be an early adopter. Waiting until stuff has been released for 6-8 months can often net you lower prices and later revs/steppings of both boards and CPUs. Also, later firmware/microcode revisions. I've done well, on this point. Here are 2 big wins I scored:
    8. If buying Intel, Xeon-branded SKUs tend to be binned for greater reliability. Not sure about Ryzen Pro, but maybe?
    9. Update the firmware of your SSDs, when you first install them (i.e. before formatting & copying data onto them).
    Last edited by coder; 06 October 2022, 11:33 PM.

    Comment


    • #12
      Originally posted by Paradigm Shifter View Post
      Yes, I know the reasons. But unless a core is really bad, and a process segfaults as soon as it lands on it, there is the potential to mis-identify? As corruption may not be immediately evident but require further operations? That was pretty much my only point, although I wasn't really clear with how I expressed it.
      Its pointer maths and threads transfer between cores that causes this to be likely instead of 100 percent sure.

      So lest say you do X address + 5 that is Y then you do a store to Y. And the bug is that the CPU is not doing correct maths so it added lets say 10 instead of 5 so now you have a segfault.

      Now between the X address having value added to make Y and the store using Y its possible the thread has been transferred between CPU cores by the scheduler for some reason.

      In reality the reporting is 100% sure that it was this CPU core that caused the MMU with a store command to report a segfault. Problem is its only like 99.9% sure that this is the CPU core with the possible problem. The 0.1 is that the thread could have been transfer between cores just at the right time to make the fault appear core that does not have defect. Please note this could be closed to 99.999 or 1 in 10 in real world we don't know yet without real world testing..

      You don't want to be taking a perfectly functional core off line. Lets say you did take a core off line as soon as it segfaulted because the code you are running is segfault free with this percent of error its in theory possible to disable every core bar the defective core first if everything goes the right way.

      Yes the more reports on the same core about the same problem the more likely the problem is real. This has to be field deployed and monitored to attempt to work out what the percent risk of the thread from a defect core being transferred to another core at just the right point that the other core segfaults instead of the one responsible. This would then come if a core has over X number of segfaults in Y time frame you can be fairly much sure it dud so disable it. Of course this need your code to be segfault free or this will come another source of false positives.

      Good adding this feature working out how to use in deployment that will take some time. But once it can be used it will be a good thing. Think about it you have a 64 core chip disabling one or two cores is already done in some cases to increase system performance by reducing power consume that reduces heat that allows the other cores to go faster. So disabling a core is not a really big deal any more because it might not in fact cost you any performance. Having a core do incorrect calculations is a really big deal.

      Comment


      • #13
        Originally posted by coder View Post
        [*]Don't be an early adopter. Waiting until stuff has been released for 6-8 months can often net you lower prices and later revs/steppings of both boards and CPUs. Also, later firmware/microcode revisions. I've done well, on this point. Here are 2 big wins I scored:
        For reliabiltiry I agree with all points but concerning prices. Unfortunately the last few years with "the flu" and shortcomings related to supplychains - some hardware got a bit more expensive 6 month after debut and then went back to MSRP.

        Ryzen 3600X German Market since lunch

        screenshot.png‚Äč

        Comment


        • #14
          Great change for those using curve optimizer on AMD. Linux shows which cores have problems already, this should help further

          Comment


          • #15
            Originally posted by coder View Post
            That's a lot of failures!

            In the past 15 years, I've had the following hardware failures (@ homelab):
            • 1 motherboard
            • 1 hard disk
            • 1 PSU (but it was a cheap one I inherited, rather than one I bought).
            • The bearings died on a cheap case fan.
            • Replaced a barely-used hard drive, when it hit a few uncorrectable read errors.
            • A TV Tuner PCI card and a TV tuner USB stick both got too flaky for me to use.

            I have another motherboard that never worked, but it was a cheap BGA board that was already out of warranty and the issue might've actually been incompatible RAM. It wasn't worth it for me to troubleshoot further, so I cut my losses.

            For best hardware reliability:
            1. Don't overclock. A few % more performance likely isn't worth any flakiness before you're ready to replace the HW.
            2. Use a quality UPS with line filtration & over/under protection. Ideally, sinewave output.
            3. Use a quality PSU (power supply).
            4. Use ECC memory (requires compatible motherboard + CPU), if possible.
              • If not, at least avoid budget RAM and run > 1 pass of memtest when you install it. I also buy RAM rated for a slightly faster speed than I plan to use.
            5. Try to have ample cooling, especially of motherboard components like VRMs.
            6. Either use a case with dust filters or clean it, periodically.
              • If dust filters, periodically clean them!
            7. Don't be an early adopter. Waiting until stuff has been released for 6-8 months can often net you lower prices and later revs/steppings of both boards and CPUs. Also, later firmware/microcode revisions. I've done well, on this point. Here are 2 big wins I scored:
            8. If buying Intel, Xeon-branded SKUs tend to be binned for greater reliability. Not sure about Ryzen Pro, but maybe?
            9. Update the firmware of your SSDs, when you first install them (i.e. before formatting & copying data onto them).


            I agree with this advice. I just want to add that by always using ECC memory in any computer larger than a NUC, I have experienced a few cases where some memory modules that had previously worked fine for many years started to have frequent errors which were corrected by ECC. Being warned by this, I was able to replace those modules.

            Without ECC, it is likely that I would not have noticed any errors before some unrecoverable data corruption would have happened in my files.







            Comment


            • #16
              The absolute best failure I had was a chipset failure (historically would have been called south bridge) .
              Half the PCIe lines corrupted specific bits during data tranfer. (The x16 GPU was fine)
              This meant that about a third of USB ports didn't work, one out of two M.2 ports corrupted data (thanks btrfs for saving it) , the onboard NIC didn't work, but a PCIe NIC worked depending where it was connected...
              Weirdest debugging session ever, hunting down interconnect diagrams of the X570 chipset. All started with corrupted DHCP response packets. and no network.

              Comment


              • #17
                here is the thing...

                Situation A) the segfault is caused by core X being defective but triggered only later in any core (same or another, expected to be somewhat random), so the segfault report mentions a random core each time

                Situation B) the segfault is triggered immediately by the faulty core X, and the segfault report repeatedly mentions core X

                the article states that the report helps case B, while not in any way being relevant to case A

                is there a case C where core Y always triggers the segfault after core X fails at its task? is it bad to at least have the data to investigate a repeatedly reported core even if this is possible?
                Last edited by marlock; 08 October 2022, 04:47 PM. Reason: fixed missing word and typo

                Comment


                • #18
                  Originally posted by marlock View Post
                  here is the thing...

                  Situation A) the segfault is caused by core X being defective but triggered only later in any core (same or another, expected to be somewhat random), so the segfault report mentions a random core each time

                  Situation B) the segfault is triggered immediately by the faulty core X, and the segfault report repeatedly mentions core X

                  the article states that the report helps case B, while not in any way being relevant to case A

                  is there a case C where core Y always triggers the segfault after core X fails at its task? is it bad to at least have the data to investigate a repeatedly reported core even if this is possible?
                  I forgot C case were X failed for defect reason to allocate memory as in use yet creates the pointer and then passes it along in a multi threaded process then Y core attempt to use a pointer that happens to point to allocated memory so segfault. If process on Y has error handling for segfaults and is locked to Y its going to repeatily error on that core.

                  Defective CPU core can be quite creative like doing maths wrong and losing information. Yes losing information is lossing page table update information that should be passed into the system to say memory pages is in use to avoid segfault. Losing this information is at some point going to cause a segfault. Multi threads application lot of cases not on the core responsible.

                  Of course there is the simple case that its not a defective core but defective code. So at least 4 different case could cause the reporting. There could be more.

                  Comment


                  • #19
                    Linux used to print a single page to /dev/console. If you had a monitor plugged into /dev/console. You could take a picture of the single page of the Segfault. So we all need the code was there to capture a segfault. No one knew just how much information it captured. The rest of the segfault wound up unordered in the bit bucket.

                    GLad to see the segfault will have a place to live in perpetuity.

                    Comment


                    • #20
                      Originally posted by oiaohm View Post
                      I forgot C case were X failed for defect reason to allocate memory as in use yet creates the pointer and then passes it along in a multi threaded process then Y core attempt to use a pointer that happens to point to allocated memory so segfault. If process on Y has error handling for segfaults and is locked to Y its going to repeatily error on that core.
                      On your case, does processor Y stay the same across reboots or can the ultimately segfaulting process get reassigned to a new core frim time to time?

                      In case of an actual defective core, it will always be the same core across reboots, different failing processes, etc.

                      Comment

                      Working...
                      X