Announcement

Collapse
No announcement yet.

Linux 6.6 Will Avoid Unnecessary Kernel Panics On AMD Zen Systems

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by baka0815 View Post

    It's usually on idle when the machine locks up or just reboots while I'm on the desktop. I'm currently not doing much on the machine, mostly browsing the web. No GPU intensive work; at least no games. Just a yt-video here and there. I'm also not using GNOME or KDE but XFCE, which should be lightweight in terms of GPU load.
    How would I know if my PSU has enough power?
    Make the sum of the maximum (peak) power consumption for each of your devices (CPU and GPU) in particular.
    Add 100W on top of that for the rest of the system.
    If this sum is greater than half of your PSU power rating, the PSU won't be efficient. If this sum is greater than 80% of your PSUs maximum power throughput, it might be insufficient.

    Comment


    • #12
      So from now on we will have only necessary kernel panics? That's progress.

      In all fairness I have not seen an actual kernel panic message in year, kind of like I have not seen a BSOD on Windows in years, unless some piece of hardware was failing.

      Regarding the people reporting seeing random lockups on various Linux distros, this has been my experience as well for years, but i have never been able to isolate it.

      I have noticed that it happens across distros, kernel versions, and hardware configurations and the likelihood of it happening increases the longer you have the computer running and during or after the transfer or attempted transfer of a large number of large files from one drive to another.

      What I have taken to doing is simply shutting down the system every day so that the ram gets cleared out, i suspect there is a bug, or 9500 hundred, in the memory allocation of some library that all DE's use, possibly some open source driver, that causes a race condition, more specifically a data race, but these types of bugs are very difficult to track down and fix because they are nondeterministic.

      Perhaps the biggest obstacle to solving these issues is that most Linux users refuse to acknowledge that they exist.

      Comment


      • #13
        Originally posted by sophisticles View Post
        So from now on we will have only necessary kernel panics? That's progress.

        In all fairness I have not seen an actual kernel panic message in year, kind of like I have not seen a BSOD on Windows in years, unless some piece of hardware was failing.

        Regarding the people reporting seeing random lockups on various Linux distros, this has been my experience as well for years, but i have never been able to isolate it.

        I have noticed that it happens across distros, kernel versions, and hardware configurations and the likelihood of it happening increases the longer you have the computer running and during or after the transfer or attempted transfer of a large number of large files from one drive to another.

        What I have taken to doing is simply shutting down the system every day so that the ram gets cleared out, i suspect there is a bug, or 9500 hundred, in the memory allocation of some library that all DE's use, possibly some open source driver, that causes a race condition, more specifically a data race, but these types of bugs are very difficult to track down and fix because they are nondeterministic.

        Perhaps the biggest obstacle to solving these issues is that most Linux users refuse to acknowledge that they exist.
        I only get the very rare lockup on linux any only when doing something with a 3d accelerator/game.. I also get the rare lockup/crash on windows when playing a game - but no BSOD, you're right, just a black screen and hard power off required. I also manage fleets of servers with a nearly nil crash rate - outside of failed HW.

        Its been my experience that *consumer grade hardware* causes problems in general, as well as un-integrated components (like building a system with memory thats not specifically listed as tested on the compatibility list - specs might match but you roll the dice). or poor heat sink/paste issues, or badly seated component someplace, or just a buggy bios that windows has a workaround for because people with paid support got opaque fixes in behind the scenes.....

        And again, its just not true for servers because the amount of time servers spend in validation - and in my experience, unlike consumer level hardware they don't have bizarre quirks or ACPI issues or broken specs or etc.

        Comment


        • #14
          Ah... wouldn't that be more appropriately applied to things like memory error correction and the suchlike? My understanding about the freezups and panics on AMD systems are just straight up software bugs.

          Comment


          • #15
            Originally posted by kaidenshi View Post

            I tried switching RAM and storage as well as the GPU with no results. I finally sold off that system as parts and switched back to Intel along with a Radeon 6600 and so far no issues under any distro.
            Also had random crashes, on two different systems. In both cases it was the chipset on the motherboard.
            case 1: old core2 system, memory controller on the board, located in the northbridge (remember those ). Turns out the northbridge chip was dying, and sometimes would corrupt data going from/to RAM.
            case 2: some kind of electrical failure partially killed my X570 chipset, causing random hangs. I figured it out when I discovered that a couple of PCIe lanes were simply gone, and some others corrupted data (deterministically) on the way. Btrfs helped a lot, as it detected the corrupted read. Funny thing was there was a mode in which the data came through fine. Those were a few very frustrating weekends. RMA, got a new chipset, all good.

            Comment


            • #16
              Originally posted by baka0815 View Post
              I have an AMD 5600 CPU and a RX580 GPU, which sometimes just locks up or just reboots. If it locks up, the screen (X11) is black and the PC doesn't respond to keyboard or mouse input, even REISUB does not work to reboot the machine.
              I haven't found anything interesting in the log-files in /var/log (dmesg, sysmessages, kern.log, ...).

              I ran memtest86+ for some time, which finished without problems and even switched the ram. I also changed my SSD to an NVMe one (but not because of this problem) and both didn't change anything.

              Does anyone has any idea on how to find the culprit?
              Interesting question, I don't know, but I've been frustrated by "random" lock-ups on LINUX too for a number of years / OS install versions.
              One conjecture (no specific reason to be sure of it) is that it MIGHT have to do with some things not handling suspend / resume well; often I'll run with the PC set to auto-sleep / suspend, progressively blanking / turning "off" the monitors, then suspending the system entirely. Although usually it seems to "work" to suspend / resume, the possibility exists that something gets unstable because of it which can sometimes / eventually cause a crash.

              Another thing I notice that's very annoying is that when the system has an "uptime" of something like 1-N weeks (a desktop general use system) it tends to get more slow, quirky, laggy, and generally more likely to have soft "glitches" / application crashes even though the OS itself runs. Until at 1-N weeks point sometimes the system just "locks up" as you mentioned and no console switching or Ctrl-Alt-Delete etc. seems to work. Usually the GUI just freezes while displaying a normal desktop for me though, but the clock doesn't visibly advance and the keyboard / mouse isn't useful. One "culprit" I think for this lag is firefox where having NN windows / tabs open is plenty fast for the first days, but day by day it lags more and more and when I look at "top/htop/atop" I see more and more memory being consumed mostly by firefox. So like on a system with say 64GB RAM it'll show like 80% of RAM used which to me is insane for a web browser, I don't care about cache "speed up" still it gets laggy and is slower than if it just didn't cache more than 6-12 GBy but just reloaded stuff from the internet as-needed. Anyway maybe it is the web browser as A cause it'd be interesting to see the "uptime" with no browser having NN things open in "the background".

              Also sometimes I get crashes or at least "glitches / bugs" that appear to relate to having two monitors attached to a NVIDIA GPU's output ports. Maybe something is unstable about the multi-monitor joined desktop code. Sometimes it just crashes (ubuntu + gnome) when for instance the mouse goes to the upper left "Activities" screen corner of the primary monitor, or sometimes maybe doing alt-tab or "alt" to bring up a window / workspace list. Sometimes it just drops a monitor and doesn't work right, loses VSYNC refresh settings etc. So something is buggy in gnome / nvidia drivers / X land but I'm not sure what.

              Using the sysfs "drop caches" periodically (like e.g. hourly) from a script / cron job seems to help de-lag / de-bloat memory consumption by e.g. firefox over NN days.

              There is a sysctl tune-able:
              vm.swappiness = 60
              which you can use to reduce the amount of tendency for the system to swap unnecessary resident pages or keep them in RAM. Sometimes it has helped to reduce that but it's not clear.

              You could play with your syslog settings to try to log more and more diagnostics and also "sync" the output to disc more often to maybe catch more useful diagnostic data.
              I think there are settings about crash handling dumping / logging / auto-reboot etc. etc. also.
              At the cost of performance you could mount more of your file systems e.g. /var, /, whatever as "sync" mode so they'll commit more immediately if there are log messages etc. and be less likely to corrupt in a crash.

              Comment


              • #17
                Originally posted by baka0815 View Post
                I have an AMD 5600 CPU and a RX580 GPU, which sometimes just locks up or just reboots. If it locks up, the screen (X11) is black and the PC doesn't respond to keyboard or mouse input, even REISUB does not work to reboot the machine.
                I haven't found anything interesting in the log-files in /var/log (dmesg, sysmessages, kern.log, ...).

                I ran memtest86+ for some time, which finished without problems and even switched the ram. I also changed my SSD to an NVMe one (but not because of this problem) and both didn't change anything.

                Does anyone has any idea on how to find the culprit?
                It is issue of kernel. I had sad same problem many times with amd 5600 and old nvidia gpu. Now i have amd rx580 and is not so frequently (once to 3 months....last 2 years from buying of pc). PC is not freezed, only not started x server. I solve with ctrl+alt+ f3 and i login to account trough germinal and after run: startx in terminal and all is going again.

                Comment

                Working...
                X