"What the -" random restarts. I'm at a loss on how to debug

Hi,

I’m at a loss on how to properly debug random restarts of my system. I’ve been running NixOS for over a year and it’s been humming along nicely until a few months ago I started getting random restarts. They are becoming very frequent (daily, usually) and it’s interrupting work significantly.

I’m pretty sure it’s a hardware issue but honestly I’ve no idea how to figure out what log / process to monitor and subsequently I have no idea on how to effectively reproduce (other than waiting).

Journalctl shows nothing of note (I think) although one time it did have some messy boot messages:

Nov 12 04:24:25 makati gnome-session-binary[3441]: DEBUG(+): GsmInhibitor: setting client-id =
Nov 12 04:24:25 makati gnome-session-binary[3441]: DEBUG(+): GsmStore: Adding object id /org/gnome/SessionManager/Inhibitor92 to store
Nov 12 04:24:25 makati gnome-session-binary[3441]: DEBUG(+): GsmManager: Inhibitor added: /org/gnome/SessionManager/Inhibitor92
Nov 12 04:24:25 makati bluetoothd[1529]: /org/bluez/hci0/dev_38_18_4C_E9_4E_F4/sep5/fd1: fd(30) ready
Nov 12 04:24:27 makati .gnome-shell-wr[3467]: Gio.DBusError: GDBus.Error:org.freedesktop.DBus.Error.Failed: error occurred in Get <...snip>
-- Boot dcce7f7358984bf28f378f072cb6bc70 --
Nov 12 04:28:30 makati kernel: Linux version 6.6.57 (nixbld@localhost) (gcc (GCC) 13.3.0, GNU ld (GNU Binutils) 2.43.1) #1-NixOS SMP PREEMPT_DYNAMIC Thu Oct 17 13:24:38 UTC 2024
-- Boot e4bef33ecc8e46ee8b1d5191cad53970 --
Nov 12 04:25:44 makati .gnome-shell-wr[3467]: Gio.DBusError: GDBus.Error:org.freedesktop.DBus.Error.Failed: error occurred in Get <...snip>
-- Boot dcce7f7358984bf28f378f072cb6bc70 --
Nov 12 04:28:30 makati kernel: Command line: initrd=\EFI\nixos\1rwiydw2hmafk77prxll963h94ph0l70-initrd-linux-6.6.57-initrd.efi init=/nix/store/cx4xsqgdbd0k2y0xi5nmcfzhvv1fy5wm-nixos-system-makati-20241030_14-51-17--v24.11.20241020.1997e4a/init loglevel=4
-- Boot e4bef33ecc8e46ee8b1d5191cad53970 --
Nov 12 04:26:28 makati .gnome-shell-wr[3467]: Gio.DBusError: GDBus.Error:org.freedesktop.DBus.Error.Failed: error occurred in Get <...snip>
-- Boot dcce7f7358984bf28f378f072cb6bc70 --
Nov 12 04:28:30 makati kernel: BIOS-provided physical RAM map:
-- Boot e4bef33ecc8e46ee8b1d5191cad53970 --
Nov 12 04:26:44 makati .gnome-shell-wr[3467]: Gio.DBusError: GDBus.Error:org.freedesktop.DBus.Error.Failed: error occurred in Get <...snip>
-- Boot dcce7f7358984bf28f378f072cb6bc70 --
Nov 12 04:28:30 makati kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
-- Boot e4bef33ecc8e46ee8b1d5191cad53970 --
Nov 12 04:27:07 makati .gnome-shell-wr[3467]: Gio.DBusError: GDBus.Error:org.freedesktop.DBus.Error.Failed: error occurred in Get <...snip>
-- Boot dcce7f7358984bf28f378f072cb6bc70 --
Nov 12 04:28:30 makati kernel: BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
Nov 12 04:28:30 makati kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000009afefff] usable
Nov 12 04:28:30 makati kernel: BIOS-e820: [mem 0x0000000009aff000-0x0000000009ffffff] reserved
Nov 12 04:28:30 makati kernel: BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable
Nov 12 04:28:30 makati kernel: BIOS-e820: [mem 0x000000000a200000-0x000000000a20ffff] ACPI NVS
...

I took a shot in the dark and did a flake update and a rebuild, but no effect.
Memtest86 all passed

I’m running:

OS: NixOS 24.11.20241109.76612b1 (Vicuna) x86_64
Host: ASRock B650M PG Riptide WiFi
Kernel: 6.6.60
Uptime: 43 mins
Packages: 1067 (nix-system), 1597 (nix-user), 51 (nix-default), 8 (flatpak)
Shell: fish 3.7.1
Resolution: 3840x2160
DE: GNOME 47.0 (Wayland)
WM: Mutter
WM Theme: Adwaita
Theme: Catppuccin-Macchiato [GTK2/3]
Icons: Adwaita [GTK2/3]
Terminal: tmux
CPU: AMD Ryzen 7 7700X (16) @ 5.573GHz
GPU: AMD ATI Raphael (an iGPU)
Memory: 13592MiB / 63415MiB

Any help would be greatly appreciated. It’s getting to a point of a clean reinstall which is a hassle!

I have enabled some config options to gather more logs, but I might not be using these effectively:

  boot.kernelParams = ["loglevel=7" "initcall_debug"];

  services.journald.rateLimitBurst = 50000;
  services.journald.rateLimitInterval = "1s";
  services.journald.extraConfig = ''
    Storage=persistent
  '';
  services.sysstat.enable = true;
  services.desktopManager.gnome.debug = true;

You can try and limit the scope of the issue. For example, does this happen on X11 or just on Wayland? Is this a Gnome/mutter issue or does it happen on other DEs?

For Gnome, there is a mutter patch for potentially reducing crashes which you can also try:

nixpkgs.overlays = [
  (final: prev: {
    mutter = prev.mutter.overrideAttrs (oldAttrs: {
      patches = (oldAttrs.patches or [ ]) ++ [
        # Avoid crashed by defaulting to high priority thread instead
        # of realtime for the KMS thread
        # https://www.phoronix.com/news/GNOME-High-Priority-KMS-Thread
        # https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/4124
        (pkgs.fetchpatch2 {
          url = "https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/4124.patch";
          hash = "sha256-h1gjyZx23NQ3VDwcGRy6hLkfgLdukao7NzH+48C/NE4=";
        })
      ];
    });
  })
];

Thanks, my plan was to use the logs to narrow it down but in lieu of those I should start doing tests. Good thing nixos makes it easier :slight_smile:

Applied that patch, thank you for the snippet. I’ll run this for a few days first to see if the issue reappears.

1 Like

This is more general Linux advice than NixOS specific, but old fashioned rsyslog can be configured to send logs over the network to another machine, which is often helpful. The kernel itself also has netconsole for sending console output, there’s also this weird thing that stores crash dumps in uefi variables so you can access them after a reboot: mjg59 | Using pstore to debug awkward kernel crashes (disclaimer - I’ve never tried it!)

TL;DR: AMD 7700X cpu had issues on 6.4/6.6 kernels. Running a later kernel brought stability improvements


Just a follow up to this issue a few months later:

  • the issue persisted with
    – adding Gnome patch
    – switching to Gnome X11
    – switching to KDE Plasma

Due to it’s disruptive nature with work, I switched to Fedora Silverblue (41, I think?) and the problem disappeared!

I was problem free for a couple of months but I just couldn’t gel with silverblue in the same way as NixOS so I switched back and boom - problem appeared again within hours.

I started going down the rabbit hole again and found that my AMD CPU 7700x had some stability issues with 6.4/6.6 kernels and newer versions were much better.

I did a one-line change in my config to move to kernel 6.12.X and it’s been stable for 15 days already.

Thanks for the help here as the suggestions were good starting points to debug!

2 Likes