"What the -" random restarts. I'm at a loss on how to debug

Hi,

I’m at a loss on how to properly debug random restarts of my system. I’ve been running NixOS for over a year and it’s been humming along nicely until a few months ago I started getting random restarts. They are becoming very frequent (daily, usually) and it’s interrupting work significantly.

I’m pretty sure it’s a hardware issue but honestly I’ve no idea how to figure out what log / process to monitor and subsequently I have no idea on how to effectively reproduce (other than waiting).

Journalctl shows nothing of note (I think) although one time it did have some messy boot messages:

Nov 12 04:24:25 makati gnome-session-binary[3441]: DEBUG(+): GsmInhibitor: setting client-id =
Nov 12 04:24:25 makati gnome-session-binary[3441]: DEBUG(+): GsmStore: Adding object id /org/gnome/SessionManager/Inhibitor92 to store
Nov 12 04:24:25 makati gnome-session-binary[3441]: DEBUG(+): GsmManager: Inhibitor added: /org/gnome/SessionManager/Inhibitor92
Nov 12 04:24:25 makati bluetoothd[1529]: /org/bluez/hci0/dev_38_18_4C_E9_4E_F4/sep5/fd1: fd(30) ready
Nov 12 04:24:27 makati .gnome-shell-wr[3467]: Gio.DBusError: GDBus.Error:org.freedesktop.DBus.Error.Failed: error occurred in Get <...snip>
-- Boot dcce7f7358984bf28f378f072cb6bc70 --
Nov 12 04:28:30 makati kernel: Linux version 6.6.57 (nixbld@localhost) (gcc (GCC) 13.3.0, GNU ld (GNU Binutils) 2.43.1) #1-NixOS SMP PREEMPT_DYNAMIC Thu Oct 17 13:24:38 UTC 2024
-- Boot e4bef33ecc8e46ee8b1d5191cad53970 --
Nov 12 04:25:44 makati .gnome-shell-wr[3467]: Gio.DBusError: GDBus.Error:org.freedesktop.DBus.Error.Failed: error occurred in Get <...snip>
-- Boot dcce7f7358984bf28f378f072cb6bc70 --
Nov 12 04:28:30 makati kernel: Command line: initrd=\EFI\nixos\1rwiydw2hmafk77prxll963h94ph0l70-initrd-linux-6.6.57-initrd.efi init=/nix/store/cx4xsqgdbd0k2y0xi5nmcfzhvv1fy5wm-nixos-system-makati-20241030_14-51-17--v24.11.20241020.1997e4a/init loglevel=4
-- Boot e4bef33ecc8e46ee8b1d5191cad53970 --
Nov 12 04:26:28 makati .gnome-shell-wr[3467]: Gio.DBusError: GDBus.Error:org.freedesktop.DBus.Error.Failed: error occurred in Get <...snip>
-- Boot dcce7f7358984bf28f378f072cb6bc70 --
Nov 12 04:28:30 makati kernel: BIOS-provided physical RAM map:
-- Boot e4bef33ecc8e46ee8b1d5191cad53970 --
Nov 12 04:26:44 makati .gnome-shell-wr[3467]: Gio.DBusError: GDBus.Error:org.freedesktop.DBus.Error.Failed: error occurred in Get <...snip>
-- Boot dcce7f7358984bf28f378f072cb6bc70 --
Nov 12 04:28:30 makati kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
-- Boot e4bef33ecc8e46ee8b1d5191cad53970 --
Nov 12 04:27:07 makati .gnome-shell-wr[3467]: Gio.DBusError: GDBus.Error:org.freedesktop.DBus.Error.Failed: error occurred in Get <...snip>
-- Boot dcce7f7358984bf28f378f072cb6bc70 --
Nov 12 04:28:30 makati kernel: BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
Nov 12 04:28:30 makati kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000009afefff] usable
Nov 12 04:28:30 makati kernel: BIOS-e820: [mem 0x0000000009aff000-0x0000000009ffffff] reserved
Nov 12 04:28:30 makati kernel: BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable
Nov 12 04:28:30 makati kernel: BIOS-e820: [mem 0x000000000a200000-0x000000000a20ffff] ACPI NVS
...

I took a shot in the dark and did a flake update and a rebuild, but no effect.
Memtest86 all passed

I’m running:

OS: NixOS 24.11.20241109.76612b1 (Vicuna) x86_64
Host: ASRock B650M PG Riptide WiFi
Kernel: 6.6.60
Uptime: 43 mins
Packages: 1067 (nix-system), 1597 (nix-user), 51 (nix-default), 8 (flatpak)
Shell: fish 3.7.1
Resolution: 3840x2160
DE: GNOME 47.0 (Wayland)
WM: Mutter
WM Theme: Adwaita
Theme: Catppuccin-Macchiato [GTK2/3]
Icons: Adwaita [GTK2/3]
Terminal: tmux
CPU: AMD Ryzen 7 7700X (16) @ 5.573GHz
GPU: AMD ATI Raphael (an iGPU)
Memory: 13592MiB / 63415MiB

Any help would be greatly appreciated. It’s getting to a point of a clean reinstall which is a hassle!

I have enabled some config options to gather more logs, but I might not be using these effectively:

  boot.kernelParams = ["loglevel=7" "initcall_debug"];

  services.journald.rateLimitBurst = 50000;
  services.journald.rateLimitInterval = "1s";
  services.journald.extraConfig = ''
    Storage=persistent
  '';
  services.sysstat.enable = true;
  services.desktopManager.gnome.debug = true;

You can try and limit the scope of the issue. For example, does this happen on X11 or just on Wayland? Is this a Gnome/mutter issue or does it happen on other DEs?

For Gnome, there is a mutter patch for potentially reducing crashes which you can also try:

nixpkgs.overlays = [
  (final: prev: {
    mutter = prev.mutter.overrideAttrs (oldAttrs: {
      patches = (oldAttrs.patches or [ ]) ++ [
        # Avoid crashed by defaulting to high priority thread instead
        # of realtime for the KMS thread
        # https://www.phoronix.com/news/GNOME-High-Priority-KMS-Thread
        # https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/4124
        (pkgs.fetchpatch2 {
          url = "https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/4124.patch";
          hash = "sha256-h1gjyZx23NQ3VDwcGRy6hLkfgLdukao7NzH+48C/NE4=";
        })
      ];
    });
  })
];

Thanks, my plan was to use the logs to narrow it down but in lieu of those I should start doing tests. Good thing nixos makes it easier :slight_smile:

Applied that patch, thank you for the snippet. I’ll run this for a few days first to see if the issue reappears.

1 Like

This is more general Linux advice than NixOS specific, but old fashioned rsyslog can be configured to send logs over the network to another machine, which is often helpful. The kernel itself also has netconsole for sending console output, there’s also this weird thing that stores crash dumps in uefi variables so you can access them after a reboot: mjg59 | Using pstore to debug awkward kernel crashes (disclaimer - I’ve never tried it!)