NixOS is randomly restarting and freezing

I have been having an issue where my laptop (NixOS 23.11) has been randomly restarting and freezing. I have not been able to find the cause of the restarts though systemd logs. The freezing seems to entirely lock the system. This has been difficult to diagnose since it appears to be independent of my config and DE. It is also seemingly random. I have been unable to find the root cause of the issue. It has not happened while gaming and mainly happens when using my web browser or neovim (I dont know why or how these are correlated. it is just something I noticed). Here is a list of everything I have tried.

Original system:
hyprland
running Nvidia with the open kernel module in prime sync mode

Memtester: 200 cycles with no issue
hw-probe: no critical failures. (I lost the link to the results)
checked dmesg to see if I had this: Ryzen - ArchWiki (section 4.1) error.

reinstalled NixOS using the gnome NixOS 23.11 image.
same problems occur with me making no changes (aside from my home directory being intact and a couple nix-shells).
unable to run benchmarks due to apparently (idk wth is going on here) broken shared libraries.

Current system:
gnome
nouveau driver

I’ve been working on this for a while and am running out of ideas so any help is appreciated.

Hardware: ASUS ROG G15 (2021)
AMD Ryzen 9 5900hs
16 GB of ram
RTX 3070 mobile
1 TB ssd storage

1 Like

You’re going to have to show those kernel logs to us.

Does the same happen without Nvidia’s driver (using Nouveau)?

Did it start happening after some particular action? like updating from 23.05 to 23.11 ?
can you find an older generation that does not have the problem?

it will take me a sec to see if I can trigger the bug again. I will post kernel logs once it happens

It has happened on both Nvidia and Nouveau.

something interesting is happening. I managed to reproduce the restart issue and was about to post the logs, then my system rebooted again and is stuck in systemd’s emergency mode.

looking through journalctl there appear to be a few things of note:
(in orange) Speculative Return Stack Overflow: IBPB-extending microcode not applied! (and another similar error referring to this doc Speculative Return Stack Overflow (SRSO) — The Linux Kernel documentation

and a set of ACPI bios errors

and a set of BTRFS errors which appear to have prevented my home directory from mounting.

trying to think of a way to get the logs to you.

Be aware of journald’s -k and -b flags.

Is your firmware/BIOS up to date?

Did you enable the µCode option for your CPU vendor?

Um, that’s never good. I hope you’ve got a backup.

A day of troubleshooting later and a massive detour figuring out how to fix btrfs (and making a backup). I now have a log for you. Dec 22 12:15:47 nixos logs before crashing - Pastebin.com

bios is most likely not up to date (that might be my next approach. I figured it was unlikely for it to be the cause but you never know)

EDIT: updated the bios. no effect

I have no idea what µCode is.

(Your link is broken.)

._.
I should just stick to github

output of journalctl -b -1 -x

This sample was after the computer restarted itself. I can get another sample if needed.

EDIT: in case it is useful here is another log where the system froze (this log was made after I updated my bios incase that is significant)

In the second log, there is an error related to a disk of yours:

Error mounting /dev/sda2 at /run/media/dragonblade316/lfs: wrong fs type, bad option, bad superblock on /dev/sda2, missing codepage or helper program, or other error

Additionally, there are a ton of errors pertaining thumbnails? I’d investigate that. Perhaps some corrupted on-diks state/cache. Try to reproduce this with an empty home directory (i.e. log in as root, mv your home dir to a temporary name, log in).

This smells a bit like disk corruption?

Also, what exactly is this “freeze”? Can you still switch TTYs?
Could you repro the issue and then press the magic sysrq + s a couple times before rebooting and show the log again?

The lfs drive is an ntfs partition on an external hard drive (a remnant of me switching from windows) that has been broken since long before this started happening (I’ve been too lazy to deal with it).

Currently I am on an empty home directory (decided to wipe the computer redo the partitions just in case) though the problem still occurs.

The freeze is comprehensive, I am unable to switch to a tty. Even rebooting takes longer (its not just a tap on the power button).

I have also found a way to more or less reproduce it. it is (most the time) fine with my terminal* but does a coin flip between freezing and restarting after around 2 mins of playing a video on yt in firefox (Though it seemed to be with brave as well).

here is the new log you needed.
output of journalctl -b -1 -x after hitting sysrq + s
logs/freezelog2.txt

That’s curious. This points towards the graphics or audio stack. Unload the relevant kernel modules one by one and attempt to repro the crash.

If the cause is the GPU (quite likely IMHO), could you try disabling the nvidia card entirely? Some laptops have that option AFAIK.

Guess I’ma need to figure out how to that.

I might try downgrading the kernel and see if the issue disappears. I’ve upgraded my system a couple
time in the last month or two and I’m wondering if this is being caused by a kernel update.

I will let you know what happens

Wayland issues : Dec 22 12:19:38 nixos .gnome-shell-wr[1426]: Xwayland terminated, exiting since it was mandatory

Multimedia issues, bluetooth, and alot related to wayland, keybindings and this one:

Dec 22 12:19:38 nixos .gnome-shell-wr[2129]: Unable to mount volume lfs: Gio.IOErrorEnum: Error mounting /dev/sda2 at /run/media/dragonblade316/lfs: wrong fs type, bad option, bad superblock on /dev/sda2, missing codepage or helper program, or other error

What fs is that? NTFS?

Dec 22 12:19:38 nixos .gsd-media-keys[2389]: Failed to grab accelerator for keybinding settings:hibernate
Dec 22 12:19:38 nixos .gsd-media-keys[2389]: Failed to grab accelerator for keybinding settings:playback-repeat

Problematic keybindings?

Dec 22 12:19:38 nixos pipewire[2633]: mod.jackdbus-detect: Failed to receive jackdbus reply: org.freedesktop.DBus.Error.ServiceUnknown: The name org.jackaudio.service was not provided by any .service files

Audio setup issues?

Dec 22 12:19:38 nixos gnome-shell[2661]: nvc0_screen_create:999 - Base screen init failed: -19
Dec 22 12:19:38 nixos gnome-shell[2661]: libEGL warning: egl: failed to create dri2 screen
Dec 22 12:19:38 nixos gnome-shell[2661]: nvc0_screen_create:999 - Base screen init failed: -19

Nvidia issues! Perhaps something from my repo MAY help with nvidia related issues: https://github.com/tolgaerok/nixos-kde/tree/main/core/gpu/nvidia

Look deeper by executing : journalctl

But all in all, its a mess

Keep us updated!

1 Like

Some of this might be useful if I take more time to troubleshoot. That Xwayland one espesially.

However, I am now beginning to think this is a hardware issue. I attempted to install windows on the laptop, and it is not even making past the file preparation stage before freezing. It is not exhibiting the random reboots that linux was, but it is still not looking good.

Thoughts?

If another OS exhibits the same issue, it’s likely a hardware issue. I’d run a memcheck for a few hours now.

Have the same issues. At random times my AMD framework running nixos 23.11 freezes, only thing to help is a force reboot. Just watching the thread.

Eventually, after the system was broken for a while, installing windows on it magically (I have no other way to describe this) worked and the problems disappeared. Though in its current life as a nix server, it appears to be exhibiting the behavior again though with less frequency. The ironic thing is that the framework you are having issues with is the one I bought to replace the freezing laptop so that is fun.

I am becoming worried it is an issue with mobile AMD cpus in some part of Linux. GPU and RAM were the prior culprits but neither makes sense at this point since running memory tests revealed no issues and while the driver is enabled the GPU is not being used by any process on my machine (and nvidia gpus are not an option with framework).

IDK, if it is an issue with the kernel (which is my current suspicion) I would not know where to begin with debugging other than trying to hook up some form of debugger. Might try other kernels and see if they fix the issue.

If it was running windows for a bit, it may have updated some firmwares in the background and that could very well cause issues like this.

Welp, after a bit of time on Windows which somehow fixed the issue and after running this thing as a nix server for a while, the issue is back. Time to get back to the debugging cycle.