I’ve been experiencing sporadic kernel panics on my laptop for the last couple of months. I have no idea, what could be causing them, or how to even start debugging this issue. It happens often enough to be annoying, but not often enough for me to be able to consistently reproduce it.
I think that it’s a kernel panic, because the CAPS LOCK key indicator starts flashing, and the system becomes completely unresponsive. I can’t switch to a VT, sysrq keys don’t do anything (I have verified that they are enabled and working under normal conditions) and even the power button doesn’t do anything unless I force a hardware reset by holding it for like 30 seconds. Unfortunately, I don’t get the Black Screen Of Death during the kernel panic (the laptop monitor and the external screen just freeze on the last frame of my desktop environment) and there is seemingly nothing suspicious in journalctl after a reboot (except that the logs just cut off at a random point, of course).
The first few times it happened was during a nixos-rebuild switch. However, this doesn’t seem to be a problem with any specific generation/configuration, because after that crash I was able to do a nixos-rebuild boot + manual reboot into that configuration with no problems. And just now it happened randomly during normal light use (watching YouTube).
How do you even debug something like this? After the last time it happened, I decided to set boot.kernel.sysctl."kernel.panic_on_oops" = 1 and boot.crashDump.enable = true in hopes that I might be able to catch the problem at an earlier stage or to display some kind of error instead of just freezing the system, but no luck so far.
I’ve seen this post about debugging kernel panics, but I wasn’t able to find any NixOS options or packages searching for kdump and I don’t know nearly enough about this topic to be able to adapt this guide from Debian to NixOS.
I think (?) that the new drm_panic kernel feature is supposed to help with such issues (it’s supposed to show kernel panic messages even without the fb console), but I have no idea which kernel version is it available in or how to enable it.
Thanks, I am indeed using the proprietary nvidia drivers, so I might try downgrading them.
However, I’d still prefer to have a way to see the kernel panic message or do a coredump of the kernel for debugging. Changing random things about my system until the crashes stop is a suboptimal solution, given that I can’t even verify that the problem is caused by the nvidia drivers.
The solution isn’t as random when the issues that you’ve described (random freezes, caps lock flashing, …) are too close to the current problem to be a coincidence. Knowing that you use the proprietary drivers only confirms that.
As to how you can debug things further, I honestly have no idea. I’ve been running the 550.78 drivers with hardware.nvidia.powerManagement.finegrained = true; under X11 and I didn’t have any problems anymore, so I stopped messing with it. Some users reported that rolling back the drivers works, so you’re free to try that as well.
The nvidia-bug-report.sh script, which would probably help you collect more information is missing from NixOS, as well, which can make debugging harder.
That being said, if you’d still like to help, you should probably follow this thread on the Nvidia forum:
The solution isn’t as random when the issues that you’ve described (random freezes, caps lock flashing, …) are too close to the current problem to be a coincidence.
I’ve been having this issue for a long time and just happened to decide to post it after the latest crash. If I am reading the nixpkgs git history correctly, linuxPackages.nvidiaPackages.latest got updated to the 550 driver line around march 3rd this year and I am ~90% sure that my first kernel panic predates that (I think it was in early february).
Also, a “random freeze” and flashing caps lock key are just the symptoms of a kernel panic (AFAIK). Additionally, my experience doesn’t perfectly match with what other users are describing. Supposedly, these people are getting kernel panics quite frequently/consistently and primarily under some load (installing packages). In my case, the crashes are relatively rare (it’s been happening for moths now, and I’ve only now been annoyed enough to make this thread) and have happened even during relatively idle operation.
Finally, even if this particular kernel panic was caused by the nvidia drivers, the fact that the kernel panic message isn’t displayed for some reason would still be an issue that I’d like to fix. I’m not necessarily blaming NixOS for this, but I know for a fact that you can get kernel panics to display properly (you can see the BlackSOD photos provided by people in the nvidia forums and I think I vaguely remember it working on my laptop back when I was daily driving Arch).
The problem stoppped happening for me after I pinned the nvidia drivers to version 535.154.05, so it was indeed the same bug in the latest proprietary nvidia drivers.
or to switch to the open source drivers (hardware.nvidia.open = true;). Keep in mind, that I haven’t personally tested the open version of the driver, and there might be some performance/feature-parity issues with the open source version compared to the proprietary version.
It’s worth noting that this isn’t fully open source and will probably not solve the issue according to the Nvidia forum. The fully open source driver is NVK and using it will probably solve this, but you won’t be able to use CUDA anymore.
Note: Under 555.58.02, it hasn’t been happening to me as much, lately, but it still occurs nonetheless.
I am almost certain that nvidia-open in the forums and hardware.nvidia.open = true in NixOS refer to the same thing (the open source kernel space driver, not the userspace drivers/NVK). It makes sense that switching to the open source version of the kernel modules would fix the issue, because kernell panics, well… happen in the kernel, not in the userspace (although they certainly can be caused by something in the userspace).
While I haven’t personally verified this, there are people in the abovementioned nvidia forum thread that report running nvidia-open driver versions 550 and 555 for weeks without any freezes. There are some feature parity/performance issues in nvidia-open, but they are relatively minor compared to NVK (in my experience) and you will be eventually forced to eventually switch to nvidia-open anyway, since nvidia is no longer planning to develop the proprietary version of the kernel space drivers.
I’m aware. I just wanted to make the distinction that they’re only open kernel modules because calling it open source drivers would give the impression that it’s all open source, while it’s not.
It’s definitely better and I rarely have it now compared to before, but it still happens. Unless that’s related to something else, though.
If it’s more stable, I don’t really care about a slight drop in performance. NVK was good when I tried it, but the only thing keeping me from switching is CUDA.
Indeed, so let’s hope that things turn out for the better going forward. Else, my next GPU definitely won’t be an Nvidia
using nvidia-open changes nothing for me, still facing a kernel panic after just about every boot,
instantly, after gdm login, after 5 min. same thing over and over.
couldnt find anything anywhere either. really frustrated.
sometime i feel like wanting to switch back to arch and fk nvidia and code in peace but i dont wanna leave after having spent months on mmy nixos-config.
really in a tough spot here.
This bug is not NixOS-specific. You would almost certainly have exactly the same issues on arch.
As I mentioned previously in the thread, pinning drivers to version 535.154.05 should fix the issue (assuming your problem is indeed the same as everybody else).
i tried but it gave some vague error i couldnt fix so i dropped it for somewhille,
today however i couldnt get nothing done because the stuff is acting up too much