My GPU has been crashing for a few weeks now, but only upon idling. Sometimes after a few minutes, sometimes after a few hours. My last available nixos generation is the only one that still works and does not trigger this behaviour. I’m trying to find out what is different between this one and the next one I generated that started this behaviour.
Exact error:
Nov 11 20:58:46 othala kernel: NVRM: GPU at PCI:0000:01:00: GPU-501a87fd-8896-8b94-dbc1-5e71f1abeb04
Nov 11 20:58:46 othala kernel: NVRM: Xid (PCI:0000:01:00): 79, pid=2243, name=.kitty-wrapped, GPU has fallen off the bus.
Nov 11 20:58:46 othala kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Nov 11 20:58:46 othala kernel: NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
I’ve found that playing a movie on full screen (in something like mpv or youtube) is enough to prevent my GPU from idling and crashing my system on the broken generations.
My only working generation is system-209, where system-210 and after have started failing.
lrwxrwxrwx 1 root root 86 Oct 6 18:22 system-209-link -> /nix/store/rss7v043gsap3n77ydv10492ga61g9hv-nixos-system-othala-25.05.20251001.5b5be50
lrwxrwxrwx 1 root root 86 Oct 12 12:26 system-210-link -> /nix/store/vjgx78mys4c3a05fn4dvsl982ar6lz88-nixos-system-othala-25.05.20251001.5b5be50
This returns empty handed, so there don’t seem to be any changes in packages and such:
❯ nix store diff-closures /nix/var/nix/profiles/system-209-link /nix/var/nix/profiles/system-210-link
I’ve checked on both systems the kernel boot params, nvidia driver version, linux kernel:
For 209
❯ cat /proc/cmdline
initrd=\EFI\nixos\iqn07p17b1cbrz9c7y4s3qz7h2g7ig0y-initrd-linux-6.12.49-initrd.efi init=/nix/store/rss7v043gsap3n77ydv10492ga61g9hv-nixos-system-othala-25.05.20251001.5b5be50/init loglevel=4 lsm=landlock,yama,bpf nvidia-drm.modeset=1 nvidia-drm.fbdev=1
❯ nvidia-smi
Wed Nov 12 08:13:07 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.153.02 Driver Version: 570.153.02 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Quadro P2000 Off | 00000000:01:00.0 Off | N/A |
| N/A 28C P3 N/A / 5001W | 190MiB / 4096MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1799 G ...mgw-xorg-server-21.1.18/bin/X 94MiB |
| 0 N/A N/A 2145 G kitty 2MiB |
| 0 N/A N/A 2341 G ...-143.0.3/bin/.firefox-wrapped 89MiB |
+-----------------------------------------------------------------------------------------+
❯ uname -a
Linux othala 6.12.49 #1-NixOS SMP PREEMPT_DYNAMIC Thu Sep 25 09:13:51 UTC 2025 x86_64 GNU/Linux
❯ uname -r
6.12.49
For 210
❯ cat /proc/cmdline
initrd=\EFI\nixos\iqn07p17b1cbrz9c7y4s3qz7h2g7ig0y-initrd-linux-6.12.49-initrd.efi init=/nix/store/vjgx78mys4c3a05fn4dvsl982ar6lz88-nixos-system-othala-25.05.20251001.5b5be50/init loglevel=4 lsm=landlock,yama,bpf nvidia-drm.modeset=1 nvidia-drm.fbdev=1
❯ nvidia-smi
Wed Nov 12 08:26:24 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.153.02 Driver Version: 570.153.02 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Quadro P2000 Off | 00000000:01:00.0 Off | N/A |
| N/A 46C P3 N/A / 5001W | 136MiB / 4096MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1738 G ...mgw-xorg-server-21.1.18/bin/X 70MiB |
| 0 N/A N/A 2085 G kitty 2MiB |
| 0 N/A N/A 2317 G ...7.5/libexec/electron/electron 59MiB |
+-----------------------------------------------------------------------------------------+
❯ uname -a
Linux othala 6.12.49 #1-NixOS SMP PREEMPT_DYNAMIC Thu Sep 25 09:13:51 UTC 2025 x86_64 GNU/Linux
❯ uname -r
6.12.49
I have tried pinning the nvidia drivers to different branches, tried reverting linux kernels (latest generations have run with 6.12.56), tried tracking link speed of my GPU (it does drop from 8GT/s, to 5, to 2.5 (downgraded) and then the GPU crashes after a while on system 210, but doesn’t crash on 209).
I’m at this point fairly certain it’s not a hardware thing. Then the system-209 generation should also not work, right? It must be something that I haven’t thought of yet.
What else can I check at this point to find any subtle differences between system-209 and system-210 that can cause this change in behaviour? Thanks for any suggestions.