NVIDIA GPU Falling off bus with unclear cause

My GPU has been crashing for a few weeks now, but only upon idling. Sometimes after a few minutes, sometimes after a few hours. My last available nixos generation is the only one that still works and does not trigger this behaviour. I’m trying to find out what is different between this one and the next one I generated that started this behaviour.

Exact error:

Nov 11 20:58:46 othala kernel: NVRM: GPU at PCI:0000:01:00: GPU-501a87fd-8896-8b94-dbc1-5e71f1abeb04
Nov 11 20:58:46 othala kernel: NVRM: Xid (PCI:0000:01:00): 79, pid=2243, name=.kitty-wrapped, GPU has fallen off the bus.
Nov 11 20:58:46 othala kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Nov 11 20:58:46 othala kernel: NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)

I’ve found that playing a movie on full screen (in something like mpv or youtube) is enough to prevent my GPU from idling and crashing my system on the broken generations.

My only working generation is system-209, where system-210 and after have started failing.

lrwxrwxrwx 1 root root   86 Oct  6 18:22 system-209-link -> /nix/store/rss7v043gsap3n77ydv10492ga61g9hv-nixos-system-othala-25.05.20251001.5b5be50
lrwxrwxrwx 1 root root   86 Oct 12 12:26 system-210-link -> /nix/store/vjgx78mys4c3a05fn4dvsl982ar6lz88-nixos-system-othala-25.05.20251001.5b5be50

This returns empty handed, so there don’t seem to be any changes in packages and such:

❯ nix store diff-closures /nix/var/nix/profiles/system-209-link /nix/var/nix/profiles/system-210-link                                

I’ve checked on both systems the kernel boot params, nvidia driver version, linux kernel:
For 209

❯ cat /proc/cmdline

initrd=\EFI\nixos\iqn07p17b1cbrz9c7y4s3qz7h2g7ig0y-initrd-linux-6.12.49-initrd.efi init=/nix/store/rss7v043gsap3n77ydv10492ga61g9hv-nixos-system-othala-25.05.20251001.5b5be50/init loglevel=4 lsm=landlock,yama,bpf nvidia-drm.modeset=1 nvidia-drm.fbdev=1

❯ nvidia-smi
Wed Nov 12 08:13:07 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.153.02             Driver Version: 570.153.02     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro P2000                   Off |   00000000:01:00.0 Off |                  N/A |
| N/A   28C    P3            N/A  / 5001W |     190MiB /   4096MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            1799      G   ...mgw-xorg-server-21.1.18/bin/X         94MiB |
|    0   N/A  N/A            2145      G   kitty                                     2MiB |
|    0   N/A  N/A            2341      G   ...-143.0.3/bin/.firefox-wrapped         89MiB |
+-----------------------------------------------------------------------------------------+

❯ uname -a
Linux othala 6.12.49 #1-NixOS SMP PREEMPT_DYNAMIC Thu Sep 25 09:13:51 UTC 2025 x86_64 GNU/Linux

❯ uname -r
6.12.49

For 210

❯ cat /proc/cmdline

initrd=\EFI\nixos\iqn07p17b1cbrz9c7y4s3qz7h2g7ig0y-initrd-linux-6.12.49-initrd.efi init=/nix/store/vjgx78mys4c3a05fn4dvsl982ar6lz88-nixos-system-othala-25.05.20251001.5b5be50/init loglevel=4 lsm=landlock,yama,bpf nvidia-drm.modeset=1 nvidia-drm.fbdev=1

❯ nvidia-smi                                                                                         
Wed Nov 12 08:26:24 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.153.02             Driver Version: 570.153.02     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro P2000                   Off |   00000000:01:00.0 Off |                  N/A |
| N/A   46C    P3            N/A  / 5001W |     136MiB /   4096MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            1738      G   ...mgw-xorg-server-21.1.18/bin/X         70MiB |
|    0   N/A  N/A            2085      G   kitty                                     2MiB |
|    0   N/A  N/A            2317      G   ...7.5/libexec/electron/electron         59MiB |
+-----------------------------------------------------------------------------------------+

❯ uname -a
Linux othala 6.12.49 #1-NixOS SMP PREEMPT_DYNAMIC Thu Sep 25 09:13:51 UTC 2025 x86_64 GNU/Linux

❯ uname -r
6.12.49

I have tried pinning the nvidia drivers to different branches, tried reverting linux kernels (latest generations have run with 6.12.56), tried tracking link speed of my GPU (it does drop from 8GT/s, to 5, to 2.5 (downgraded) and then the GPU crashes after a while on system 210, but doesn’t crash on 209).

I’m at this point fairly certain it’s not a hardware thing. Then the system-209 generation should also not work, right? It must be something that I haven’t thought of yet.

What else can I check at this point to find any subtle differences between system-209 and system-210 that can cause this change in behaviour? Thanks for any suggestions.

To reduce the mystery: can you check out the Nixpkgs revisiion used for 209 and check if you can rebuild a working system from there again?

Also, can you reboot and let it idle and ssh from outside to see when the bus speed starts dropping (to see how disectable the issue is)?

I indeed already tried checking out the nixpkgs revision from 209 and rebuilt a system from there. Still nothing. 209 and 210 actually use the same nixpkgs rev.

Yeah, I can ssh into my machine from another and still interact with it. So it’s just the GPU that falls off the bus, the rest of the system stays functional. The bus speed starts dropping to 2.5GT/s and crashes after some time - usually within a minute or two. However, on the generation 209 system, the bus speed also drops to 2.5GT/s but the GPU remains functional.

Any ideas on what else I could do to diagnose the problem?

Any promising details in the dmesg?

If it is the same revision, nix-diff might help?

Nothing interesting at all in dmesg, apart from the xid79 kernel message about the gpu falling off the bus.

Thanks for pointing to nix-diff, I didn’t know about this tool yet. I thought nix store diff-closures would do the same, but apparently nix-diff runs a little deeper. The only difference between 209 and 210 is that I added a network password for wpa_supplicant. That’s it. I removed that to be sure, but still the problem persists.

I checked to see the difference between one of my (much) later generations (271 or something) and 209, and between the many many lines, I noticed that there actually was a different kernel module between those two. Apparently, I had added the Nvidia kernel module manually and removed it after 209, even though the diff between 209 and 210 didn’t mention this. I have added it again (boot.initrd.kernelModules = [ "nvidia" ];) and the system seems stable at first glance. I’ve let it idle for a while now and the GPU is still hanging on. I seriously wonder why this difference didn’t appear between older generations - yet still they froze my GPU.

Regardless, it seems to have been solved now. Thanks for thinking along and pointing me to nix-diff, I’ll make sure to remember that one (and stay more on top of version control of my system config).

1 Like

Ah great!

Yeah, I am a Nouveau user on my laptop where I need the Nvidia card working as an output, and the card is a decade old, so the details of NVRM issues definitely escape me (and I do not know the device support overlap amount with Nvidia non-free driver and with Nouveau)

This does not make a practical difference, it’s added by the nvidia module by default: nixpkgs/nixos/modules/hardware/video/nvidia.nix at c2448301fb856e351aab33e64c33a3fc8bcf637d · NixOS/nixpkgs · GitHub

Unless you explicitly disable services.xserver? Maybe that condition should be removed.