Suspend/Resume broken after 24.11 Update

polygon · December 2, 2024, 6:27pm

Suspend / Resume is (partially) broken on my Desktop. The screen stays black but network comes up, so I can SSH into the machine and get a bit of diagnostics out. I have a GTX 1060 in this machine, so I configured it to use the proprietary nvidia driver as per recommendation. I am also running KDE Plasma 6 on this machine with Xorg server, no wayland. The config for this system is here:

github.com

polygon/dotfiles/blob/99fc5737429a0807bcdf572e1447746b2a6a802c/systems/cube/cube.nix

# Edit this configuration file to define what should be installed on
# your system.  Help is available in the configuration.nix(5) man page
# and in the NixOS manual (accessible by running ‘nixos-help’).

{ config, pkgs, unstable, ... }:
{
  # == Module Configuration ==

  # General settings for all clients/workstations
  modules.systems.client.enable = true;

  # Enable Wireguard tunnels
  modules.wireguard.mullvad.enable = true;

  # Enable VirtualBox
  modules.apps.virtualbox.enable = true;

  # Enable SyncThing
  modules.apps.syncthing.enable = true;

This file has been truncated. show original

Looking at journalctl seems to indicate issues with the graphics card:

Dec 02 18:46:28 cube kernel: PM: suspend exit
Dec 02 18:46:28 cube kernel: NVRM: GPU at PCI:0000:07:00: GPU-b75ba18c-1365-8656-6e98-8e243012b937
Dec 02 18:46:28 cube kernel: NVRM: Xid (PCI:0000:07:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: Shader Program Header 18 Error
Dec 02 18:46:28 cube kernel: NVRM: Xid (PCI:0000:07:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x405840=0x82040000
Dec 02 18:46:28 cube kernel: NVRM: Xid (PCI:0000:07:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x405848=0x80000000
Dec 02 18:46:28 cube kernel: NVRM: Xid (PCI:0000:07:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ChID 0040, Class 0000c197, Offset 00001b0c, Data 1000f010

[...]

Dec 02 18:46:31 cube kernel: NVRM: Xid (PCI:0000:07:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: Shader Program Header 11 Error
Dec 02 18:46:31 cube kernel: NVRM: Xid (PCI:0000:07:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: Shader Program Header 18 Error
Dec 02 18:46:31 cube kernel: NVRM: Xid (PCI:0000:07:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x405840=0xa2040800
Dec 02 18:46:31 cube kernel: NVRM: Xid (PCI:0000:07:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x405848=0x80000000
Dec 02 18:46:31 cube kernel: NVRM: Xid (PCI:0000:07:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ChID 009c, Class 0000c197, Offset 00002390, Data 00000000

[...]

Dec 02 18:46:44 cube kwin_x11[6983]: kwin_scene_opengl: A graphics reset attributable to the current GL context occurred.
Dec 02 18:46:44 cube kernel: BUG: unable to handle page fault for address: ffffc90015eea800
Dec 02 18:46:44 cube kernel: #PF: supervisor read access in kernel mode
Dec 02 18:46:44 cube kernel: #PF: error_code(0x0000) - not-present page
Dec 02 18:46:44 cube kernel: PGD 100000067 P4D 100000067 PUD 10026e067 PMD 19b24d067 PTE 0
Dec 02 18:46:44 cube kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Dec 02 18:46:44 cube kernel: CPU: 3 PID: 8518 Comm: vsync event mon Tainted: P           O       6.6.63 #1-NixOS
Dec 02 18:46:44 cube kernel: Hardware name: Gigabyte Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F32 05/06/2019 
Dec 02 18:46:44 cube kernel: RIP: 0010:_nv012663rm+0xf1/0x1f0 [nvidia]
Dec 02 18:46:44 cube kernel: Code: 00 48 8b b8 a0 1d 00 00 49 8b 84 24 20 09 00 00 48 c7 45 28 00 00 00 00 89 45 20 e8 f9 0b 66 00 4c 8b 45 18 8b 4d 20 49 8b 00 <8b> 10 48 89 48 20 48 8b 4d 28 48 89 48 28 0f ae f8 89 50 18 8b 45
Dec 02 18:46:44 cube kernel: RSP: 0018:ffffc90011b6bbd0 EFLAGS: 00010246
Dec 02 18:46:44 cube kernel: RAX: ffffc90015eea800 RBX: ffff888180d79408 RCX: 0000000000000be0
Dec 02 18:46:44 cube kernel: RDX: 00000000180d6d4b RSI: 0000000000009410 RDI: ffff888114918008
Dec 02 18:46:44 cube kernel: RBP: ffff888100cbdbb0 R08: ffff8881981458d8 R09: 0000000000000000 
Dec 02 18:46:44 cube kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff888127566008
Dec 02 18:46:44 cube kernel: R13: 000000000007f800 R14: 0000000000000003 R15: ffff888127566900
Dec 02 18:46:44 cube kernel: FS:  00007f127a7dc6c0(0000) GS:ffff8887fe380000(0000) knlGS:0000000000000000
Dec 02 18:46:44 cube kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 02 18:46:44 cube kernel: CR2: ffffc90015eea800 CR3: 0000000163ba4000 CR4: 00000000003506e0 
Dec 02 18:46:44 cube kernel: Call Trace:
Dec 02 18:46:44 cube kernel:  <TASK>
Dec 02 18:46:44 cube kernel:  ? __die+0x23/0x80
Dec 02 18:46:44 cube kernel:  ? page_fault_oops+0x171/0x500
Dec 02 18:46:44 cube kernel:  ? srso_return_thunk+0x5/0x5f
Dec 02 18:46:44 cube kernel:  ? srso_return_thunk+0x5/0x5f
Dec 02 18:46:44 cube kernel:  ? search_bpf_extables+0x5f/0x90
Dec 02 18:46:44 cube kernel:  ? exc_page_fault+0x158/0x160
Dec 02 18:46:44 cube kernel:  ? asm_exc_page_fault+0x26/0x30
Dec 02 18:46:44 cube kernel:  ? _nv012663rm+0xf1/0x1f0 [nvidia]
Dec 02 18:46:44 cube kernel:  _nv023448rm+0x97/0xa6 [nvidia]
Dec 02 18:46:44 cube kernel:  _nv047909rm+0x1a1/0x1b0 [nvidia]
Dec 02 18:46:44 cube kernel:  _nv022972rm+0xd9/0x160 [nvidia]
Dec 02 18:46:44 cube kernel:  _nv049933rm+0x3ff/0x500 [nvidia]
Dec 02 18:46:44 cube kernel:  _nv014741rm+0x3f1/0x690 [nvidia]
Dec 02 18:46:44 cube kernel:  _nv048059rm+0x69/0xd0 [nvidia]
Dec 02 18:46:44 cube kernel:  ? _nv000702kms+0x90/0x90 [nvidia_modeset]
Dec 02 18:46:44 cube kernel:  _nv013137rm+0x86/0xa0 [nvidia]
Dec 02 18:46:44 cube kernel:  _nv000598rm+0x5e/0x70 [nvidia]
Dec 02 18:46:44 cube kernel:  rm_kernel_rmapi_op+0x127/0x213 [nvidia]
Dec 02 18:46:44 cube kernel:  ? srso_return_thunk+0x5/0x5f
Dec 02 18:46:44 cube kernel:  nvkms_call_rm+0x4f/0x90 [nvidia_modeset]
Dec 02 18:46:44 cube kernel:  _nv002849kms+0x42/0x50 [nvidia_modeset]
Dec 02 18:46:44 cube kernel:  ? _nv002490kms+0x75/0xa0 [nvidia_modeset]
Dec 02 18:46:44 cube kernel:  ? _nv000119kms+0x67/0xa0 [nvidia_modeset]
Dec 02 18:46:44 cube kernel:  ? srso_return_thunk+0x5/0x5f
Dec 02 18:46:44 cube kernel:  ? _copy_from_user+0x2f/0x90
Dec 02 18:46:44 cube kernel:  ? srso_return_thunk+0x5/0x5f
Dec 02 18:46:44 cube kernel:  ? nvKmsIoctl+0xf9/0x270 [nvidia_modeset]
Dec 02 18:46:44 cube kernel:  ? nvkms_unlocked_ioctl+0x11a/0x190 [nvidia_modeset]
Dec 02 18:46:44 cube kernel:  ? __x64_sys_ioctl+0x9f/0xe0
Dec 02 18:46:44 cube kernel:  ? do_syscall_64+0x39/0x90
Dec 02 18:46:44 cube kernel:  ? entry_SYSCALL_64_after_hwframe+0x78/0xe2
Dec 02 18:46:44 cube kernel:  </TASK>

Dec 02 18:46:44 cube kernel: CR2: ffffc90015eea800
Dec 02 18:46:44 cube kernel: ---[ end trace 0000000000000000 ]---
Dec 02 18:46:44 cube kernel: RIP: 0010:_nv012663rm+0xf1/0x1f0 [nvidia]
Dec 02 18:46:44 cube kernel: Code: 00 48 8b b8 a0 1d 00 00 49 8b 84 24 20 09 00 00 48 c7 45 28 00 00 00 00 89 45 20 e8 f9 0b 66 00 4c 8b 45 18 8b 4d 20 49 8b 00 <8b> 10 48 89 48 20 48 8b 4d 28 48 89 48 28 0f ae f8 89 50 18 8b 45
Dec 02 18:46:44 cube kernel: RSP: 0018:ffffc90011b6bbd0 EFLAGS: 00010246
Dec 02 18:46:44 cube kernel: RAX: ffffc90015eea800 RBX: ffff888180d79408 RCX: 0000000000000be0
Dec 02 18:46:44 cube kernel: RDX: 00000000180d6d4b RSI: 0000000000009410 RDI: ffff888114918008
Dec 02 18:46:44 cube kernel: RBP: ffff888100cbdbb0 R08: ffff8881981458d8 R09: 0000000000000000
Dec 02 18:46:44 cube kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff888127566008
Dec 02 18:46:44 cube kernel: R13: 000000000007f800 R14: 0000000000000003 R15: ffff888127566900
Dec 02 18:46:44 cube kernel: FS:  00007f127a7dc6c0(0000) GS:ffff8887fe380000(0000) knlGS:0000000000000000
Dec 02 18:46:44 cube kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 02 18:46:44 cube kernel: CR2: ffffc90015eea800 CR3: 0000000163ba4000 CR4: 00000000003506e0
Dec 02 18:46:44 cube kernel: note: vsync event mon[8518] exited with irqs disabled

Some searching around the error messages brought me here:

https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Preserve_video_memory_after_suspend
https://download.nvidia.com/XFree86/Linux-x86_64/435.17/README/powermanagement.html

And it seems as if, for some reason, it’s required to use the nvidia drivers /proc interface to tell it that a suspend is imminent as well as tell it post-resume that this is what has happened.

It doesn’t seem as if NixOS currently creates nvidia-suspend.sh, nvidia-resume.sh, and nvidia-hibernate.sh and just adding these Kernel options of the first link will completely disable suspend/resume and the driver complains that it didn’t receive the suspend call on the proc-interface.

Does anyone else encounter this issue after upgrading to 24.11?

marmar · December 2, 2024, 6:35pm

I have nvidia optimus and enable nvidia drivers.

My case is a bit different in that I try to systemctl suspend and the OS freezes, and I can only force powering the laptop off…

It was working just fine some versions prior.

polygon · December 2, 2024, 7:13pm

I’ve found hardware.nvidia.powerManagement.enable: NixOS Search

With it enabed, my screen now turns on after resume. However, it just shows console output, the desktop UI never comes back

polygon · December 3, 2024, 9:54am

Still not sure what is going on, but it definitely is not just a driver issue. I changed the nvidia driver version to be the one I used in 24.05 (550.78), however the issue still persists. Something else must have gotten messed up in 24.11.

mmarx · December 3, 2024, 11:30am

I have the same problem, also with a GTX 1060. I’m currently running both linuxPackages_latest and the beta-branch NVIDIA drivers, and for now that seems to work, but I’ll keep an eye on that.

polygon · December 3, 2024, 12:37pm

Updating the driver to beta fixes the issue somewhat. My system comes back in the graphical environment eventually. However, in 24.05, it maybe took 5 seconds for the lockscreen to appear and afterwards working was fluent. Now, it takes around 30 seconds for the lockscreen to appear and another 1-2 minutes afterwards until the desktop becomes responsive.

I’m still on linuxPackages.linux_6_6 due to ZFS. All more recent Kernels for stable ZFS are already EoL.

This seems related, but unfortunately the thread went stale without solution: Screen locker: Must switch to virtual console and back to get password dialog - #12 by Lehas777 - Help - KDE Discuss

vincenttc · December 14, 2024, 5:10pm

I ran into the same issues, but managed to make suspend somewhat work for my use case. I use i3 as a window manager, and picom as a compositor. I’m currently on driver version 565.77. What works for me is enabling hardware.nvidia.powerManagement and killing picom automatically after suspend. It is not perfect as I get hangs for a couple of seconds when opening a new alacritty terminal for the first time. The lock screen appears within a couple of seconds.

Some other variations I have tried:

Not enabling hardware.nvidia.powerManagement, requires me to switch to another tty and back to get a login screen (and my wallpaper disappears).
Not killing picom automatically after suspend, the login screen just never appears, unless I go to another tty and kill picom from there.
Killing picom before suspend. This results in just a black screen (but there is a video signal being sent, as my monitor stays awake), and switching tty’s doesn’t work.

polygon · December 14, 2024, 8:23pm

It doesn’t seem to only be suspend/resume. I noticed I get these issues even when the screen just turns off after some minutes of idling. It appears as if 24.11 has some issues, whether they come from NixOS itself or just a bad constellation of software versions. There seems to be a plenty of breakage around of stuff that used to work before.

palik · December 23, 2024, 11:55am

Facing the same issue with unstable channel without Nvidia drivers.

polygon · January 6, 2025, 9:42pm

Some updates later, desktop does not come back even after extensive waiting. Need to switch to console and to restart the display manager.

polygon · January 11, 2025, 3:36pm

Latest update, things started working for me again with Beta Nvidia drivers. Strangely enough, I have to disable hardware.nvidia.powerManagement.enable again.

tengkuizdihar · February 15, 2025, 5:57am

I also have the same condition, where my screen goes blank after suspending. Only happened on my PC using NVIDIA, not my laptop which uses intel onboard gpu. Is there already an issue thread for this in the github? I searched one, but the closest I get is Suspend/resume broken on NixOS 24.11 when using KVM · Issue #369376 · NixOS/nixpkgs · GitHub.

I was using nixos 24.11 9d3ae807ebd2981d593cddd0080856873139aa40, right now trying the latest one 2ff53fe64443980e139eaa286017f53f88336dd0.

tengkuizdihar · February 15, 2025, 7:03am

update: still broke lol
upadte again: disabling picom fixed it, now i can suspend just like usual

polygon · February 15, 2025, 11:22am

Hope it keeps working. I often thought that things started working again, but apparently there was some kind of hardware state involved and after a while things broke again seemingly out of nowhere.

What seemed to have fixed it for me for good was moving to LTS Kernel 6.12. Almost a week now and suspend/resume is working fine through multiple cycles and full reboots.

vincenttc · February 21, 2025, 9:28am

What worked for me in the end is to switch to the production driver ( hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.production;). Everything works fine now, and I don’t need to kill picom anymore. I still have powermanagement enabled, I haven’t tried with it off yet.

fareycircles · March 31, 2025, 2:34am

I have the same NixOS version (24.11) and GPU (GeForce GTX 1060 3GT OC), and I’m having a similar problem: graphics corruption on waking from suspend. Updating to kernel version 6.12 didn’t solve the problem for me, although it did seem to reduce the corruption a bit.

I think I’ve tried every combination of kernel versions 6.6 / 6.12 and NVIDIA driver versions 550.135 (production) / 565.77 (stable and beta). I’ve also tried kernel version 6.13 with NVIDIA driver version 570.86.

pwaller · April 21, 2025, 6:28am

I recently got suspend/resume reliably working again on my nvidia system. The symptoms were exactly as you described @fareycircles. I suspect the same thing is happening on my amdgpu system as well but there the symptoms are nowhere near as bad.

The thing which fixed it was the workaround in this issue: #369376:

systemd.services.systemd-suspend.environment.SYSTEMD_SLEEP_FREEZE_USER_SESSIONS = "false";

fareycircles · April 21, 2025, 6:40pm

This didn’t work for me, but I’ve passed it on to this related thread in case it helps someone else!