NVIDIA power saving misbehaving when disconnecting laptop from power

Hello!
Disclaimer: it’s my first post on any sort of forum, so I’m open for posting feedback :slight_smile:

I’m running into some strange nvidia power management behavior on my lenovo legion laptop when on battery power. I’m running X on my integrated amd gpu, and I’ve followed the nixos nvidia page.

While connected to the charger, it works fine; the card enters its sleep state (confirmed through “/sys/bus/pci/devices/0000:01:00.0/power/runtime_status,” where the address corresponds to my card, as well as other sources).

However, when I unplug the charger, it mysteriously goes into “active” mode (from D3cold to D0) and stays there. I’ve disabled tlp and any other potential things that could conflict with it in my configuration; it’s basically down to barebones.

There’s a second PCI device which corresponds to the nvidia audio subsystem, which also needs to have power management enabled for the nvidia card to go to sleep (I think this happens automatically with the nixos config, but I’ve also been able to force it with “echo auto | sudo tee /sys/bus/pci/devices/0000:01:00.1/power/control”).

When my laptop is unplugged (or when it boots unplugged), something removes the audio pci device, and the graphics pci device goes into D0 active power mode permanently, even though it still should have power management enabled (confirmed with “/sys/bus/pci/devices/0000:01:00.0/power/control” returning “auto”)

I’ve done some investigating, and if I have “udevadm monitor” running when I unplug it, it seems the kernel is removing the device:

# udevadm monitor:

KERNEL[960.782975] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/controlC0 (sound)
KERNEL[960.783066] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input17/event13 (input)
KERNEL[960.788150] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input17 (input)
KERNEL[960.788184] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input16/event12 (input)
UDEV  [960.797849] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/controlC0 (sound)
UDEV  [960.798570] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input17/event13 (input)
UDEV  [960.799050] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input17 (input)
UDEV  [960.799881] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input16/event12 (input)
KERNEL[960.801130] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input16 (input)
KERNEL[960.801182] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input15/event11 (input)
UDEV  [960.801982] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input16 (input)
UDEV  [960.802138] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input15/event11 (input)
KERNEL[960.809736] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input15 (input)
KERNEL[960.809837] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input14/event10 (input)
UDEV  [960.810719] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input15 (input)
UDEV  [960.811042] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input14/event10 (input)
KERNEL[960.821528] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input14 (input)
KERNEL[960.821646] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/hwC0D0 (sound)
KERNEL[960.821742] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/pcmC0D9p (sound)
KERNEL[960.821785] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/pcmC0D8p (sound)
KERNEL[960.821869] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/pcmC0D7p (sound)
KERNEL[960.821908] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/pcmC0D3p (sound)
KERNEL[960.822035] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0 (sound)
UDEV  [960.822670] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/pcmC0D9p (sound)
UDEV  [960.822695] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input14 (input)
UDEV  [960.822710] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/hwC0D0 (sound)
UDEV  [960.824093] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/pcmC0D3p (sound)
UDEV  [960.824113] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/pcmC0D7p (sound)
KERNEL[960.825217] unbind   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/hdaudioC0D0 (hdaudio)
KERNEL[960.825253] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/hdaudioC0D0 (hdaudio)
UDEV  [960.825298] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/pcmC0D8p (sound)
KERNEL[960.825634] unbind   /devices/pci0000:00/0000:00:01.1/0000:01:00.1 (pci)
UDEV  [960.825674] unbind   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/hdaudioC0D0 (hdaudio)
UDEV  [960.825723] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0 (sound)
KERNEL[960.825771] remove   /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:00/device:02/wakeup/wakeup14 (wakeup)
KERNEL[960.825794] remove   /devices/virtual/devlink/pci:0000:01:00.0--pci:0000:01:00.1 (devlink)
KERNEL[960.825843] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1 (pci)
UDEV  [960.826038] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1/hdaudioC0D0 (hdaudio)
UDEV  [960.826167] remove   /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:00/device:02/wakeup/wakeup14 (wakeup)
UDEV  [960.826191] remove   /devices/virtual/devlink/pci:0000:01:00.0--pci:0000:01:00.1 (devlink)
UDEV  [960.826329] unbind   /devices/pci0000:00/0000:00:01.1/0000:01:00.1 (pci)
UDEV  [960.826501] remove   /devices/pci0000:00/0000:00:01.1/0000:01:00.1 (pci)

Also, dmesg at that same moment shows that (possibly the same culprit?) tries to remove the graphics card itself:

NVRM: Attempting to remove device 0000:01:00.0 with non-zero usage count!

I’m not sure if this is the right place to post this, as it’s not unlikely that it’s perhaps a driver or laptop model / bios specific problem, but hopefully someone can help :smiley:

Also, here’s my nvidia related section of configuration.nix:


  # Enable openGL
  hardware.opengl = {
    enable = true;
    driSupport = true;
    driSupport32Bit = true;
  };

  # Load drivers for X
  services.xserver.videoDrivers = [ "amdgpu" ];

  hardware.nvidia = {
    modesetting.enable = true;
    powerManagement.enable = true;
    powerManagement.finegrained = true;
    # Use proprietary driver
    open = false;
    nvidiaSettings = true;
    # Select driver
    package = config.boot.kernelPackages.nvidiaPackages.production;


    # enable and configure PRIME
    prime = {
      offload.enable = true;
      offload.enableOffloadCmd = true;
      amdgpuBusId = "PCI:6:0:0";
      nvidiaBusId = "PCI:1:0:0";
    };
  };

  boot.kernelModules = [ "amdgpu" "nvidia" ];

Something to note is that if I add options nvidia "NVreg_EnableGpuFirmware=1" to options nvidia "NVreg_EnableGpuFirmware=1" in configuration.nix, then both the graphics and the audio pci devices are removed (successfully this time).

Also, I examined all existing udev rules on my system and ruled that out as a source for this behaviour (and udevadm shows that udev reacts to the kernel doing stuff, not the other way around).

My hardware is a Lenovo Legion Slim 5 14APH8 with AMD Ryzen 7 7840HS (+ Radeon 780M) as well as an NVIDIA 4060.

I should also add that I can successfully use nvidia offloading, so at least no problems there.

Sorry for the long post, and thanks for the help.

First off, here’s the actual NixOS wiki page on nvidia: Nvidia - NixOS Wiki

Nvidia powermanagement is kind of known to cause issues, especially when sleep is involved. I’d give updating the driver to the newest version a shot before trying to go down this rabbit hole any further. Take an example from my dotfiles.

I’d also give running it without offloading a shot, just to see if that is involved, if it is that would narrow down the problems.

It’s also always worth reading NVIDIA’s driver docs (this one is for 555.58.02, you’re running something much older).

The unofficial wiki is outdated on the open thing, by the way, 555.58.02 doesn’t yet set it by default, but the very next driver version will - officially making it the version preferred by nvidia now. Sadly that driver has a new dependency, making it less trivial to switch to, but I’d still suggest giving the open driver a shot and seeing if it regresses anything.

Thanks for the advice!

I tried using the new driver, as well as older and newer ones, but they don’t affect the behaviour at all… Even using nouveau the same thing happens, which really doesn’t seem like a driver issue. I’ve also installed arch linux in a separate partition, and lo and behold, the exact same behaviour occurs. I’ve got a hunch it’s some strange BIOS behavior…

Welp, a bit of a nuclear option, but I wrote a kernel patch to forcibly avoid the device being removed, with a kernel parameter to turn the changes off just in case:

diff -rupN linux-vanilla/drivers/acpi/osl.c linux-nvidiapatch/drivers/acpi/osl.c
--- linux-vanilla/drivers/acpi/osl.c	2024-08-28 15:12:34.035708325 -0700
+++ linux-nvidiapatch/drivers/acpi/osl.c	2024-08-30 21:35:14.120217200 -0700
@@ -1148,10 +1148,22 @@ struct acpi_hp_work {
 	u32 src;
 };
 
+bool allow_nvidia_removal = false;
+core_param(allow_nvidia_audio_removal, allow_nvidia_removal, bool, 0644);
+
 static void acpi_hotplug_work_fn(struct work_struct *work)
 {
 	struct acpi_hp_work *hpw = container_of(work, struct acpi_hp_work, work);
 
+  // NVIDIA audio patch
+  printk(KERN_INFO "acpi_hotplug_work_fn called...\n");
+
+  const char *name = dev_name(&hpw->adev->dev);
+  if (!allow_nvidia_removal && name && (strcmp(name, "device:01") == 0 || strcmp(name, "device:02") == 0)) {
+    printk(KERN_INFO "blocked acpi attempt to remove nvidia card.\n");
+    return;
+  }
+
 	acpi_os_wait_events_complete();
 	acpi_device_hotplug(hpw->adev, hpw->src);
 	kfree(hpw);
diff -rupN linux-vanilla/include/linux/acpi.h linux-nvidiapatch/include/linux/acpi.h
--- linux-vanilla/include/linux/acpi.h	2024-08-28 15:12:39.507685961 -0700
+++ linux-nvidiapatch/include/linux/acpi.h	2024-08-30 21:36:08.633150830 -0700
@@ -74,6 +74,8 @@ static inline struct fwnode_handle *acpi
 	return fwnode;
 }
 
+extern bool allow_nvidia_removal;
+
 static inline void acpi_free_fwnode_static(struct fwnode_handle *fwnode)
 {
 	if (WARN_ON(!is_acpi_static_node(fwnode)))
diff -rupN linux-vanilla/init/main.c linux-nvidiapatch/init/main.c
--- linux-vanilla/init/main.c	2024-08-28 15:12:40.025683870 -0700
+++ linux-nvidiapatch/init/main.c	2024-08-28 15:21:39.951150105 -0700
@@ -933,6 +933,7 @@ void start_kernel(void)
 	boot_cpu_hotplug_init();
 
 	pr_notice("Kernel command line: %s\n", saved_command_line);
+  printk(KERN_INFO "This is the kernel with the custom nvidia patch, yay!\n");
 	/* parameters may set static keys */
 	jump_label_init();
 	parse_early_param();

Here it is in case anyone happens to be using the same laptop on linux lol :P⠀

Realistically, this is a bad idea and I should look at my ACPI tables or something, but on the other hand, it works fine and I’d rather not spend more time on this unless I need to :P:P

I added it to my configuration.nix with

boot.kernelPatches = [{
  name = "nvidia-audio-patch";
  patch = /path/to/patch.patch;
}];
1 Like