Nvidia-Offload not working

Sophia · September 17, 2023, 5:05pm

I have searched all over the place for a solution myself, unfortunately I reached a dead end.

posts I looked at :

Didn’t help since I have Intel + Nvidia computer, still read it

Soultion found was to use legacy drivers, looking into nvidia legacy cards list, my card isn’t in it, and looking at the latest driver card support mine is listed, however I tried using the same legacy drivers just to be sure, didn’t work

nvidia-smi detects my card, this post had problems because their GPU was external, mine is not

Had the opposite problem, nvidia-offload was working, but their GPU wasn’t turning off when not using anything that needed the GPU

Before the things I tried, here are my configuration.nix, nvidia.nix and hardware-configuration.nix

https://github.com/MintzyG/NixosDots/tree/main

Computer Info:
CPU: Intel i5-9300H (8) @ 4.100GHz
GPU: NVIDIA GeForce GTX 1650 Mobile / Max-Q
GPU: Intel CoffeeLake-H GT2 [UHD Graphics 630]
Kernel: 6.1.51
Host: ASUSTeK COMPUTER INC. X571GT
NixOS 23.05.3327.4f77ea639305

Things I tried:

glxinfo -B

sophia@nixos ~> glxinfo -B
name of display: :0
display: :0  screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: Intel (0x8086)
    Device: Mesa Intel(R) UHD Graphics 630 (CFL GT2) (0x3e9b)
    Version: 23.0.3
    Accelerated: yes
    Video memory: 15860MB
    Unified memory: yes
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
OpenGL vendor string: Intel
OpenGL renderer string: Mesa Intel(R) UHD Graphics 630 (CFL GT2)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 23.0.3
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6 (Compatibility Profile) Mesa 23.0.3
OpenGL shading language version string: 4.60
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.2 Mesa 23.0.3
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

It is detecting Intel integrated GPU

now

nvidia-offload glxinfo -B

sophia@nixos ~> nvidia-offload glxinfo -B
name of display: :0
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  156 (NV-GLX)
  Minor opcode of failed request:  6 ()
  Value in failed request:  0x0
  Serial number of failed request:  97
  Current serial number in output stream:  97

On the second post, they found out which drivers to use by accessing boot logs and filtering through nvidia, where an error specified a driver

So I did the same but don’t know what the specific return 1's here mean

journalctl --boot=-1 | rg -i nvidia

set 17 12:53:00 nixos kernel: Command line: initrd=\efi\nixos\6h52cwk3pxi3fl8gwdbi9zgwbrbw6m5j-initrd-linux-6.1.51-initrd.efi init=/nix/store/dndfg3dp6xk4kww7f7789madwn0k8zcd-nixos-system-nixos-23.05.3327.4f77ea639305/init module_blacklist=i915 loglevel=4 nvidia-drm.modeset=1 nvidia.NVreg_PreserveVideoMemoryAllocations=1
set 17 12:53:00 nixos kernel: Kernel command line: initrd=\efi\nixos\6h52cwk3pxi3fl8gwdbi9zgwbrbw6m5j-initrd-linux-6.1.51-initrd.efi init=/nix/store/dndfg3dp6xk4kww7f7789madwn0k8zcd-nixos-system-nixos-23.05.3327.4f77ea639305/init module_blacklist=i915 loglevel=4 nvidia-drm.modeset=1 nvidia.NVreg_PreserveVideoMemoryAllocations=1
set 17 12:53:00 nixos kernel: nvidia: loading out-of-tree module taints kernel.
set 17 12:53:00 nixos kernel: nvidia: module license 'NVIDIA' taints kernel.
set 17 12:53:00 nixos kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 244
set 17 12:53:00 nixos kernel: nvidia 0000:01:00.0: enabling device (0000 -> 0003)
set 17 12:53:01 nixos kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.86.05  Fri Jul 14 20:46:33 UTC 2023
set 17 12:53:01 nixos kernel: nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
set 17 12:53:01 nixos kernel: nvidia-uvm: Loaded the UVM driver, major device number 240.
set 17 12:53:01 nixos systemd-modules-load[551]: Inserted module 'nvidia_uvm'
set 17 12:53:01 nixos kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.86.05  Fri Jul 14 20:20:58 UTC 2023
set 17 12:53:01 nixos (udev-worker)[582]: nvidia: Process '/nix/store/8fv91097mbh5049i9rglc73dx6kjg3qk-bash-5.2-p15/bin/bash -c 'mknod -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
set 17 12:53:01 nixos (udev-worker)[582]: nvidia: Process '/nix/store/8fv91097mbh5049i9rglc73dx6kjg3qk-bash-5.2-p15/bin/bash -c 'for i in $(cat /proc/driver/nvidia/gpus/*/information | grep Minor | cut -d \  -f 4); do mknod -m 666 /dev/nvidia${i} c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) ${i}; done'' failed with exit code 1.
set 17 12:53:01 nixos systemd-modules-load[551]: Inserted module 'nvidia_modeset'
set 17 12:53:01 nixos kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
set 17 12:53:02 nixos systemd-modules-load[551]: Inserted module 'nvidia_drm'
set 17 12:53:02 nixos kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0

journalctl --boot -t xsession doesn’t have any mentions of nvidia

lscpi

00:00.0 Host bridge: Intel Corporation 8th Gen Core 4-core Processor Host Bridge/DRAM Registers [Coffee Lake H] (rev 0d)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 0d)
00:02.0 VGA compatible controller: Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630] (rev 02)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 0d)
00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model
00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)
00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
00:14.3 Network controller: Intel Corporation Cannon Lake PCH CNVi WiFi (rev 10)
00:15.0 Serial bus controller: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 (rev 10)
00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
00:17.0 RAID bus controller: Intel Corporation 82801 Mobile SATA Controller [RAID mode] (rev 10)
00:1d.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #9 (rev f0)
00:1d.6 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #15 (rev f0)
00:1e.0 Communication controller: Intel Corporation Cannon Lake PCH Serial IO UART Host Controller (rev 10)
00:1e.3 Serial bus controller: Intel Corporation Cannon Lake PCH SPI Host Controller (rev 10)
00:1f.0 ISA bridge: Intel Corporation HM470 Chipset LPC/eSPI Controller (rev 10)
00:1f.3 Audio device: Intel Corporation Cannon Lake PCH cAVS (rev 10)
00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
00:1f.5 Serial bus controller: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
01:00.0 3D controller: NVIDIA Corporation TU117M [GeForce GTX 1650 Mobile / Max-Q] (rev a1)
02:00.0 Non-Volatile memory controller: SK hynix BC501 NVMe Solid State Drive
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

As seen in the last post I mentioned when I cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status
it reports suspended

cat /proc/driver/nvidia/gpus/0000:01:00.0/power
reports:

Runtime D3 status:          Enabled (fine-grained)
Video Memory:               Off

GPU Hardware Support:
 Video Memory Self Refresh: Supported
 Video Memory Off:          Supported

and then after that, using nvidia-smi

Sun Sep 17 13:56:03 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05              Driver Version: 535.86.05    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1650        Off | 00000000:01:00.0 Off |                  N/A |
| N/A   46C    P0               6W /  50W |      4MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1074      G   ...86nfwcw6ma-xorg-server-21.1.8/bin/X        4MiB |
+---------------------------------------------------------------------------------------+

My GPU does appears, knowing nvidia-smi turns on the gpu, it is correctly sleeping when not in use, and when called for it does turn on

xrandr --list-providers also detects my gpu

Providers: number : 2
Provider 0: id: 0x43 cap: 0xf, Source Output, Sink Output, Source Offload, Sink Offload crtcs: 3 outputs: 2 associated providers: 0 name:modesetting
Provider 1: id: 0x26d cap: 0x0 crtcs: 0 outputs: 0 associated providers: 0 name:NVIDIA-G0

sudo lshw -c display

 *-display                 
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:137 memory:a3000000-a3ffffff memory:90000000-9fffffff memory:a0000000-a1ffffff ioport:4000(size=128) memory:a4000000-a407ffff
  *-display
       physical id: 2
       bus info: pci@0000:00:02.0
       version: 02
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress msi pm bus_master cap_list rom
       configuration: driver=i915 latency=0
       resources: irq:146 memory:a2000000-a2ffffff memory:80000000-8fffffff ioport:5000(size=64) memory:c0000-dffff

This confirms I’m using the correct bus for the config file

I also tried production, stable and beta

Tried blacklisting Intel GPU with boot.kernelParams = [ "module_blacklist=i915" ]; with this X doesn’t startup and I get stuck in the systemd screen having to use a tty to remove the blacklist

I just don’t know what to do or try anymore

oneiromantica · September 22, 2023, 7:49am

My rather not-so-helpful news is: I had the exact same problem about two years ago - everything set up and detected correctly, but nvidia-offload would only respond with the “BadValue” error. Have you tried running anything with lutris’ “enable offload” option, or with Gnome’s “run with discrete gpu” action?

I’m currently using an almost identical nvidia.nix, only without the powerManagement entry and the enableOffloadCmd value, and it works now.

Sadly, what got it to work for me was an update of the linux kernel to a newer version (though unstable versions will fail building nVidia drivers from time to time).

Sophia · September 23, 2023, 9:55pm

So my machine was on kernel 6.1.54, I switched to _latest which is 6.5.4 hoping for a miracle, unfortunately it did not make it work, sadge.

Also tried the Lutris thing, their prime-run option just made performance even worse from what it was from when I tried running the same CAD application without it. And I can’t test it, since I can’t run CLI apps like glxinfo to see if it actually invokes the GPU, however despite the worse perfomance it does run the application, unlike nvidia-offload command.

For the GNOME thing, I searched for it and only found a GNOME shell thing related to it, but it seems I would have to install GNOME to test that, or am I wrong? Because I couldn’t find anything else on it. What I found was that it need switcheroo-control to be able to run the command “run with discrete GPU”, so i tried installing it, but it doesn’t work.

sudo systemctl status switcheroo-control.service or start returns:
Unit switcheroo-control.service could not be found.

Trying to enable it returns:
Failed to enable unit: Unit file switcheroo-control.service does not exist.

Trying to use it switcherooctl list returns nothing and no errors, and switcherooctl --list returns:

Traceback (most recent call last):
  File "/nix/store/fjjpiz6m74q189ahik57v0x1nvh1jjcw-switcheroo-control-2.3/bin/.switcherooctl-wrapped", line 181, in <module>
    launch(args, gpu)
  File "/nix/store/fjjpiz6m74q189ahik57v0x1nvh1jjcw-switcheroo-control-2.3/bin/.switcherooctl-wrapped", line 67, in launch
    os.execvp(args[0], args)
  File "/nix/store/bc45k1n0pkrdkr3xa6w84w1xhkl1kkyp-python3-3.10.12/lib/python3.10/os.py", line 575, in execvp
    _execvpe(file, args)
  File "/nix/store/bc45k1n0pkrdkr3xa6w84w1xhkl1kkyp-python3-3.10.12/lib/python3.10/os.py", line 617, in _execvpe
    raise last_exc
  File "/nix/store/bc45k1n0pkrdkr3xa6w84w1xhkl1kkyp-python3-3.10.12/lib/python3.10/os.py", line 608, in _execvpe
    exec_func(fullname, *argrest)
FileNotFoundError: [Errno 2] No such file or directory

So maybe it wouldn’t have worked with GNOME? or maybe it needs something else to work? Or I did something wrong?

Another thing I should maybe point out is, Games are working fine and running fine, be it through steam or minecraft running through the prism launcher, these work and with the GPU strangely enough, however they do have their own implementations of prime, the problem comes when I try running standalone apps like my CAD apps, they just give the BadValue error when run with offload.

And here’s my dot files again, since I deleted the old Repo when migrating to flakes and can’t edit the original message now