Difficulty with offloading & NVIDIA eGPU

enricozb · November 15, 2021, 1:08am

I have a framework laptop with a GTX 1070 in an eGPU enclosure and wanted to get offloading working using only the laptop’s internal display, and no external monitors. I am using thunderbolt for this. I apologize in advance for the code/log-heavy post, but I’m just pretty lost as to what I am doing incorrectly. Is my desired setup of an eGPU using a laptop’s internal display even possible? I want to run X using the dedicated (intel) graphics card, but I want to offload certain programs (namely steam) to the eGPU. What follows is all of the information I think is relevant, thank you in advance for any and all help:

The GPU is detected by nvidia-smi:

Sun Nov 14 14:59:31 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 495.44       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
|  0%   33C    P8     5W / 151W |      2MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

However, xrandr --listproviders only shows one provider:

Providers: number : 1
Provider 0: id: 0x48 cap: 0xf, Source Output, Sink Output, Source Offload, Sink Offload crtcs: 4 outputs: 5 associated providers: 0 name:modesetting

And, likely related to this, nvidia-offload glxinfo fails with:

name of display: :0
Failed to establish dbus connectionX Error of failed request:  BadAlloc (insufficient resources for operation)
  Major opcode of failed request:  152 (GLX)
  Minor opcode of failed request:  5 (X_GLXMakeCurrent)
  Serial number of failed request:  0
  Current serial number in output stream:  28

Notice the error is BadAlloc. I haven’t seen this error in other nixos threads, I’ve only found BadValue.

Here is my nvidia config:

{ config, pkgs, ... }:

let
  nvidia-offload = pkgs.writeShellScriptBin "nvidia-offload" ''
    export __NV_PRIME_RENDER_OFFLOAD=1
    export __NV_PRIME_RENDER_OFFLOAD_PROVIDER=NVIDIA-G0
    export __GLX_VENDOR_LIBRARY_NAME=nvidia
    export __VK_LAYER_NV_optimus=NVIDIA_only
    exec -a "$0" "$@"
  '';
in
{
  environment.systemPackages = [
    nvidia-offload
    pkgs.glxinfo
  ];

  services.xserver.videoDrivers = [ "nvidia" ];

  hardware.nvidia.prime = {
    offload.enable = true;
    intelBusId = "PCI:0:2:0";
    nvidiaBusId = "PCI:4:0:0";
  };
}

and here is the portion dedicated to setting up i3 (the only other place where xserver is referenced):

{ pkgs, ... }:

{
  imports = [
    ./rofi/rofi.nix
    ./picom/picom.nix
  ];

  environment.pathsToLink = [ "/libexec" ];

  services.xserver = {
    enable = true;
    dpi = 144;

    libinput = {
      enable = true;
      touchpad = {
        tapping = false;
        naturalScrolling = true;
        clickMethod = "clickfinger";
      };
    };

    xkbOptions = "caps:escape";

    desktopManager.xterm.enable = false;

    # TODO(enricozb): what is this?
    displayManager.defaultSession = "none+i3";

    windowManager.i3 = {
      package = pkgs.i3-gaps;
      configFile = ./config;
      enable = true;
    };
  };

  environment.etc = {
    "i3/scripts".source = ./scripts;
    "i3/wallpaper.jpg".source = ./wallpapers/carmine.jpg;
  };

  environment.systemPackages = with pkgs; [
    # status bar
    python310
    pamixer
    playerctl

    xsel

    rofi      # application launcher most people use
    feh       # background setter
    flameshot # screenshots
  ];
}

The bus IDs above were determined from the following output from lspci:

00:02.0 VGA compatible controller: Intel Corporation TigerLake-LP GT2 [Iris Xe Graphics] (rev 01)
04:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)

Lastly, journalctl --boot=-1 | rg -i nvidia shows the following:

Nov 14 14:20:18 frame kernel: Command line: initrd=\efi\nixos\1x55i4q5pz6plnx7ik2qk7w5ykjfqza0-initrd-linux-5.14.15-initrd.efi init=/nix/store/qlcrhqc8px2riqf30h3n9vzhg2ccw5xx-nixos-system-frame-21.11pre326640.e544ee88fa4/init loglevel=4 net.ifnames=0 nvidia-drm.modeset=1
Nov 14 14:20:18 frame kernel: Kernel command line: initrd=\efi\nixos\1x55i4q5pz6plnx7ik2qk7w5ykjfqza0-initrd-linux-5.14.15-initrd.efi init=/nix/store/qlcrhqc8px2riqf30h3n9vzhg2ccw5xx-nixos-system-frame-21.11pre326640.e544ee88fa4/init loglevel=4 net.ifnames=0 nvidia-drm.modeset=1
Nov 14 14:20:18 frame kernel: nvidia: loading out-of-tree module taints kernel.
Nov 14 14:20:18 frame kernel: nvidia: module license 'NVIDIA' taints kernel.
Nov 14 14:20:18 frame kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 245
Nov 14 14:20:18 frame kernel: nvidia 0000:04:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
Nov 14 14:20:18 frame kernel: input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:07.0/0000:02:00.0/0000:03:01.0/0000:04:00.1/sound/card1/input4
Nov 14 14:20:18 frame kernel: input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:07.0/0000:02:00.0/0000:03:01.0/0000:04:00.1/sound/card1/input5
Nov 14 14:20:18 frame kernel: input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:07.0/0000:02:00.0/0000:03:01.0/0000:04:00.1/sound/card1/input6
Nov 14 14:20:18 frame kernel: input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:07.0/0000:02:00.0/0000:03:01.0/0000:04:00.1/sound/card1/input7
Nov 14 14:20:18 frame kernel: input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:07.0/0000:02:00.0/0000:03:01.0/0000:04:00.1/sound/card1/input8
Nov 14 14:20:18 frame kernel: input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:07.0/0000:02:00.0/0000:03:01.0/0000:04:00.1/sound/card1/input9
Nov 14 14:20:18 frame kernel: input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:00/0000:00:07.0/0000:02:00.0/0000:03:01.0/0000:04:00.1/sound/card1/input10
Nov 14 14:20:18 frame kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  495.44  Fri Oct 22 06:13:12 UTC 2021
Nov 14 14:20:18 frame kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  495.44  Fri Oct 22 06:05:22 UTC 2021
Nov 14 14:20:18 frame kernel: [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
Nov 14 14:20:18 frame kernel: nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
Nov 14 14:20:19 frame systemd-udevd[532]: nvidia: Process '/nix/store/wadmyilr414n7bimxysbny876i2vlm5r-bash-5.1-p8/bin/bash -c 'mknod -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Nov 14 14:20:19 frame systemd-modules-load[511]: Inserted module 'nvidia_uvm'
Nov 14 14:20:19 frame kernel: nvidia-uvm: Loaded the UVM driver, major device number 238.
Nov 14 14:20:19 frame kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 1
Nov 14 14:20:19 frame systemd-modules-load[511]: Inserted module 'nvidia_drm'
Nov 14 14:20:19 frame systemd-udevd[543]: card1-DVI-D-1: Process '/nix/store/wadmyilr414n7bimxysbny876i2vlm5r-bash-5.1-p8/bin/bash -c 'mknod -m 666 /dev/nvidia1 c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 1'' failed with exit code 1.
Nov 14 14:20:19 frame systemd-udevd[543]: card1: Process '/nix/store/wadmyilr414n7bimxysbny876i2vlm5r-bash-5.1-p8/bin/bash -c 'mknod -m 666 /dev/nvidia1 c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 1'' failed with exit code 1.
Nov 14 14:20:19 frame systemd-udevd[543]: card1-HDMI-A-1: Process '/nix/store/wadmyilr414n7bimxysbny876i2vlm5r-bash-5.1-p8/bin/bash -c 'mknod -m 666 /dev/nvidia1 c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 1'' failed with exit code 1.
Nov 14 14:20:19 frame systemd-udevd[543]: card1: Process '/nix/store/wadmyilr414n7bimxysbny876i2vlm5r-bash-5.1-p8/bin/bash -c 'mknod -m 666 /dev/nvidia1 c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 1'' failed with exit code 1.
Nov 14 14:20:19 frame systemd-udevd[543]: card1: Process '/nix/store/wadmyilr414n7bimxysbny876i2vlm5r-bash-5.1-p8/bin/bash -c 'mknod -m 666 /dev/nvidia1 c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 1'' failed with exit code 1.
Nov 14 14:20:19 frame systemd-udevd[543]: card1: Process '/nix/store/wadmyilr414n7bimxysbny876i2vlm5r-bash-5.1-p8/bin/bash -c 'mknod -m 666 /dev/nvidia1 c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 1'' failed with exit code 1.
Nov 14 14:20:19 frame systemd-udevd[543]: card1: Process '/nix/store/wadmyilr414n7bimxysbny876i2vlm5r-bash-5.1-p8/bin/bash -c 'mknod -m 666 /dev/nvidia1 c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 1'' failed with exit code 1.
Nov 14 14:20:19 frame systemd-udevd[543]: card1: Process '/nix/store/wadmyilr414n7bimxysbny876i2vlm5r-bash-5.1-p8/bin/bash -c 'mknod -m 666 /dev/nvidia1 c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 1'' failed with exit code 1.

I wonder what those process failures are at the very end of the journalctl log…

Again, thank you in advance for any help, and please let me know if I can provide any more information.

enricozb · November 15, 2021, 9:02am

I think I might have fixed this myself. After reading this section of the eGPU arch wiki page, I noticed that my generated xorg.conf differed from the recommended one. (To output the generated xorg.conf file, I set services.xserver.exportConfiguration = true;) Specifically, my xorg.conf had this section:

Section "Device"
  Identifier "Device-nvidia[0]"
  Driver "nvidia"
  Option "AccelMethod" "glamor"
  BusID "PCI:4:0:0"
EndSection

But if you compare this to the arch wiki page, it’s missing Option "AllowExternalGpus" "True". So after searching for AllowExternalGpus in the nixpkgs repo, I found that setting

hardware.nvidia.prime.sync.allowExternalGpu = true;

fixed it! Again, I’m not using sync, but it looks like setting this option here sets it for offload as well.

Now xrandr --listproviders outputs:

Providers: number : 2
Provider 0: id: 0x48 cap: 0xf, Source Output, Sink Output, Source Offload, Sink Offload crtcs: 4 outputs: 5 associated providers: 0 name:modesetting
Provider 1: id: 0x24e cap: 0x2, Sink Output crtcs: 4 outputs: 8 associated providers: 0 name:NVIDIA-G0

as expected. And nvidia-offload glxinfo | grep "OpenGL renderer" outputs:

Failed to establish dbus connectionOpenGL renderer string: NVIDIA GeForce GTX 1070/PCIe/SSE2

Not sure how to interpret the failed dbus connection.