How to use NVIDIA V100/A100 GPUs?

AFAIU the current NVIDIA driver packaged is 495.44. However this driver doesn’t support important “data center” GPUs, esp. V100’s and A100’s. For example, nvidia-smi works just fine on an g4dn.xlarge instance (T4), but fails to detect the GPU on a p3.2xlarge instance (V100):

❯ nvidia-smi
No devices were found

How does one use V100s/A100s in NixOS? It looks like there are drivers readily available here.

When supermicro starts producing boards again, our university chair hopefully also get’s a similar card and than I can look into that.

1 Like

I have access to a machine to test on but I have no idea how to go about adding a driver to NixOS… do you have any recommendations on where to start?

I have an example here: nixos/nvidia: add vaapi support by jonringer · Pull Request #162660 · NixOS/nixpkgs · GitHub

You just need to package the drivers (probably not trivial), and then just add them to hardware.opengl.extraPackages. This will mount them at /run/opengl-driver/, so your applications just need to add /run/opengl-driver/lib to their RUNPATH. You can use addOpenGLRunpath hook for ELF’s. Or you can export LD_LIBRARY_PATH=run/opengl-driver/lib before launching your application on NixOS.

Related: WIP nixos/opengl: move to hardware.drivers by jonringer · Pull Request #158079 · NixOS/nixpkgs · GitHub

2 Likes

So here is my generic advice (I am not an expert on graphic drivers, especially nvidea):

So first I would make sure your tools can find the libraries that jonringer describes. If you export the environment variable LD_DEBUG=libs by typing export LD_DEBUG=libs than your libc will provide log output in which directories it looks for libraries when starting an application. Make sure it find the libraries it is looking for in /run/opengl-driver/lib. I saw the package is just a packed archive with a shell script around it. You can unpack that with sh ./the-archive.run -x and than try to copy the all libraries in a nix build to a new package. I assume that other nvidia driver follow a similar design, so have a look how they are packaged. Than put the resulting package into /run/opengl-driver/lib by adding it to hardware.opengl.extraPackages as described by @jonringer

If even after installing your drivers it still does not work, you might be able to use strace to see how the driver tries to access the hardware through userspace interfaces i.e. devices files in /dev. You might be able to see that it tries to enumerate devices nodes that don’t exist or that cannot be opened because of permission errors.

2 Likes

Driver version support aside, nvidia-smi seems to work even with 495.44 on A100

hardware = {
  nvidia = {
    nvidiaPersistenced = true;
  };
  opengl.driSupport32Bit = true;
};

services.xserver.videoDrivers = [ "nvidia" ];
[root@nix-gpu-poc:~]# nvidia-smi
Fri Mar 11 13:37:55 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 495.44       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  On   | 00000000:00:10.0 Off |                    0 |
| N/A   23C    P0    31W / 250W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

EDIT:
As expected, CUDA will not work with this version. But your nvidia-smi issue might be something else

[nix-shell:~/cuda-samples/Samples/1_Utilities/deviceQuery]# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL
1 Like

Huh, very interesting! I wonder if I just didn’t have the right config set up.

Why is it expected that CUDA wouldn’t work? AFAIU there are not separate CUDA distributions for different classes of GPUs – unlike the device drivers.

I don’t know if there is an official reason it won’t work, but the consumer GPU drivers (https://us.download.nvidia.com/XFree86/Linux-x86_64/510.54/NVIDIA-Linux-x86_64-510.54.run) have caused me issues with A100 even on ubuntu. It could just be that I set it up wrong though.

I tried to use the tesla drivers with this override:

let nvdriver = pkgs.linuxPackages.nvidia_x11.overrideAttrs (oldAttrs: {
  src = pkgs.fetchurl {
    url = "https://us.download.nvidia.com/tesla/510.47.03/NVIDIA-Linux-x86_64-510.47.03.run";
    sha256 = "sha256:146nwmwn5xwa52jgmc4m1kjr0zpj7b5rqicn1ly8sc9nx3d2397j";
  };
  version = "510.47.03-a100";
}); in
{
  hardware.nvidia.package = nvdriver;

  # I don't know if this is needed
  boot.kernelPackages = pkgs.linuxPackages // {
    nvidiaPackages.stable = nvdriver;
  };
}

This along with CUDA 11.6.1 still throws the above error.
strace shows that it’s looking for CUDA libs in the wrong location.

write(1, " CUDA Device Query (Runtime API)"..., 66 CUDA Device Query (Runtime API) version (CUDART static linking)

) = 66
futex(0x7f6d243f3048, FUTEX_WAKE_PRIVATE, 2147483647) = 0
openat(AT_FDCWD, "/nix/store/z56jcx3j1gfyk4sv7g8iaan0ssbdkhz1-glibc-2.33-56/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/nix/store/4ix6hsc66x3hjghvxrdvgsyh92nlihx9-gcc-10.3.0-lib/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/nix/store/z56jcx3j1gfyk4sv7g8iaan0ssbdkhz1-glibc-2.33-56/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/nix/store/z56jcx3j1gfyk4sv7g8iaan0ssbdkhz1-glibc-2.33-56/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
futex(0x4a4c50, FUTEX_WAKE_PRIVATE, 2147483647) = 0
write(1, "cudaGetDeviceCount returned 35\n", 31cudaGetDeviceCount returned 35
) = 31
write(1, "-> CUDA driver version is insuff"..., 64-> CUDA driver version is insufficient for CUDA runtime version
) = 64
write(1, "Result = FAIL\n", 14Result = FAIL
)         = 14
exit_group(1)                           = ?
+++ exited with 1 +++
1 Like

You can probably just provide the right path with LD_LIBRARY_PATH or by setting the rpath of the binary.

Yep, that did it

[nix-shell:~/cuda-samples/Samples/1_Utilities/deviceQuery]# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA A100-PCIE-40GB"
  CUDA Driver Version / Runtime Version          11.6 / 11.6
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 40536 MBytes (42505207808 bytes)
  (108) Multiprocessors, (064) CUDA Cores/MP:    6912 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1215 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 41943040 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 16
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.6, CUDA Runtime Version = 11.6, NumDevs = 1
Result = PASS

In addition to the system config above for using the right driver, I’m using this shell.nix to set up the env vars

{ pkgs ? import <nixpkgs> {} }:

let
        cuda = with pkgs; callPackage "/etc/nixpkgs/pkgs/development/compilers/cudatoolkit/common.nix" {
                version = "11.6.1-a100";
                url = "https://developer.download.nvidia.com/compute/cuda/11.6.1/local_installers/cuda_11.6.1_510.47.03_linux.run";
                sha256 = "sha256:0zjdk166ihiqhcd5a8zwphvx1skmzzxnd6162c0j0x0bw3y9l8db";
                gcc = gcc10;
        };
        nvdriver = pkgs.linuxPackages.nvidia_x11.overrideAttrs (oldAttrs: {
                src = pkgs.fetchurl {
                        url = "https://us.download.nvidia.com/tesla/510.47.03/NVIDIA-Linux-x86_64-510.47.03.run";
                        sha256 = "sha256:146nwmwn5xwa52jgmc4m1kjr0zpj7b5rqicn1ly8sc9nx3d2397j";
                };
                version = "510.47.03-a100";
        });

in

pkgs.mkShell {
        buildInputs = with pkgs; [
                cuda
                nvdriver
                gdb

        ];
        shellHook = ''
                export CUDA_PATH=${cuda}
                export LD_LIBRARY_PATH=${nvdriver}/lib
        '';
}

I thought having the packages in buildInputs would automatically set the library paths. Is that wrong?

Why is it expected that CUDA wouldn’t work? AFAIU there are not separate CUDA distributions for different classes of GPUs – unlike the device drivers.

I stand corrected. It turns out that this is all much more confusing than I realized and that for the most part the drivers are all the same. So as long as there’s a recent enough driver it should work with any model of GPU. (Shout out to @kmittman for explaining!)

Based on what you said, I tried using the consumer drivers. Turns out I was wrong about needing to use Tesla drivers. Consumer drivers will work too as long as LD_LIBRARY_PATH is set correctly. I think the difference between the two is just some software features like MIG, thought I can’t find any documentation from nvidia about it. I’d appreciate it if you could share any insights you got from kmittman about the differences between geforce and tesla drivers.

I’m no expert in this stuff… AFAIU the core driver itself is the same but there are variations in the auxiliary features and odd bits included – CUDA utilities, docs, slightly different patches for games, etc.

AFAICT it’s an attempt at price discrimination. Kinda like how they sell “data center” chips at a significant markup over the equivalent power consumer ones.

I just tried this workaround but nvidia-smi is still not able to pick up my V100 on a p3.2xlarge EC2 instance. I’m not sure that it’s changing the driver at all since after running nixos-rebuild switch I see

❯ readlink $(which nvidia-smi)
/nix/store/xz78pb0h456j4ly3pqa3aw4gksz8d72i-nvidia-x11-495.44-5.10.106-bin/bin/nvidia-smi

even though version should be overridden to 510.47.03. Here’s my system config: nix-info.txt · GitHub.

@illustris did you encounter anything like this? Do you have any idea what I could be doing wrong?

Update: I also added

  hardware.nvidia.nvidiaPersistenced = true;
  hardware.opengl.driSupport32Bit = true;

to no avail.

FWIW I discovered that although the /nix/store path is wrong, it appears that nvidia-smi is the right version after all… No idea why it’s not finding my V100 device.

❯ nvidia-smi -h
NVIDIA System Management Interface -- v510.47.03

Update update: I also tried switching to cudatoolkit 11.5 but that doesn’t seem to do anything for me. Also added the strace logs to the gist: nix-info.txt · GitHub

Do you have

services.xserver.videoDrivers = [ "nvidia" ];

set? From what I understand this even needs to be set when you have a headless NixOS server.

Yes, I set that here: nix-info.txt · GitHub. The abridged version is

  # NVIDIA drivers, etc.
  nixpkgs.config.allowUnfree = true;
  services.xserver.videoDrivers = [ "nvidia" ];
  hardware.nvidia.package = nvdriver;
  boot.kernelPackages = pkgs.linuxPackages // {
    nvidiaPackages.stable = nvdriver;
  };
  hardware.opengl.enable = true;

I’ve been testing on proxmox with nixos VMs. I’ll try p3.2xlarge instances specifically in some time, see if I can reproduce the issue you’re facing.

1 Like

Looking at the strace, I’m seeing that it’s trying and failing to open a few files:

  • /usr/bin/nvidia-modprobe
  • /dev/nvidia-caps/nvidia-cap1
  • /dev/nvidia-caps/nvidia-cap2
  • /dev/nvidia0

It also fails to mknodat:

mknodat(AT_FDCWD, "/dev/nvidia0", S_IFCHR|0666, makedev(0xc3, 0)) = -1 EACCES (Permission denied)

@illustris would you be able to share the results of strace nvidia-smi on your system so that we can diff the results?

Weird… you should already have /dev/nvidia0. Can you see nvidia in lsmod?

[root@nix-gpu-poc:~]# lsmod | grep -i nvidia
nvidia_drm             73728  0
nvidia_modeset       1163264  1 nvidia_drm
nvidia_uvm           1204224  0
drm_kms_helper        307200  5 bochs,drm_vram_helper,nvidia_drm
nvidia              39112704  17 nvidia_uvm,nvidia_modeset
drm                   643072  8 drm_kms_helper,bochs,drm_vram_helper,nvidia,drm_ttm_helper,nvidia_drm,ttm
i2c_core              102400  5 drm_kms_helper,nvidia,psmouse,i2c_piix4,drm

Here’s my strace. It doesn’t need to do mknod because the device already exists.