How to use NVIDIA V100/A100 GPUs?

So here is my generic advice (I am not an expert on graphic drivers, especially nvidea):

So first I would make sure your tools can find the libraries that jonringer describes. If you export the environment variable LD_DEBUG=libs by typing export LD_DEBUG=libs than your libc will provide log output in which directories it looks for libraries when starting an application. Make sure it find the libraries it is looking for in /run/opengl-driver/lib. I saw the package is just a packed archive with a shell script around it. You can unpack that with sh ./the-archive.run -x and than try to copy the all libraries in a nix build to a new package. I assume that other nvidia driver follow a similar design, so have a look how they are packaged. Than put the resulting package into /run/opengl-driver/lib by adding it to hardware.opengl.extraPackages as described by @jonringer

If even after installing your drivers it still does not work, you might be able to use strace to see how the driver tries to access the hardware through userspace interfaces i.e. devices files in /dev. You might be able to see that it tries to enumerate devices nodes that don’t exist or that cannot be opened because of permission errors.

2 Likes

Driver version support aside, nvidia-smi seems to work even with 495.44 on A100

hardware = {
  nvidia = {
    nvidiaPersistenced = true;
  };
  opengl.driSupport32Bit = true;
};

services.xserver.videoDrivers = [ "nvidia" ];
[root@nix-gpu-poc:~]# nvidia-smi
Fri Mar 11 13:37:55 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 495.44       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  On   | 00000000:00:10.0 Off |                    0 |
| N/A   23C    P0    31W / 250W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

EDIT:
As expected, CUDA will not work with this version. But your nvidia-smi issue might be something else

[nix-shell:~/cuda-samples/Samples/1_Utilities/deviceQuery]# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL
1 Like

Huh, very interesting! I wonder if I just didn’t have the right config set up.

Why is it expected that CUDA wouldn’t work? AFAIU there are not separate CUDA distributions for different classes of GPUs – unlike the device drivers.

I don’t know if there is an official reason it won’t work, but the consumer GPU drivers (https://us.download.nvidia.com/XFree86/Linux-x86_64/510.54/NVIDIA-Linux-x86_64-510.54.run) have caused me issues with A100 even on ubuntu. It could just be that I set it up wrong though.

I tried to use the tesla drivers with this override:

let nvdriver = pkgs.linuxPackages.nvidia_x11.overrideAttrs (oldAttrs: {
  src = pkgs.fetchurl {
    url = "https://us.download.nvidia.com/tesla/510.47.03/NVIDIA-Linux-x86_64-510.47.03.run";
    sha256 = "sha256:146nwmwn5xwa52jgmc4m1kjr0zpj7b5rqicn1ly8sc9nx3d2397j";
  };
  version = "510.47.03-a100";
}); in
{
  hardware.nvidia.package = nvdriver;

  # I don't know if this is needed
  boot.kernelPackages = pkgs.linuxPackages // {
    nvidiaPackages.stable = nvdriver;
  };
}

This along with CUDA 11.6.1 still throws the above error.
strace shows that it’s looking for CUDA libs in the wrong location.

write(1, " CUDA Device Query (Runtime API)"..., 66 CUDA Device Query (Runtime API) version (CUDART static linking)

) = 66
futex(0x7f6d243f3048, FUTEX_WAKE_PRIVATE, 2147483647) = 0
openat(AT_FDCWD, "/nix/store/z56jcx3j1gfyk4sv7g8iaan0ssbdkhz1-glibc-2.33-56/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/nix/store/4ix6hsc66x3hjghvxrdvgsyh92nlihx9-gcc-10.3.0-lib/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/nix/store/z56jcx3j1gfyk4sv7g8iaan0ssbdkhz1-glibc-2.33-56/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/nix/store/z56jcx3j1gfyk4sv7g8iaan0ssbdkhz1-glibc-2.33-56/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
futex(0x4a4c50, FUTEX_WAKE_PRIVATE, 2147483647) = 0
write(1, "cudaGetDeviceCount returned 35\n", 31cudaGetDeviceCount returned 35
) = 31
write(1, "-> CUDA driver version is insuff"..., 64-> CUDA driver version is insufficient for CUDA runtime version
) = 64
write(1, "Result = FAIL\n", 14Result = FAIL
)         = 14
exit_group(1)                           = ?
+++ exited with 1 +++
1 Like

You can probably just provide the right path with LD_LIBRARY_PATH or by setting the rpath of the binary.

Yep, that did it

[nix-shell:~/cuda-samples/Samples/1_Utilities/deviceQuery]# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA A100-PCIE-40GB"
  CUDA Driver Version / Runtime Version          11.6 / 11.6
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 40536 MBytes (42505207808 bytes)
  (108) Multiprocessors, (064) CUDA Cores/MP:    6912 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1215 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 41943040 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 16
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.6, CUDA Runtime Version = 11.6, NumDevs = 1
Result = PASS

In addition to the system config above for using the right driver, I’m using this shell.nix to set up the env vars

{ pkgs ? import <nixpkgs> {} }:

let
        cuda = with pkgs; callPackage "/etc/nixpkgs/pkgs/development/compilers/cudatoolkit/common.nix" {
                version = "11.6.1-a100";
                url = "https://developer.download.nvidia.com/compute/cuda/11.6.1/local_installers/cuda_11.6.1_510.47.03_linux.run";
                sha256 = "sha256:0zjdk166ihiqhcd5a8zwphvx1skmzzxnd6162c0j0x0bw3y9l8db";
                gcc = gcc10;
        };
        nvdriver = pkgs.linuxPackages.nvidia_x11.overrideAttrs (oldAttrs: {
                src = pkgs.fetchurl {
                        url = "https://us.download.nvidia.com/tesla/510.47.03/NVIDIA-Linux-x86_64-510.47.03.run";
                        sha256 = "sha256:146nwmwn5xwa52jgmc4m1kjr0zpj7b5rqicn1ly8sc9nx3d2397j";
                };
                version = "510.47.03-a100";
        });

in

pkgs.mkShell {
        buildInputs = with pkgs; [
                cuda
                nvdriver
                gdb

        ];
        shellHook = ''
                export CUDA_PATH=${cuda}
                export LD_LIBRARY_PATH=${nvdriver}/lib
        '';
}

I thought having the packages in buildInputs would automatically set the library paths. Is that wrong?

Why is it expected that CUDA wouldn’t work? AFAIU there are not separate CUDA distributions for different classes of GPUs – unlike the device drivers.

I stand corrected. It turns out that this is all much more confusing than I realized and that for the most part the drivers are all the same. So as long as there’s a recent enough driver it should work with any model of GPU. (Shout out to @kmittman for explaining!)

Based on what you said, I tried using the consumer drivers. Turns out I was wrong about needing to use Tesla drivers. Consumer drivers will work too as long as LD_LIBRARY_PATH is set correctly. I think the difference between the two is just some software features like MIG, thought I can’t find any documentation from nvidia about it. I’d appreciate it if you could share any insights you got from kmittman about the differences between geforce and tesla drivers.

I’m no expert in this stuff… AFAIU the core driver itself is the same but there are variations in the auxiliary features and odd bits included – CUDA utilities, docs, slightly different patches for games, etc.

AFAICT it’s an attempt at price discrimination. Kinda like how they sell “data center” chips at a significant markup over the equivalent power consumer ones.

I just tried this workaround but nvidia-smi is still not able to pick up my V100 on a p3.2xlarge EC2 instance. I’m not sure that it’s changing the driver at all since after running nixos-rebuild switch I see

❯ readlink $(which nvidia-smi)
/nix/store/xz78pb0h456j4ly3pqa3aw4gksz8d72i-nvidia-x11-495.44-5.10.106-bin/bin/nvidia-smi

even though version should be overridden to 510.47.03. Here’s my system config: nix-info.txt · GitHub.

@illustris did you encounter anything like this? Do you have any idea what I could be doing wrong?

Update: I also added

  hardware.nvidia.nvidiaPersistenced = true;
  hardware.opengl.driSupport32Bit = true;

to no avail.

FWIW I discovered that although the /nix/store path is wrong, it appears that nvidia-smi is the right version after all… No idea why it’s not finding my V100 device.

❯ nvidia-smi -h
NVIDIA System Management Interface -- v510.47.03

Update update: I also tried switching to cudatoolkit 11.5 but that doesn’t seem to do anything for me. Also added the strace logs to the gist: nix-info.txt · GitHub

Do you have

services.xserver.videoDrivers = [ "nvidia" ];

set? From what I understand this even needs to be set when you have a headless NixOS server.

Yes, I set that here: nix-info.txt · GitHub. The abridged version is

  # NVIDIA drivers, etc.
  nixpkgs.config.allowUnfree = true;
  services.xserver.videoDrivers = [ "nvidia" ];
  hardware.nvidia.package = nvdriver;
  boot.kernelPackages = pkgs.linuxPackages // {
    nvidiaPackages.stable = nvdriver;
  };
  hardware.opengl.enable = true;

I’ve been testing on proxmox with nixos VMs. I’ll try p3.2xlarge instances specifically in some time, see if I can reproduce the issue you’re facing.

1 Like

Looking at the strace, I’m seeing that it’s trying and failing to open a few files:

  • /usr/bin/nvidia-modprobe
  • /dev/nvidia-caps/nvidia-cap1
  • /dev/nvidia-caps/nvidia-cap2
  • /dev/nvidia0

It also fails to mknodat:

mknodat(AT_FDCWD, "/dev/nvidia0", S_IFCHR|0666, makedev(0xc3, 0)) = -1 EACCES (Permission denied)

@illustris would you be able to share the results of strace nvidia-smi on your system so that we can diff the results?

Weird… you should already have /dev/nvidia0. Can you see nvidia in lsmod?

[root@nix-gpu-poc:~]# lsmod | grep -i nvidia
nvidia_drm             73728  0
nvidia_modeset       1163264  1 nvidia_drm
nvidia_uvm           1204224  0
drm_kms_helper        307200  5 bochs,drm_vram_helper,nvidia_drm
nvidia              39112704  17 nvidia_uvm,nvidia_modeset
drm                   643072  8 drm_kms_helper,bochs,drm_vram_helper,nvidia,drm_ttm_helper,nvidia_drm,ttm
i2c_core              102400  5 drm_kms_helper,nvidia,psmouse,i2c_piix4,drm

Here’s my strace. It doesn’t need to do mknod because the device already exists.

Huh, so on a p3.2xlarge instance running Ubuntu 20.04.4, I get

ubuntu@threedoo:~$ lsmod | grep -i nvidia
nvidia_uvm           1052672  2
nvidia_drm             61440  2
nvidia_modeset       1159168  2 nvidia_drm
nvidia              39059456  217 nvidia_uvm,nvidia_modeset
drm_kms_helper        253952  1 nvidia_drm
drm                   557056  6 drm_kms_helper,nvidia,nvidia_drm
ubuntu@threedoo:~$ ls /dev/nvidia*
/dev/nvidia-modeset  /dev/nvidia-uvm-tools  /dev/nvidiactl
/dev/nvidia-uvm      /dev/nvidia0

But on a p3.2xlarge instance running NixOS

❯ lsmod | grep -i nvidia
nvidia_uvm           1183744  0
nvidia_drm             69632  0
nvidia_modeset       1163264  1 nvidia_drm
nvidia              39100416  2 nvidia_uvm,nvidia_modeset
drm_kms_helper        270336  4 cirrus,nvidia_drm
drm                   614400  5 drm_kms_helper,nvidia,cirrus,nvidia_drm
i2c_core              102400  5 drm_kms_helper,nvidia,psmouse,i2c_piix4,drm
❯ ls /dev/nvidia*
/dev/nvidia-modeset  /dev/nvidia-uvm-tools  /dev/nvidiactl
/dev/nvidia-uvm      /dev/nvidia1
❯ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.10.106, NixOS, 21.11 (Porcupine)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.3.16`
 - channels(root): `"nixos-21.11.336674.e80f8f4d833, nix-ld"`
 - channels(skainswo): `"home-manager, nixpkgs-unstable-22.05pre343295.adf7f03d3bf"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

So for some reason it’s numbered /dev/nvidia1 instead of /dev/nvidia0. Perhaps this is the culprit? The nvidia-smi strace reveals that it’s only looking for /dev/nvidia0 at least.

Does anyone know how these devices get named/numbered?

FWIW I tried

❯ sudo ln -s /dev/nvidia1 /dev/nvidia0

but that didn’t work :frowning:

Looks I won’t be able to spin up a P3 instance any time soon… AWS is declining limit increase requests because of shortages and asking customers to contact their sales team for additional review.

Can you share dmesg? It might have some messages from the nvidia driver explaining why the device number is different.

Also, can you show ls -l /dev/nvidia* to see devnode numbers?

crw-rw-rw- 1 root root 195,   0 Mar 13 07:45 /dev/nvidia0

On yours I’m expecting the major and minor numbers to be 195,1.
You could try something along the lines of

rm /dev/nvidia1
mknod /dev/nvidia0 c 195 1
chmod 666 /dev/nvidia0

On Ubuntu,

❯ ls -l /dev/nvidia*
crw-rw-rw- 195,254 root 31 Mar 19:41 /dev/nvidia-modeset
crw-rw-rw-   253,0 root 31 Mar 19:41 /dev/nvidia-uvm
crw-rw-rw-   253,1 root 31 Mar 19:41 /dev/nvidia-uvm-tools
crw-rw-rw-   195,0 root 31 Mar 19:41 /dev/nvidia0
crw-rw-rw- 195,255 root 31 Mar 19:41 /dev/nvidiactl

and on NixOS:

Unable to determine time zone: No such file or directory (os error 2)
crw-rw-rw- 195,254 root 31 Mar 19:53 /dev/nvidia-modeset
crw-rw-rw-   245,0 root 31 Mar 19:53 /dev/nvidia-uvm
crw-rw-rw-   245,0 root 31 Mar 19:53 /dev/nvidia-uvm-tools
crw-rw-rw-   195,1 root 31 Mar 19:53 /dev/nvidia1
crw-rw-rw- 195,255 root 31 Mar 19:53 /dev/nvidiactl

The notable difference to me appears to be that on Ubuntu it’s 195,0 whereas on NixOS it’s 195,1. There’s also a difference in the /dev/nvidia-uvm* stuff, but I’m not sure if that matters?

I also tried rm-mknod-chmod. It doesn’t work but possibly gets us a little further:

$ strace nvidia-smi
...
stat("/dev/nvidia0", {st_mode=S_IFCHR|0666, st_rdev=makedev(0xc3, 0x1), ...}) = 0
unlink("/dev/nvidia0")                  = -1 EACCES (Permission denied)
stat("/usr/bin/nvidia-modprobe", 0x7ffcf762b9c0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/proc/driver/nvidia/params", O_RDONLY) = 4
newfstatat(4, "", {st_mode=S_IFREG|0444, st_size=0, ...}, AT_EMPTY_PATH) = 0
read(4, "ResmanDebugLevel: 4294967295\nRmL"..., 1024) = 827
close(4)                                = 0
stat("/dev/nvidia0", {st_mode=S_IFCHR|0666, st_rdev=makedev(0xc3, 0x1), ...}) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffcf7629670) = 0
getpid()                                = 1912
newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}, AT_EMPTY_PATH) = 0
write(1, "No devices were found\n", 22No devices were found

You might try in the us-west-2 region. My understanding is that usage limits are often based on region. I’m just on a rinky-dink personal account and I’m able to get p3 quota in us-west-2, so it might be worth a shot!