How to use NVIDIA V100/A100 GPUs?

Huh, so on a p3.2xlarge instance running Ubuntu 20.04.4, I get

ubuntu@threedoo:~$ lsmod | grep -i nvidia
nvidia_uvm           1052672  2
nvidia_drm             61440  2
nvidia_modeset       1159168  2 nvidia_drm
nvidia              39059456  217 nvidia_uvm,nvidia_modeset
drm_kms_helper        253952  1 nvidia_drm
drm                   557056  6 drm_kms_helper,nvidia,nvidia_drm
ubuntu@threedoo:~$ ls /dev/nvidia*
/dev/nvidia-modeset  /dev/nvidia-uvm-tools  /dev/nvidiactl
/dev/nvidia-uvm      /dev/nvidia0

But on a p3.2xlarge instance running NixOS

❯ lsmod | grep -i nvidia
nvidia_uvm           1183744  0
nvidia_drm             69632  0
nvidia_modeset       1163264  1 nvidia_drm
nvidia              39100416  2 nvidia_uvm,nvidia_modeset
drm_kms_helper        270336  4 cirrus,nvidia_drm
drm                   614400  5 drm_kms_helper,nvidia,cirrus,nvidia_drm
i2c_core              102400  5 drm_kms_helper,nvidia,psmouse,i2c_piix4,drm
❯ ls /dev/nvidia*
/dev/nvidia-modeset  /dev/nvidia-uvm-tools  /dev/nvidiactl
/dev/nvidia-uvm      /dev/nvidia1
❯ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.10.106, NixOS, 21.11 (Porcupine)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.3.16`
 - channels(root): `"nixos-21.11.336674.e80f8f4d833, nix-ld"`
 - channels(skainswo): `"home-manager, nixpkgs-unstable-22.05pre343295.adf7f03d3bf"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

So for some reason it’s numbered /dev/nvidia1 instead of /dev/nvidia0. Perhaps this is the culprit? The nvidia-smi strace reveals that it’s only looking for /dev/nvidia0 at least.

Does anyone know how these devices get named/numbered?

FWIW I tried

❯ sudo ln -s /dev/nvidia1 /dev/nvidia0

but that didn’t work :frowning:

Looks I won’t be able to spin up a P3 instance any time soon… AWS is declining limit increase requests because of shortages and asking customers to contact their sales team for additional review.

Can you share dmesg? It might have some messages from the nvidia driver explaining why the device number is different.

Also, can you show ls -l /dev/nvidia* to see devnode numbers?

crw-rw-rw- 1 root root 195,   0 Mar 13 07:45 /dev/nvidia0

On yours I’m expecting the major and minor numbers to be 195,1.
You could try something along the lines of

rm /dev/nvidia1
mknod /dev/nvidia0 c 195 1
chmod 666 /dev/nvidia0

On Ubuntu,

❯ ls -l /dev/nvidia*
crw-rw-rw- 195,254 root 31 Mar 19:41 /dev/nvidia-modeset
crw-rw-rw-   253,0 root 31 Mar 19:41 /dev/nvidia-uvm
crw-rw-rw-   253,1 root 31 Mar 19:41 /dev/nvidia-uvm-tools
crw-rw-rw-   195,0 root 31 Mar 19:41 /dev/nvidia0
crw-rw-rw- 195,255 root 31 Mar 19:41 /dev/nvidiactl

and on NixOS:

Unable to determine time zone: No such file or directory (os error 2)
crw-rw-rw- 195,254 root 31 Mar 19:53 /dev/nvidia-modeset
crw-rw-rw-   245,0 root 31 Mar 19:53 /dev/nvidia-uvm
crw-rw-rw-   245,0 root 31 Mar 19:53 /dev/nvidia-uvm-tools
crw-rw-rw-   195,1 root 31 Mar 19:53 /dev/nvidia1
crw-rw-rw- 195,255 root 31 Mar 19:53 /dev/nvidiactl

The notable difference to me appears to be that on Ubuntu it’s 195,0 whereas on NixOS it’s 195,1. There’s also a difference in the /dev/nvidia-uvm* stuff, but I’m not sure if that matters?

I also tried rm-mknod-chmod. It doesn’t work but possibly gets us a little further:

$ strace nvidia-smi
...
stat("/dev/nvidia0", {st_mode=S_IFCHR|0666, st_rdev=makedev(0xc3, 0x1), ...}) = 0
unlink("/dev/nvidia0")                  = -1 EACCES (Permission denied)
stat("/usr/bin/nvidia-modprobe", 0x7ffcf762b9c0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/proc/driver/nvidia/params", O_RDONLY) = 4
newfstatat(4, "", {st_mode=S_IFREG|0444, st_size=0, ...}, AT_EMPTY_PATH) = 0
read(4, "ResmanDebugLevel: 4294967295\nRmL"..., 1024) = 827
close(4)                                = 0
stat("/dev/nvidia0", {st_mode=S_IFCHR|0666, st_rdev=makedev(0xc3, 0x1), ...}) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffcf7629670) = 0
getpid()                                = 1912
newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}, AT_EMPTY_PATH) = 0
write(1, "No devices were found\n", 22No devices were found

You might try in the us-west-2 region. My understanding is that usage limits are often based on region. I’m just on a rinky-dink personal account and I’m able to get p3 quota in us-west-2, so it might be worth a shot!

Hmm… for reasons beyond my comprehension this seems to be fixed in 22.05!

1 Like