AFAIU the current NVIDIA driver packaged is 495.44. However this driver doesn’t support important “data center” GPUs, esp. V100’s and A100’s. For example, nvidia-smi works just fine on an g4dn.xlarge instance (T4), but fails to detect the GPU on a p3.2xlarge instance (V100):
❯ nvidia-smi
No devices were found
How does one use V100s/A100s in NixOS? It looks like there are drivers readily available here.
You just need to package the drivers (probably not trivial), and then just add them to hardware.opengl.extraPackages. This will mount them at /run/opengl-driver/, so your applications just need to add /run/opengl-driver/lib to their RUNPATH. You can use addOpenGLRunpath hook for ELF’s. Or you can export LD_LIBRARY_PATH=run/opengl-driver/lib before launching your application on NixOS.
So here is my generic advice (I am not an expert on graphic drivers, especially nvidea):
So first I would make sure your tools can find the libraries that jonringer describes. If you export the environment variable LD_DEBUG=libs by typing export LD_DEBUG=libs than your libc will provide log output in which directories it looks for libraries when starting an application. Make sure it find the libraries it is looking for in /run/opengl-driver/lib. I saw the package is just a packed archive with a shell script around it. You can unpack that with sh ./the-archive.run -x and than try to copy the all libraries in a nix build to a new package. I assume that other nvidia driver follow a similar design, so have a look how they are packaged. Than put the resulting package into /run/opengl-driver/lib by adding it to hardware.opengl.extraPackages as described by @jonringer
If even after installing your drivers it still does not work, you might be able to use strace to see how the driver tries to access the hardware through userspace interfaces i.e. devices files in /dev. You might be able to see that it tries to enumerate devices nodes that don’t exist or that cannot be opened because of permission errors.
[root@nix-gpu-poc:~]# nvidia-smi
Fri Mar 11 13:37:55 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44 Driver Version: 495.44 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:00:10.0 Off | 0 |
| N/A 23C P0 31W / 250W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
EDIT:
As expected, CUDA will not work with this version. But your nvidia-smi issue might be something else
[nix-shell:~/cuda-samples/Samples/1_Utilities/deviceQuery]# ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL
I tried to use the tesla drivers with this override:
let nvdriver = pkgs.linuxPackages.nvidia_x11.overrideAttrs (oldAttrs: {
src = pkgs.fetchurl {
url = "https://us.download.nvidia.com/tesla/510.47.03/NVIDIA-Linux-x86_64-510.47.03.run";
sha256 = "sha256:146nwmwn5xwa52jgmc4m1kjr0zpj7b5rqicn1ly8sc9nx3d2397j";
};
version = "510.47.03-a100";
}); in
{
hardware.nvidia.package = nvdriver;
# I don't know if this is needed
boot.kernelPackages = pkgs.linuxPackages // {
nvidiaPackages.stable = nvdriver;
};
}
This along with CUDA 11.6.1 still throws the above error.
strace shows that it’s looking for CUDA libs in the wrong location.
write(1, " CUDA Device Query (Runtime API)"..., 66 CUDA Device Query (Runtime API) version (CUDART static linking)
) = 66
futex(0x7f6d243f3048, FUTEX_WAKE_PRIVATE, 2147483647) = 0
openat(AT_FDCWD, "/nix/store/z56jcx3j1gfyk4sv7g8iaan0ssbdkhz1-glibc-2.33-56/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/nix/store/4ix6hsc66x3hjghvxrdvgsyh92nlihx9-gcc-10.3.0-lib/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/nix/store/z56jcx3j1gfyk4sv7g8iaan0ssbdkhz1-glibc-2.33-56/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/nix/store/z56jcx3j1gfyk4sv7g8iaan0ssbdkhz1-glibc-2.33-56/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
futex(0x4a4c50, FUTEX_WAKE_PRIVATE, 2147483647) = 0
write(1, "cudaGetDeviceCount returned 35\n", 31cudaGetDeviceCount returned 35
) = 31
write(1, "-> CUDA driver version is insuff"..., 64-> CUDA driver version is insufficient for CUDA runtime version
) = 64
write(1, "Result = FAIL\n", 14Result = FAIL
) = 14
exit_group(1) = ?
+++ exited with 1 +++
[nix-shell:~/cuda-samples/Samples/1_Utilities/deviceQuery]# ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA A100-PCIE-40GB"
CUDA Driver Version / Runtime Version 11.6 / 11.6
CUDA Capability Major/Minor version number: 8.0
Total amount of global memory: 40536 MBytes (42505207808 bytes)
(108) Multiprocessors, (064) CUDA Cores/MP: 6912 CUDA Cores
GPU Max Clock rate: 1410 MHz (1.41 GHz)
Memory Clock rate: 1215 Mhz
Memory Bus Width: 5120-bit
L2 Cache Size: 41943040 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 167936 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 16
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.6, CUDA Runtime Version = 11.6, NumDevs = 1
Result = PASS
In addition to the system config above for using the right driver, I’m using this shell.nix to set up the env vars
{ pkgs ? import <nixpkgs> {} }:
let
cuda = with pkgs; callPackage "/etc/nixpkgs/pkgs/development/compilers/cudatoolkit/common.nix" {
version = "11.6.1-a100";
url = "https://developer.download.nvidia.com/compute/cuda/11.6.1/local_installers/cuda_11.6.1_510.47.03_linux.run";
sha256 = "sha256:0zjdk166ihiqhcd5a8zwphvx1skmzzxnd6162c0j0x0bw3y9l8db";
gcc = gcc10;
};
nvdriver = pkgs.linuxPackages.nvidia_x11.overrideAttrs (oldAttrs: {
src = pkgs.fetchurl {
url = "https://us.download.nvidia.com/tesla/510.47.03/NVIDIA-Linux-x86_64-510.47.03.run";
sha256 = "sha256:146nwmwn5xwa52jgmc4m1kjr0zpj7b5rqicn1ly8sc9nx3d2397j";
};
version = "510.47.03-a100";
});
in
pkgs.mkShell {
buildInputs = with pkgs; [
cuda
nvdriver
gdb
];
shellHook = ''
export CUDA_PATH=${cuda}
export LD_LIBRARY_PATH=${nvdriver}/lib
'';
}
I thought having the packages in buildInputs would automatically set the library paths. Is that wrong?
Why is it expected that CUDA wouldn’t work? AFAIU there are not separate CUDA distributions for different classes of GPUs – unlike the device drivers.
I stand corrected. It turns out that this is all much more confusing than I realized and that for the most part the drivers are all the same. So as long as there’s a recent enough driver it should work with any model of GPU. (Shout out to @kmittman for explaining!)
Based on what you said, I tried using the consumer drivers. Turns out I was wrong about needing to use Tesla drivers. Consumer drivers will work too as long as LD_LIBRARY_PATH is set correctly. I think the difference between the two is just some software features like MIG, thought I can’t find any documentation from nvidia about it. I’d appreciate it if you could share any insights you got from kmittman about the differences between geforce and tesla drivers.
I’m no expert in this stuff… AFAIU the core driver itself is the same but there are variations in the auxiliary features and odd bits included – CUDA utilities, docs, slightly different patches for games, etc.
AFAICT it’s an attempt at price discrimination. Kinda like how they sell “data center” chips at a significant markup over the equivalent power consumer ones.
I just tried this workaround but nvidia-smi is still not able to pick up my V100 on a p3.2xlarge EC2 instance. I’m not sure that it’s changing the driver at all since after running nixos-rebuild switch I see
FWIW I discovered that although the /nix/store path is wrong, it appears that nvidia-smi is the right version after all… No idea why it’s not finding my V100 device.
❯ nvidia-smi -h
NVIDIA System Management Interface -- v510.47.03
Update update: I also tried switching to cudatoolkit 11.5 but that doesn’t seem to do anything for me. Also added the strace logs to the gist: nix-info.txt · GitHub