I have a flake which provides a python with pytorch-bin, which successfully uses CUDA on my NixOS development machine and even uses MPS on M2 Macs.
I want to use this on an HPC system whose compute nodes have NVIDIA GPUs.
In Nix-less usage, use of CUDA on this system requires using lmod to load the CUDA module: module load CUDA. (This uses the default version 12.2.2, but other versions are available too.)
This command seems to prepend /scicomp/builds/Rocky/8.7/Common/software/CUDA/12.2.2/nvvm/bin:/scicomp/builds/Rocky/8.7/Common/software/CUDA/12.2.2/bin
to PATH and
/scicomp/builds/Rocky/8.7/Common/software/CUDA/12.2.2/nvvm/lib64:/scicomp/builds/Rocky/8.7/Common/software/CUDA/12.2.2/extras/CUPTI/lib64:/scicomp/builds/Rocky/8.7/Common/software/CUDA/12.2.2/lib
to LD_LIBRARY_PATH
Is it possible to make the pytorch provided by my flake to use the CUDA GPUs on this system?
unpacking sources
Creating directory NVIDIA-Linux-x86_64-530.30.02
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 530.30.02/nix/store/8zpy6cffgk57wb8vnpyas6c9x21bixmj-NVIDIA-Linux-x86_64-530.30.02.run: line 729: /build/makeself.mtsq6w7s/zstd: No such file or directory
/nix/store/8zpy6cffgk57wb8vnpyas6c9x21bixmj-NVIDIA-Linux-x86_64-530.30.02.run: line 720: /dev/tty: No such device or address
Terminated
xz: (stdin): File format not recognized
tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors
Could this have something to do with trying to run this on a compute node to which it is impossible to log in? (The login nodes do not have any NVIDIA GPUs.)
… even though ls in the job does report the presence of /dev/tty.
Edit: line 729 calls a function containing line 790, so the tty problem occurs after the /build/makeself.etc one:
Looks like some of these are the normal userspace libraries - these you want to remove from the LD_LIBRARY_PATH because it has a higher priority than DT_RUNPATH and you might end up loading an old libc or libstdc++. You need to figure out which one of these directories or directories listed in /etc/ld.so.conf (if exists) contains libcuda.so and other “driver” libraries. Only those must be in the LD_LIBRARY_PATH, and only if these directories do not mix drivers with the normal libraries (in which case you might again run into a conflict).
You can automate (most of?) this process by using nixglhost.
You can always inspect which exact shared libraries (their paths) are being loaded by running your program with LD_DEBUG=libs and grepping for the ones of interest (libcuda.so, libnvidia-gl.so).
Regarding the workflow, it might also be convenient to use Nixpkgs’ singularity-tools.buildImage and ship the Nix-built software onto the cluster as images
So the module which adds those things to LD_LIBRARY_PATH etc. should be loaded, and nixglhost will then massage the environment to make thinks work in Nix? … well, I have a proof of concept that uses nixglhost and claims to have a working CUDA both with and without loading that module.
How so? It seems a lot slower and less convenient then using nix directly via nix-portable (though nix-portable isn’t without issues), and I don’t see how Singularity (or apptainer, as it’s called these days) would help with CUDA conundrum. In my initial attempts, I get no CUDA in the container.
It wouldn’t make any difference to CUDA, but if e.g. your store is on a distributed file system it would (could) compensate for all the indirection that Nixpkgs create (symlinks, etc). Btw, out of curiosity can you tell more about your cluster? Does your nix-portable use user namespaces or does it fallback to ptrace?
Please gist LD_DEBUG=libs for both cases; just as a guess, if the cluster deploys both the x11 driver and the cuda_compat, there’s a chance that nixglhost would pick up the former
Can’t reproduce the inconsistency at the moment: it now works on A100 even if the module is loaded. But I have been fiddling around with exactly how nixglhost is invoked (nix run github:numtide#nix-gl-host vs nix develop my/flake# vs local installation) so maybe the problem only manifests itself in a subset of these approaches.
But given that I have a solution that works, understanding the discrepancy does not have sufficient priority right now.
Could you share your solution @jacg, please? Our cluster also uses lmod and I have problems setting my ML environment up correctly with nix. Specifically, I need CUDA support for pytorch.
I’m sorry, this is a huge context switch for me and I don’t really have the time to extract the signal from the noise of our project-specific cruft and give you something that it works in isolation. I think that the most important bits are here (NB: this comes from a flake that uses nosys so the treatment of system is not what you would typically see, and you would have to adapt that to whatever you’re used to doing):
In brief: the solution uses torch-bin instead of torch and allows some unfree packages that torch-bin needs.
I can’t find a coherent story to tell you in the time I have available for the necessary archaeology, so if this doesn’t give you sufficient inspiration, I suggest you follow @SergeK 's advice and start a new thread describing your specific problems and the solutions you tried.
I’m sorry I can’t give you a more helpful answer right now