I have a flake which provides a python with pytorch-bin, which successfully uses CUDA on my NixOS development machine and even uses MPS on M2 Macs.
I want to use this on an HPC system whose compute nodes have NVIDIA GPUs.
In Nix-less usage, use of CUDA on this system requires using lmod to load the CUDA module: module load CUDA. (This uses the default version 12.2.2, but other versions are available too.)
This command seems to prepend /scicomp/builds/Rocky/8.7/Common/software/CUDA/12.2.2/nvvm/bin:/scicomp/builds/Rocky/8.7/Common/software/CUDA/12.2.2/bin
to PATH and
/scicomp/builds/Rocky/8.7/Common/software/CUDA/12.2.2/nvvm/lib64:/scicomp/builds/Rocky/8.7/Common/software/CUDA/12.2.2/extras/CUPTI/lib64:/scicomp/builds/Rocky/8.7/Common/software/CUDA/12.2.2/lib
to LD_LIBRARY_PATH
Is it possible to make the pytorch provided by my flake to use the CUDA GPUs on this system?
unpacking sources
Creating directory NVIDIA-Linux-x86_64-530.30.02
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 530.30.02/nix/store/8zpy6cffgk57wb8vnpyas6c9x21bixmj-NVIDIA-Linux-x86_64-530.30.02.run: line 729: /build/makeself.mtsq6w7s/zstd: No such file or directory
/nix/store/8zpy6cffgk57wb8vnpyas6c9x21bixmj-NVIDIA-Linux-x86_64-530.30.02.run: line 720: /dev/tty: No such device or address
Terminated
xz: (stdin): File format not recognized
tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors
Could this have something to do with trying to run this on a compute node to which it is impossible to log in? (The login nodes do not have any NVIDIA GPUs.)
… even though ls in the job does report the presence of /dev/tty.
Edit: line 729 calls a function containing line 790, so the tty problem occurs after the /build/makeself.etc one:
Looks like some of these are the normal userspace libraries - these you want to remove from the LD_LIBRARY_PATH because it has a higher priority than DT_RUNPATH and you might end up loading an old libc or libstdc++. You need to figure out which one of these directories or directories listed in /etc/ld.so.conf (if exists) contains libcuda.so and other “driver” libraries. Only those must be in the LD_LIBRARY_PATH, and only if these directories do not mix drivers with the normal libraries (in which case you might again run into a conflict).
You can automate (most of?) this process by using nixglhost.
You can always inspect which exact shared libraries (their paths) are being loaded by running your program with LD_DEBUG=libs and grepping for the ones of interest (libcuda.so, libnvidia-gl.so).
Regarding the workflow, it might also be convenient to use Nixpkgs’ singularity-tools.buildImage and ship the Nix-built software onto the cluster as images
So the module which adds those things to LD_LIBRARY_PATH etc. should be loaded, and nixglhost will then massage the environment to make thinks work in Nix? … well, I have a proof of concept that uses nixglhost and claims to have a working CUDA both with and without loading that module.
How so? It seems a lot slower and less convenient then using nix directly via nix-portable (though nix-portable isn’t without issues), and I don’t see how Singularity (or apptainer, as it’s called these days) would help with CUDA conundrum. In my initial attempts, I get no CUDA in the container.
It wouldn’t make any difference to CUDA, but if e.g. your store is on a distributed file system it would (could) compensate for all the indirection that Nixpkgs create (symlinks, etc). Btw, out of curiosity can you tell more about your cluster? Does your nix-portable use user namespaces or does it fallback to ptrace?
Please gist LD_DEBUG=libs for both cases; just as a guess, if the cluster deploys both the x11 driver and the cuda_compat, there’s a chance that nixglhost would pick up the former
Can’t reproduce the inconsistency at the moment: it now works on A100 even if the module is loaded. But I have been fiddling around with exactly how nixglhost is invoked (nix run github:numtide#nix-gl-host vs nix develop my/flake# vs local installation) so maybe the problem only manifests itself in a subset of these approaches.
But given that I have a solution that works, understanding the discrepancy does not have sufficient priority right now.
Could you share your solution @jacg, please? Our cluster also uses lmod and I have problems setting my ML environment up correctly with nix. Specifically, I need CUDA support for pytorch.
I’m sorry, this is a huge context switch for me and I don’t really have the time to extract the signal from the noise of our project-specific cruft and give you something that it works in isolation. I think that the most important bits are here (NB: this comes from a flake that uses nosys so the treatment of system is not what you would typically see, and you would have to adapt that to whatever you’re used to doing):
In brief: the solution uses torch-bin instead of torch and allows some unfree packages that torch-bin needs.
I can’t find a coherent story to tell you in the time I have available for the necessary archaeology, so if this doesn’t give you sufficient inspiration, I suggest you follow @SergeK 's advice and start a new thread describing your specific problems and the solutions you tried.
I’m sorry I can’t give you a more helpful answer right now
having similar error when extracting .run file of other packages, i.e. ascendtoolikt, something similar to cudatoolkit but for ascend npu. Curious if this problem caused by makeself?