Getting pytorch to use CUDA on HPC system

jacg · January 12, 2024, 10:50am

I have a flake which provides a python with pytorch-bin, which successfully uses CUDA on my NixOS development machine and even uses MPS on M2 Macs.

I want to use this on an HPC system whose compute nodes have NVIDIA GPUs.

In Nix-less usage, use of CUDA on this system requires using lmod to load the CUDA module: module load CUDA. (This uses the default version 12.2.2, but other versions are available too.)

This command seems to prepend
/scicomp/builds/Rocky/8.7/Common/software/CUDA/12.2.2/nvvm/bin:/scicomp/builds/Rocky/8.7/Common/software/CUDA/12.2.2/bin
to PATH and

/scicomp/builds/Rocky/8.7/Common/software/CUDA/12.2.2/nvvm/lib64:/scicomp/builds/Rocky/8.7/Common/software/CUDA/12.2.2/extras/CUPTI/lib64:/scicomp/builds/Rocky/8.7/Common/software/CUDA/12.2.2/lib
to LD_LIBRARY_PATH

Is it possible to make the pytorch provided by my flake to use the CUDA GPUs on this system?

Edit: the Nix here is provided by nix-portable.

jacg · January 12, 2024, 12:32pm

Trying to throw nixGL at this problem, I get

unpacking sources
Creating directory NVIDIA-Linux-x86_64-530.30.02
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 530.30.02/nix/store/8zpy6cffgk57wb8vnpyas6c9x21bixmj-NVIDIA-Linux-x86_64-530.30.02.run: line 729: /build/makeself.mtsq6w7s/zstd: No such file or directory
/nix/store/8zpy6cffgk57wb8vnpyas6c9x21bixmj-NVIDIA-Linux-x86_64-530.30.02.run: line 720: /dev/tty: No such device or address

Terminated
xz: (stdin): File format not recognized
tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors

Could this have something to do with trying to run this on a compute node to which it is impossible to log in? (The login nodes do not have any NVIDIA GPUs.)

… even though ls in the job does report the presence of /dev/tty.

Edit: line 729 calls a function containing line 790, so the tty problem occurs after the /build/makeself.etc one:

718   │ UnTAR() {
719   │     tar xvf - 2> /dev/null || {
720   │         echo "Extraction failed." > /dev/tty; kill -15 $$;
721   │     };
722   │ }
723   │ 
724   │ $echo -n "Uncompressing $label"
725   │ cd $tmpdir ; res=3
726   │ 
727   │ [ "$keep" = "y" ] || trap '$echo "Signal caught, cleaning up" > /dev/tty; cd $TMPROOT; rm -rf $tmpdir; exit 15' 1 2 15
728   │ 
729   │ if (cd "$location"; tail -n +$skip $0; ) | zstd -d  | UnTAR | (while read a; do $echo -n "."; done; $echo; ); then

How can I find out where /build is?

SergeK · January 12, 2024, 1:38pm

Looks like some of these are the normal userspace libraries - these you want to remove from the LD_LIBRARY_PATH because it has a higher priority than DT_RUNPATH and you might end up loading an old libc or libstdc++. You need to figure out which one of these directories or directories listed in /etc/ld.so.conf (if exists) contains libcuda.so and other “driver” libraries. Only those must be in the LD_LIBRARY_PATH, and only if these directories do not mix drivers with the normal libraries (in which case you might again run into a conflict).

You can automate (most of?) this process by using nixglhost.

You can always inspect which exact shared libraries (their paths) are being loaded by running your program with LD_DEBUG=libs and grepping for the ones of interest (libcuda.so, libnvidia-gl.so).

Regarding the workflow, it might also be convenient to use Nixpkgs’ singularity-tools.buildImage and ship the Nix-built software onto the cluster as images

jacg · January 12, 2024, 3:56pm

So the module which adds those things to LD_LIBRARY_PATH etc. should be loaded, and nixglhost will then massage the environment to make thinks work in Nix? … well, I have a proof of concept that uses nixglhost and claims to have a working CUDA both with and without loading that module.

How so? It seems a lot slower and less convenient then using nix directly via nix-portable (though nix-portable isn’t without issues), and I don’t see how Singularity (or apptainer, as it’s called these days) would help with CUDA conundrum. In my initial attempts, I get no CUDA in the container.

SergeK · January 12, 2024, 4:09pm

It wouldn’t make any difference to CUDA, but if e.g. your store is on a distributed file system it would (could) compensate for all the indirection that Nixpkgs create (symlinks, etc). Btw, out of curiosity can you tell more about your cluster? Does your nix-portable use user namespaces or does it fallback to ptrace?

jacg · January 12, 2024, 4:14pm

I would expect it to be using namespaces, as this is a pretty new system. How can I be sure?

lsns shows a bunch of COMMANDs in /nix/store, so I guess that’s a yes.

jacg · January 12, 2024, 4:16pm

… when I’m on a node that RTX 3090s, but when I’m on one that has A100s it only works if the module is NOT loaded. Curious.

SergeK · January 12, 2024, 4:23pm

Please gist LD_DEBUG=libs for both cases; just as a guess, if the cluster deploys both the x11 driver and the cuda_compat, there’s a chance that nixglhost would pick up the former

jacg · January 15, 2024, 11:15am

Can’t reproduce the inconsistency at the moment: it now works on A100 even if the module is loaded. But I have been fiddling around with exactly how nixglhost is invoked (nix run github:numtide#nix-gl-host vs nix develop my/flake# vs local installation) so maybe the problem only manifests itself in a subset of these approaches.

But given that I have a solution that works, understanding the discrepancy does not have sufficient priority right now.

mainrs · April 19, 2024, 3:13pm

Could you share your solution @jacg, please? Our cluster also uses lmod and I have problems setting my ML environment up correctly with nix. Specifically, I need CUDA support for pytorch.

SergeK · April 19, 2024, 5:59pm

It’s probably better if you start a new thread, describe your situation in detail, and cite the previous threads

jacg · April 21, 2024, 9:46am

I’m sorry, this is a huge context switch for me and I don’t really have the time to extract the signal from the noise of our project-specific cruft and give you something that it works in isolation. I think that the most important bits are here (NB: this comes from a flake that uses nosys so the treatment of system is not what you would typically see, and you would have to adapt that to whatever you’re used to doing):

{ self
, nixpkgs # <---- This `nixpkgs` has systems removed e.g. legacyPackages.zlib
, ...
}: let
  pkgs = import nixpkgs {
    inherit (nixpkgs.legacyPackages) system;
    config.allowUnfreePredicate = pkg: builtins.elem (nixpkgs.lib.getName pkg) [
      "triton"
      "cuda_cudart"
      "cuda_nvtx"
      "torch"
    ];
  };

  python-with-packages = pkgs.python3.withPackages(ps: [ ps.torch-bin ]);

  in {

    devShell = self.devShells.clang;

    devShells.clang = pkgs.mkShell.override { stdenv = pkgs.clang_16.stdenv; } {
      packages = [ python-with-packages ];
    };
  }

In brief: the solution uses torch-bin instead of torch and allows some unfree packages that torch-bin needs.

I can’t find a coherent story to tell you in the time I have available for the necessary archaeology, so if this doesn’t give you sufficient inspiration, I suggest you follow @SergeK 's advice and start a new thread describing your specific problems and the solutions you tried.

I’m sorry I can’t give you a more helpful answer right now

qlibp · August 8, 2024, 4:18pm

having similar error when extracting .run file of other packages, i.e. ascendtoolikt, something similar to cudatoolkit but for ascend npu. Curious if this problem caused by makeself?