Improving NixOS data science infrastructure: CI for MKL&CUDA

They are now including CUDA in the prebuilt binaries, which makes it even easier to package. Only downside is that libtorch_cuda.so is now a 709MiB binary ;).

1 Like

We now have a derivation python3Packages.pytorch-bin with CUDA support:

https://github.com/NixOS/nixpkgs/pull/96669

Should help those who want to avoid the heavy build of python3Packages.pytorch. I also did a PR for libtorch-bin for the C++ API (which is also used by e.g. the Rust tch) crate, so hopefully we’ll have that soon as well:

https://github.com/NixOS/nixpkgs/pull/96488

10 Likes

Forgot to add: the upstream builds use MKL as their BLAS library. This should generally give better performance than multi-threaded OpenBLAS, which we use by default as the system-wide BLAS and is thus used by python3Packages.pytorch by default. Multi-threaded OpenBLAS also does not work correctly if your application uses any kind of threading.

Unfortunately, on a AMD Ryzen CPUs, MKL will use slower SSE kernels. You can force the use of AVX2 kernels with the MKL version that libtorch/PyTorch use, with export MKL_DEBUG_CPU_TYPE=5.

I’d assume this is probably where the most people affected would reside.

If you’re on 5.9, you’ll need to circumvent the GPL-condom to use nvidia-uvm for CUDA. Spent more time then I’d like debugging the wrong places.

1 Like

Thanks for the heads-up! I was wondering why I was getting CUDA error: unknown error errors. Some straceing revealed that /dev/nvidia-uvm could not be opened. Manual modprobeing showed an error that reminded me of your comment.

It’s annoying to run into these Linux ↔ NVIDIA licensing shenanigans when you are just trying to get work done :frowning: .

1 Like

I haven’t tested it, but there is this patch https://github.com/Frogging-Family/nvidia-all/blob/f1d3c6cf024945e7a477ed306bd173fa6b81d72d/patches/kernel-5.9.patch

Officially, we need to wait a month to Nvidia fix it NVIDIA Doesn't Expect To Have Linux 5.9 Driver Support For Another Month - Phoronix

Luckily NixOS makes it so easy to switch kernels :slight_smile: , so it’s not a real problem to stick to a slightly older kernel.

1 Like

I have tried to use the new BLAS/LAPACK infrastructure to build R with MKL and the resulting R produces incorrect results for matrix multiplication (see my comment Add BLAS/LAPACK switching mechanism by matthewbauer · Pull Request #83888 · NixOS/nixpkgs · GitHub). I have opened an issue here R built with MKL computes incorrect results · Issue #104026 · NixOS/nixpkgs · GitHub which has the code I used to build R.

@danieldk CUDA and OpenCL in the last version of Nvidia Driver is working with the kernel 5.9.

It is. I have switched to 5.9 a while ago (maybe 2 weeks?) and have been using CUDA a lot with Torch.

1 Like