Improving NixOS data science infrastructure: CI for MKL&CUDA

An alternative would be to have every PR go through staging. We could make an exception for security-critical ones of course.

Then we’d merge staging-next if and only if hydra is all-green. That way everybody who wants to get something into master is motivated to monitor the staging-next status. We should encourage people to use reverts as a first measure to fix things, since otherwise it would incentivize people to just push to staging and let other people deal with the fallout.

Maybe we could even get hydra to do git-bisect somewhat automatically. That would be a lot cheaper than building every PR.

So the flow would then be: I want to update a package. I open a PR against staging, do the usual quality checks that are common today (does it still work? do some reverse dependencies build?) and then get it merged.

At some point, staging gets promoted to staging-next. I see that there are several breakages in staging-next, none of which are caused by my update. Its pretty obvious that failure 1 was caused by PR X, so I open a new PR to revert those changes and ping the author of the original PR. Some other failures are not quite as obvious, so I (or someone else) run a git-bisect. Eventually everything builds, staging-next gets merged, the next staging gets promoted.

2 Likes

Noob question here, is it possible to install tensorflow with GPU from PyPi on NixOS ?

Not out of the box, since the dynamic linker/library paths will be incorrect for NixOS. However, the tensorflow-bin derivation uses the PyPI package and patches up the library dependencies:

1 Like

I now have hydra running thanks to the help of many on github and irc. Here’s the NixOps deploy: GitHub - tbenst/nix-data-hydra: Hydra deployment for Nix Data. Next challenge is to figure out how to only distribute items that are some form of unfreeRedistributable.

Here are two approaches that come to mind:

  • Fork NIxpkgs and patch nixpkgs/lib/licenses.nix to manually remove free = false for licenses that can be redistributed, e.g. unfreeRedistributable, issl, nvidia_cuda, and nvidia_cudnn [1].
  • add new attribute to licenses called e.g. redistributable, and create a system-wide allowRedistributable flag.

The former we can do on our own. The latter is imho the better solution but I’m not sure how to go about that, both in terms of the code base as well as politically with maintainers–not sure if this thought would be well-received

[1] We need to update the license for Nvidia to be more precise than unfree. I made a pull request here.

Edit: realized the manual has a nice section on this. Third option is to handle with a well-crafted overlay: Nixpkgs 23.11 manual | Nix & NixOS

2 Likes

Quick update: my hydra is broken as no jobs de-queue due to hydra-queue-runner gets stuck while there are items in the queue · Issue #366 · NixOS/hydra · GitHub. If anyone can help troubleshoot let me know, happy to give ssh access!

Also, if anyone has bandwidth to create a new nvidia derivation that aims to be redistributable, that would be awesome. I presume the build could be modified to only copy these specific files to $out. My understanding is the output should only include the following files (per my reading of the license / this is what Anaconda distributes).

cuDNN:

cudnn.h
libcudnn.so

CUDA:

lib/libcublas.so
lib/libcublasLt.so
lib/libcudart.so
lib/libcufft.so
lib/libcufftw.so
lib/libcurand.so
lib/libcusolver.so
lib/libcusparse.so
lib/libnppc.so
lib/libnppial.so
lib/libnppicc.so
lib/libnppicom.so
lib/libnppidei.so
lib/libnppif.so
lib/libnppig.so
lib/libnppim.so
lib/libnppist.so
lib/libnppisu.so
lib/libnppitc.so
lib/libnpps.so
lib/libnvToolsExt.so
lib/libnvblas.so
lib/libnvgraph.so
lib/libnvjpeg.so
lib/libnvrtc-builtins.so
lib/libnvrtc.so
lib/libnvvm.so

My offer to help you get hercules-agent running still stands :slight_smile:

1 Like

Thanks! I finally think I understand the name…Hercules, slayer of Hydra :laughing:.

@tomberek and I were able to get hydra running, although there are some definite pain points in nix with large files like cuda.run (3GB), or in hydra with large derivations like PyTorch (12GB! Had to disable store-uri compression). Also took a fair bit of effort to figure out distributed builds—didn’t realize that we needed an ssh key for hydra-queue-runner.

Would love to chat with you about cachix though, I’ll drop you a DM

1 Like

If you can get Cachix working, that’s how hercules agent gets derivations and outputs, nothing goes through our central server.

If you need compute resources, we have a 16 cores build box in the nix-community project. It would be nice to see it running with more CPU utilization :wink:

that’d be amazing! I’ll shoot you a DM

Any progress? I’ve basically given up on compiling pytorch with CUDA locally, just isn’t time-feasible without leaving it on overnight. The base expression takes <40min but with just enabling CUDA support alone I was only at ~63% after 3.5 hours.

1 Like

Luckily, our machines have plenty of cores, so it does not take that long. But it is long enough to be annoying when we move up nixpkgs and something in PyTorch’s closure is updated. So, instead I have started to just use upstream binaries and patchelf them.

I know it’s not so nice as source builds, but ‘builds’ finish in seconds. Still it would be nice to have a (semi-)official binary cache for source builds.

(I primarily use libtorch, so this is only a derivation for that, but the same could be done for the Python module.)

4 Likes

Last time I heard, stites was working on a Hydra for the GPU libs.

For now you can use his GitHub - stites/pytorch-world: nix scripts for pytorch-related libraries repo with the cachix binary cache.

@stites : how close are you from a working automated binary cache for PyTorch ? :slight_smile:

1 Like

That’s great :slight_smile: I think it would be useful to say why it’s better/worse than officially recommended conda installation.

Also missing “Getting started” in the README for those unfamiliar with Nix.

Made quite a bit of progress–we build these jobsets against MKL and CUDA: GitHub - nix-community/nix-data-science: Standard set of packages and overlays for data-scientists [maintainer=@tbenst]. The build results are available here: https://hydra.nix-data.org/project/nix-data.

So if you use the pinned nixpkgs on 20.03 and the same overlay, should at least have some guarantees that the long build will succeed.

For caching, in practice we just need to integrate with Cachix to upload the binaries and we’ll be good-to-go.

The only thing holding us back is uncertainty around the licensing situation. I just sent nvidia another email. As pointed out by @xbreak, I think it is reasonable to conclude that we do not currently modify the object code of the binaries, but rather are modifying the library metadata.

3 Likes

That is so sweet to hear! That will definitely have a huge positive impact :slightly_smiling_face:

Thanks so much for your work

Not sure that those are going to be required anymore because of recent updates (see tbenst answer)

Thanks for the effort! Are you planning to add an overlay for R with MKL instead of OpenBLAS? We are trying to create one (or update R in nixpkgs to have an option to use MKL). MKL is the only thing that keeps my team from abandoning MRO. Microsoft seem to have lost interest in R and MRO is stuck at version 3.5.3.

Great idea! We currently are building the tidyverse and a few other R packages.

Care to make a pull request adding an R overlay? If not, I’ll get around to it eventually.

Right now it’s just two jobs (one to build an R environment and one to build RStudio), but I’ve been meaning do separate jobs for each R package

Patching the binaries wasn’t as bad as I thought. Not sure if everything was patched but CUDA support is distributed with the binary from pypi.
The closure can probably be reduced but for my purposes it works and is far faster than attempting to compile it from source.

Nix expressions:
https://paste.sr.ht/~eadwu/3559ec6647fbe79e57b4b0b9b67ddd0d9130ffae

In case anyone comes across this, I’m not sure how much of a strict dependency this is, but it seems to prefers CUDA 10.1 (or at least one of the executables has a link to a CUDA 10.1 library)

    cudnn = pkgs.cudnn_cudatoolkit_10_1;
    cudatoolkit = pkgs.cudatoolkit_10_1;