Improving NixOS data science infrastructure: CI for MKL&CUDA

Appreciate the productive & solution-oriented discussion :smiley: . Certainly a cachix-style solution that all data-scientists could just add to their configuration.nix would be great. But that doesn’t address the CI/CD problem that hydra solves so beautifully for an entire OS. Anyone that has tried to have an overlay with numpy.blas = mkl knows this pain. In the one year I’ve been using NixOS, I’ve never succeeded in building python3.withPackages (ps: with ps; [ numpy scipy pytorch sklearn ]) on unstable. I think this is largely because the maintainer test burden on reverse dependencies is high due to extremely long builds.

I completely understand why NixOS is free-sofware only by default. I think Ubuntu provides a good example here, where main & universe are Free & open source, however users can opt in to restricted–which provides binaries for CUDA–as well as multiverse which provides binaries for MKL. My understanding is multiverse also provides builds against these libraries e.g. caffe-cuda, although I haven’t installed myself.

I think canonical is trustable to keep track of legality. And not to mention that RHEL/CentOS, Fedora, SLES, and OpenSUSE also redistribute CUDA. Conda does as well, and if I’m not mistaken, so does pip.

I’m confident we are in clear, legal territory if we wrap rather than modify, but if the main concern is licensing, I’m happy to reach out to Nvidia for clarification as it pertains to NixOS / hydra and report back.

Splitting CUDA into parts according to the license sounds like a good idea to me.

Just to clarify, CUDA itself is under one license, but separate CUDA libraries like cuDNN have slightly different supplements. Modern deep learning frameworks like PyTorch and TensorFlow depend on both CUDA and cuDNN. The only other Nvidia dependency I’m aware of for PyTorch is nccl, which has permissive licensing.

2 Likes

RHEL/CentOS, Fedora, SLES, and OpenSUSE also redistribute CUDA

We would be doing a slightly different thing, which is probably fine, but someone needs to make the call. Note that for mainline Hydra CUDA is nothing special, so the maintainers need to be able to make such calls uniformly and reliably… I think for a few years we only built non-branded Firefox on Hydra because of the minor patches for NixOS layout of stuff that were declared definitely fine as soon as we actually started discussing with Mozilla. Extrapolating from that precedent.

Speaking of nontrivial things: changing the layout during redistribution might make it verbatim redistribution of consituent parts, not verbatim redistribution of the original work in its entirety. I have no idea and I have no desire to bear responsibility for calling it either way.

A separate collaboration of CUDA and MKL users does single out CUDA, can delegate the calls to people having read CUDA and MKL licenses in details, etc.

the CI/CD problem that hydra solves so beautifully for an entire OS

I think CUDA is large enough that you should stop thiniking in terms of entire OS. Nix means that you can have a separate branch that stabilises some set of the related heavy things, and install the development environment for your data pipeline from there — without breaking your system. A bit suboptimal, but any other solution means keeping your ground against the huge churn.

I think this is largely because the maintainer test burden on reverse dependencies is high due to extremely long builds.

Note that this also means that ofBorg would also time out, and nobody would be surprised, and not everyone would care.

I know RFC46 is not yet accepted, and it is about platforms and not individual packages, but I like the terminology we worked on there. Trying to keep everything heavy around Python data crunching green on unstable would mean asking for Tier-2 impact on merges for things that are not really served at the Tier-2 level by existing PR testing tooling. Actually, @timokau achieves this for Sage by doing a huge amount of high-quality work and being always there to help whenever any question about Sage impact arises. In the case of Sage, though, there is a lot of actual investigation to be done.

If there are enough people interested in doing the reviews and investigations and fixes for master PRs to the things related to the data crunching packages, just having dedicated build capacity with unprivileged accounts for everyone in the response team could be enough, Note that you need people to iterate quickly on proposed fixes, which probably means nix build is quite useful regardless of a Hydra instance.

Of course, once you have a reputation for keeping up with debugging the relevant PRs, an explicit OK from Nvidia and a binary cache (possibly lagging a few weeks) with a significant user base, you might have an easier time convincing people to make an exception. On the other hand, at that point this exception will be a smaller improvement than it would be right now.

On the other hand, I guess the first steps are the same regardless of the final goal: refactor to increase the chance of legality of redistribution, try to get Nvidia’s comments, organise a debugging team.

3 Likes

Until nixpkgs is doing CI properly; all pacakges need to be green in order to pass, this is just chasing tail.

I have the same problem with maintaining cachix, haskell packages are bumped and failing packages are just marked as broken. I don’t have a chance to fix mine, resulting in a few angry people every few weeks.

I think the short-term solution is to use a separate distribution (separate github repository, with a new binary cache, CI, etc) where you can control the whole pipeline.

I’ve seen a few people interested into data science repository, so getting a budget to host the agents should be possible.

Long term, nixpkgs has to change the process. I’m very much for linux kernel like workflow, that would enable everyone to work with people they want to and it scales as linux kernel has proven. Enforcing CI to be green across all PRs into a single branch just won’t scale, unless we get crazy amount of computing resources we can waste.

3 Likes

I’ve been building the TensorFlow stack recently for various combinations of python, optimization settings, and systems, pushing to cachix and a private bucket. It would help usability to know what attribute set works, but it requires enough people to curate and update.

Some of the packages in this space require large builds. Im considering to experiment with cached builds with ccache or Bazel.

1 Like

Long term, nixpkgs has to change the process. I’m very much for linux kernel like workflow, that would enable everyone to work with people they want to and it scales as linux kernel has proven.

I am not sure that the impact distribution is similar enough to blindly trust that scaling data; we might have a larger ratio of commits changing substantially the parts with large impact, we cannot fully control the impact of our changes even in principle (let’s call a spade a spade here), a single full rebuild of Nixpkgs takes days instead of tens of minutes for everything-enabled kernel.

And for the low-rev-depth changes, which are similar in impact distribution to what happens with Linux wireless of BtrFS or whatever, building a PR doesn’t include a large build anyway, even now.

An alternative would be to have every PR go through staging. We could make an exception for security-critical ones of course.

Then we’d merge staging-next if and only if hydra is all-green. That way everybody who wants to get something into master is motivated to monitor the staging-next status. We should encourage people to use reverts as a first measure to fix things, since otherwise it would incentivize people to just push to staging and let other people deal with the fallout.

Maybe we could even get hydra to do git-bisect somewhat automatically. That would be a lot cheaper than building every PR.

So the flow would then be: I want to update a package. I open a PR against staging, do the usual quality checks that are common today (does it still work? do some reverse dependencies build?) and then get it merged.

At some point, staging gets promoted to staging-next. I see that there are several breakages in staging-next, none of which are caused by my update. Its pretty obvious that failure 1 was caused by PR X, so I open a new PR to revert those changes and ping the author of the original PR. Some other failures are not quite as obvious, so I (or someone else) run a git-bisect. Eventually everything builds, staging-next gets merged, the next staging gets promoted.

2 Likes

Noob question here, is it possible to install tensorflow with GPU from PyPi on NixOS ?

Not out of the box, since the dynamic linker/library paths will be incorrect for NixOS. However, the tensorflow-bin derivation uses the PyPI package and patches up the library dependencies:

1 Like

I now have hydra running thanks to the help of many on github and irc. Here’s the NixOps deploy: GitHub - tbenst/nix-data-hydra: Hydra deployment for Nix Data. Next challenge is to figure out how to only distribute items that are some form of unfreeRedistributable.

Here are two approaches that come to mind:

  • Fork NIxpkgs and patch nixpkgs/lib/licenses.nix to manually remove free = false for licenses that can be redistributed, e.g. unfreeRedistributable, issl, nvidia_cuda, and nvidia_cudnn [1].
  • add new attribute to licenses called e.g. redistributable, and create a system-wide allowRedistributable flag.

The former we can do on our own. The latter is imho the better solution but I’m not sure how to go about that, both in terms of the code base as well as politically with maintainers–not sure if this thought would be well-received

[1] We need to update the license for Nvidia to be more precise than unfree. I made a pull request here.

Edit: realized the manual has a nice section on this. Third option is to handle with a well-crafted overlay: Nixpkgs 23.11 manual | Nix & NixOS

2 Likes

Quick update: my hydra is broken as no jobs de-queue due to hydra-queue-runner gets stuck while there are items in the queue · Issue #366 · NixOS/hydra · GitHub. If anyone can help troubleshoot let me know, happy to give ssh access!

Also, if anyone has bandwidth to create a new nvidia derivation that aims to be redistributable, that would be awesome. I presume the build could be modified to only copy these specific files to $out. My understanding is the output should only include the following files (per my reading of the license / this is what Anaconda distributes).

cuDNN:

cudnn.h
libcudnn.so

CUDA:

lib/libcublas.so
lib/libcublasLt.so
lib/libcudart.so
lib/libcufft.so
lib/libcufftw.so
lib/libcurand.so
lib/libcusolver.so
lib/libcusparse.so
lib/libnppc.so
lib/libnppial.so
lib/libnppicc.so
lib/libnppicom.so
lib/libnppidei.so
lib/libnppif.so
lib/libnppig.so
lib/libnppim.so
lib/libnppist.so
lib/libnppisu.so
lib/libnppitc.so
lib/libnpps.so
lib/libnvToolsExt.so
lib/libnvblas.so
lib/libnvgraph.so
lib/libnvjpeg.so
lib/libnvrtc-builtins.so
lib/libnvrtc.so
lib/libnvvm.so

My offer to help you get hercules-agent running still stands :slight_smile:

1 Like

Thanks! I finally think I understand the name…Hercules, slayer of Hydra :laughing:.

@tomberek and I were able to get hydra running, although there are some definite pain points in nix with large files like cuda.run (3GB), or in hydra with large derivations like PyTorch (12GB! Had to disable store-uri compression). Also took a fair bit of effort to figure out distributed builds—didn’t realize that we needed an ssh key for hydra-queue-runner.

Would love to chat with you about cachix though, I’ll drop you a DM

1 Like

If you can get Cachix working, that’s how hercules agent gets derivations and outputs, nothing goes through our central server.

If you need compute resources, we have a 16 cores build box in the nix-community project. It would be nice to see it running with more CPU utilization :wink:

that’d be amazing! I’ll shoot you a DM

Any progress? I’ve basically given up on compiling pytorch with CUDA locally, just isn’t time-feasible without leaving it on overnight. The base expression takes <40min but with just enabling CUDA support alone I was only at ~63% after 3.5 hours.

1 Like

Luckily, our machines have plenty of cores, so it does not take that long. But it is long enough to be annoying when we move up nixpkgs and something in PyTorch’s closure is updated. So, instead I have started to just use upstream binaries and patchelf them.

I know it’s not so nice as source builds, but ‘builds’ finish in seconds. Still it would be nice to have a (semi-)official binary cache for source builds.

(I primarily use libtorch, so this is only a derivation for that, but the same could be done for the Python module.)

4 Likes

Last time I heard, stites was working on a Hydra for the GPU libs.

For now you can use his GitHub - stites/pytorch-world: nix scripts for pytorch-related libraries repo with the cachix binary cache.

@stites : how close are you from a working automated binary cache for PyTorch ? :slight_smile:

1 Like

That’s great :slight_smile: I think it would be useful to say why it’s better/worse than officially recommended conda installation.

Also missing “Getting started” in the README for those unfamiliar with Nix.

Made quite a bit of progress–we build these jobsets against MKL and CUDA: GitHub - nix-community/nix-data-science: Standard set of packages and overlays for data-scientists [maintainer=@tbenst]. The build results are available here: https://hydra.nix-data.org/project/nix-data.

So if you use the pinned nixpkgs on 20.03 and the same overlay, should at least have some guarantees that the long build will succeed.

For caching, in practice we just need to integrate with Cachix to upload the binaries and we’ll be good-to-go.

The only thing holding us back is uncertainty around the licensing situation. I just sent nvidia another email. As pointed out by @xbreak, I think it is reasonable to conclude that we do not currently modify the object code of the binaries, but rather are modifying the library metadata.

3 Likes