An alternative would be to have every PR go through staging. We could make an exception for security-critical ones of course.
Then we’d merge staging-next if and only if hydra is all-green. That way everybody who wants to get something into master is motivated to monitor the staging-next status. We should encourage people to use reverts as a first measure to fix things, since otherwise it would incentivize people to just push to staging and let other people deal with the fallout.
Maybe we could even get hydra to do git-bisect somewhat automatically. That would be a lot cheaper than building every PR.
So the flow would then be: I want to update a package. I open a PR against staging, do the usual quality checks that are common today (does it still work? do some reverse dependencies build?) and then get it merged.
At some point, staging gets promoted to staging-next. I see that there are several breakages in staging-next, none of which are caused by my update. Its pretty obvious that failure 1 was caused by PR X, so I open a new PR to revert those changes and ping the author of the original PR. Some other failures are not quite as obvious, so I (or someone else) run a git-bisect. Eventually everything builds, staging-next gets merged, the next staging gets promoted.
Fork NIxpkgs and patch nixpkgs/lib/licenses.nix to manually remove free = false for licenses that can be redistributed, e.g. unfreeRedistributable, issl, nvidia_cuda, and nvidia_cudnn .
add new attribute to licenses called e.g. redistributable, and create a system-wide allowRedistributable flag.
The former we can do on our own. The latter is imho the better solution but I’m not sure how to go about that, both in terms of the code base as well as politically with maintainers–not sure if this thought would be well-received
 We need to update the license for Nvidia to be more precise than unfree. I made a pull request here.
Also, if anyone has bandwidth to create a new nvidia derivation that aims to be redistributable, that would be awesome. I presume the build could be modified to only copy these specific files to $out. My understanding is the output should only include the following files (per my reading of the license / this is what Anaconda distributes).
Thanks! I finally think I understand the name…Hercules, slayer of Hydra .
@tomberek and I were able to get hydra running, although there are some definite pain points in nix with large files like cuda.run (3GB), or in hydra with large derivations like PyTorch (12GB! Had to disable store-uri compression). Also took a fair bit of effort to figure out distributed builds—didn’t realize that we needed an ssh key for hydra-queue-runner.
Would love to chat with you about cachix though, I’ll drop you a DM
Any progress? I’ve basically given up on compiling pytorch with CUDA locally, just isn’t time-feasible without leaving it on overnight. The base expression takes <40min but with just enabling CUDA support alone I was only at ~63% after 3.5 hours.
Luckily, our machines have plenty of cores, so it does not take that long. But it is long enough to be annoying when we move up nixpkgs and something in PyTorch’s closure is updated. So, instead I have started to just use upstream binaries and patchelf them.
I know it’s not so nice as source builds, but ‘builds’ finish in seconds. Still it would be nice to have a (semi-)official binary cache for source builds.
(I primarily use libtorch, so this is only a derivation for that, but the same could be done for the Python module.)
So if you use the pinned nixpkgs on 20.03 and the same overlay, should at least have some guarantees that the long build will succeed.
For caching, in practice we just need to integrate with Cachix to upload the binaries and we’ll be good-to-go.
The only thing holding us back is uncertainty around the licensing situation. I just sent nvidia another email. As pointed out by @xbreak, I think it is reasonable to conclude that we do not currently modify the object code of the binaries, but rather are modifying the library metadata.
Thanks for the effort! Are you planning to add an overlay for R with MKL instead of OpenBLAS? We are trying to create one (or update R in nixpkgs to have an option to use MKL). MKL is the only thing that keeps my team from abandoning MRO. Microsoft seem to have lost interest in R and MRO is stuck at version 3.5.3.
Patching the binaries wasn’t as bad as I thought. Not sure if everything was patched but CUDA support is distributed with the binary from pypi.
The closure can probably be reduced but for my purposes it works and is far faster than attempting to compile it from source.