Nixpkgs's current development workflow is not sustainable

As a heavy user of nixpkg pytorch (thanks btw @rehno), I’m confused on the discussion of why packages can’t be arbitrarily updated or removed. I don’t have the staging experience you guys have (@delroth , @samuela) and to be clear; maintainer involvement/pinging, modularity, the PR backlog, and discovering not-broken versions of packages all are obviously serious problems to me.

I thought the point was Nix never promised working builds, just reproducibility. Even nixpkg unstable is a misnomer since its not like the git history is being rewritten. Updates break things; on Debian, on MacOS, on Arch, and I’m pretty sure npm and pip don’t even allow updates to be published until they break at least 5 downstream projects.

@rehno I’ve got a project using torch_1_8_1, cuda-enabled torch_1_9_0, and torch_1_9_1 using overlays all in the same file. When I was upgrading 1_9_1 was broken… and I just went and found another version that wasn’t. If nixpkg.torch randomly changed to 1_11_0 and broke everything, I wouldn’t even know it for 2 years probably. Finding a working version might be an open problem, but I’m never irritated when an update goes poorly, I expect it to go poorly.

What would be nice is if, when a package build failed, there was a link to a repo that had only the package in it so I could fork the repo, patch it, use the patch, and ping the maintainer with a PR. Just like any other node/python/deno/rust/ruby/elixr/crystal/haskell/vs-code/atom package

14 Likes

Sandro, before going into specifics on your reply, I’d just like to state what might be obvious: while I personally think there are issues and possible improvements to be made to the workflow that is currently being used, I’m still extremely grateful for your work (and FRidh’s, and many others’s). This is a really hard problem to tackle and the current (imo) flawed process is still better than nothing in many ways.

That sounds like some tooling improvements are in order. Why can’t Hydra send the maintainer(s) of a derivation a breakage notification when a failure happens on certain designated “important” branches (master, staging-next, release branches)? All the data should already be available for that.

This is a bit curious to me and goes against my assumptions. Is there any reason why Python modules would be less maintained (by their listed maintainer) than other derivations in nixpkgs? If they are less maintained, any idea why? If they are similarly well maintained as the rest of nixpkgs, any reason why we have a higher standard for freshness? We don’t do these mechanical bulk bumps for e.g. all GNU packages or other sets you could think about.

As a side note, sometimes there are good reasons to keep packages behind for a short time. We have a bunch of modules that mainly see use in nixpkgs because they’re home-assistant dependencies. HA is super finicky with its version requirements, it’s a pain, and I know we tend to batch updates along with HA version bumps to avoid too much breakage. Maintainers are more likely to be aware of that than a bulk update script and/or a reviewer without context. A recent-ish example I remember there is this pyatmo version bump which was merged before I (maintainer) had any chance to say a word on the PR (5h from PR creation to merge).

I try to be fairly responsive. I built my own tooling to send myself notifications when repology shows my maintained packages as out of date, which frankly sounds like something we ought to have as a standard opt-in - if not opt-out - tool for nixpkgs. I do think there are situations when maintainers are unresponsive. I don’t think it’s reasonable to assume that’s the default situation.

I unfortunately don’t think this is realistic for most maintainers. nixpkgs work is something I do on the side when I have time after my actual full time work, I can’t spend that time trying to find all the changes that might be happening throughout the project and might be impacting my work. I suspect this is the case for many maintainers. Notifications are a different thing though, they are directly actionable for me and would point me directly at what to fix.

Ideally, once python-updates or staging or some of these “wide impact branches” gets into a mostly stable state (like, most things work, only a few % of breakages remain), we’d flip a switch somewhere that sends notifications to all maintainers saying “we’re planning to merge this in X days, the following derivations that you maintain are broken by these large scale changes, please take action by coordinating fixes on PR #nnnnn and sending fixes on the yyyyyyy branch”.

This is still reactive vs. proactive, but we apply the shift-left principle: maintainers are told about the breakages much earlier giving them more time to act before most users are impacted, and potentially finding bugs before the branch is merged (more eyes on the work!).

FWIW, I don’t have a strong opinion on whether maintaining multiple versions is a good solution or just creating more maintenance problems. My gut feeling is towards the latter, but I didn’t spend much time thinking about this.


PS: I would have replied earlier, but I was spending my time yesterday fixing a derivation build that got broken by the last set of Python updates… flexget: unbreak by adding some more explicit dependencies by delroth · Pull Request #170106 · NixOS/nixpkgs · GitHub

11 Likes

I don’t know why there would be less support, but I can say it wasn’t until I started using nix that I realized how much system-baggage python modules have/assume. Almost no major python module is actually written with just python; numpy, opencv, matplotlib, even the stupid cli-progress bar tqdm (that claims it has no dependencies!) still hooks into tkinter, keras, and matplotlib and causes problems because of it. Python modules cause me the most pain of anything I’ve setup in nix, and I’ve compiled electron apps in nix-shell.

Literally this morning with my robotics group we found import torch; import numpy; causes a segfault but changing the order to import numpy; import torch works fine, whole program runs, torch works great. Its an absolute house of cards.

21 Likes

One way to think about it is this:

Every package bump that involves transitive dependencies is a little mini release ceremony. However it’s all on one person to make it all the way through to the very end before running out of steam:

  1. Update your package + dependencies of your package
  2. Fix the reverse dependencies of your dependencies (excluding your package)
  3. Fix the dependencies of those reverse dependencies
  4. Fix the reverse dependencies of those dependencies, and so forth…

My proposal is essentially to leave some packages (completely?) disconnected from the dependency graph until enough of them has been updated before taking that big stride forward.

Potentially if people start connecting up the graph prematurely, maybe it will cause problems? But I think the idea would be to keep hydra “evergreen” so to speak, so that things keep building, but if the top-level/* pins haven’t been rolled forward yet, you have to pin on your own (in your own overlay).

3 Likes

We already do that for hypothesis (usually we are 2 to 4 versions behind), pytest, setuptools and probably some more which can cause breakges in almost every package. Maybe we need to formalize this list and add more things to it. Also we should probably add more passthru packages to test to important downstream users. I encourage anyone that has good examples to send PRs.

Also feel free to ping me if you get stuck trying to debug some python failure but only if you have a complete log. I don’t have time to dig deeper into issues when I am on the go but I can usually take a quick look at a log and drop some ideas what might be causing this or where to dig more.

@hexa can we somehow do this? This sounds like something which could improve the situation for everyone drastically. Maybe a notification channel that people can subscribe to?

Nothing we can really fix to be honest.

Like Debian, we can’t have every version and every variant of a software. We can make exceptions like for example for ncdu because version 2 does not yet work on darwin.

Probably but the majority of people on every other distro doing dev work are also using some software from somewhere else. We can’t possible satisfy the needs for everyone.

For example I have many overlays where I draw in patches from unmerged PRs or update to a development version of a software. It wouldn’t really work if those where in nixpkgs.

No it doesn’t. If you have very special and very customized needs you can very easily adopt nixpkgs to be working for you. If there are overlays which a big majority of people need to use we need to think about how to upstream them but we can’t make nixpkgs work for every situation for everyone.

Do you and all the others have any overlays on their system right now which really should be upstreamed? Please share them so we can talk about them and hopefully find better answers to the problems than some program needs to be updated to a new version of X.

That are literally overlays where you can just copy the contents of an .nix file into with a few lines of boiler plate code to get things working quickly again. I am often doing that if I really don’t have time and need things to get back working fast.

There are actually email notifications which where disabled because they send way to many emails which all quickly landed in spam. The feature could maybe be resurrected to be opt-in.

No, maybe to many packages? Maybe people got used to that they do not need to maintain them that well.

I don’t think we can really compare python to gnu libraries. A comparable eco system would be node/npm/etc., ruby, go, haskell or rust. Go and Rust mostly use vendored dependencies where this issue does not apply to but others like packages not getting CVEs patched. Ruby in nixpkgs is an order of magnitude smaller than python. And finally node which has even bigger auto updates and is not even recursed into because it is to big. Haskell in nixpkgs is kinda similar to python with big auto updates which multiple times already broken hadolint for me. Only difference I know is that haskell packages better follow semver.

Home-Assistant is an ende user program which allows us to override anything it requires to a known-working version. Something we cannot easily do in pythonPackages. Also many of its dependencies are not widely used in nixpkgs.

Is that open source? Can we have a link to it?

2 Likes

Ruby has the Gemfile.lock standard which allows vendoring via bundix, so you mostly only need to put non-ruby dependency fixes into nixpkgs. So I think it is somewhat comparable to Go/Rust. Except no one has figured out how to hash the Gemfile.lock/bundix’s gemset.nix to avoid adding it to nixpkgs for applications.

1 Like

I’ve been working on getting us merge trains, but we’re still not there with GitHub.

Reading all well put feedback here it seems that for staging merges we’d like to compile a list of broken packages compared to master and ping all maintainers.

Anyone that’s up for writing such a script?

19 Likes

Quick concrete suggestion that could help with “versioning” workflow via overlays: Flake inputs should be able to do a sparse checkout of a Git repository · Issue #5811 · NixOS/nix · GitHub

Compared to my current workflow which involves painfully updating hashes:

2 Likes

[RFC 0109] Nixpkgs Generated Code Policy by Ericson2314 · Pull Request #109 · NixOS/rfcs · GitHub I think would help with these things. I guess we might need more shepherds to unblock things?


Our community is in limbo as in waste-of-time projects like Flakes suck all the energy out of the room yet there is little governance mechanism to coordinate tackling the problems Nixpkgs faces in bold, innovative ways.

We have these conversations year after year, but nothing happens than various tireless contributors work harder and harder.

The only way to make things more sustainable is

  • to improve the techonology Nixpkgs uses
  • to work with upstream communities to tackle the social problems together.

If we have fixed governance, we can actually tackle these issues. But right now, we cannot.

25 Likes

Our community is in limbo as in waste-of-time projects like Flakes suck all the energy out of the room

Characterizing flakes as a waste of time while at the same time bringing up your pet project RFC 109 (which won’t actually solve the issues brought up in this discussion) seems rather odd.

Flakes do in principle allow us to reduce the scope of Nixpkgs, by moving parts of Nixpkgs into separate projects. E.g. a lot of “leaf” packages and NixOS modules could be moved into their own flakes pretty easily. But it’s important to understand that this does not solve the integration/testing problem - it just makes it somebody else’s problem.

20 Likes

I have come to a similar conclusion (not based on this particular example) and we’re making some progress, but these things take a very long time to communicate well. We also don’t want to rush them, even if they are long overdue.

Rebuilding trust and making responsibilities explicit is important for the future of Nix.

14 Likes

But it’s important to understand that this does not solve the integration/testing problem - it just makes it somebody else’s problem.

We agree! That is why I don’t believe in breaking up Nixpkgs — maybe some stuff can move out, but we still need a final “integrate it all” repo, and keeping it up to date is just as hard.

Characterizing flakes as a waste of time

You have on other occasions said that Flakes is intentionally not trying to change how Nixpkgs works in the short term. I also agree that is a good decision.

But if that’s true in the short term, and in the longer time splitting up Nixpkgs — one thing flakes could be good for — doesn’t help, then Flakes is a waste of time as far as Nixpkgs is concerned.

(You have intended Flakes to help with the learning curve, which, is more plausible. Maybe more users that don’t rage-quit Nix — as we agree is a problem — could even help with Nixpkgs too. But we can’t discuss that this stuff your vision is written and it’s off topic here anyways.)

While at the same time bringing up your pet project RFC 109 (which won’t actually solve the issues brought up in this discussion)

Well first of all, RFC 109 is hardly a pet from an engineering standpoint :).

RFC 92 is a pet — I find it beautiful but with a long slow payoff to getting practical benefits. RFC 109 is an gross hack to try to get us some immediate benefits. Gross hacks are not pets.

The point of RC 109 is to allow using lang2nix stuff in Nixpkgs, immediately which I think does somewhat help with these issues, in that we can start deleting autogenerated code which is a maintenance burden, and using more autogenerated code where the costs of doing today meant we opted not to.

Given there is explicit talk about python packages, multiple versions, etc. in this thread, I think that’s on topic! It’s not a slam dunk and doesn’t solve everything, but it gets us to a slightly better position.

Do you have paths to any “easy wins” in the short term to counter-propose?

10 Likes

Hello! I’m a bit late to the party…

I, too, have been trying to participate in maintaining CUDA packages lately. In fact, they are almost exactly the part of nixpkgs, that I’d like to discuss here, and the part that (this time) brought to the surface many of the pain points that @samuela mentions. At risk of going off-topic, I will try to fill in some details.

The context is that nixpkgs packages a lot of complex “scientific computing” software that is, for practical purposes, most commonly deployed with unfree dependencies like CUDA (think jax, pytorch or… blender). In fact, with a bit of work Nix appears to be a pretty good fit for deploying all of that software. All of the same packages are as well available through other means of distribution, like python’s PyPi, conda, or mainstream distributions’ repositories. Most of the time they will “just work”, you’ll be getting tested pre-built up-to-date packages. Except they break. For all sorts of reason. One python package overwrites files from another python package and all of a sudden faiss cannot see the GPU anymore. Fundamentally, with Nix and nixpkgs one could implement everything that these other building-packaging-distribution systems do, but have more control and predictability. Have fewer breaks.

That is theory. The practice is that nixpkgs, for known reasons, hasn’t got continuous integration running for CUDA-enabled software. The implication is that even configuration and build failures go about invisible for maintainers, leave alone integration failures, or failures in tests involving hardware acceleration. This also means eventual rot in any chunks of nix code that touch CUDA. It probably wouldn’t be too far from the truth to say that the occasionally partially-working state of these packages (their CUDA-enabled versions, that is) has been largely maintained through unsystematic pull requests from interested individuals, looked after by maintainers of adjacent parts of nixpkgs.

One attempt to address this situation or, rather, an ongoing exploration of possible ways to address it was the introduction of @NixOS/cuda-maintainers, called for by @samuela. It is my impression so far that this has been an improvement:

  • this introduced (somewhat) explicit responsibility for previously un-owned parts of nixpkgs;
  • we started running builds (cf. this) and caching results (cf. cuda-maintainers.cachix.org and many thanks to @domenkozar!) for the unfree sci-comp packages on a regular basis, which means that related regressions are not invisible anymore;
  • in parallel, @samuela is running a collection of crafted integration tests, in a CI that tries, on schedule, to notify authors about merged commits that introduce regressions, cf. nixpkgs-upkeep;
  • three previous items made it safe enough to start slowly pruning the outdated hacks and patches in these expressions, and even to perform substantial changes to how CUDA code is organized overall (notably, the introduction of the cudaPackages with support for overrideScope'; many thanks to @FRidh!)
  • working in-tree also has the additional benefit that the many small fixes, extensions, and adjustments that people do in overlays can start migrating upstream, and might even reflect in how downstream packagers handle cuda dependencies when they first introduce them

Essentially, this is precisely the kind of initiative, that @7c6f434c suggests: we target a well-scoped part of nixpkgs, and try to shift the status quo from “this is consistently broken” to “mostly works and has users”.

Obviously, there are many limitations. First of all, we don’t have any dedicated hardware for running those builds and tests, which means our current workflow simply isn’t sustainable: it exists as long as we are personally involved. Lack of dedicated hardware also implies that it’s simply infeasible for us to build staging (we’ve tried). In turn, this implies that we can only address regressions after the fact, when they have already reached master - one of @samuela’s concerns. This gets, however, worse: there’s no feedback between our fragile constantly-changing hand-crafted CI and the CI of nixpkgs. Thus when regressions have reached master, there’s nothing stopping them from flowing further into nixos-unstable and nixpkgs-unstable! Unless the regressions also affect some of the selected free packages, they’ll be automatically merged into the unstable branches. This problem is not hypothetical: for example many of regressions caused by gcc bump took longer to address for cuda-enabled packages, than for the free packages.

One conclusion is that our original implicit goal of keeping the unstable branches “mostly green” was simply naive and wrong, and we can’t but choose a different policy. A different policy both for maintaining and for consuming these unfree packages: I’ve tried for a while (months) to stay at the release branch. I had to update to nixpkgs-unstable because the release had too many things broken, that we’ve already fixed in master, but even regardless: the release branch has (for our purposes) too low a frequency, missing packages, missing updates. My understanding is that many people (@samuela included) treat nixpkgs-unstable as a rolling release branch. From the discussion in this thread, this interpretation appears to be not entirely correct, but maybe it’s what we need. One alternative workflow we’ve considered (but haven’t discussed in depth) for cuda-packages specifically is maintaining and advertising our own “rolling release” branch, that we would merge things into only after checking against our own (unfree-aware and focused) CI. Perhaps this is also where merge-trains could be used to save some compute.

The complexity of building staging and even master (turns out that when you import with config.cudaSupport = true or override blas and lapack you trigger the whole lot of rebuilds) begs further: do we have to build that much?

There might be a lot of fat to cut. One thing we’ve discovered from the inherited code-base, for example, is that the old runfile-based cudatoolkit expression, whose NAR is slightly above 4GiB of mass, has a dependency on fontconfig and alsa-lib, among other surprising things. The new split cudaPackages don’t have this artifact, but the migration is still in process and we do depend on the old cudatoolkit. That’s very significant. This means that everytime fontconfig, or alsa-lib, or unixODBC, or gtk2, or the-list-goes-on is updated: we have to pump in additional 4GiB into the nix store, we have to rebuild cudnn and magma, we have to rebuild pytorch and tensorflow, we have to rebuild jaxlib, all of which fight in the super heavyweight!

The issue is way more systematic, however, than just cuda updating too often. These same behemoth packages, pytorch and tensorflow, through a few levels of indirection have such comparatively small and high-frequency dependencies as pillow (an image-loading package for python that’s only used by some utils modules at runtime), and pyyaml, and many more. Many of these are in propagatedBuildInputs, some are checkInputs, but most of them are never ever used in buildPhase and cannot possibly affect the output. What they do is they signal a false-positive and cause a very expensive rebuild. On schedule. Is this behaviour inherent and unavoidable for any package set written in Nix? Obviously not

I suspect one could, if desired, write “a” nixpkgs that would literally be archlinux, and rebuild just as often (which is probably rather infrequent, compared to “the” nixpkgs), just by introducing enough boundaries and indirections. Split phases in most packages. Not that this would be useful per se, it’s interesting as the opposite extreme, the other end of the spectrum than what current nixpkgs is. And maybe we should head toward somewhere inbetween: rebuild build results could have actually changed, test when test results could.

This brings me to the “nixpkgs is too large part”. I guess it’s pretty large. The actual problem could be, however, that too many things depend on too many things. I’m new to Nix, and just a year ago I had so many more doubts and questions about decisions made in nixpkgs, not least of them the choice of monorepo. Now I actually began to respect the current structure: it might be really boosting synchronization between so many independent teams and individual contributors, helping the eventual consistency. When a change happens, the signal propagates to all affected much sooner, than it probably ever could with subsystems in separate flakes. I don’t think we need to change this. Keeping the “signal” metaphor, I think we need to prune and hone the centralized nixpkgs that we have so as to reduce the “noise” like these false-positives about maybe-changing outputs. We need better compartmentalization in the sense of how many links there are between packages (that’s common sense), but that does not necessitate splitting nixpkgs into multiple repositories

I like the idea about automated broken markers (they save compute, and they spread the signal that a change is needed). I also like the idea of integrating our CI into the status checks. Obviously, for that to happen the unfree CI must be stabilized first, and we must find a sustainable (and scalable) source of storage and compute for it. I don’t think it impossible at all that build failures even for unfree packages might be integrated as merge blockers in future. In fact, I think it inevitably must happen as Nix and nixpkgs grow, and new users come with the expectations they have from other distributions: these parts of nixpkgs will see demand, and maintaining them “in the blind” is impossible. It’s not happening overnight of course. As pointed out by others, there are bureaucracy and trust issues, there are even pure technical difficulties: even if we came up with an agreeable solution, it’s just a lot of work to build a separate CI that wouldn’t contaminate what’s expected to be “free”, and integrate that with existing automated workflows.

14 Likes

And this is I think the real issue: expectations that master and nix*-unstable are rolling releases that should always be green. Having every package passing always is utterly impossible. There are incompatibilities that either

  • can be resolved easily by whoever made the change impacting the package now breaking;
  • can be resolved by the maintainer, but this generally speaking takes time if it would happen at all. We can’t hold back all updates of non-leaf packages.
  • cannot be resolved without significant effort, either with or without upstream.

In that sense, I think users should know what packages matter to them, and have their “own channels”. Following nix*-unstable works fine for core packages, but there is indeed a good chance that a package or two you like does not work. Then, it is better I think to have your own channel, and preferably not one, but where you split up your dependencies in groups. E.g., your NixOS core system might follow nixos-unstable just fine, but your Python environment with scientific computing packages will follow another channel, independently. This users can set up with CI themselves, and is even easier with flakes. What could be nice to have is perhaps a better service for this than say having to set up GH Actions and optionally a binary cache.

In case of Python we should do the wheel building in a separate derivation. That could save a lot of work for larger builds, and especially when we go to content-addressed derivations it gets even more interesting as it could result into not having to make any further rebuilds.

12 Likes

not really?
For me two channel had always lead to big and uncontrollable issues and switching to state issues → replacement of the whole system as it is utterly impossible to be debugged and fixed …
I my experience that only works for small and limited environments (but even there - not over time).

(my Desktop is not small and standard size nor are my Data Science environments - it depends e.g. via GPU and spyder IDS, jupyter … on graphics/UI )


I would be happy to see any kind of change and or progress in the python-nixos-eco-system (after year of resignation)

2 Likes

If each and every user is expected to maintain their own channel/overlay/flake, what is the point of having a centralized package repository? If we’re going to pass the maintenance off to users then it doesn’t seem like we’re fit to do the maintenance in the first place.

6 Likes

There can me more community-maintained channels, which is a relatively cheap method. E.g., a scientific Python channel that advances when the subset of scientific computing packages pass.

8 Likes

fix python to allow multiple versions of dependencies
Allowing Multiple Versions of Python Package in PYTHONPATH/nixpkgs

then expensive packages (tensorflow …) can pin all their dependencies (“bottom-up pinning”, inversion of control), and we avoid rebuilds

then nix can get closer to its promise

the current python situation (collisions in site-packages/) is like the FHS situation (collisions in usr/) that nix wants to solve

… and we avoid rebuilds

the other strategy is to make builds cheaper, for example with Incremental builds
but it requires more complexity (normalize sources, store objects, patch objects)

2 Likes

It’s very nice to read about the CUDA specifics!

In the vain of

It would be great to:

  • Find a way to build these packages with hydra.nixos.org! It would be nice someone in this community could find someone at NVidia to get the licensing (is that the only issue?) sorted out. Or do we need donated hardware? For all we know, NVidia could itself benefit from some Nix CI :O.

  • See if some OpenCL / Vulkan champion wants to up our free software alternatives? I feel like Nixpkgs could also be a good venue for putting all the myriad pieces together to allow non-CUDA GPGPU to work well. Ideally we get an upstream OpenCL / Vulkan interested in Nixpkgs precisely because it is the best way to put all the pieces together.

4 Likes

I think this is a great idea! I’m not sure what the hold up would be other than getting hydra to build unfree packages. We don’t even need to cache them for now! Simply building them would be immensely useful for CI. What’s necessary in order to get hydra to build unfree packages?

There’s no need for special hardware to build CUDA-enabled software, but a GPU is necessary for running GPU-enabled tests. Presently, there are no GPU-enabled tests anywhere in nixpkgs AFAIK. (The Nix build environment blocks access to the GPU by default.) I created a preliminary set of tests here: GitHub - samuela/cuda-nix-testsuite but it’s still early days and I think we’re still working on figuring out what the best solution is for tests requiring a GPU.

5 Likes