Right now hydra for nixpkgs-unstable and nxos-unstable are often stalled for weeks at a time because hydra builds or tests are failing. This is not only inconvenient, but also potentially dangerous as it blocks security critical updates from being released.
I think this can be mostly avoided by switching to a merge-train type setup. This is a setup where merges are not committed to master until they are fully tested. This makes it very unlikely that master goes red. I wrote a blog post about the concept but I will fully reexplain here as it could relate to nixpkgs.
The basic concept is simple. When a PR is “ready” instead of merging to master and calling it a day we merge to a queue and start to run the regular tests. Only when the relevant tests pass does master get updated.
Now this is currently very similar to now the -unstable channel’s work today. However the key difference is that when the tests fail the change is automatically removed from the queue and all of the other changes are queued up for re-testing. This means that just the relevant change should get skipped and later PRs keep moving as usual as opposed to how channels get blocked up until manually fixed.
However there are complications:
Batching
As I understand it, we don’t have enough capacity in hydra to test each commit separately. Even though Hydra will only rebuild and retest changed derivations my understanding is that batching is still a notable load reduction.
This means that we will need to continue to batch changes, however this complicates things quite a bit as now when we have a failure we don’t know exactly what commit caused it. We have a couple of options:
- Simply abort all of the queued changes. This will allow them to be rebased and we hope that the regular pre-merge testing will find the one that had logical conflicts.
- Run a bisection-like analysis to find the original culprit. This will burn some hydra capacity but assuming it happens pretty rarely it shouldn’t be too much of an issue. And by being smart (like just trying to build the broken derivations) it shouldn’t be too expensive. Furthermore afterwards we should be able to process bigger batches until we catch up. (Assuming that we don’t accumulate conflicting changes quick enough that it spirals into a cycle of constant bisection)
I think either will be an improvement on the long broken times we have now, but unfortunately I don’t think the built in GitHub merge-train support can handle either of these cases so we may have to roll our own (or at least have some tooling on too of the GitHub solution)
Impure Breakages
Sometimes things are broken by URLs dropping off the face of the planet or similar. Luckily this shouldn’t be a major issue because of Hydra caching but it can often blame “good” changes for revealing impure-breakages. I don’t think there is much we can do about this. We will just have to fix the issue before merging in changes that cause the broken packages to change.
Flakey Breakages
Flakey derivations will become more painful than they currently are because they will cause merges to be rejected instead of just being merged and succeeding in the second (or third) evaluation. This can be mitigated by retries, but that will cause extra Hydra load and the only solution is to solve the root cause. I think by marking non-critical packages as broken quickly this will not be a blocker.
Summary
I think this could help us keep the -unstable channels green. I need to spend a bit more time seeing if this would be expected to catch most of the breakages we see but would be interested in any thoughts that people have on the idea itself or easy ways to implement it for nixpkgs.
Sidenote: If this works well we can probably drop the nixpkgs/Nico’s distinction as IIUC the difference is just that the current nixpkgs only exists because it is expected to break less often than nixos due to less tests being run.