New Merge Policy for an always-green hydra

kevincox · September 4, 2020, 7:18pm

Right now hydra for nixpkgs-unstable and nxos-unstable are often stalled for weeks at a time because hydra builds or tests are failing. This is not only inconvenient, but also potentially dangerous as it blocks security critical updates from being released.

I think this can be mostly avoided by switching to a merge-train type setup. This is a setup where merges are not committed to master until they are fully tested. This makes it very unlikely that master goes red. I wrote a blog post about the concept but I will fully reexplain here as it could relate to nixpkgs.

The basic concept is simple. When a PR is “ready” instead of merging to master and calling it a day we merge to a queue and start to run the regular tests. Only when the relevant tests pass does master get updated.

Now this is currently very similar to now the -unstable channel’s work today. However the key difference is that when the tests fail the change is automatically removed from the queue and all of the other changes are queued up for re-testing. This means that just the relevant change should get skipped and later PRs keep moving as usual as opposed to how channels get blocked up until manually fixed.

However there are complications:

Batching

As I understand it, we don’t have enough capacity in hydra to test each commit separately. Even though Hydra will only rebuild and retest changed derivations my understanding is that batching is still a notable load reduction.

This means that we will need to continue to batch changes, however this complicates things quite a bit as now when we have a failure we don’t know exactly what commit caused it. We have a couple of options:

Simply abort all of the queued changes. This will allow them to be rebased and we hope that the regular pre-merge testing will find the one that had logical conflicts.
Run a bisection-like analysis to find the original culprit. This will burn some hydra capacity but assuming it happens pretty rarely it shouldn’t be too much of an issue. And by being smart (like just trying to build the broken derivations) it shouldn’t be too expensive. Furthermore afterwards we should be able to process bigger batches until we catch up. (Assuming that we don’t accumulate conflicting changes quick enough that it spirals into a cycle of constant bisection)

I think either will be an improvement on the long broken times we have now, but unfortunately I don’t think the built in GitHub merge-train support can handle either of these cases so we may have to roll our own (or at least have some tooling on too of the GitHub solution)

Impure Breakages

Sometimes things are broken by URLs dropping off the face of the planet or similar. Luckily this shouldn’t be a major issue because of Hydra caching but it can often blame “good” changes for revealing impure-breakages. I don’t think there is much we can do about this. We will just have to fix the issue before merging in changes that cause the broken packages to change.

Flakey Breakages

Flakey derivations will become more painful than they currently are because they will cause merges to be rejected instead of just being merged and succeeding in the second (or third) evaluation. This can be mitigated by retries, but that will cause extra Hydra load and the only solution is to solve the root cause. I think by marking non-critical packages as broken quickly this will not be a blocker.

Summary

I think this could help us keep the -unstable channels green. I need to spend a bit more time seeing if this would be expected to catch most of the breakages we see but would be interested in any thoughts that people have on the idea itself or easy ways to implement it for nixpkgs.

Sidenote: If this works well we can probably drop the nixpkgs/Nico’s distinction as IIUC the difference is just that the current nixpkgs only exists because it is expected to break less often than nixos due to less tests being run.

blaggacao · September 4, 2020, 7:27pm

I think that k8s is doing a similar thing. — might be worth it having a sneaky look over there.

IIRC, the merge train is based on label queries. See also:

kevincox · September 4, 2020, 7:38pm

The label-based merge looks good. This is probably what we would need to do if we can’t force our batching requirements into GitHub’s native merge-train feature.

But yes, as you mentioned I would not merge directly but put into a merge train so that we can test in the finally merged result instead of testing each change separately.

blaggacao · September 4, 2020, 8:10pm

@timokau needs to read this, too. /cc marvin

zimbatm · September 4, 2020, 9:45pm

+1 for this.

https://bors.tech/ is a bot that does pretty much this. All it needs it for Hydra to report build status on a single Git branch.

rycee · September 4, 2020, 9:48pm

The requirements for building Nixpkgs remind me of Bake, which you may want to have a look at. Unfortunately it hasn’t been maintained for some time but it has some interesting underlying ideas that seem relevant for Nixpkgs. From the blog post Maximum Patches, Minimum Time:

The problem Bake is trying to solve is:

You have a set of tests that must pass, which take several hours to run.

You have a current state of the world, which passes all the tests.

You have a stream of hundreds of incoming patches per day that can be applied to the current state.

You want to advance the state by applying patches, ensuring the current state always passes all tests.

You want to reject bad patches and apply good patches as fast as possible.

kevincox · September 4, 2020, 10:49pm

Bake sounds like exactly the workflow and logic that I am proposing!

timokau · September 5, 2020, 3:57pm

Thanks for the ping! This is not really relevant for marvin. Marvin’s main purpose is basically allocating (hu)manpower (can’t think of a non-gendered term right now), not merging anything.

I do think its a very promising idea though, I have proposed something similar in the past.

One potential issue with your suggested approach comes to mind: How do you handle things that currently go through staging? If you keep the staging branch, you would still need to serialize the merge back into master which could potentially block things that would go straight through master right now.

danieldk · September 5, 2020, 4:55pm

I think the proposal sounds good. I have a question though: would it be useful to have a distinction between packages that could result in failing NixOS tests and other packages (which could go directly in master)?

I think this has two benefits:

It reduces the size of builds needed for the merge train and bisection.
Some things fail on Hydra because of stupid reasons. To give one example: amd-blis fails when it ends up on a Hydra node that is so old that it doesn’t have AVX instructions. Since this is quick-to-compile leaf package, it’s ok to have it fail on Hydra on the odd occasion it ends up on an old build node.

I think this is a label that we do not have now (unless I missed it), which could probably added relatively easily: affects-nixos-test and perhaps a report to go with it, so that people could trigger the tests with ofborg.

vcunat · September 5, 2020, 6:14pm

Well, it sounds nice, but I’m afraid it won’t be as easy to implement as it sounds (in our case). All the checks are relatively expensive right now. Just evaluating all NixOS tests takes over ten minutes by itself.* We only do it once a day by default – surely in order to save resources. I think they also get reran way more often than “necessary”, as this stuff just isn’t optimized for “good caching”.

if we want to give more machine resources in exchange for improving this, I’d suggest (discussing) two relatively easy steps:

Evaluate trunk-combined more often.
With that trivial step (a couple clicks for me) it would be much easier to quickly determine that we have a problem and narrow it down to the merge/commit that started it.
Include tests in a staging-next jobset.
Right now there’s the complication that our “release” expressions don’t have a good way of including both Darwin stuff and NixOS tests in a single jobset, which is AFAIK why so far staging-next branch (the big changes, often riskier) get merged to master without running NixOS tests.
Perhaps /cc @FRidh on that. We’ve certainly discussed this already and maybe there’s been some kind of WIP, but I haven’t seen any of this reaching the practice.

* If I look at eval without tests vs. eval without darwin but with tests, I see eval times 96 vs. 706 seconds.

vcunat · September 5, 2020, 6:30pm

Perhaps I should explain… I do love perfect solutions, but my experience is that without gradual steps there’s usually quite a big risk of expending lots of energy and not really “deploying” anything at all.

kevincox · September 5, 2020, 6:36pm

One potential issue with your suggested approach comes to mind: How do you handle things that currently go through staging? If you keep the staging branch, you would still need to serialize the merge back into master which could potentially block things that would go straight through master right now?

Yes, I think you would treat the merge from staging to master the same as any other merge. It would have to go into the queue. Of course a lump of changes is more likely to fail but maybe that is working as intended?

kevincox · September 5, 2020, 6:53pm

| vcunat Great contributor
September 5 |

| - |

We only do it once a day by default

Oh, I didn’t realize it was this infrequent. I probably should have checked. I guess I should set up my recurring donation to contribute to the evaluation. However even with once-a-day it might make sense to implement it. However we might need to switch to imperfect heuristics.

if we want to give more machine resources in exchange for improving this, I’d suggest (discussing) two relatively easy steps:

These both sound helpful already. Even if they don’t give the benefits of the original proposal.

kevincox · September 5, 2020, 7:00pm

| vcunat Great contributor
September 5 |

| - |

Perhaps I should explain… I do love perfect solutions, but my experience is that without gradual steps there’s usually quite a big risk of expending lots of energy and not really “deploying” anything at all.

I think the perfect solution is a full evaluation on each merge but that doesn’t seem feasible at this point.

However this could be deployed somewhat in parallel. We could treat merges to master as the “queue” and see how the evaluations fare. Of course we won’t have the ability to do the actual “drop from queue” or “revert” action but it may be possible to do some analysis on what we expect the result to look like before some big switch.

If there is interest I can look into doing this analysis and we can decide if we want to proceed further.

blaggacao · September 5, 2020, 8:05pm

So how are you planning to proceed after factoring in all the feedback? Call out for participants in an ad-hoc working group on the design and implementation on a merge-train-like pull request merging workflow? What skill sets are needed?

With some more hands on the table, it might be easier to shoot for the bigger goals — regardless whether they are ultimately achieved or not.

kevincox · September 5, 2020, 8:23pm

If there is interest, (which there seems to be) I think that what you mentioned seems like the right approach. There is definitely still some information that needs gathering as I have already learned a lot from this discussion. But once the input is settling down what I will probably do is try to summarize the requirements and write up a sketch of the steps we could take to move towards a merge-train style workflow. Then send it out to a some sort of working group to fill in the details before submitting some sort of RFC. I don’t know if there is a common workflow for preparing RFCs or if it makes sense to submit a PR to the RFC repo for a procedure change like this?

I don’t know if we need any sort of particular skills for this. Probably some basic statistics for trying to estimate how it would affect the change flow and some knowledge of GitHub and Hydra would be useful for evaluating how much work possible solutions would be.

jonringer · September 5, 2020, 8:28pm

I think a lot of this is possible, if we had infinite amount of compute resources. Ultimately, it’s very expensive to expand compute. This is doubly true for platforms like darwin where there’s OS licensing which can double the compute costs.

I think a lot this is mitigated by the use of “nixpkgs-review” . For a lot of the more critical packages, I can ripgrep for nixos tests which include the package, and just run them locally. Obviously, it would be better if CI did this.

I’m actually somewhat opposed to having the “always” green policy. This means that if someone wants to update something like libxml2, they would be responsible for fixing all the broken reverse dependencies. We kind of do this with “staging-next” where we will try to stabilize any of the bigger failures. But maintainers of smaller packages are largely needed to then adjust their packages if they are broken. It’s not a great process, but I’m not sure of a better way without causing burnout on all the main contributors.

I’m not saying that “always” green shouldn’t be a goal, I’m just saying that with the current amount of man-hours and compute, it’s not really feasible. If we did want to make this feasible, we would either need to trim down the size of nixpkgs significantly or have a massive influx of contributors.

kevincox · September 5, 2020, 8:48pm

| jonringer Great contributor
September 5 |

| - |

I’m actually somewhat opposed to having the “always” green policy. This means that if someone wants to update something like libxml2, they would be responsible for fixing all the broken reverse dependencies. We kind of do this with “staging-next” where we will try to stabilize any of the bigger failures. But maintainers of smaller packages are largely needed to then adjust their packages if they are broken. It’s not a great process, but I’m not sure of a better way without causing burnout on all the main contributors.

I don’t think merging the break to master and waiting for all of the maintainers of dependant packages to fix it is the best approach here. I think the better approach is to merge to a staging branch, then get the maintainers to submit fixes either to master (if the change is forwards-compatible) or to the staging branch. This way we can slowly collect the pieces of the migration without blocking up the master branch. Of course this still doesn’t help with unresponsive maintainers, but at some point I guess we need to call it and mark the unmaintained packages broken. It should be possible to do that then merge to master without ever breaking the branch and it doesn’t seem like much extra work to me.

I’m not saying that “always” green shouldn’t be a goal, I’m just saying that with the current amount of man-hours and compute, it’s not really feasible. If we did want to make this feasible

I think it should if only for the ability to ship security fixes. The other option is to create a separate mechanism to update a channel with a security fix added, essential sidestepping a broken master branch.

However there are other benefits too. For example people contributing packages and fixed would like to see their changes go out fast, it is quite annoying when your change gets blocked for a week or so. Additional pulling master to start a change only to find it broken is quite annoying.

Ma27 · September 5, 2020, 9:03pm

I don’t think merging the break to master and waiting for all of the maintainers of dependant packages to fix it is the best approach here.

Ever tried to bump glibc, gcc or openssl? Getting every single leaf-package to build (and also to work which is a totally different story btw) is just not practical. A better solution would be to fix Hydra’s email notifications and to re-enable it after that.

I agree with @jonringer here. We also have ZHF iterations for a reason IMHO.

Also it seems as Hydra-related impurities are forgotten here: sometimes it happens that a package breaks due to too much load on a builder (I guess) or something fails with some kind of Illegal instruction error depending on the builder’s CPU. Another story are timed out tests or the fact that we don’t have too much aarch64 machines so a lot of those builds get cancelled occasionally.

timokau · September 5, 2020, 9:52pm

I still think a mixed-responsibility approach is best here:

The author of the change is responsible for notifying maintainers of broken reverse-dependencies before the change is merged and give them reasonable time to fix the breakage.
The maintainers are then responsible to implement a fix.
The author is responsible to mark remaining packages as broken after that reasonable time.

That should be scalable and maintain the always-green (assuming “mark as broken” counts as green).

Edit: Of course the first step could work together with some sort of staging system and (semi-)automatic notifications. The before is key here though.