New Merge Policy for an always-green hydra

The label-based merge looks good. This is probably what we would need to do if we can’t force our batching requirements into GitHub’s native merge-train feature.

But yes, as you mentioned I would not merge directly but put into a merge train so that we can test in the finally merged result instead of testing each change separately.

@timokau needs to read this, too. /cc marvin

+1 for this.

https://bors.tech/ is a bot that does pretty much this. All it needs it for Hydra to report build status on a single Git branch.

10 Likes

The requirements for building Nixpkgs remind me of Bake, which you may want to have a look at. Unfortunately it hasn’t been maintained for some time but it has some interesting underlying ideas that seem relevant for Nixpkgs. From the blog post Maximum Patches, Minimum Time:

The problem Bake is trying to solve is:

  • You have a set of tests that must pass, which take several hours to run.
  • You have a current state of the world, which passes all the tests.
  • You have a stream of hundreds of incoming patches per day that can be applied to the current state.
  • You want to advance the state by applying patches, ensuring the current state always passes all tests.
  • You want to reject bad patches and apply good patches as fast as possible.
8 Likes

Bake sounds like exactly the workflow and logic that I am proposing!

Thanks for the ping! This is not really relevant for marvin. Marvin’s main purpose is basically allocating (hu)manpower (can’t think of a non-gendered term right now), not merging anything.

I do think its a very promising idea though, I have proposed something similar in the past.

One potential issue with your suggested approach comes to mind: How do you handle things that currently go through staging? If you keep the staging branch, you would still need to serialize the merge back into master which could potentially block things that would go straight through master right now.

1 Like

I think the proposal sounds good. I have a question though: would it be useful to have a distinction between packages that could result in failing NixOS tests and other packages (which could go directly in master)?

I think this has two benefits:

  1. It reduces the size of builds needed for the merge train and bisection.
  2. Some things fail on Hydra because of stupid reasons. To give one example: amd-blis fails when it ends up on a Hydra node that is so old that it doesn’t have AVX instructions. Since this is quick-to-compile leaf package, it’s ok to have it fail on Hydra on the odd occasion it ends up on an old build node.

I think this is a label that we do not have now (unless I missed it), which could probably added relatively easily: affects-nixos-test and perhaps a report to go with it, so that people could trigger the tests with ofborg.

2 Likes

Well, it sounds nice, but I’m afraid it won’t be as easy to implement as it sounds (in our case). All the checks are relatively expensive right now. Just evaluating all NixOS tests takes over ten minutes by itself.* We only do it once a day by default – surely in order to save resources. I think they also get reran way more often than “necessary”, as this stuff just isn’t optimized for “good caching”.

:thinking: if we want to give more machine resources in exchange for improving this, I’d suggest (discussing) two relatively easy steps:

  • Evaluate trunk-combined more often.
    With that trivial step (a couple clicks for me) it would be much easier to quickly determine that we have a problem and narrow it down to the merge/commit that started it.
  • Include tests in a staging-next jobset.
    Right now there’s the complication that our “release” expressions don’t have a good way of including both Darwin stuff and NixOS tests in a single jobset, which is AFAIK why so far staging-next branch (the big changes, often riskier) get merged to master without running NixOS tests.
    Perhaps /cc @FRidh on that. We’ve certainly discussed this already and maybe there’s been some kind of WIP, but I haven’t seen any of this reaching the practice.

* If I look at eval without tests vs. eval without darwin but with tests, I see eval times 96 vs. 706 seconds.

7 Likes

Perhaps I should explain… I do love perfect solutions, but my experience is that without gradual steps there’s usually quite a big risk of expending lots of energy and not really “deploying” anything at all.

4 Likes

One potential issue with your suggested approach comes to mind: How do you handle things that currently go through staging? If you keep the staging branch, you would still need to serialize the merge back into master which could potentially block things that would go straight through master right now?

Yes, I think you would treat the merge from staging to master the same as any other merge. It would have to go into the queue. Of course a lump of changes is more likely to fail but maybe that is working as intended?

| vcunat Great contributor
September 5 |

  • | - |

We only do it once a day by default

Oh, I didn’t realize it was this infrequent. I probably should have checked. I guess I should set up my recurring donation to contribute to the evaluation. However even with once-a-day it might make sense to implement it. However we might need to switch to imperfect heuristics.

:thinking:if we want to give more machine resources in exchange for improving this, I’d suggest (discussing) two relatively easy steps:

These both sound helpful already. Even if they don’t give the benefits of the original proposal.

| vcunat Great contributor
September 5 |

  • | - |

Perhaps I should explain… I do love perfect solutions, but my experience is that without gradual steps there’s usually quite a big risk of expending lots of energy and not really “deploying” anything at all.

I think the perfect solution is a full evaluation on each merge :sweat_smile: but that doesn’t seem feasible at this point.

However this could be deployed somewhat in parallel. We could treat merges to master as the “queue” and see how the evaluations fare. Of course we won’t have the ability to do the actual “drop from queue” or “revert” action but it may be possible to do some analysis on what we expect the result to look like before some big switch.

If there is interest I can look into doing this analysis and we can decide if we want to proceed further.

So how are you planning to proceed after factoring in all the feedback? Call out for participants in an ad-hoc working group on the design and implementation on a merge-train-like pull request merging workflow? What skill sets are needed?

With some more hands on the table, it might be easier to shoot for the bigger goals — regardless whether they are ultimately achieved or not.

1 Like

If there is interest, (which there seems to be) I think that what you mentioned seems like the right approach. There is definitely still some information that needs gathering as I have already learned a lot from this discussion. But once the input is settling down what I will probably do is try to summarize the requirements and write up a sketch of the steps we could take to move towards a merge-train style workflow. Then send it out to a some sort of working group to fill in the details before submitting some sort of RFC. I don’t know if there is a common workflow for preparing RFCs or if it makes sense to submit a PR to the RFC repo for a procedure change like this?

I don’t know if we need any sort of particular skills for this. Probably some basic statistics for trying to estimate how it would affect the change flow and some knowledge of GitHub and Hydra would be useful for evaluating how much work possible solutions would be.

1 Like

I think a lot of this is possible, if we had infinite amount of compute resources. Ultimately, it’s very expensive to expand compute. This is doubly true for platforms like darwin where there’s OS licensing which can double the compute costs.

I think a lot this is mitigated by the use of “nixpkgs-review” . For a lot of the more critical packages, I can ripgrep for nixos tests which include the package, and just run them locally. Obviously, it would be better if CI did this.

I’m actually somewhat opposed to having the “always” green policy. This means that if someone wants to update something like libxml2, they would be responsible for fixing all the broken reverse dependencies. We kind of do this with “staging-next” where we will try to stabilize any of the bigger failures. But maintainers of smaller packages are largely needed to then adjust their packages if they are broken. It’s not a great process, but I’m not sure of a better way without causing burnout on all the main contributors.

I’m not saying that “always” green shouldn’t be a goal, I’m just saying that with the current amount of man-hours and compute, it’s not really feasible. If we did want to make this feasible, we would either need to trim down the size of nixpkgs significantly or have a massive influx of contributors.

6 Likes

| jonringer Great contributor
September 5 |

  • | - |

I’m actually somewhat opposed to having the “always” green policy. This means that if someone wants to update something like libxml2, they would be responsible for fixing all the broken reverse dependencies. We kind of do this with “staging-next” where we will try to stabilize any of the bigger failures. But maintainers of smaller packages are largely needed to then adjust their packages if they are broken. It’s not a great process, but I’m not sure of a better way without causing burnout on all the main contributors.

I don’t think merging the break to master and waiting for all of the maintainers of dependant packages to fix it is the best approach here. I think the better approach is to merge to a staging branch, then get the maintainers to submit fixes either to master (if the change is forwards-compatible) or to the staging branch. This way we can slowly collect the pieces of the migration without blocking up the master branch. Of course this still doesn’t help with unresponsive maintainers, but at some point I guess we need to call it and mark the unmaintained packages broken. It should be possible to do that then merge to master without ever breaking the branch and it doesn’t seem like much extra work to me.

I’m not saying that “always” green shouldn’t be a goal, I’m just saying that with the current amount of man-hours and compute, it’s not really feasible. If we did want to make this feasible

I think it should if only for the ability to ship security fixes. The other option is to create a separate mechanism to update a channel with a security fix added, essential sidestepping a broken master branch.

However there are other benefits too. For example people contributing packages and fixed would like to see their changes go out fast, it is quite annoying when your change gets blocked for a week or so. Additional pulling master to start a change only to find it broken is quite annoying.

4 Likes

I don’t think merging the break to master and waiting for all of the maintainers of dependant packages to fix it is the best approach here.

Ever tried to bump glibc, gcc or openssl? Getting every single leaf-package to build (and also to work which is a totally different story btw) is just not practical. A better solution would be to fix Hydra’s email notifications and to re-enable it after that.

I agree with @jonringer here. We also have ZHF iterations for a reason IMHO.

Also it seems as Hydra-related impurities are forgotten here: sometimes it happens that a package breaks due to too much load on a builder (I guess) or something fails with some kind of Illegal instruction error depending on the builder’s CPU. Another story are timed out tests or the fact that we don’t have too much aarch64 machines so a lot of those builds get cancelled occasionally.

5 Likes

I still think a mixed-responsibility approach is best here:

  • The author of the change is responsible for notifying maintainers of broken reverse-dependencies before the change is merged and give them reasonable time to fix the breakage.
  • The maintainers are then responsible to implement a fix.
  • The author is responsible to mark remaining packages as broken after that reasonable time.

That should be scalable and maintain the always-green (assuming “mark as broken” counts as green).

Edit: Of course the first step could work together with some sort of staging system and (semi-)automatic notifications. The before is key here though.

9 Likes

I agree. I think with some decent tooling for at least notifications, and ideally also for marking as broken this would be a good solution.

1 Like

Is there an issue or discussion for things like this? I have seen it brought up a couple of times in this discussion and it seems like this is a problem that shouldn’t be too hard to solve.