Steps towards even more PR automation

Kicked off by this thread on “Quality control for packages” I would like to present my long-term vision/wishes for automation of package upgrades and QA.

I would especially like to hear what’s already planned from our great people working on build automation, but also from those who spend a lot of time reviewing/testing/merging PRs, and in general whether others share this vision.

Starting with a quote from the other thread:

That’s right. ofborg and r-ryantm are already very good steps into the right direction. I think we should all-in into even more automation.

This is how I think we can master the pull request flood in the long term:

  • Full automatic nix-review build for all r-ryantm bot upgrades, so that all dependent packages will be automatically rebuilt. Also for mass rebuidls. This should be against master, even if the PR is against staging, because staging is often broken, and the master build will tell us most of what we need.
  • Full automatic NixOS tests for them as well, and post results on the PR. Right now we run NixOS test on user-demand only.
  • Automatic merging of r-ryantm bot point releases, e.g. 1.2.3 -> 1.2.4, after the above checks have passed.
  • Add NixOS tests for a lot more apps (both CLI and desktop).
  • Establish a evidence system for NixOS tests that says: “If this test passes / is accepted by a human, then the following packages X, Y, Z are probably working fine”. For example, we could say that if a test screenshotting Firefox showing various image file types and fonts looks OK (a human would make that statement on the PR), then libpng, libjpeg, and all the font libraries Firefox depends on, are probably OK. Thus, the workflow would be that if a bot proposes an updateto libpng, futher automation would say “hey humans, I have built the dependent Firefox and Evince against this, here’s the test outputs, if you say the screenshots of one of those look OK, I will auto-merge this PR”.

I believe this will allow us to remove most of the labour involved for upgrades, giving us more time to focus on the difficult stuff.

For security and performance, we should:

  • Add an explicit label nonmalicious-checked that committers can apply to tell automatic infrastructure that the code involved is safe to be run. This is similar to the current @grahamcofborg build, but would be more explicit, not conflating this assertion with the actual build command as we do now, and not only be for ofborg. Infrastructure should only rely on the commit that was labelled this way; force-pushes or new commits have to be re-labeled. This label would not put a judgment on whether the change is a good one to merge, just assert that it does not obviously contain code to abuse infrastructure.

  • Forbid PRs to be merged without this label. (This just makes current workflow explicit.)

  • Move to building all pre-merge stuff in throwaway VMs, so that we need no longer care about malicious code for the purpose of testing PRs.

  • As an optimisation: The outputs of the throwaway VMs should be kept a while. As soon as the corresponding change is labelled nonmalicious-checked, those contents shall be unlocked to be shared with e.g. Hydra, so that we don’t have to build twice.

  • Allow people to more easily contribute build power based on throwaway VMs. Everybody should be able to run a daemon on their machines that hooks into an ofborg-like system and accepts jobs from it, and then post results to the PRs (edit for clarity: to provide evidence whether builds or tests break; untrusted community builder outputs would not be put on the official binary cache, but instead be re-built by trusted Hydra after merge; it’s the testing-while-iterating that’s expensive and needs community compute power, not the final Hydra build).

    I suspect we would easily tenfold the currently available Hydra+ofborg resources this way, making it easy enough to test even every mass-rebuild PR against master (as opposed to batching in staging). I also expect it’ll make it easier for companies to contribute spare server power to this task, as it is often easier to do this than to get a company approve donations. Thinking this thought to an end, we could make it even easier to for companies to chime in by permitting a mini-plugs in the automated build results posting, for example the bot post could read “All NixOS tests passing. These build results were contributed by: AprilTree – all you need for devops”. (This approach allows companies to support orgs from the marketing budget, instead of a “charitable causes” budget which most companies don’t have.)

Finally, community-process-wise we should:

  • Use sharded out sprints much more. They work so well. I was very impressed by how fast the migration of all Perl tests to Python was done, with ~265 tests being ported in record time by many contributors. We could do such things for changes like QT wrapping as well, thus allowing that the vast majority of packages could be moved to a new approach swiftly, and the new approach gets broad checks.
  • Make a sprint to give each unmaintained package a maintainer.

Opinions?

11 Likes
  • Fully automatic nixpkgs-review is definitely on my todo list for nixpkgs-update.
  • Fully automatic NixOS tests for packages in nixpkgs-update is already supported via passthru.tests. For example, gjs does it. (Side note, I had to blacklist gjs NixOS testing because the test was taking over 10 hours to run on a powerful machine, haven’t looked into why it was so slow yet.)

The main problem I see with this is that we don’t have a good way to collect and aggregate changelog/news files. If we could do this, then we could maintain a list of changes since the last time a human has “patrolled” the derivation looking for new optional dependencies, changes breaking runtime, etc.

4 Likes

With new python PRs, i generally require that unit tests are ran as part of the checkPhase. Breaking changes should be caught with a nix-review

Edit: In stead of trying to capture breaking changes upstream, we could validate the scenarios work downstream. Which is also why nixos tests are nice.

Yea, but that’s only if the package in question is bumped right? what happens if a new php version is released, and then you want to verify that nextcloud still works correctly? I don’t think we aggregate downstream package tests, but I could be mistaken.

EDIT v2: @ryantm don’t take this harshly, r-ryantm is a great benefit to nixpkgs and I think we all appreciate that it’s able to do keep large sections of the nixpkgs more up-to-date. I think the tooling still has plenty of room to grow, and I’m just excited by the opportunities.

1 Like

I was thinking of making another bot or extending r-ryantm to listen to an rss feed of pypi updates. If the package bump caused <50 rebuilds, then I would just do a nix-review, and if the review passed, then create a PR which in most cases could be easily merged. If something failed a build, review, or rebuild count, then I would log it similar to the r-ryantm logs.

1 Like

I would love to have this! It makes sense at some many levels. However, I see some trust-related issues with this model.

We currently are using the hydra builds to populate the binary cache. With this model in mind, distributing the build means we would either need to:

  1. trust each and every builder involved in this distributed CI.
  2. having a way to verify the produced builds before sending them to the cache.

Point 1 wouldn’t obviously scale.

Point 2 is pretty tricky, especially given the fact a lot of builds are still not binary-reproducible. Even if we assume we’ll only use this system on the reproducible subset of nixpkgs[1], we would still need to implement a whole infrastructure that would be in charge of assessing the trust. I know that a CT-like system for assessing the trust we put on the nix cache has been discussed during the last Nixcon, I don’t know how much progress has been done in that area since.

[1] I’m not even sure that finding the said subset would be a feasible task.

2 Likes

With new python PRs, i generally require that unit tests are ran as part of the checkPhase. Breaking changes should be caught with a nix-review

Edit: In stead of trying to capture breaking changes upstream, we could validate the scenarios work downstream. Which is also why nixos tests are nice.

There are also some changes that get mentioned in the release notes and deserve attention, but concern corner cases likely to be missed by our tests.

So eventually looking at the changelogs is a good idea even if we do some bumps without it.

Yea, but that’s only if the package in question is bumped right? what happens if a new php version is released, and then you want to verify that nextcloud still works correctly? I don’t think we aggregate downstream package tests, but I could be mistaken.

I guess the point is that maybe we should eventually add nextcloud tests to PHP’s test list.

  • Move to building all pre-merge stuff in throwaway VMs, so that we need no longer care about malicious code for the purpose of testing PRs.

Note that the fun part is not just container escape (or container + VM escape, these also happen from time to time!), but also the need of fine network isolation of fixed-output derivations.

Although one could try to build all FODs on central infrastructure as they are usually less CPU-intensive.

  • Allow people to more easily contribute build power based on throwaway VMs. Everybody should be able to run a daemon on their machines that hooks into an ofborg-like system and accepts jobs from it, and then post results to the PRs. I suspect we would easily tenfold the currently available Hydra+ofborg resources this way.

OfBorg previously ran on community computing hardware. It didn’t have the completely crazy amount of hardware anyway. And there is coordination overhead in all that (which I observed as I had a builder). I think the move to central infrastructure was partially because communicating exact sandboxing guarantees and managing the FOD network access on non-centralised builds is a nontrivial amount of coordination work (and isolation is technically easier in a fully controlled setup fully known to one person).

Thinking this thought to an end, we could make it even easier to for companies to chime in by permitting a mini-plugs in the automated build results posting, for example the bot post could read “All NixOS tests passing. These build results were contributed by: AprilTree – all you need for devops”. (This approach allows companies to support orgs from the marketing budget, instead of a “charitable causes” budget which most companies don’t have.)

And if we want to ever use third-party builds, we need to track that they reproduce on technically and administratively independent builders (otherwise a single break-in or a single deliberate attack could poison the cache easily), so at least name+website link is necessary information anyway.

I also think its very important to give PR authors more control over their PRs metadata (for example through a tagging bot) and implement something like rfc 30 (which I started but then had to abondon because of a lack of time).

The main thing is that there should be a single tag (“needs review” or similar) that marks all commits that are immediately actionable for any reviewer. That action could well be “ask followup questions” (and change the tag to “needs info”), “ping other reviewer with domain knowledge” (and change the tag, maybe also “needs info”), request changes (“needs work”) etc.

1 Like

Correct! Probably we should extend nixpkgs-review to run NixOS tests too!

edit: Here’s the feature request for that: https://github.com/Mic92/nixpkgs-review/issues/77

1 Like

This is not exactly what I had in mind – I should have been a bit clearer on that. What I mean is much simpler.

There are 2 components of trust:

  1. Do we trust the inputs to be nonmalicious-checked as to not mess with build infrastructure or introduce backdoors?
  2. Do we trust some infrastructure/builder to put things on the binary cache?

I do not propose that community-contributed builds from non-trusted infrastructure make it into the binary cache.

Instead I mean the following simple rules:

  • If something is nonmalicious-checked, any infrastructure can safely build it, even if it does not have VM level isolation.
  • If something is nonmalicious-checked, AND it was built by anybody’s infrastructure (trusted or not), THEN it can be merged.
    If it was built on untrusted infrastructure, then trusted infrastructure will re-build it after the merge as happens now.
  • If something is nonmalicious-checked, AND was built by trusted infrastructure, THEN it can make it onto the binary cache.

So in general, contributed untrusted infrastructure is only used make the decision of whether any build or test breaks, not to produce code that users run.

In my proposed model, the biggest negative impact a malicious community-contributed builder could have is to claim e.g. a Firefox upgrade “works fine” when in fact the build breaks (which is not a big problem, detected automatically after the merge when Hydra builds it, and we can exclude the lying builder afterwards); it cannot result in untrusted code on our binary caches.

I have updated the description accordingly now:

Everybody should be able to run a daemon on their machines that hooks into an ofborg-like system and accepts jobs from it, and then post results to the PRs (edit for clarity: to provide evidence whether builds or tests break; untrusted community builder outputs would not be put on the official binary cache, but instead be re-built by trusted Hydra after merge; it’s the testing-while-iterating that’s expensive and needs community compute power, not the final Hydra build).

2 Likes

Hi all,

First off I’d like to say I agree in spirit with most of this thread, and it’s
great to see others thinking about it. We need to lean heavily on automation and
“work smarter not harder” to keep up with the world and avoid burning our people
out.

On this front, I have a couple improvements I’d like to call out.

  1. With help from @ryantm, I’ve done work to refactor nixpkgs-update
    by separating out the updater “rewrite rules” from the nuts-and-bolts of
    cloning the repo, doing builds, interacting with Repology, sending a PR, etc.
    The goal here is to make it easy to add a new improvement/rewrite rule that
    gets run when versions upgrade.
    https://github.com/ryantm/nixpkgs-update/blob/321c8c59b2faa1a9ae89224fa125fe081dd56f71/src/Update.hs#L212-L222

  2. Using (1), we’ve added a rewrite rule that can quote
    meta.homepage for https://github.com/NixOS/rfcs/pull/45. Example here:
    https://github.com/NixOS/nixpkgs/pull/82390
    This is a fairly simple proof-of-concept rewrite rule.

  3. Using (1), we’ve added a rewrite rule that is capable of updating
    Rust application packages automatically, like this:
    https://github.com/NixOS/nixpkgs/pull/82384

  4. Using (1), we’ve added a rewrite rule that is capable of updating Golang
    application packages automatically, like this:
    https://github.com/NixOS/nixpkgs/pull/82492
    For a PR example of adding a new rule, this is what added golang support:
    https://github.com/ryantm/nixpkgs-update/pull/163

  5. With help from @mic92, nixpkgs-review is now capable of operating
    non-interactively with a --no-shell cmdline argument. This is a building
    block that can let us hook it into other bots like nixpkgs-update, or
    ad-hoc mass PR verification scripts, or other CI bots, and so on.
    https://github.com/Mic92/nixpkgs-review/pull/84

Aside from celebrating the concrete improvements already implemented-and-deployed,
the main reason to emphasize these is that they are building blocks for further
automation, so if you have ideas here it ought to be easier to implement them now!


A quick note on the philosophy of (1): this is not meant to replace treewide
PRs or CI checks. The purview of nixpkgs-update is still in updating
individual package versions, but we often have some “not absolutely urgent but
still nice to have” improvements in the works that we’d like to slowly sweep
through NixPkgs, like RFC 45, or things like removing unnecessary rec, or
using fetchFromGitHub for release tarballs instead of fetchzip, etc. One way to
accomplish these is by having someone do a treewide PR with an ad-hoc
sed/hnix/editor-macro generating the diff, but these can be hard to code review
and create mass rebuilds + merge conflicts, and while appropriate for relatively
important/critical changes are less compelling for small cleanups. Moreover they
don’t keep applying over time.

The above gives that hook to have the updater bot slowly drag in small
improvements over time, spread out as packages themselves get updates and
reviews/testing as part of their normal lifecycle.


Some comments on the above thread:

  1. Based on experience, I’d advocate against having auto-merge bots. Potential
    security vulnerabilities and complexity aside, we’ve had multiple issues
    where the nixpkgs-update bot generated garbage that still passed CI, and was
    only saved by human code review (huge thanks to @jonringer and @marsam for
    reviewing+merging a ton of my PRs, btw). I’d argue we should focus the bulk
    of our effort on PR generation / PR validation with CI, at least initially,
    because it’s lower risk and higher RoI.

  2. The ability to farm out temporary OfBorg capacity would be fantastic. Much of
    the above was an output of me initially writing my own scripts for
    https://github.com/NixOS/nixpkgs/issues/79975. I burnt out the OfBorg bot by
    going too fast, then learned from @grahamc how nixpkgs-update does rate
    limiting and implemented it, which got me thinking about updating
    nixpkgs-update to make these kinds of updates easier in the future. Once I
    added rate limiting, the script spent most of its time sleeping with 72 cores
    idle waiting for OfBorg to catch up. It would have been nice to connect that
    capacity temporarily for PR builds. Capacity on the AWS spot market is
    extremely cheap, but it requires OfBorg to support a dynamic build farm
    gracefully.

    I agree these should never make their way into the official binary cache for
    security reasons, of course.

  3. At the end of the day our humans are our most valuable resource, and probably
    the greatest risk to our ecosystem is that we burn out key maintainers by
    having them do too much repetitive, grindy work that should have been
    automated. To the extent that we can make their lives easier with heavy PR
    CI validation / nix derivation sanitizing we should prioritize it! No novel
    point here; just reiterating that I think improvements suggested in this
    thread are valuable and important for the long-term health of NixPkgs.

5 Likes

One could consider an opt-in auto-merge bot for some cases; then having great tests for a package would actually start paying off…

I mean, human review is also not infallible, and for some packages we could end up with comparable risk and lower effort (and tests-as-demos as a bonus).

1 Like

This is in the spirit of rfc 50 https://github.com/NixOS/rfcs/pull/50 , which enabled finer granularity of commit access (presumably through orborg).

I don’t think we will completely get rid of the human element, but we could take enormous steps to ease the amount of involvement for the human.

1 Like

I don’t think we will completely get rid of the human element, but we could take enormous steps to ease the amount of involvement for the human.

In a sense, letting the package maintainer accept a set of tests as sufficient for full-auto updates is the easiest policy to agree on…

Of course, substantial expression improvements suggested by not-yet-trusted-enough humans need to pass a trusted human evaluation.

1 Like

I agree, and I think we do not need full auto-merging without human review.

It would be totally sufficient if for every PR, a committer looks at the diff to check that a bot didn’t screw up there, and sets my proposed nonmalicious-checked label.

This workflow is still extremely low overhead; it would probably come down to 30 seconds PRs per minute on average, so it’d allow 1000s of PRs per day.

2 Likes