Steps towards even more PR automation

Kicked off by this thread on “Quality control for packages” I would like to present my long-term vision/wishes for automation of package upgrades and QA.

I would especially like to hear what’s already planned from our great people working on build automation, but also from those who spend a lot of time reviewing/testing/merging PRs, and in general whether others share this vision.

Starting with a quote from the other thread:

That’s right. ofborg and r-ryantm are already very good steps into the right direction. I think we should all-in into even more automation.

This is how I think we can master the pull request flood in the long term:

  • Full automatic nix-review build for all r-ryantm bot upgrades, so that all dependent packages will be automatically rebuilt. Also for mass rebuidls. This should be against master, even if the PR is against staging, because staging is often broken, and the master build will tell us most of what we need.
  • Full automatic NixOS tests for them as well, and post results on the PR. Right now we run NixOS test on user-demand only.
  • Automatic merging of r-ryantm bot point releases, e.g. 1.2.3 -> 1.2.4, after the above checks have passed.
  • Add NixOS tests for a lot more apps (both CLI and desktop).
  • Establish a evidence system for NixOS tests that says: “If this test passes / is accepted by a human, then the following packages X, Y, Z are probably working fine”. For example, we could say that if a test screenshotting Firefox showing various image file types and fonts looks OK (a human would make that statement on the PR), then libpng, libjpeg, and all the font libraries Firefox depends on, are probably OK. Thus, the workflow would be that if a bot proposes an updateto libpng, futher automation would say “hey humans, I have built the dependent Firefox and Evince against this, here’s the test outputs, if you say the screenshots of one of those look OK, I will auto-merge this PR”.

I believe this will allow us to remove most of the labour involved for upgrades, giving us more time to focus on the difficult stuff.

For security and performance, we should:

  • Add an explicit label nonmalicious-checked that committers can apply to tell automatic infrastructure that the code involved is safe to be run. This is similar to the current @grahamcofborg build, but would be more explicit, not conflating this assertion with the actual build command as we do now, and not only be for ofborg. Infrastructure should only rely on the commit that was labelled this way; force-pushes or new commits have to be re-labeled. This label would not put a judgment on whether the change is a good one to merge, just assert that it does not obviously contain code to abuse infrastructure.

  • Forbid PRs to be merged without this label. (This just makes current workflow explicit.)

  • Move to building all pre-merge stuff in throwaway VMs, so that we need no longer care about malicious code for the purpose of testing PRs.

  • As an optimisation: The outputs of the throwaway VMs should be kept a while. As soon as the corresponding change is labelled nonmalicious-checked, those contents shall be unlocked to be shared with e.g. Hydra, so that we don’t have to build twice.

  • Allow people to more easily contribute build power based on throwaway VMs. Everybody should be able to run a daemon on their machines that hooks into an ofborg-like system and accepts jobs from it, and then post results to the PRs (edit for clarity: to provide evidence whether builds or tests break; untrusted community builder outputs would not be put on the official binary cache, but instead be re-built by trusted Hydra after merge; it’s the testing-while-iterating that’s expensive and needs community compute power, not the final Hydra build).

    I suspect we would easily tenfold the currently available Hydra+ofborg resources this way, making it easy enough to test even every mass-rebuild PR against master (as opposed to batching in staging). I also expect it’ll make it easier for companies to contribute spare server power to this task, as it is often easier to do this than to get a company approve donations. Thinking this thought to an end, we could make it even easier to for companies to chime in by permitting a mini-plugs in the automated build results posting, for example the bot post could read “All NixOS tests passing. These build results were contributed by: AprilTree – all you need for devops”. (This approach allows companies to support orgs from the marketing budget, instead of a “charitable causes” budget which most companies don’t have.)

Finally, community-process-wise we should:

  • Use sharded out sprints much more. They work so well. I was very impressed by how fast the migration of all Perl tests to Python was done, with ~265 tests being ported in record time by many contributors. We could do such things for changes like QT wrapping as well, thus allowing that the vast majority of packages could be moved to a new approach swiftly, and the new approach gets broad checks.
  • Make a sprint to give each unmaintained package a maintainer.

Opinions?

8 Likes
  • Fully automatic nixpkgs-review is definitely on my todo list for nixpkgs-update.
  • Fully automatic NixOS tests for packages in nixpkgs-update is already supported via passthru.tests. For example, gjs does it. (Side note, I had to blacklist gjs NixOS testing because the test was taking over 10 hours to run on a powerful machine, haven’t looked into why it was so slow yet.)

The main problem I see with this is that we don’t have a good way to collect and aggregate changelog/news files. If we could do this, then we could maintain a list of changes since the last time a human has “patrolled” the derivation looking for new optional dependencies, changes breaking runtime, etc.

4 Likes

With new python PRs, i generally require that unit tests are ran as part of the checkPhase. Breaking changes should be caught with a nix-review

Edit: In stead of trying to capture breaking changes upstream, we could validate the scenarios work downstream. Which is also why nixos tests are nice.

Yea, but that’s only if the package in question is bumped right? what happens if a new php version is released, and then you want to verify that nextcloud still works correctly? I don’t think we aggregate downstream package tests, but I could be mistaken.

EDIT v2: @ryantm don’t take this harshly, r-ryantm is a great benefit to nixpkgs and I think we all appreciate that it’s able to do keep large sections of the nixpkgs more up-to-date. I think the tooling still has plenty of room to grow, and I’m just excited by the opportunities.

1 Like

I was thinking of making another bot or extending r-ryantm to listen to an rss feed of pypi updates. If the package bump caused <50 rebuilds, then I would just do a nix-review, and if the review passed, then create a PR which in most cases could be easily merged. If something failed a build, review, or rebuild count, then I would log it similar to the r-ryantm logs.

1 Like

I would love to have this! It makes sense at some many levels. However, I see some trust-related issues with this model.

We currently are using the hydra builds to populate the binary cache. With this model in mind, distributing the build means we would either need to:

  1. trust each and every builder involved in this distributed CI.
  2. having a way to verify the produced builds before sending them to the cache.

Point 1 wouldn’t obviously scale.

Point 2 is pretty tricky, especially given the fact a lot of builds are still not binary-reproducible. Even if we assume we’ll only use this system on the reproducible subset of nixpkgs[1], we would still need to implement a whole infrastructure that would be in charge of assessing the trust. I know that a CT-like system for assessing the trust we put on the nix cache has been discussed during the last Nixcon, I don’t know how much progress has been done in that area since.

[1] I’m not even sure that finding the said subset would be a feasible task.

2 Likes

With new python PRs, i generally require that unit tests are ran as part of the checkPhase. Breaking changes should be caught with a nix-review

Edit: In stead of trying to capture breaking changes upstream, we could validate the scenarios work downstream. Which is also why nixos tests are nice.

There are also some changes that get mentioned in the release notes and deserve attention, but concern corner cases likely to be missed by our tests.

So eventually looking at the changelogs is a good idea even if we do some bumps without it.

Yea, but that’s only if the package in question is bumped right? what happens if a new php version is released, and then you want to verify that nextcloud still works correctly? I don’t think we aggregate downstream package tests, but I could be mistaken.

I guess the point is that maybe we should eventually add nextcloud tests to PHP’s test list.

  • Move to building all pre-merge stuff in throwaway VMs, so that we need no longer care about malicious code for the purpose of testing PRs.

Note that the fun part is not just container escape (or container + VM escape, these also happen from time to time!), but also the need of fine network isolation of fixed-output derivations.

Although one could try to build all FODs on central infrastructure as they are usually less CPU-intensive.

  • Allow people to more easily contribute build power based on throwaway VMs. Everybody should be able to run a daemon on their machines that hooks into an ofborg-like system and accepts jobs from it, and then post results to the PRs. I suspect we would easily tenfold the currently available Hydra+ofborg resources this way.

OfBorg previously ran on community computing hardware. It didn’t have the completely crazy amount of hardware anyway. And there is coordination overhead in all that (which I observed as I had a builder). I think the move to central infrastructure was partially because communicating exact sandboxing guarantees and managing the FOD network access on non-centralised builds is a nontrivial amount of coordination work (and isolation is technically easier in a fully controlled setup fully known to one person).

Thinking this thought to an end, we could make it even easier to for companies to chime in by permitting a mini-plugs in the automated build results posting, for example the bot post could read “All NixOS tests passing. These build results were contributed by: AprilTree – all you need for devops”. (This approach allows companies to support orgs from the marketing budget, instead of a “charitable causes” budget which most companies don’t have.)

And if we want to ever use third-party builds, we need to track that they reproduce on technically and administratively independent builders (otherwise a single break-in or a single deliberate attack could poison the cache easily), so at least name+website link is necessary information anyway.

I also think its very important to give PR authors more control over their PRs metadata (for example through a tagging bot) and implement something like rfc 30 (which I started but then had to abondon because of a lack of time).

The main thing is that there should be a single tag (“needs review” or similar) that marks all commits that are immediately actionable for any reviewer. That action could well be “ask followup questions” (and change the tag to “needs info”), “ping other reviewer with domain knowledge” (and change the tag, maybe also “needs info”), request changes (“needs work”) etc.

1 Like

Correct! Probably we should extend nixpkgs-review to run NixOS tests too!

edit: Here’s the feature request for that: https://github.com/Mic92/nixpkgs-review/issues/77

1 Like

This is not exactly what I had in mind – I should have been a bit clearer on that. What I mean is much simpler.

There are 2 components of trust:

  1. Do we trust the inputs to be nonmalicious-checked as to not mess with build infrastructure or introduce backdoors?
  2. Do we trust some infrastructure/builder to put things on the binary cache?

I do not propose that community-contributed builds from non-trusted infrastructure make it into the binary cache.

Instead I mean the following simple rules:

  • If something is nonmalicious-checked, any infrastructure can safely build it, even if it does not have VM level isolation.
  • If something is nonmalicious-checked, AND it was built by anybody’s infrastructure (trusted or not), THEN it can be merged.
    If it was built on untrusted infrastructure, then trusted infrastructure will re-build it after the merge as happens now.
  • If something is nonmalicious-checked, AND was built by trusted infrastructure, THEN it can make it onto the binary cache.

So in general, contributed untrusted infrastructure is only used make the decision of whether any build or test breaks, not to produce code that users run.

In my proposed model, the biggest negative impact a malicious community-contributed builder could have is to claim e.g. a Firefox upgrade “works fine” when in fact the build breaks (which is not a big problem, detected automatically after the merge when Hydra builds it, and we can exclude the lying builder afterwards); it cannot result in untrusted code on our binary caches.

I have updated the description accordingly now:

Everybody should be able to run a daemon on their machines that hooks into an ofborg-like system and accepts jobs from it, and then post results to the PRs (edit for clarity: to provide evidence whether builds or tests break; untrusted community builder outputs would not be put on the official binary cache, but instead be re-built by trusted Hydra after merge; it’s the testing-while-iterating that’s expensive and needs community compute power, not the final Hydra build).

2 Likes