I would especially like to hear what’s already planned from our great people working on build automation, but also from those who spend a lot of time reviewing/testing/merging PRs, and in general whether others share this vision.
Starting with a quote from the other thread:
That’s right. ofborg and r-ryantm are already very good steps into the right direction. I think we should all-in into even more automation.
This is how I think we can master the pull request flood in the long term:
Full automatic nix-review build for all r-ryantm bot upgrades, so that all dependent packages will be automatically rebuilt. Also for mass rebuidls. This should be against master, even if the PR is against staging, because staging is often broken, and the master build will tell us most of what we need.
Full automatic NixOS tests for them as well, and post results on the PR. Right now we run NixOS test on user-demand only.
Automatic merging of r-ryantm bot point releases, e.g. 1.2.3 → 1.2.4, after the above checks have passed.
Add NixOS tests for a lot more apps (both CLI and desktop).
Establish a evidence system for NixOS tests that says: “If this test passes / is accepted by a human, then the following packages X, Y, Z are probably working fine”. For example, we could say that if a test screenshotting Firefox showing various image file types and fonts looks OK (a human would make that statement on the PR), then libpng, libjpeg, and all the font libraries Firefox depends on, are probably OK. Thus, the workflow would be that if a bot proposes an updateto libpng, futher automation would say “hey humans, I have built the dependent Firefox and Evince against this, here’s the test outputs, if you say the screenshots of one of those look OK, I will auto-merge this PR”.
I believe this will allow us to remove most of the labour involved for upgrades, giving us more time to focus on the difficult stuff.
For security and performance, we should:
Add an explicit label nonmalicious-checked that committers can apply to tell automatic infrastructure that the code involved is safe to be run. This is similar to the current @grahamcofborg build, but would be more explicit, not conflating this assertion with the actual build command as we do now, and not only be for ofborg. Infrastructure should only rely on the commit that was labelled this way; force-pushes or new commits have to be re-labeled. This label would not put a judgment on whether the change is a good one to merge, just assert that it does not obviously contain code to abuse infrastructure.
Forbid PRs to be merged without this label. (This just makes current workflow explicit.)
Move to building all pre-merge stuff in throwaway VMs, so that we need no longer care about malicious code for the purpose of testing PRs.
As an optimisation: The outputs of the throwaway VMs should be kept a while. As soon as the corresponding change is labelled nonmalicious-checked, those contents shall be unlocked to be shared with e.g. Hydra, so that we don’t have to build twice.
Allow people to more easily contribute build power based on throwaway VMs. Everybody should be able to run a daemon on their machines that hooks into an ofborg-like system and accepts jobs from it, and then post results to the PRs (edit for clarity: to provide evidence whether builds or tests break; untrusted community builder outputs would not be put on the official binary cache, but instead be re-built by trusted Hydra after merge; it’s the testing-while-iterating that’s expensive and needs community compute power, not the final Hydra build).
I suspect we would easily tenfold the currently available Hydra+ofborg resources this way, making it easy enough to test even every mass-rebuild PR against master (as opposed to batching in staging). I also expect it’ll make it easier for companies to contribute spare server power to this task, as it is often easier to do this than to get a company approve donations. Thinking this thought to an end, we could make it even easier to for companies to chime in by permitting a mini-plugs in the automated build results posting, for example the bot post could read “All NixOS tests passing. These build results were contributed by: AprilTree – all you need for devops”. (This approach allows companies to support orgs from the marketing budget, instead of a “charitable causes” budget which most companies don’t have.)
Finally, community-process-wise we should:
Use sharded out sprints much more. They work so well. I was very impressed by how fast the migration of all Perl tests to Python was done, with ~265 tests being ported in record time by many contributors. We could do such things for changes like QT wrapping as well, thus allowing that the vast majority of packages could be moved to a new approach swiftly, and the new approach gets broad checks.
Make a sprint to give each unmaintained package a maintainer.
Fully automatic nixpkgs-review is definitely on my todo list for nixpkgs-update.
Fully automatic NixOS tests for packages in nixpkgs-update is already supported via passthru.tests. For example, gjs does it. (Side note, I had to blacklist gjs NixOS testing because the test was taking over 10 hours to run on a powerful machine, haven’t looked into why it was so slow yet.)
The main problem I see with this is that we don’t have a good way to collect and aggregate changelog/news files. If we could do this, then we could maintain a list of changes since the last time a human has “patrolled” the derivation looking for new optional dependencies, changes breaking runtime, etc.
With new python PRs, i generally require that unit tests are ran as part of the checkPhase. Breaking changes should be caught with a nix-review
Edit: In stead of trying to capture breaking changes upstream, we could validate the scenarios work downstream. Which is also why nixos tests are nice.
Yea, but that’s only if the package in question is bumped right? what happens if a new php version is released, and then you want to verify that nextcloud still works correctly? I don’t think we aggregate downstream package tests, but I could be mistaken.
EDIT v2: @ryantm don’t take this harshly, r-ryantm is a great benefit to nixpkgs and I think we all appreciate that it’s able to do keep large sections of the nixpkgs more up-to-date. I think the tooling still has plenty of room to grow, and I’m just excited by the opportunities.
I was thinking of making another bot or extending r-ryantm to listen to an rss feed of pypi updates. If the package bump caused <50 rebuilds, then I would just do a nix-review, and if the review passed, then create a PR which in most cases could be easily merged. If something failed a build, review, or rebuild count, then I would log it similar to the r-ryantm logs.
I would love to have this! It makes sense at some many levels. However, I see some trust-related issues with this model.
We currently are using the hydra builds to populate the binary cache. With this model in mind, distributing the build means we would either need to:
trust each and every builder involved in this distributed CI.
having a way to verify the produced builds before sending them to the cache.
Point 1 wouldn’t obviously scale.
Point 2 is pretty tricky, especially given the fact a lot of builds are still not binary-reproducible. Even if we assume we’ll only use this system on the reproducible subset of nixpkgs[1], we would still need to implement a whole infrastructure that would be in charge of assessing the trust. I know that a CT-like system for assessing the trust we put on the nix cache has been discussed during the last Nixcon, I don’t know how much progress has been done in that area since.
[1] I’m not even sure that finding the said subset would be a feasible task.
With new python PRs, i generally require that unit tests are ran as part of the checkPhase. Breaking changes should be caught with a nix-review
Edit: In stead of trying to capture breaking changes upstream, we could validate the scenarios work downstream. Which is also why nixos tests are nice.
There are also some changes that get mentioned in the release notes and deserve attention, but concern corner cases likely to be missed by our tests.
So eventually looking at the changelogs is a good idea even if we do some bumps without it.
Yea, but that’s only if the package in question is bumped right? what happens if a new php version is released, and then you want to verify that nextcloud still works correctly? I don’t think we aggregate downstream package tests, but I could be mistaken.
I guess the point is that maybe we should eventually add nextcloud tests to PHP’s test list.
Move to building all pre-merge stuff in throwaway VMs, so that we need no longer care about malicious code for the purpose of testing PRs.
Note that the fun part is not just container escape (or container + VM escape, these also happen from time to time!), but also the need of fine network isolation of fixed-output derivations.
Although one could try to build all FODs on central infrastructure as they are usually less CPU-intensive.
Allow people to more easily contribute build power based on throwaway VMs. Everybody should be able to run a daemon on their machines that hooks into an ofborg-like system and accepts jobs from it, and then post results to the PRs. I suspect we would easily tenfold the currently available Hydra+ofborg resources this way.
OfBorg previously ran on community computing hardware. It didn’t have the completely crazy amount of hardware anyway. And there is coordination overhead in all that (which I observed as I had a builder). I think the move to central infrastructure was partially because communicating exact sandboxing guarantees and managing the FOD network access on non-centralised builds is a nontrivial amount of coordination work (and isolation is technically easier in a fully controlled setup fully known to one person).
Thinking this thought to an end, we could make it even easier to for companies to chime in by permitting a mini-plugs in the automated build results posting, for example the bot post could read “All NixOS tests passing. These build results were contributed by: AprilTree – all you need for devops”. (This approach allows companies to support orgs from the marketing budget, instead of a “charitable causes” budget which most companies don’t have.)
And if we want to ever use third-party builds, we need to track that they reproduce on technically and administratively independent builders (otherwise a single break-in or a single deliberate attack could poison the cache easily), so at least name+website link is necessary information anyway.
I also think its very important to give PR authors more control over their PRs metadata (for example through a tagging bot) and implement something like rfc 30 (which I started but then had to abondon because of a lack of time).
The main thing is that there should be a single tag (“needs review” or similar) that marks all commits that are immediately actionable for any reviewer. That action could well be “ask followup questions” (and change the tag to “needs info”), “ping other reviewer with domain knowledge” (and change the tag, maybe also “needs info”), request changes (“needs work”) etc.
This is not exactly what I had in mind – I should have been a bit clearer on that. What I mean is much simpler.
There are 2 components of trust:
Do we trust the inputs to be nonmalicious-checked as to not mess with build infrastructure or introduce backdoors?
Do we trust some infrastructure/builder to put things on the binary cache?
I do not propose that community-contributed builds from non-trusted infrastructure make it into the binary cache.
Instead I mean the following simple rules:
If something is nonmalicious-checked, any infrastructure can safely build it, even if it does not have VM level isolation.
If something is nonmalicious-checked, AND it was built by anybody’s infrastructure (trusted or not), THEN it can be merged.
If it was built on untrusted infrastructure, then trusted infrastructure will re-build it after the merge as happens now.
If something is nonmalicious-checked, AND was built by trusted infrastructure, THEN it can make it onto the binary cache.
So in general, contributed untrusted infrastructure is only used make the decision of whether any build or test breaks, not to produce code that users run.
In my proposed model, the biggest negative impact a malicious community-contributed builder could have is to claim e.g. a Firefox upgrade “works fine” when in fact the build breaks (which is not a big problem, detected automatically after the merge when Hydra builds it, and we can exclude the lying builder afterwards); it cannot result in untrusted code on our binary caches.
I have updated the description accordingly now:
Everybody should be able to run a daemon on their machines that hooks into an ofborg-like system and accepts jobs from it, and then post results to the PRs (edit for clarity: to provide evidence whether builds or tests break; untrusted community builder outputs would not be put on the official binary cache, but instead be re-built by trusted Hydra after merge; it’s the testing-while-iterating that’s expensive and needs community compute power, not the final Hydra build).
First off I’d like to say I agree in spirit with most of this thread, and it’s
great to see others thinking about it. We need to lean heavily on automation and
“work smarter not harder” to keep up with the world and avoid burning our people
out.
On this front, I have a couple improvements I’d like to call out.
Aside from celebrating the concrete improvements already implemented-and-deployed,
the main reason to emphasize these is that they are building blocks for further
automation, so if you have ideas here it ought to be easier to implement them now!
A quick note on the philosophy of (1): this is not meant to replace treewide
PRs or CI checks. The purview of nixpkgs-update is still in updating
individual package versions, but we often have some “not absolutely urgent but
still nice to have” improvements in the works that we’d like to slowly sweep
through NixPkgs, like RFC 45, or things like removing unnecessary rec, or
using fetchFromGitHub for release tarballs instead of fetchzip, etc. One way to
accomplish these is by having someone do a treewide PR with an ad-hoc
sed/hnix/editor-macro generating the diff, but these can be hard to code review
and create mass rebuilds + merge conflicts, and while appropriate for relatively
important/critical changes are less compelling for small cleanups. Moreover they
don’t keep applying over time.
The above gives that hook to have the updater bot slowly drag in small
improvements over time, spread out as packages themselves get updates and
reviews/testing as part of their normal lifecycle.
Some comments on the above thread:
Based on experience, I’d advocate against having auto-merge bots. Potential
security vulnerabilities and complexity aside, we’ve had multiple issues
where the nixpkgs-update bot generated garbage that still passed CI, and was
only saved by human code review (huge thanks to @jonringer and @marsam for
reviewing+merging a ton of my PRs, btw). I’d argue we should focus the bulk
of our effort on PR generation / PR validation with CI, at least initially,
because it’s lower risk and higher RoI.
The ability to farm out temporary OfBorg capacity would be fantastic. Much of
the above was an output of me initially writing my own scripts for Rust Application Migration to fetchCargoTarball Implementation · Issue #79975 · NixOS/nixpkgs · GitHub. I burnt out the OfBorg bot by
going too fast, then learned from @grahamc how nixpkgs-update does rate
limiting and implemented it, which got me thinking about updating nixpkgs-update to make these kinds of updates easier in the future. Once I
added rate limiting, the script spent most of its time sleeping with 72 cores
idle waiting for OfBorg to catch up. It would have been nice to connect that
capacity temporarily for PR builds. Capacity on the AWS spot market is extremely cheap, but it requires OfBorg to support a dynamic build farm
gracefully.
I agree these should never make their way into the official binary cache for
security reasons, of course.
At the end of the day our humans are our most valuable resource, and probably
the greatest risk to our ecosystem is that we burn out key maintainers by
having them do too much repetitive, grindy work that should have been
automated. To the extent that we can make their lives easier with heavy PR
CI validation / nix derivation sanitizing we should prioritize it! No novel
point here; just reiterating that I think improvements suggested in this
thread are valuable and important for the long-term health of NixPkgs.
One could consider an opt-in auto-merge bot for some cases; then having great tests for a package would actually start paying off…
I mean, human review is also not infallible, and for some packages we could end up with comparable risk and lower effort (and tests-as-demos as a bonus).
I agree, and I think we do not need full auto-merging without human review.
It would be totally sufficient if for every PR, a committer looks at the diff to check that a bot didn’t screw up there, and sets my proposed nonmalicious-checked label.
This workflow is still extremely low overhead; it would probably come down to 30 seconds PRs per minute on average, so it’d allow 1000s of PRs per day.
OK, from a first try today, the nixpkgs-reviewing bot is awesome.
I have merged 9 PRs in quick succession and approved or commented on another 7, all within 1 hour.
And that included proper process, that is, getting the built outputs on my machine, running some of them, reading changelogs (commit history for some) and checking the nixpkgs diffs.
It would have gone even faster had I only looked at the ones that the bot indentifed as obviously unproblematic, so I think for getting easy changes in, it’s extra good.
I’ve put some requests to make it even smoother into the announcement thread.
Another update from the last couple weeks. The nixpkgs-update bot already
automatically detects that a package needs updating, automatically updates the
nix expression and recomputes shas, automatically makes the git commits and
sends PRs, automatically builds/tests the package and its full reverse closure
of dependencies with nixpkgs-review, and automatically triggers the OfBorg CI
on other platforms to run any NixOS tests. So what’s left?
One one of the remaining parts that I’ve found the most time consuming is going and
hunting down the package’s changelog to see if there’s anything I should
investigate more thoroughly before merging. To smooth that process, nixpkgs-update will now compute and link meta.changelog in the PR
description, like so: