There seems to be a bottleneck for the nixpkgs PR workflow, caused by limited availability of aarch64-darwin builders, which seem to take multiple days right now. I don’t have exact data to support this as the monitoring Grafana linked on GitHub - NixOS/ofborg: @ofborg tooling automation https://monitoring.ofborg.org/dashboard/db/ofborg seems to be empty, but I’m sure most people reading this are aware of the problem.
Since I couldn’t find any discussion related to this specific topic I’d like to start a thread about this issue. The initial questions I have would be:
What do the p95 queue times look like for each platform?
What does the current builder infrastructure look like?
Is there a pattern (hours/weekdays) for when most build requests are happening?
What would be the options for improving the situation? (Spin up scaleway machines in order to do more builds, ask people to run builder nodes on their old M1 Mac Minis, etc)
To me this seems like quite high priority issue as solving it would speed up the waiting time for checks to complete on PRs possibly from multiple days to just hours.
(adding separate ping to @cole-h as you seem to be quite active on the ofborg repo these days
There are 3 x86 and 2 arm64 builder. One way to help is marking more things on darwin as broken. Because than stuff doesn’t unnecessary gets rebuild again and again. Otherwise the infra and ofborg team is looking for help. Checkout the infrastructure matrix channel.
Maybe move one from hydra.nixos.org builders here? As hydra.nixos.org is bottlenecked by central pieces all the time, I don’t think we’ll see a significant change in throughput there.
So just to double check I’ve understood this correctly, is the solution for people with aarch64-darwin hanging builds (this one has been waiting for 2 days) to mark aarch64-darwin as broken and try again?
No, marking the supported platforms and the broken status accurately in more packages reduces the amount of packages ofborg tries to build needlessly thus reducing the queue.
https://zh.fail/ can be used to find those when its not a new package.
Basically some people treat darwin as a blocker but nothing currently enforces that, it’s not in a good state of support and aarch64-darwin is in the worst state. So either you wait, or sometimes reviewers will ignore the aarch64-darwin builld result. (Of course you can’t rely on this.) Still I think now that ofborg is taking >3 days it’s less likely to be treated as a blocker.
I have planned to try and solve Darwin ofborg for the 25.05 cycle. I believe the resources are there to upgrade the hardware and make sure it’s not wasting all its cycles timing out on LLVM builds. However there is some preliminary discussion now about potentially redoing the ofborg structure entirely so I will await the outcome of that before moving forward with this.
In the meantime your best options are to not wait for Darwin results or, if you want someone to test it on the platform, ping @NixOS/darwin-maintainers or build it on the community builder.
Also, aarch64-darwin is in a much better state than x86_64-darwin… (thankfully that will also change in 25.05 thanks to the 11.3 bump)
In my experience (from a couple months ago), x86_64-darwin would take up to 1 day, certainly slow, but aarch64-darwin would take 3+ days. I don’t have concrete stats on this, I’ve no idea how to even collect that.
Good conversation so far, thanks everyone for participating. (I probably need to get a matrix setup going at some point, seems like a lot discussions are on that side these days. Though I do also like this being discussed on public Nix discourse tbh)
Some thoughts on the discussion so far:
Marking builds accurately broken on platforms would indeed reduce stress on the build infra, but wouldn’t fix the underlying issue itself. Also without good data to support the case it’s hard to say how much it would even affect the queues.
For the discussion about which one aarch64-darwin or x86_64-darwin is better supported is somewhat offtopic. I think the reason for that confusion might be that when talking about being supported people are talking about how many packages are able to build, how recent software the builders are running etc, while the thread is specifically about build queues. Before posting I did go through roughly one week of PRs that are not yet merged and the clear pattern was that while x86_64 darwin sometimes took almost a day to execute, pretty much all the PRs that still had checks waiting were waiting for aarch64-darwin and had been for days.
Merging package while there’s aarch64-darwin build still queuing might work, but might even add to the problem. By not actually checking if the package builds properly on a platform you might’ve added one more broken package to the list of builds, exactly opposite to what was suggested in the replies of this thread. It also makes it harder for maintainers to merge PRs as you can’t do simple things like look for approved packages that have all checks passed.
Gladly it does sound like there are plans to improve the situation based on some of the replies. I do really like the Nix build ecosystem, but while fast and reliable build system should make the PR process flow naturally for both the contributors and the maintainers, right now in the current state this specific issue is harming both sides.