OfBorg aarch64-darwin builds causing bottleneck

errnoh · October 31, 2024, 1:51pm

There seems to be a bottleneck for the nixpkgs PR workflow, caused by limited availability of aarch64-darwin builders, which seem to take multiple days right now. I don’t have exact data to support this as the monitoring Grafana linked on GitHub - NixOS/ofborg: @ofborg tooling automation https://monitoring.ofborg.org/dashboard/db/ofborg seems to be empty, but I’m sure most people reading this are aware of the problem.

Since I couldn’t find any discussion related to this specific topic I’d like to start a thread about this issue. The initial questions I have would be:

What do the p95 queue times look like for each platform?
What does the current builder infrastructure look like?
Is there a pattern (hours/weekdays) for when most build requests are happening?
What would be the options for improving the situation? (Spin up scaleway machines in order to do more builds, ask people to run builder nodes on their old M1 Mac Minis, etc)

To me this seems like quite high priority issue as solving it would speed up the waiting time for checks to complete on PRs possibly from multiple days to just hours.

(adding separate ping to @cole-h as you seem to be quite active on the ofborg repo these days

Mic92 · October 31, 2024, 1:55pm

There are 3 x86 and 2 arm64 builder. One way to help is marking more things on darwin as broken. Because than stuff doesn’t unnecessary gets rebuild again and again. Otherwise the infra and ofborg team is looking for help. Checkout the infrastructure matrix channel.

vcunat · October 31, 2024, 2:03pm

Maybe move one from hydra.nixos.org builders here? As hydra.nixos.org is bottlenecked by central pieces all the time, I don’t think we’ll see a significant change in throughput there.

hexa · October 31, 2024, 2:42pm

We are planning to move hydra to a bigger host. Just not before the release is out.

I got started here: build: init mimas by mweinelt · Pull Request #501 · NixOS/infra · GitHub

Sporeray · November 1, 2024, 2:51am

So just to double check I’ve understood this correctly, is the solution for people with aarch64-darwin hanging builds (this one has been waiting for 2 days) to mark aarch64-darwin as broken and try again?

Artturin · November 1, 2024, 3:03am

No, marking the supported platforms and the broken status accurately in more packages reduces the amount of packages ofborg tries to build needlessly thus reducing the queue.

https://zh.fail/ can be used to find those when its not a new package.

Sporeray · November 1, 2024, 3:04am

Ah ok, so best course of action for now is to just wait? (I don’t currently know if it builds for aarch64-darwin)

waffle8946 · November 1, 2024, 1:43pm

Basically some people treat darwin as a blocker but nothing currently enforces that, it’s not in a good state of support and aarch64-darwin is in the worst state. So either you wait, or sometimes reviewers will ignore the aarch64-darwin builld result. (Of course you can’t rely on this.) Still I think now that ofborg is taking >3 days it’s less likely to be treated as a blocker.

Atemu · November 1, 2024, 1:53pm

Darwin in general for “important” packages, yes, but outside of that only so far as to requiring to mark packages as broken.

[citation needed]

IME aarch64-darwin is better supported than x86_64-darwin these days.

We’ve been ignoring it for months now. A queued aarch64-darwin ofBorg build is not signal, it’s noise.

emily · November 1, 2024, 2:22pm

I have planned to try and solve Darwin ofborg for the 25.05 cycle. I believe the resources are there to upgrade the hardware and make sure it’s not wasting all its cycles timing out on LLVM builds. However there is some preliminary discussion now about potentially redoing the ofborg structure entirely so I will await the outcome of that before moving forward with this.

In the meantime your best options are to not wait for Darwin results or, if you want someone to test it on the platform, ping @NixOS/darwin-maintainers or build it on the community builder.

Also, aarch64-darwin is in a much better state than x86_64-darwin… (thankfully that will also change in 25.05 thanks to the 11.3 bump)

waffle8946 · November 1, 2024, 3:28pm

Which build generally completes before a PR merges, and which one doesn’t?

Pretty hard to argue there’s good support for a platform that we’re merging unbuilt PRs against.

emily · November 2, 2024, 11:03pm

Neither? x86_64-darwin doesn’t schedule meaningfully better than aarch64-darwin in my experience. They’re both using the same Macs, AFAIK.

waffle8946 · November 3, 2024, 2:11am

In my experience (from a couple months ago), x86_64-darwin would take up to 1 day, certainly slow, but aarch64-darwin would take 3+ days. I don’t have concrete stats on this, I’ve no idea how to even collect that.

errnoh · November 3, 2024, 12:43pm

Good conversation so far, thanks everyone for participating. (I probably need to get a matrix setup going at some point, seems like a lot discussions are on that side these days. Though I do also like this being discussed on public Nix discourse tbh)

Some thoughts on the discussion so far:

Marking builds accurately broken on platforms would indeed reduce stress on the build infra, but wouldn’t fix the underlying issue itself. Also without good data to support the case it’s hard to say how much it would even affect the queues.
For the discussion about which one aarch64-darwin or x86_64-darwin is better supported is somewhat offtopic. I think the reason for that confusion might be that when talking about being supported people are talking about how many packages are able to build, how recent software the builders are running etc, while the thread is specifically about build queues. Before posting I did go through roughly one week of PRs that are not yet merged and the clear pattern was that while x86_64 darwin sometimes took almost a day to execute, pretty much all the PRs that still had checks waiting were waiting for aarch64-darwin and had been for days.
Merging package while there’s aarch64-darwin build still queuing might work, but might even add to the problem. By not actually checking if the package builds properly on a platform you might’ve added one more broken package to the list of builds, exactly opposite to what was suggested in the replies of this thread. It also makes it harder for maintainers to merge PRs as you can’t do simple things like look for approved packages that have all checks passed.

Gladly it does sound like there are plans to improve the situation based on some of the replies. I do really like the Nix build ecosystem, but while fast and reliable build system should make the PR process flow naturally for both the contributors and the maintainers, right now in the current state this specific issue is harming both sides.

michaelglass · November 13, 2024, 10:19pm

is https://nix.ci/stats.php up to date? Could it be that ofborg is hanging (like in this issue?)