I’m curious about the causes of mass aborts in hydra from time to time, as well as restarts that I tend to see happening. I haven’t seen a pattern to them exactly, so is it a manual decision? Or does something else trigger aborts of all the jobs in an evaluation?
This recent evalution of trunk-combined failed because of a mass abort: https://hydra.nixos.org/eval/1622920. From what I can tell, it likely would have succeeded otherwise.
Perhaps this it is to free up resources so other jobsets can proceed, but I’m just speculating and would appriciate any insight.
Thanks for the reply, I hadn’t seen that thread.
I guess the reason it’s maybe effecting
trunk-combined is that there are shared jobs between the various builds but cancelling an evaluation cancels the jobs in all the other evaluations.
Yes, this is a bit unfortunate on how the aborting is implemented.
I wonder, would it be desirable for hydra to automatically retry jobs that end up being shared when a new evaluation starts? So while you might cancel a staging evaluation, if there are still jobs that trunk-combined shares when it evaluates next, only those shared jobs would get queued again.
I imagine though if the cancellations are primarily about freeing up build resources, this might have opposite of the desired outcome.
I can’t make any promises about delivering, as I may be out of my depth, but I’d be happy to see if thats something I could contribue to hydra if it’s a desired behaviour.
Each evaluation has a button that restarts all aborted jobs (failed ones are unaffected).
These abortions you pointed out weren’t manual. They were a kind of failure, actually: https://github.com/NixOS/nixos-org-configurations/issues/128