I’m curious about the causes of mass aborts in hydra from time to time, as well as restarts that I tend to see happening. I haven’t seen a pattern to them exactly, so is it a manual decision? Or does something else trigger aborts of all the jobs in an evaluation?
This recent evalution of trunk-combined failed because of a mass abort: https://hydra.nixos.org/eval/1622920. From what I can tell, it likely would have succeeded otherwise.
Perhaps this it is to free up resources so other jobsets can proceed, but I’m just speculating and would appriciate any insight.
I guess the reason it’s maybe effecting trunk-combined is that there are shared jobs between the various builds but cancelling an evaluation cancels the jobs in all the other evaluations.
I wonder, would it be desirable for hydra to automatically retry jobs that end up being shared when a new evaluation starts? So while you might cancel a staging evaluation, if there are still jobs that trunk-combined shares when it evaluates next, only those shared jobs would get queued again.
I imagine though if the cancellations are primarily about freeing up build resources, this might have opposite of the desired outcome.
I can’t make any promises about delivering, as I may be out of my depth, but I’d be happy to see if thats something I could contribue to hydra if it’s a desired behaviour.