What causes Hydra Aborts?

I’m curious about the causes of mass aborts in hydra from time to time, as well as restarts that I tend to see happening. I haven’t seen a pattern to them exactly, so is it a manual decision? Or does something else trigger aborts of all the jobs in an evaluation?

This recent evalution of trunk-combined failed because of a mass abort: https://hydra.nixos.org/eval/1622920. From what I can tell, it likely would have succeeded otherwise.

Perhaps this it is to free up resources so other jobsets can proceed, but I’m just speculating and would appriciate any insight.

3 Likes

Thanks for the reply, I hadn’t seen that thread.

I guess the reason it’s maybe effecting trunk-combined is that there are shared jobs between the various builds but cancelling an evaluation cancels the jobs in all the other evaluations.

1 Like

Yes, this is a bit unfortunate on how the aborting is implemented.

1 Like

I wonder, would it be desirable for hydra to automatically retry jobs that end up being shared when a new evaluation starts? So while you might cancel a staging evaluation, if there are still jobs that trunk-combined shares when it evaluates next, only those shared jobs would get queued again.

I imagine though if the cancellations are primarily about freeing up build resources, this might have opposite of the desired outcome.

I can’t make any promises about delivering, as I may be out of my depth, but I’d be happy to see if thats something I could contribue to hydra if it’s a desired behaviour.

1 Like

Each evaluation has a button that restarts all aborted jobs (failed ones are unaffected).

These abortions you pointed out weren’t manual. They were a kind of failure, actually: broken big-parallel feature on some Hydra machines · Issue #128 · NixOS/infra · GitHub

2 Likes