What causes Hydra Aborts?

wkral · October 30, 2020, 10:52pm

I’m curious about the causes of mass aborts in hydra from time to time, as well as restarts that I tend to see happening. I haven’t seen a pattern to them exactly, so is it a manual decision? Or does something else trigger aborts of all the jobs in an evaluation?

This recent evalution of trunk-combined failed because of a mass abort: https://hydra.nixos.org/eval/1622920. From what I can tell, it likely would have succeeded otherwise.

Perhaps this it is to free up resources so other jobsets can proceed, but I’m just speculating and would appriciate any insight.

ryneeverett · November 3, 2020, 5:08am

wkral · November 3, 2020, 5:54am

Thanks for the reply, I hadn’t seen that thread.

I guess the reason it’s maybe effecting trunk-combined is that there are shared jobs between the various builds but cancelling an evaluation cancels the jobs in all the other evaluations.

FRidh · November 3, 2020, 3:08pm

Yes, this is a bit unfortunate on how the aborting is implemented.

wkral · November 3, 2020, 6:54pm

I wonder, would it be desirable for hydra to automatically retry jobs that end up being shared when a new evaluation starts? So while you might cancel a staging evaluation, if there are still jobs that trunk-combined shares when it evaluates next, only those shared jobs would get queued again.

I imagine though if the cancellations are primarily about freeing up build resources, this might have opposite of the desired outcome.

I can’t make any promises about delivering, as I may be out of my depth, but I’d be happy to see if thats something I could contribue to hydra if it’s a desired behaviour.

vcunat · November 3, 2020, 8:51pm

Each evaluation has a button that restarts all aborted jobs (failed ones are unaffected).

vcunat · November 3, 2020, 8:52pm

These abortions you pointed out weren’t manual. They were a kind of failure, actually: broken big-parallel feature on some Hydra machines · Issue #128 · NixOS/infra · GitHub