My org is outgrowing our single-node Hydra instance and is interested in exploring horizontal scaling options for it. Seems like the obvious thing to pursue is the buildMachines config option, however we wouldn’t want to be having to reconfigure/restart Hydra just to add or remove builders.
So my questions are:
- Is this indeed the expected model, that you have a static list of potential builders, and then turn them on and off in response to load (or according to a schedule) and let the Hydra leader machine just figure out not to send work to the machines that have gone AWOL?
- Is there a good way to put a NixOS machine into “lame duck” mode, where it will finish what it’s currently working on but not accept any new work? It would be nice not to have to wait until the whole system is completely idle to switch off an overprovisioned node, particularly if we’re bursting into short term AWS spot instances.
- Is there any existing work that’s gone into queue-monitoring hooks for Hydra, or do people pretty much just roll their own for that kind of thing?
There is also
services.hydra.buildMachinesFiles NixOS Search - Loading...
Not sure if that can do reloading.
I think so. At least I wouldn’t know how else to do it.
I assume this would be very slow because the nix-daemon and presumable hydra try all builders everytime and don’t keep track of machines that cannot be reached.
I imagine you would like to have some kind of proxy service. Since https://nixbuild.net/ does scaling I suppose @rickynils implemented such a thing.
Do we know how this works on hydra.nixos.org? I would have thought the load on there would be quite bursty for dealing with the occasional massive rebuild. How does it handle builders going down occasionally as a high availability system?
@FRidh Thanks for the pointer on nixbuild.net. We briefly evaluated Hercules as well, but found it wasn’t a good fit for us on account of being heavily integrated with Github. Nixbuild seems much lower level, which I like a lot, though it not having a North American endpoint is likely a barrier for me.
hydra.nixos.org builders are quite static and don’t change very often.
The are managed through https://github.com/NixOS/nixos-org-configurations/blob/a02a620f56cee88299d479f51676ca3f2a6c4a82/delft/provisioner.nix#L41 and https://github.com/NixOS/nixos-org-configurations/blob/a02a620f56cee88299d479f51676ca3f2a6c4a82/hydra-packet-importer/config-example.json . I am not sure how that exactly works but it could be working on the fly.
AFAIK builders are up all the time and if they are down for a short amount of time hydra just cannot connect to them and does not use them but that should add some latency to build start times. I didn’t see something so far that would gracefully handle changing builders.
Late to the punch here, but thanks for these links— it does indeed look like
buildMachinesFiles can be dynamic, as there’s further evidence here that hydra.nixos.org is (or was?) scaling up and down dynamically in AWS.