The Nix worker protocol has served us well enough for almost two decades but I argue it has not scaled with the load we place on it today. I suggest investigating a replacement protocol that uses the Syndicated Actor model. I’m not yet proposing a timeline for implementation nor am I announcing that I’m doing it myself. I want to publish my thoughts because I anticipate alternate implementations of Nix and we should settle as a community on protocol improvements that are future-proofed for interoperability.
What is wrong with the worker protocol
- Custom protocol specified by working code.
- Point-to-point only. No broadcasting of job or machine status.
- The party requesting builds is responsible for selecting workers and job scheduling.
- Hydra is the only open-source worker pool manager.
- Frontends must authenticate to each and every worker, either by socket permissions or with SSH keys.
- All substitution (cache fetching) methods are builtin and cannot be implemented externally.
- Transient build failures not handled gracefully.
- Cannot add additional workers to running builds.
Syndicate as a replacement protocol
I propose the Syndicated Actor model as an alternative protocol. Syndicate is a relatively novel protocol with an epistemological model that I think matches well with what we need. I wont explain Syndicate in detail but in brief there are dataspaces that hold assertions of facts. Nix frontend tools assert facts to dataspaces such as “I want this store path available” or “I want a fresh build of this derivation”. Nix workers observe these facts in a dataspace and in turn assert facts like “I am building this derivation”, "I have this many CPUs and this much RAM", “I have this store path available at this URI”.
In this model the frontend utilities and workers would communicate using conversations rather than unidirectional streams of commands. This is more complicated in total but most of the work would be handled by a Syndicate library.
By focusing on build scheduling I am glossing over what the protocol does in regard to store management, but that focus is on where I think the bottlenecks are.
Some benefits would be:
- Protocol defined by schema.
- Frontends and workers authenticate to a dataspace server rather than to each other (dataspace server would typically be machine-local).
- Monitoring of worker supply and demand.
- Hackable, hot-swappable worker pools, machine-local or remote.
- Work stealing. “A worker with half my resources is building something and I am idle so I will build it also.” - “A worker with twice my resources is building what I am building so I will abort.”
- Custom path substitutiters or fetchers that are independent of the Nix codebase.
My interest
I found Syndicate because I am interested in operating systems and interprocess communication. I have implemented the protocol and DSL in Nim. Syndicate is funded by Nlnet but my work with it is not. I am currently funded and working on ERIS, which I would like to integrate into Nix and Guix, but I would prefer to do this without touching the Nix codebase and implement as a standalone substitution utility. This is why I am interested in a new protocol that would make extensible path-realization possible. The current ERIS project scope would not cover this amount of work.
Practicalities
I’m am interested in working on this but not unless there is some community consensus that it would be worthwhile and if I can find the time.
I figure that stepwise-transition would be to translate the worker protocol to a Syndicate protocol for both the evaulator and worker. A translator would listen for frontends at the standard Nix daemon socket location and assert and retract translated commands to a dataspace. Another utility would watch the dataspace and send commands to nix-daemon. Doing this via the current protocol and without touching the current tools would not be fun but it would make regressions obvious.
The Syndicate dataspace server already exists and is packaged.
There is a Rust library for Syndicate which could be used to implement a translation layer. I don’t do Rust, I have been working in Nim and I have a partial implementation of the Nix worker protocol. If a translation layer via the worker protocol over sockets works as well as the current protocol then the Nim Syndicate library could linked with the Nix C++ tools, but it would be messy. I don’t have a clear idea of what a final integration would look like.
I’m curious if any other protocol alternatives have been discussed and I happy to start the conversation regardless of the outcome.