A proposal for replacing the Nix worker protocol

ehmry · August 12, 2022, 1:48pm

The Nix worker protocol has served us well enough for almost two decades but I argue it has not scaled with the load we place on it today. I suggest investigating a replacement protocol that uses the Syndicated Actor model. I’m not yet proposing a timeline for implementation nor am I announcing that I’m doing it myself. I want to publish my thoughts because I anticipate alternate implementations of Nix and we should settle as a community on protocol improvements that are future-proofed for interoperability.

What is wrong with the worker protocol

Custom protocol specified by working code.
Point-to-point only. No broadcasting of job or machine status.
The party requesting builds is responsible for selecting workers and job scheduling.
Hydra is the only open-source worker pool manager.
Frontends must authenticate to each and every worker, either by socket permissions or with SSH keys.
All substitution (cache fetching) methods are builtin and cannot be implemented externally.
Transient build failures not handled gracefully.
Cannot add additional workers to running builds.

Syndicate as a replacement protocol

I propose the Syndicated Actor model as an alternative protocol. Syndicate is a relatively novel protocol with an epistemological model that I think matches well with what we need. I wont explain Syndicate in detail but in brief there are dataspaces that hold assertions of facts. Nix frontend tools assert facts to dataspaces such as “I want this store path available” or “I want a fresh build of this derivation”. Nix workers observe these facts in a dataspace and in turn assert facts like “I am building this derivation”, "I have this many CPUs and this much RAM", “I have this store path available at this URI”.

In this model the frontend utilities and workers would communicate using conversations rather than unidirectional streams of commands. This is more complicated in total but most of the work would be handled by a Syndicate library.

By focusing on build scheduling I am glossing over what the protocol does in regard to store management, but that focus is on where I think the bottlenecks are.

Some benefits would be:

Protocol defined by schema.
Frontends and workers authenticate to a dataspace server rather than to each other (dataspace server would typically be machine-local).
Monitoring of worker supply and demand.
Hackable, hot-swappable worker pools, machine-local or remote.
Work stealing. “A worker with half my resources is building something and I am idle so I will build it also.” - “A worker with twice my resources is building what I am building so I will abort.”
Custom path substitutiters or fetchers that are independent of the Nix codebase.

My interest

I found Syndicate because I am interested in operating systems and interprocess communication. I have implemented the protocol and DSL in Nim. Syndicate is funded by Nlnet but my work with it is not. I am currently funded and working on ERIS, which I would like to integrate into Nix and Guix, but I would prefer to do this without touching the Nix codebase and implement as a standalone substitution utility. This is why I am interested in a new protocol that would make extensible path-realization possible. The current ERIS project scope would not cover this amount of work.

Practicalities

I’m am interested in working on this but not unless there is some community consensus that it would be worthwhile and if I can find the time.

I figure that stepwise-transition would be to translate the worker protocol to a Syndicate protocol for both the evaulator and worker. A translator would listen for frontends at the standard Nix daemon socket location and assert and retract translated commands to a dataspace. Another utility would watch the dataspace and send commands to nix-daemon. Doing this via the current protocol and without touching the current tools would not be fun but it would make regressions obvious.

The Syndicate dataspace server already exists and is packaged.

There is a Rust library for Syndicate which could be used to implement a translation layer. I don’t do Rust, I have been working in Nim and I have a partial implementation of the Nix worker protocol. If a translation layer via the worker protocol over sockets works as well as the current protocol then the Nim Syndicate library could linked with the Nix C++ tools, but it would be messy. I don’t have a clear idea of what a final integration would look like.

I’m curious if any other protocol alternatives have been discussed and I happy to start the conversation regardless of the outcome.

TLATER · August 12, 2022, 3:30pm

Bazel’s REAPI may be a consideration too, as it would give some amount of interoperability with an existing ecosystem of build tools.

It’s in principle designed for build tooling, but it’s very bazel focused, so might not be the best fit in the world. It has its own CAS-based caching protocol, for example, which I imagine would not work well with nix. But perhaps the execution API would work by itself, with some clever managent on the build server end.

blaggacao · August 12, 2022, 5:51pm

Does SAM interoperate nicely with one of the more common declarative proto spec DSLs, like for example protobuf?

Or does the implementation entirely rely on SDKs?

The answer to this might be significant to gauge the degree of practical interoperability.

EDIT: here is the other one: https://capnproto.org/

ehmry · August 13, 2022, 12:11am

Does SAM interoperate nicely with one of the more common declarative proto spec DSLs, like for example protobuf?

SAM using the Preserves data language which is a self describing protocol that also supports schemas. Its a superset of Nix values, JSON, CBOR, and maybe protobuf. I don’t work at google so I don’t know what protobuf is actually good at. The schemas serve as documentation and to generate types or procedures for the language you are using the DSL with so that you have consistent message formatting.

Self-describing doesn’t mean that you can implement a complex SAM protocol by implementing the data language and then just doing some stuff with values you get off the wire. Most of the effort in implementation is in the DSL semantics and code-generators.

REAPI is new to me, looks interesting.

blaggacao · August 13, 2022, 12:26am

Thanks!

I actually have trouble parsing “DSL”, it doesn’t make sense in the classical use of the term, even when looking at Syndicate DSL Syntax and Semantics—Syndicated Actors

Does the project mean SDK or do they provide per-language transpilers that come from a common DSL? Or do they refer to DSL when they mean “SDK + Preserves”?

Preserves

So I kind of understand that preserve serializes to whatever you want (e.g. fastest)? What serialization did SAM choose (if it did)?

Going through the upstream-docs, I’m getting slightly concerned about documentation / knowledge accessibility. Is the project aware of that their docs aren’t really accessible for adopters?

EDIT:

Since SAM praises itself for service discovery & scheduling, are there already any schedulers that would adopt it in the likes of slurm / yunikorn / mesos / nomad / k8s ?

ehmry · August 15, 2022, 2:02pm

I actually have trouble parsing “DSL”, it doesn’t make sense in the classical use of the term, even when looking at Syndicate DSL Syntax and Semantics—Syndicated Actors. Does the project mean SDK or do they provide per-language transpilers that come from a common DSL?

A Syndicate library comes domain specific verbs for reactive programming that are different then what you would normally do with a language. A reactive refactor of libnix would mean drastic changes which is which why I think translating the current protocol is a better initial strategy for a transition.

Going through the upstream-docs, I’m getting slightly concerned about documentation / knowledge accessibility. Is the project aware of that their docs aren’t really accessible for adopters?

In my experience everyone who has more code to write than they have time is aware that their documentation is bad.

ehmry · August 16, 2022, 1:50am

Something that I neglected to mention is that it is possible to do authentication and message routing with the Syndicate server and its configuration language. This would offload much of the work of configuring worker pools. I’m use the scripting language on my machine for various things already.

chreekat · August 17, 2022, 9:13am

I share @blaggacao 's concern. Nix is already the too -much-code-not -enough-window-dressing system we are now attempting to make more approachable. Adopting an upstart system that hasn’t proven itself for something as fundamental as a protocol description sounds counter to the goals right now.

Having said that, improving and standardizing the protocol is, well, in line with those same goals, and maybe I’m missing some important background information needed to properly appraise your proposal.

Ericson2314 · August 17, 2022, 5:04pm

I think it’s good to be really concrete on the short-term goals.

Give `nix daemon` and `nix-store --serve` protocols separate serializers with version info by Ericson2314 · Pull Request #6223 · NixOS/nix · GitHub this separates the serialization of the multiple protocols we have today so the composable bits can handle versioning / versioning needed to be done out of band. Something likes this has to be merged.

I would appreciate suggestions on how to deal with the remaining concerns; it might just be best to just duplicate the code I share with C++ for example?!
We need to do deprecate the legacy protocol Deprecate legacy ssh store / `nix-store --serve` · Issue #4665 · NixOS/nix · GitHub
I agree the fancier stuff @ehmry is talking about probably shouldn’t live in tree from day one, but if we improve our modularity we can better allow experiments to live out of tree.

Ultimately @ehmry is right the current way of distributing builds is embarrassingly bad, and this is holding back Nix in institutional settings.

rickynils · August 17, 2022, 7:52pm

I really welcome discussions and proposals about the current and future Nix worker protocols! During my work on https://nixbuild.net, two major observations I’ve made around the worker protocols are these:

The short-term goals mentioned by @Ericson2314 (refactor/document the current serialisation code, remove the legacy protocol etc) are probably crucial to do before starting to work on implementing a better/smarter worker protocol inside Nix itself.
If you disregard the current lack of docs/specs, and the ad-hoc/organic feel that the current worker protocol(s) suffer from, you are actually not very limited at all by today’s protocol. As a proof of this, I would say nixbuild.net solves almost all of the things that @ehmry list under “What is wrong with the worker protocol”, without changing anything in Nix.

@ehmry actually suggests an approach that is very similar to the one we’ve taken in nixbuild.net (we’re not using anything like Syndicate, though):

I figure that stepwise-transition would be to translate the worker protocol to a Syndicate protocol for both the evaulator and worker. A translator would listen for frontends at the standard Nix daemon socket location and assert and retract translated commands to a dataspace. Another utility would watch the dataspace and send commands to nix-daemon. Doing this via the current protocol and without touching the current tools would not be fun but it would make regressions obvious.

Maybe a viable approach would be to clean up the current state of affairs, and keep the worker protocol simple but flexible enough to allow interesting work on scheduling, worker discovery etc etc happen externally to Nix? Or perhaps we then have to compromise too much?

Ericson2314 · August 17, 2022, 7:56pm

Yeah I’ve talked to @rickynils before and we are basically in 100% agreement on this stuff.

Anyone else that cares about these small incremental improvements, please chime in on the associated issues and PRs. I need helping convince @edolstra and others that people care about this stuff, and not just the more directly user-facing things like CLI/Flakes/etc…

ehmry · August 18, 2022, 2:21pm

If I may contextualize this a bit, the worker protocol is something that users aren’t aware of until they are in the deep end of Nix, and that is how things should stay regardless of protocol iteration. Also, Nix was successful not because of documentation and usability, but because the people that read the NixOS thesis paper were able to solve problems at a faster rate than everyone else. An alternate public narrative might be useful but the internals are still the internals and the developers are still the developers.

Back on topic, REAPI should be a consideration. It would be practical because it does what we do and the protocol design and documentation is not our responsibility.

I see two options which are not mutually exclusive, a protocol that is explicitly for executing build jobs, and a protocol/language that is expressive enough for build jobs and comes with semantics that enable emergent features.

At the risk of over-complicating things it would be nice to have documentation on abstract model of where the separation between the frontends and the nix-daemon is and how store paths are realized with substitutions or builds.

Ericson2314 · August 21, 2022, 4:25pm

At the risk of over-complicating things it would be nice to have documentation on abstract model of where the separation between the frontends and the nix-daemon is and how store paths are realized with substitutions or builds.

If you didn’t see it already, you might like https://github.com/NixOS/nix/pull/6877 which is about an “abstract model” of the store layer.

blaggacao · August 27, 2022, 6:06pm

A clarification I need.

Are we talking about the protocol or protocol + scheduling?

If scheduling is part of the concern, then vendoring some off-the-shelve scheduler solutions should be our first choice to evaluate.

I think it is equally bad for Nix adoption in institutional settings to perpetuate the “myth of the niche” (a covert form of NIH?). Interoperability is important and if we can at least shim in readily available schedulers, such as slurm / Apache YuniKorn * or others, that’s a valuable interface to produce.

* despite the marketing, it is my understanding that it’s not k8s-bound and can be shimmed onto other schedulers.

Ericson2314 · August 27, 2022, 8:16pm

@blaggacao I think I agree, and I thing @ehmry does too.

The Nix store layer is largely composed of:

Sandboxing, logic for an individual build
Scheduling
Storage
Networking / Exchanging data

For all 4 of these it should be easy to experiment with off-the-shelf replacements.

The better are layering is, the less NIH we get to be!

fricklerhandwerk · August 27, 2022, 9:16pm

On a slight tangent: @Ericson2314 You keep repeating „layering“, which appears to have become idiomatic over the years, and it seems a more suitable term would be „well-defined architecture“. That is why I both like @ehmry‘s proposal as well as your brief list here, as it allows thinking separately of principle and implementation. And as long as it’s not written out or drawn anywhere, people will have to synthesize these concepts in their head from whatever the implementation happens to present itself as (and it may have a lot of arbitrary noise) - which may or may not succeed.

Ericson2314 · August 27, 2022, 9:49pm

Yeah I long been in the habit of saying “layering”. It is not the perfect word for it, but it does nicely evoke the idea of “peeling up a layer on top” and the layers below still make sense in isolation.

Properly defining one’s architecture is certainly good, but doesn’t require modularity does it? — a convoluted cyclic dependency thing could be still be rigorously defined even if it isn’t modular.

tejing · August 28, 2022, 4:42am

I feel like the term “well defined architecture” encompasses more than just rigor, but also the cleanliness of the abstraction/interface. Could just be a personal quirk of mine, though.

thufschmitt · September 12, 2022, 10:06am

Coming a bit late to the party, but I’m very interested in this (as part of this roadmap item in particular). The remote build system is in need of a fresh bowl of air imho, and a new protocol could bring a lot (if only by switching to a pull model rather that a push one. Please someone do that!).

For the protocol itself, I’d have a strong preference for REAPI, if only because it’s the closest thing to a standard for remote builds (bringing amongst other things a number of great server implementation and hosted services). And also because (based on some high-level discussions and readings, I could never find the time to look at it closely enough) it’s very nicely designed with performance and extensibility in mind, making it potentially a very good fit for Nix.

Regarding syndicated, I’m curious how it behaves latency-wise. The “conversational” aspect makes it sound like the transactions could take a non-trivial amount of time (which is already an issue with the current protocol afaict). I wonder whether there’s some cleverness to keep things fast

ehmry · September 13, 2022, 8:06pm

That depends on your definition of trivial amount of time, but yes, it can take some ping-ponging to get stuff done. To trigger a build at a worker means creating a diff for the conversation state and submiting that to the dataspace server, which then sends a diff of what is relevant to the worker. A diff is not created for every update, changes are collected in the scope of a “turn” and then grouped together. That is all handled automatically and I would say the tradeoff is that the behavior is more observable and predictable.

All things considered I think the next protocol iteration should be more conservative than Syndicate but it’s still useful for thinking about what features we are missing out on.