A proposal for replacing the Nix worker protocol

Bazel’s REAPI may be a consideration too, as it would give some amount of interoperability with an existing ecosystem of build tools.

It’s in principle designed for build tooling, but it’s very bazel focused, so might not be the best fit in the world. It has its own CAS-based caching protocol, for example, which I imagine would not work well with nix. But perhaps the execution API would work by itself, with some clever managent on the build server end.

Does SAM interoperate nicely with one of the more common declarative proto spec DSLs, like for example protobuf?

Or does the implementation entirely rely on SDKs?

The answer to this might be significant to gauge the degree of practical interoperability.

EDIT: here is the other one: https://capnproto.org/

Does SAM interoperate nicely with one of the more common declarative proto spec DSLs, like for example protobuf?

SAM using the Preserves data language which is a self describing protocol that also supports schemas. Its a superset of Nix values, JSON, CBOR, and maybe protobuf. I don’t work at google so I don’t know what protobuf is actually good at. The schemas serve as documentation and to generate types or procedures for the language you are using the DSL with so that you have consistent message formatting.

Self-describing doesn’t mean that you can implement a complex SAM protocol by implementing the data language and then just doing some stuff with values you get off the wire. Most of the effort in implementation is in the DSL semantics and code-generators.


REAPI is new to me, looks interesting.

1 Like

Thanks!

I actually have trouble parsing “DSL”, it doesn’t make sense in the classical use of the term, even when looking at Syndicate DSL Syntax and Semantics—Syndicated Actors

Does the project mean SDK or do they provide per-language transpilers that come from a common DSL? Or do they refer to DSL when they mean “SDK + Preserves”?

Preserves

So I kind of understand that preserve serializes to whatever you want (e.g. fastest)? What serialization did SAM choose (if it did)?

Going through the upstream-docs, I’m getting slightly concerned about documentation / knowledge accessibility. Is the project aware of that their docs aren’t really accessible for adopters?

EDIT:

Since SAM praises itself for service discovery & scheduling, are there already any schedulers that would adopt it in the likes of slurm / yunikorn / mesos / nomad / k8s ?

I actually have trouble parsing “DSL”, it doesn’t make sense in the classical use of the term, even when looking at Syndicate DSL Syntax and Semantics—Syndicated Actors. Does the project mean SDK or do they provide per-language transpilers that come from a common DSL?

A Syndicate library comes domain specific verbs for reactive programming that are different then what you would normally do with a language. A reactive refactor of libnix would mean drastic changes which is which why I think translating the current protocol is a better initial strategy for a transition.

Going through the upstream-docs, I’m getting slightly concerned about documentation / knowledge accessibility. Is the project aware of that their docs aren’t really accessible for adopters?

In my experience everyone who has more code to write than they have time is aware that their documentation is bad.

3 Likes

Something that I neglected to mention is that it is possible to do authentication and message routing with the Syndicate server and its configuration language. This would offload much of the work of configuring worker pools. I’m use the scripting language on my machine for various things already.

2 Likes

I share @blaggacao 's concern. Nix is already the too -much-code-not -enough-window-dressing system we are now attempting to make more approachable. Adopting an upstart system that hasn’t proven itself for something as fundamental as a protocol description sounds counter to the goals right now.

Having said that, improving and standardizing the protocol is, well, in line with those same goals, and maybe I’m missing some important background information needed to properly appraise your proposal.

4 Likes

I think it’s good to be really concrete on the short-term goals.

  1. Give `nix daemon` and `nix-store --serve` protocols separate serializers with version info by Ericson2314 · Pull Request #6223 · NixOS/nix · GitHub this separates the serialization of the multiple protocols we have today so the composable bits can handle versioning / versioning needed to be done out of band. Something likes this has to be merged.

    I would appreciate suggestions on how to deal with the remaining concerns; it might just be best to just duplicate the code I share with C++ for example?!

  2. We need to do deprecate the legacy protocol Deprecate legacy ssh store / `nix-store --serve` · Issue #4665 · NixOS/nix · GitHub

  3. I agree the fancier stuff @ehmry is talking about probably shouldn’t live in tree from day one, but if we improve our modularity we can better allow experiments to live out of tree.

    Ultimately @ehmry is right the current way of distributing builds is embarrassingly bad, and this is holding back Nix in institutional settings.

6 Likes

I really welcome discussions and proposals about the current and future Nix worker protocols! During my work on https://nixbuild.net, two major observations I’ve made around the worker protocols are these:

  • The short-term goals mentioned by @Ericson2314 (refactor/document the current serialisation code, remove the legacy protocol etc) are probably crucial to do before starting to work on implementing a better/smarter worker protocol inside Nix itself.

  • If you disregard the current lack of docs/specs, and the ad-hoc/organic feel that the current worker protocol(s) suffer from, you are actually not very limited at all by today’s protocol. As a proof of this, I would say nixbuild.net solves almost all of the things that @ehmry list under “What is wrong with the worker protocol”, without changing anything in Nix.

@ehmry actually suggests an approach that is very similar to the one we’ve taken in nixbuild.net (we’re not using anything like Syndicate, though):

I figure that stepwise-transition would be to translate the worker protocol to a Syndicate protocol for both the evaulator and worker. A translator would listen for frontends at the standard Nix daemon socket location and assert and retract translated commands to a dataspace. Another utility would watch the dataspace and send commands to nix-daemon. Doing this via the current protocol and without touching the current tools would not be fun but it would make regressions obvious.

Maybe a viable approach would be to clean up the current state of affairs, and keep the worker protocol simple but flexible enough to allow interesting work on scheduling, worker discovery etc etc happen externally to Nix? Or perhaps we then have to compromise too much?

11 Likes

Yeah I’ve talked to @rickynils before and we are basically in 100% agreement on this stuff.

Anyone else that cares about these small incremental improvements, please chime in on the associated issues and PRs. I need helping convince @edolstra and others that people care about this stuff, and not just the more directly user-facing things like CLI/Flakes/etc…

5 Likes

If I may contextualize this a bit, the worker protocol is something that users aren’t aware of until they are in the deep end of Nix, and that is how things should stay regardless of protocol iteration. Also, Nix was successful not because of documentation and usability, but because the people that read the NixOS thesis paper were able to solve problems at a faster rate than everyone else. An alternate public narrative might be useful but the internals are still the internals and the developers are still the developers.

Back on topic, REAPI should be a consideration. It would be practical because it does what we do and the protocol design and documentation is not our responsibility.

I see two options which are not mutually exclusive, a protocol that is explicitly for executing build jobs, and a protocol/language that is expressive enough for build jobs and comes with semantics that enable emergent features.

At the risk of over-complicating things it would be nice to have documentation on abstract model of where the separation between the frontends and the nix-daemon is and how store paths are realized with substitutions or builds.

4 Likes

At the risk of over-complicating things it would be nice to have documentation on abstract model of where the separation between the frontends and the nix-daemon is and how store paths are realized with substitutions or builds.

If you didn’t see it already, you might like Greatly expand architecture section, including splitting into abstract vs concrete model by Ericson2314 · Pull Request #6877 · NixOS/nix · GitHub which is about an “abstract model” of the store layer.

1 Like

A clarification I need.

Are we talking about the protocol or protocol + scheduling?

If scheduling is part of the concern, then vendoring some off-the-shelve scheduler solutions should be our first choice to evaluate.

I think it is equally bad for Nix adoption in institutional settings to perpetuate the “myth of the niche” (a covert form of NIH?). Interoperability is important and if we can at least shim in readily available schedulers, such as slurm / Apache YuniKorn * or others, that’s a valuable interface to produce.

* despite the marketing, it is my understanding that it’s not k8s-bound and can be shimmed onto other schedulers.

@blaggacao I think I agree, and I thing @ehmry does too.

The Nix store layer is largely composed of:

  • Sandboxing, logic for an individual build
  • Scheduling
  • Storage
  • Networking / Exchanging data

For all 4 of these it should be easy to experiment with off-the-shelf replacements.

The better are layering is, the less NIH we get to be!

3 Likes

On a slight tangent: @Ericson2314 You keep repeating „layering“, which appears to have become idiomatic over the years, and it seems a more suitable term would be „well-defined architecture“. That is why I both like @ehmry‘s proposal as well as your brief list here, as it allows thinking separately of principle and implementation. And as long as it’s not written out or drawn anywhere, people will have to synthesize these concepts in their head from whatever the implementation happens to present itself as (and it may have a lot of arbitrary noise) - which may or may not succeed.

4 Likes

Yeah I long been in the habit of saying “layering”. It is not the perfect word for it, but it does nicely evoke the idea of “peeling up a layer on top” and the layers below still make sense in isolation.

Properly defining one’s architecture is certainly good, but doesn’t require modularity does it? — a convoluted cyclic dependency thing could be still be rigorously defined even if it isn’t modular.

I feel like the term “well defined architecture” encompasses more than just rigor, but also the cleanliness of the abstraction/interface. Could just be a personal quirk of mine, though.

1 Like

Coming a bit late to the party, but I’m very interested in this (as part of this roadmap item in particular). The remote build system is in need of a fresh bowl of air imho, and a new protocol could bring a lot (if only by switching to a pull model rather that a push one. Please someone do that!).

For the protocol itself, I’d have a strong preference for REAPI, if only because it’s the closest thing to a standard for remote builds (bringing amongst other things a number of great server implementation and hosted services). And also because (based on some high-level discussions and readings, I could never find the time to look at it closely enough) it’s very nicely designed with performance and extensibility in mind, making it potentially a very good fit for Nix.

Regarding syndicated, I’m curious how it behaves latency-wise. The “conversational” aspect makes it sound like the transactions could take a non-trivial amount of time (which is already an issue with the current protocol afaict). I wonder whether there’s some cleverness to keep things fast

2 Likes

That depends on your definition of trivial amount of time, but yes, it can take some ping-ponging to get stuff done. To trigger a build at a worker means creating a diff for the conversation state and submiting that to the dataspace server, which then sends a diff of what is relevant to the worker. A diff is not created for every update, changes are collected in the scope of a “turn” and then grouped together. That is all handled automatically and I would say the tradeoff is that the behavior is more observable and predictable.

All things considered I think the next protocol iteration should be more conservative than Syndicate but it’s still useful for thinking about what features we are missing out on.

1 Like

I’ve started working on this. No timeline for achieving utility. - Syndicated Nix Actor

5 Likes