Colmena: Yet another NixOS deployment tool

zhaofengli · February 13, 2021, 1:25am

Colmena is a simple, stateless NixOS deployment tool modeled after NixOps and morph, designed from the ground up to support parallel deployments.

$ colmena apply --on @tag-a
[INFO ] Enumerating nodes...
[INFO ] Selected 7 out of 45 hosts.
  (...) ✅ 0s Successfully built
  sigma 🕗 7s copying path '/nix/store/h6qpk8rwm3dh3zsl1wlj1jharzf8aw9f-unit-haigha-agent.service' to 'ssh://root@sigma.redacted'...
  theta ✅ 7s Activation successful
  gamma 🕘 8s Starting...
  alpha ✅ 1s Activation successful
epsilon 🕗 7s copying path '/nix/store/fhh4rfixny8b21l6jqzk7nqwxva5k20h-nixos-system-epsilon-20.09pre-git' to 'ssh://root@epsilon.redacted'...
   beta 🕗 7s removing obsolete file /boot/kernels/z28ayg10kpnlrz0s2qrb9pzv82lc20s2-initrd-linux-5.4.89-initrd
  kappa ✅ 2s Activation successful

It’s compatible with existing configurations written for NixOps (none backend) or morph with minimal modification, and differs from them in the following areas:

Entirely stateless
Supports parallel deployment
Supports deploying to the local machine
Supports per-node overrides of Nixpkgs pinning
Supports node tagging
Has a little tool (colmena introspect) to help you extract information out of your configurations

Also, it’s written in Rust, and that alone counts as a feature for some people

But why another deployment tool?

I’ve been using NixOps and morph for a while, and have found neither of them fit my use cases very well. NixOps is stateful, making collaboration hard for projects that allow multiple people to deploy. Morph is stateless but lacks any form of parallelism which makes it painful to use with a large number of hosts, and the problem does not appear to be easily fixable.

Furthermore, I’m used to a nixos-rebuild-style of workflow when it comes to managing my desktop hosts, but neither tool provides an easy way to deploy to the local machine.

What’s the state of Colmena?

I started Colmena late last year and have been using it to manage 40+ hosts running NixOS. There isn’t a stable release yet, but it’s on the horizon, and I’ve been working to iron out some wrinkles in the tool. I made a post about Colmena in the NixOS subreddit a couple of months ago, and a lot has changed since then.

I would also like to thank the following people:

Developers of NixOps and morph, for inspiring me to create this in the first place
@aanderse and @CitadelCore (GitHub), for testing and feedback
@justinas (GitHub), for help with the implementation of deployment.keys support

Try it out, and tell me what you think!

nh2 · February 13, 2021, 2:45am

Cool!

Could you elaborate a bit? The most interesting questions for me are:

How is this parallelism achieved over nixops? If I remember correctly, for large NixOps networks of e.g. 100 nodes, you have to do 100 machine evaluations, which takes a long time. Does Colmena do something specific about that, or just parallelise?
Anything that NixOps can do that Colmena currently can’t?

zhaofengli · February 13, 2021, 2:58am

For evaluation, Colmena evaluates nodes in chunks, and the number of nodes in a chunk is automatically determined based on available RAM on the host. It takes too much memory to evaluate all the machines (45 for me) at once, and on the other hand, a waste of memory and CPU time if we evaluate each node on its own (done some comparison). Evaluating several nodes at once seems to be a good compromise.

Unlike NixOps, Colmena only deals with existing NixOS hosts, akin to the SSH-only “none” backend in NixOps. It does not provision machines on cloud providers or manage their entire lifecycle for you, which will inevitably introduce state.

Sandro · February 13, 2021, 6:13am

The db always annoyed me about nixops. I don’t use any of its features and regularly just wipe it if any problem pops up and repopulate it. I wished you could just skip that part without yet another tool.

NobbZ · February 13, 2021, 7:24am

This sounds interesting!

How do you deal with secrets?

What happens with them after a reboot?

Someone told me that they are only in a tmpfs with nixops and would be lost on reboot of a machine and require a redeployment. Because of that I never actually tried nixops (or any of the other tools) myself…

zhaofengli · February 13, 2021, 7:38am

Like NixOps, Colmena uploads secrets to /run/keys by default, which is a tmpfs that will not survive reboots. However, you can set the destination directory (deployment.keys.<key>.destDir) to anywhere you like that is persistent.

ryantm · February 13, 2021, 7:58am

You can https://www.ryantm.com/blog/nixops-without-sharing/

fzakaria · February 13, 2021, 3:50pm

The proliferation of the same tools in Nix is quickly becoming my #1 pain point.

It’s making the space very fractured and hard to use.

Why don’t people just contribute and fix a single tool ?
Nix has made it too easy to write software ! People are feeling more empowered to write from scratch rather than commit upstream.

JosW · February 13, 2021, 5:37pm

Although I applaud the effort the OP has gone through to get to this stage, I do agree.
All the different twists of Nixops makes it hard to keep track. And the yet to be released Nixops 2.0 could perhaps benefit far more of these efforts under its 2.0 umbrella.

nrj · February 13, 2021, 9:14pm

Early in the process when people are still experimenting with what works best it seems fine to have a bunch of different competitors. Likely some tools will win out over time but before then I wouldn’t want to stifle exploration.

volth · February 13, 2021, 11:48pm

I ended up with similar technique, evaluating 40-50 system closures at once on a laptop with 64GB RAM.
It seems that the Nix evaluator might be optimized for this case (as the closures have a lot in common).

Mic92 · February 15, 2021, 7:18am

I just shifted evaluation from laptop to my servers by using krops. Krops first uploads the configuration & secrets to the server and than performs the upgrade process there. This is faster because usually the upload on my laptop is not that great. I have to admit that krops is not able build to update 50 servers in parallel but I am wondering why not other tools that are built with scalability in mind do not follow the same strategy. Wo wants store packages for 50 servers on their laptop?

aanderse · February 15, 2021, 6:47pm

I have a high powered machine which I do deploys from. I deploy to low powered machines. Some machines I deploy to are utilizing enough resources that a nixos-rebuild can almost OOM them. A tool like nixops or colmena suits this purpose well.

Mic92 · February 15, 2021, 7:08pm

This can be also achieved with krops: I deploy my raspberry pis, by uploading the configuration to my server from my laptop and than remotely start nixos-rebuild: https://github.com/Mic92/dotfiles/blob/fc47b85eba034ae540ef9997d20da59183138dff/nixos/krops.nix#L77 Also in this case my laptop’s bandwidth is not the bottleneck while I can still edit files locally.

volth · February 15, 2021, 7:58pm

I do that for about 200 servers (by 40-50 per batch to fit nix-instantiate in available memory)
The packages are the mostly the same, all the difference is in config.networking.* and config.services.*

So it takes ~1.1 size of a single closure in Nix Store (actually 2.2 because there are westmere and skylake builds). The Nix evaluator could do similar deduplication.

Mic92 · February 15, 2021, 8:01pm

You can deduplicate evaluation by passing nixpkg explictly to every server that you evaluate, than you only evaluate nixpkgs once, the same does not work for the module system so.

volth · February 15, 2021, 8:12pm

Evaluation Nix on target servers is not an option at all:

There could be (and actually are) servers which are so low power that they cannot evaluate their own closures.
Even if a server has few gigs of ram, starting off a process which would consume 100% CPU and 1 GB of RAM might affect stability. oom-killer might kill someone, etc
It raises security questions, because private nixpkgs has credentials to privale git repositories, src pointed do local dev-versions and to SSL-certificates for ALL the domains, … and similar stuff not to leak onto rented cloud workers.

aanderse · February 15, 2021, 10:17pm

Indeed. I have a very low powered linode cloud server which chokes if I do a nixos-rebuild on it during normal operation.

I’m glad there are a variety of tools to use in this ecosystem. Different tools for different workflows. Both nixops and colmena suit my needs well. Thanks for your hard work on this great tool @zhaofengli!

kvtb · February 16, 2021, 7:33pm

I have the same problem, In my case: not enough disk space to support multiple systems on my laptop using nixops. It seems all packages are downloaded even if 90% of those are not required to build the configuration.
So krops is currently the only solution to this problem?

Mic92 · February 17, 2021, 7:27am

A very simple alternative to that is to just git pull configuration on your servers and do a nixos-rebuild switch: https://github.com/Mic92/doctor-cluster-config/blob/89e892d5f5fae000bc02bc758e0ca290e3861e53/update-all.sh

With flakes you don’t need to manage nix channels. Servers can be pre-build in CI: https://drone.thalheim.io/Mic92/doctor-cluster-config/18