Npcnix - "pull-based" provisioning and configuration of nixos systems

dpc · May 3, 2023, 9:10pm

I finally wrapped up the approach I’m using into something coherent on github.

To sum up: when managing my infra in AWS, I have a git repository that has both Terraform configuration and NixOS flake defining configurations for all the hosts I might need.

By running terraform apply I both: create new infra (EC2 machines) and provision them with their configuration passed to them via an s3 bucket. New hosts bootstrap automatically via user-data init script, and then monitor their desired configuration and rebuild themselves when it has been updated using a system daemon.

This has many operational benefits:

It integrates seamlessly with Terraform workflow, making desired NixOS configuration / configuration deployed work exactly the same as any other Terraform state.
Avoids sshing to machine to update them (which is sometimes undesirable due to network restrictions, or require another set of proviledges to take care of) - and I don’t have to touch my yubikey (hw ssh key) many times just to update a handful of machines at once.

adamcstephens · May 4, 2023, 2:28am

I’ve been thinking about pull based deployments lately and am very interested in this. Will this tool, perhaps using the image format, allow for applying a configuration without local evaluation? Pre building has multiple options, but evaluation still takes CPU time and more importantly memory. I’m interested in support for lower memory hosts than my flake can evaluate on.

dpc · May 5, 2023, 3:52am

There are multiple ways to approach it.

I’m typically using it with a binary cache so no building on the target host is being done anyway. I’ll push my code to CI, it will build and publish all artifacts to cachix, verify everything at least compiles, then I deploy to machines.

Another approach are remote builders (dedicate one host to serve as a build machine for other hosts, etc.).

Also if you start from already pre-baked AMI, you can avoid worrying about the bootstrap.

The bootstrap script will set up a temporary 2GB swap: https://github.com/rustshop/npcnix/blob/295295e1b6309ba24b39412000f0a95954731281/flake.nix#L97 , and then the initial configuration can include some swapfile as well (which is a good idea to have anyway on a machine low in memory, as it allows the kernel to free up some of the cold in-memory data).

My personal deployments are usually t3.nano EC2 instances, so very tight on resources, and it all works fine for me, despite lots of Rust code being used which needs a lot of resources to compile.

picnoir · May 10, 2023, 8:21am

Nice project!

First, prepare a location that can store and serve files (an npcnix store ) - e.g. an S3 bucket (or a prefix inside it).

Within it, publish compressed flakes under a certain addresses (an npcnix remote s) - e.g. a keys in a S3 bucket.

I’m a bit confused by that part, it’s bit sad to lose the direct source repository commit => machine closure mapping.

Why do you need this specific store? Can’t you fetch the NixOS closure definitions from the source git repository itself? I feel like I’m missing some context that lead you to this direction.

dpc · May 10, 2023, 6:52pm

It might be doable, though not exactly my use case.

First: what exactly are you going to point the npcnix as a reference to follow? Git branch? This might work, but it will be largely inconvenient, IMO. In any larger infra it is useful to have some temporary control over deployment ordering - so you can update set of hosts first, see if everything seems OK, and only then deploy all others. Or deploy in some order for some reason etc.

There’s a problem of notification system. Right now npcnix uses http etags and rather slow polling (avg 1 minute) to detect when the reference (aka remote) changed. This makes it very lightweight. With git it will have to be git fetch over a stateful repo or git clone --depth 0. Both much heavier.

If your Nix configuration is part of a much larger repository (which is my use-case), by packing only the needed Nix files, the size of a remote is much smaller as well.

My use case is: a bigger repository for all cloud management (Terraform, utilities for building AMIs, containers, k8s definitions, etc.), small to medium size fleet of AWS hosts (let’s say 30-100), etc.

The model closely follows Terraform model, where modifying the source code and deploying from source to become the “cloud state” are two different steps.