Introducing bento, a NixOS deployment framework

Solene · September 4, 2022, 10:37am

Hi!

I wrote https://github.com/rapenne-s/bento/ because I was dissatisfied with the various NixOS deployment tools around. They are way to strict, and either requiring flakes or not compatible with them.

Bento aims at managing a huge amount of NixOS in the wild, so it has been made with robustness in mind to bypass firewalls, configurations can be built on the central management server and served if it is used as a substituters by the clients, and secure as each host can only access its own configuration file, without sacrificing the configuration management for the system administrators who have everything in one repository.

I created a screencast to show the workflow to add a new host: How to add a new NixOS system to Bento deployment - asciinema.org

You can find the rational behind Bento on my blog Solene'%

update 2022-09-09: bento 1.0.0 released! It’s now a single script

update: you can track the status of remote systems
https://asciinema.org/a/520504

update: you can track the version of the remote systems against what you have locally (thanks to reproducibility!)

   machine   local version   remote version              state                                     time
   -------       ---------      -----------      -------------                                     ----
  kikimora        996vw3r6      996vw3r6 💚    sync pending 🚩       (build 5m 53s) (new config 2m 48s)
       nas        r7ips2c6      lvbajpc5 🛑 rebuild pending 🚩       (build 5m 49s) (new config 1m 45s)
      t470        b2ovrtjy      ih7vxijm 🛑      rollbacked 🔃                           (build 2m 24s)
        x1        fcz1s2yp      fcz1s2yp 💚      up to date 💚                           (build 2m 37s)

blaggacao · September 4, 2022, 12:12pm

This is a very interesting & integrating approach.

However, I think a self updating fleet can’t live in production without a scheduling strategy that reacts to both life- and readiness checks.

Or put differently, choreography unfortunately won’t do, it needs orchestration.

Solene · September 4, 2022, 12:34pm

What do you mean by “life and readiness checks”?

NobbZ · September 4, 2022, 1:05pm

Readinesschecks confirm whether a service became ready to serve its purpose after a start, while lifeness checks confirm that a ready service remains alive and being able to do its jobs.

If such checks do not suceed for a couple of times within a time window the service is considered in a fail state and restarted.

Solene · September 4, 2022, 4:48pm

Thank you for your answer. It seems I could improve bento by receiving feedback upon upgrades to get the status and know the current state of each client.

As they are not always connected, unlike servers, it seems more complicated to follow all of them in “real time”.

I’m currently adding feedback, after an upgrade, a log is sent to tell if it was successful or not. This is a first step to know what’s happening on each remote host.

Stuart · September 4, 2022, 6:04pm

Thanks Solene!

For my growing fleet of notebooks at home, this approach looks like a nice relatively frictionless way forward.

For my use case, I suspect that I could be making changes during a weekend that could propagate through the polling to the notebooks/desktop that the family might be using at the same point in time. I guess that means there is a need to have remote client interaction to indicate that something is happening when an update is about to go ahead. Maybe not a must-have feature, but something that did cross my mind when looking at what you have done.

Nice job. Looking at the other comments, I guess you have a bit to think about on what use cases you want to cover for the tool/approach at you iterate it.

Solene · September 4, 2022, 6:06pm

A systray agent is already in the todo list, this could be just to notify a reboot is required (if you use nixos-rebuild boot instead of switch) or if they want to update.

blaggacao · September 4, 2022, 6:28pm

Thanks @NobbZ for chiming in!

Indeed!

Hence, let’s think about readiness and liveness as two states that we need to assert.

In any case, since we’re observing the systemd-scheduler, we probably want to assess these states in function of what sytemd has to tell us.

Furthermore, we want to make a summary statement about a critical set of services (that we call “OS”) rather than probably any specific service.

Although, if that OS hosts services for us, then we also want to assert that state for those services, so the semantic may expand. If I run services on a set of hosts reliably, I wouldn’t try to do that with systemd, though, but rather some sort of data-center scheduler. So we probably don’t need to conflate OS life- & readiness (“the services required for the OS to function as desired”) and workload readiness (“the things I need uptime for”) at this point.

Probably, for the scheduling strategy a n+1 might be ok-ish, but that n+1 might need compounding over a host class. Imagine you require n=3 “blue-class” hosts to satisfy your workload, then you’d have to run n+1 and always can only cycle one spare host until it’s ready and live again.

But here comes the additional twist: when we have production workloads scheduled via a suitable distributed scheduler, we only can continue cycling if that external scheduler gives us life and readiness greenlight for the relevant production services.

So strictly speaking for an automated fleet update to not take down your system at uncontrolled points in time, in addition, it would probably need a foreign scheduler interface that listens for a summary green light from that scheduler to continue cycling with the next host of a given host class.

Not trivial. Especially not in a stateless manner (“choreography”). Such state would have to span the fleet, so we have a distributed state. Enter etcd / consul?

This is really a bit of a hard nut to crack, especially since the entire NixOS ecosystem somehow tacitly just is like “downtime is cool” in its implicit deployment and operating model. It’s an desktop OS, not a scheduler OS, I get it. But at the same time “Nix for Work” wants to be a thing. This is the big quest of our times fir this community, imho.

Solene · September 4, 2022, 7:13pm

I understand what you mean, it totally make sense for servers.

However I mostly target workstations, they just work without dependency on other systems, so I feel it’s less an issue if they don’t update at the same time of other. They are all independent.

blaggacao · September 4, 2022, 7:34pm

I don’t know, but maybe it makes sense to start employing these clearly distinct semantics:

NixOS for Workstations
NixOS for Server

Maybe even as a badge. I feel that the general conversation throughout the Nix Ecosystem (not this particular one), lacks clarity and category.

This would be a case for the Doc Team to evaluate / consider in the ongoing Nomenclature effort, though.

Solene · September 4, 2022, 7:38pm

What’s the difference between NixOS for workstation and NixOS for server? It’s just NixOS in the end

Except that if your servers are interconnected, lazy updates are not suitable.

blaggacao · September 4, 2022, 7:58pm

Ha, glad you asked! I just felt the need to embed this in a bigger context w.r.t. ecosystem semantics: NixOS for Workstations vs NixOS for Servers

Since, I’m looking at Nix 90% with the “for Business” lense, I’ve somehow felt that mismatch in idealized and purported usage scenario tacitly happening in the past, so I thought this may merit a different framing.

This was just a first take, but I hope that framing goes somewhat in the right direction and leads us to a marginal ecosystem improvement, so the creator will.

Solene · September 4, 2022, 8:06pm

This make me realize it should be the reason why I named bento “a deployment framework” in this thread title, because it’s something you can build upon.

I’d not expect any business to use it as this but as a foundation to build something matching their requirements. They could throw the code entirely, and just keep the idea if this is enough for them

Solene · September 5, 2022, 3:38pm

I added a way to track the state of remote systems
https://asciinema.org/a/519060

Solene · September 5, 2022, 4:27pm

Instead of relying on sending configurations files through sftp, and run nixos-rebuild flakes / not flakes, this involve a lot of conditionals, and I’d like to make things simpler.

In the current state, it’s also not possible to use a single flakes file to manage all hosts. This is going to change.

I found a way to transfer a nixos configuration file as a single file, this will be transmitted over sftp, so I’ll be able to tell if a client is running the same derivation that is currently on the sftp, and solve a lot of problems. I’ll be able to get ride of nixos-rebuild too, and the client won’t know if it’s using flakes or not.

create a derivation file for the system, using flakes. I still need to figure how to do this without flakes

DRV=$(nix path-info --json --derivation .#nixosConfigurations.bento-machine.config.system.build.toplevel | jq '.[].path' | tr -d '"')

make the result of $DRV available to the remote machine

nix-build $DRV -A system   (or nix build $DRV)
sudo result/bin/switch-to-configuration switch (or boot)

edit: getting the derivation path using non flakes

nix-instantiate '<nixpkgs/nixos>' -A config.system.build.toplevel -I nixos-config=./configuration.nix

Now I just need to ensure the result only contain what’s required for this host.

bjornfor · September 5, 2022, 4:58pm

Related: NixOS: switch-to-configuration script does not correctly add a boot entry when executed standalone · Issue #82851 · NixOS/nixpkgs · GitHub

Solene · September 5, 2022, 4:58pm

I may have been a bit too enthusiastic, because it doesn’t seem to do what I thought

Solene · September 5, 2022, 4:59pm

That’s potentially a lot of headache avoided, thank you very much

Solene · September 5, 2022, 7:09pm

Now featuring time since update, and if not up to date, time since the configuration is available

Solene · September 6, 2022, 11:57am

I’ve been hitting issues with nixos-rebuild, it’s interesting because the command doesn’t report correctly it’s failing.

Bento is now reporting issues like not enough disk space, but ultimately I need it to report the current version of the system to compare with what we have locally

https://github.com/NixOS/nixpkgs/issues/189966