Introducing bento, a NixOS deployment framework

Hi! :wave:t3:

I wrote https://github.com/rapenne-s/bento/ because I was dissatisfied with the various NixOS deployment tools around. They are way to strict, and either requiring flakes or not compatible with them.

Bento aims at managing a huge amount of NixOS in the wild, so it has been made with robustness in mind to bypass firewalls, configurations can be built on the central management server and served if it is used as a substituters by the clients, and secure as each host can only access its own configuration file, without sacrificing the configuration management for the system administrators who have everything in one repository.

I created a screencast to show the workflow to add a new host: How to add a new NixOS system to Bento deployment - asciinema.org

You can find the rational behind Bento on my blog Solene'%

update 2022-09-09: bento 1.0.0 released! It’s now a single script

update: you can track the status of remote systems
https://asciinema.org/a/520504

update: you can track the version of the remote systems against what you have locally (thanks to reproducibility!)

   machine   local version   remote version              state                                     time
   -------       ---------      -----------      -------------                                     ----
  kikimora        996vw3r6      996vw3r6 💚    sync pending đŸš©       (build 5m 53s) (new config 2m 48s)
       nas        r7ips2c6      lvbajpc5 🛑 rebuild pending đŸš©       (build 5m 49s) (new config 1m 45s)
      t470        b2ovrtjy      ih7vxijm 🛑      rollbacked 🔃                           (build 2m 24s)
        x1        fcz1s2yp      fcz1s2yp 💚      up to date 💚                           (build 2m 37s)
17 Likes

This is a very interesting & integrating approach.

However, I think a self updating fleet can’t live in production without a scheduling strategy that reacts to both life- and readiness checks.

Or put differently, choreography unfortunately won’t do, it needs orchestration.

2 Likes

What do you mean by “life and readiness checks”?

1 Like

Readinesschecks confirm whether a service became ready to serve its purpose after a start, while lifeness checks confirm that a ready service remains alive and being able to do its jobs.

If such checks do not suceed for a couple of times within a time window the service is considered in a fail state and restarted.

5 Likes

Thank you for your answer. It seems I could improve bento by receiving feedback upon upgrades to get the status and know the current state of each client.

As they are not always connected, unlike servers, it seems more complicated to follow all of them in “real time”.

I’m currently adding feedback, after an upgrade, a log is sent to tell if it was successful or not. This is a first step to know what’s happening on each remote host.

Thanks Solene!

For my growing fleet of notebooks at home, this approach looks like a nice relatively frictionless way forward.

For my use case, I suspect that I could be making changes during a weekend that could propagate through the polling to the notebooks/desktop that the family might be using at the same point in time. I guess that means there is a need to have remote client interaction to indicate that something is happening when an update is about to go ahead. Maybe not a must-have feature, but something that did cross my mind when looking at what you have done.

Nice job. Looking at the other comments, I guess you have a bit to think about on what use cases you want to cover for the tool/approach at you iterate it.

2 Likes

A systray agent is already in the todo list, this could be just to notify a reboot is required (if you use nixos-rebuild boot instead of switch) or if they want to update. :+1:t3:

1 Like

Thanks @NobbZ for chiming in!

Indeed!

Hence, let’s think about readiness and liveness as two states that we need to assert.

In any case, since we’re observing the systemd-scheduler, we probably want to assess these states in function of what sytemd has to tell us.

Furthermore, we want to make a summary statement about a critical set of services (that we call “OS”) rather than probably any specific service.

Although, if that OS hosts services for us, then we also want to assert that state for those services, so the semantic may expand. If I run services on a set of hosts reliably, I wouldn’t try to do that with systemd, though, but rather some sort of data-center scheduler. So we probably don’t need to conflate OS life- & readiness (“the services required for the OS to function as desired”) and workload readiness (“the things I need uptime for”) at this point.

Probably, for the scheduling strategy a n+1 might be ok-ish, but that n+1 might need compounding over a host class. Imagine you require n=3 “blue-class” hosts to satisfy your workload, then you’d have to run n+1 and always can only cycle one spare host until it’s ready and live again.

But here comes the additional twist: when we have production workloads scheduled via a suitable distributed scheduler, we only can continue cycling if that external scheduler gives us life and readiness greenlight for the relevant production services.

So strictly speaking for an automated fleet update to not take down your system at uncontrolled points in time, in addition, it would probably need a foreign scheduler interface that listens for a summary green light from that scheduler to continue cycling with the next host of a given host class.

Not trivial. Especially not in a stateless manner (“choreography”). Such state would have to span the fleet, so we have a distributed state. Enter etcd / consul? :slight_smile:

This is really a bit of a hard nut to crack, especially since the entire NixOS ecosystem somehow tacitly just is like “downtime is cool” in its implicit deployment and operating model. It’s an desktop OS, not a scheduler OS, I get it. But at the same time “Nix for Work” wants to be a thing. This is the big quest of our times fir this community, imho.

2 Likes

I understand what you mean, it totally make sense for servers.

However I mostly target workstations, they just work without dependency on other systems, so I feel it’s less an issue if they don’t update at the same time of other. They are all independent.

1 Like

I don’t know, but maybe it makes sense to start employing these clearly distinct semantics:

  • NixOS for Workstations
  • NixOS for Server

Maybe even as a badge. I feel that the general conversation throughout the Nix Ecosystem (not this particular one), lacks clarity and category.

This would be a case for the Doc Team to evaluate / consider in the ongoing Nomenclature effort, though.

4 Likes

What’s the difference between NixOS for workstation and NixOS for server? It’s just NixOS in the end :thinking:

Except that if your servers are interconnected, lazy updates are not suitable.

Ha, glad you asked! I just felt the need to embed this in a bigger context w.r.t. ecosystem semantics: NixOS for Workstations vs NixOS for Servers

Since, I’m looking at Nix 90% with the “for Business” lense, I’ve somehow felt that mismatch in idealized and purported usage scenario tacitly happening in the past, so I thought this may merit a different framing.

This was just a first take, but I hope that framing goes somewhat in the right direction and leads us to a marginal ecosystem improvement, so the creator will.

1 Like

This make me realize it should be the reason why I named bento “a deployment framework” in this thread title, because it’s something you can build upon.

I’d not expect any business to use it as this :scream_cat: :scream_cat: but as a foundation to build something matching their requirements. They could throw the code entirely, and just keep the idea if this is enough for them :smiley:

I added a way to track the state of remote systems
https://asciinema.org/a/519060

Instead of relying on sending configurations files through sftp, and run nixos-rebuild flakes / not flakes, this involve a lot of conditionals, and I’d like to make things simpler.

In the current state, it’s also not possible to use a single flakes file to manage all hosts. This is going to change.

I found a way to transfer a nixos configuration file as a single file, this will be transmitted over sftp, so I’ll be able to tell if a client is running the same derivation that is currently on the sftp, and solve a lot of problems. I’ll be able to get ride of nixos-rebuild too, and the client won’t know if it’s using flakes or not.

create a derivation file for the system, using flakes. I still need to figure how to do this without flakes

DRV=$(nix path-info --json --derivation .#nixosConfigurations.bento-machine.config.system.build.toplevel | jq '.[].path' | tr -d '"')

make the result of $DRV available to the remote machine

nix-build $DRV -A system   (or nix build $DRV)
sudo result/bin/switch-to-configuration switch (or boot)

edit: getting the derivation path using non flakes

nix-instantiate '<nixpkgs/nixos>' -A config.system.build.toplevel -I nixos-config=./configuration.nix

Now I just need to ensure the result only contain what’s required for this host.

Related: NixOS: switch-to-configuration script does not correctly add a boot entry when executed standalone · Issue #82851 · NixOS/nixpkgs · GitHub

1 Like

I may have been a bit too enthusiastic, because it doesn’t seem to do what I thought :sweat_smile:

That’s potentially a lot of headache avoided, thank you very much :star_struck:

Now featuring time since update, and if not up to date, time since the configuration is available

1 Like

I’ve been hitting issues with nixos-rebuild, it’s interesting because the command doesn’t report correctly it’s failing.

Bento is now reporting issues like not enough disk space, but ultimately I need it to report the current version of the system to compare with what we have locally :+1:t3:

https://github.com/NixOS/nixpkgs/issues/189966

1 Like