Nixos-rebuild-ng: a nixos-rebuild rewrite

nixos-rebuild-ng: init by thiagokokada · Pull Request #354029 · NixOS/nixpkgs · GitHub.

Hello folks :wave:.

I am here to collect earlier feedback about the PR above, a Work-in-Progress rewrite of nixos-rebuild in Python. The objective is to eventually it become a drop-in replacement, but first I want to know if this is something people thinks it is a good idea before I invest more time in its development.

Below is a copy of the PR description including technical reasons in why I am using Python. By the way, to be clear, if there are people that want this in another language (e.g.: Rust looks another good option) I am all in and could close this PR, but I don’t have enough familiarity in the language to help here.


Opening a PR to collect early feedback before I sink more time in this. Keep in mind that while code reviews are welcome, this is not the focus for this PR yet.

The current state of nixos-rebuild is dare: it is one of the most critical piece of code we have in NixOS, but it has tons of issues:

  • The code is written in Bash, and while this by itself is not necessary bad, it means that it is difficult to do refactorings due to the lack of tooling for the language
  • The code itself is a hacky mess. Changing even one line of code can cause issues that affects dozens of people
  • Lack of proper testing (we do have some integration tests, but no unit tests and coverage is probably pitiful)
  • The code predates some of the improvements nix had over the years, e.g.: it builds Flakes inside a temporary directory and read the resulting symlink since the code seems to predate --print-out-paths flag

Given all of those above, improvements in the nixos-rebuild are difficult to do. A full rewrite is probably the easier way to improve the situation since this can be done in a separate package that will not broke anyone. So this is an attempt of the rewrite.

The language of choice here is Python. I am open to other options here, and I mostly choose Python since it is the language I am most comfortable here, but I am open for other options. Still, I think Python is a good choice because:

  • It is the language of choice for many critical things inside nixpkgs, like the NixOSTestVM and systemd-boot activation scripts
  • It is a language with great tooling, e.g.: mypy for type checking, ruff for linting, pytest for unit testing
  • It is a scripting language that fits well with the scope of this project
  • Python’s standard library is great and it means we will probably need zero external dependencies for this project. For example, nixos-rebuild currently depends in jq for JSON parsing, while Python has json in standard library

I am aware about the current switch-to-configuration-ng rewrite, however I am not sure what is the scope of that project vs nixos-rebuild. If the idea is just a drop-in replacement than both of those rewrites are ortogonal, since nixos-rebuild also includes some extra logic for e.g.: profile management. If the idea is to migrate more and more logic to switch-to-configuration-ng and eventually drop nixos-rebuild, I am happy to close this PR.

17 Likes

Haven’t read your PR carefully, but there’s an attempt to add a new tool as a middle ground between old nixos-rebuild and switch-to-configuration:

However, this is reverted because it has a bug that makes nixos-rebuild fail to deploy, and is too close to the release window. But there should be another attempt to reintroduce this after 24.11.

If you want to touch this part of the logic, I strongly recommend you to discuss with @roberth , so that you may be able to get a better solution.

2 Likes

Without having read the PR yet, bash to python is an improvement, there were bugs in the past due to simply using bash, because not many people are aware of its pitfalls compared to python.

This is interesting, but I do think this approach has some issues. For example, the system-run thing was mostly a hacky workaround because switch-to-configuration script inherits the TTY from the running shell, and this caused issues when people where switching configuration remotely via e.g.: SSH and the network.target restarted, causing the connection to drop and the script get stucked since it could not flush more data to stdout/stderr (at least this is my assumption).

I tried nixos-rebuild-ng in the same situation and this issue doesn’t happen. I got disconnected from my SSH session but the switch finished successfully. The reason I assume is because we are using subprocess, that doesn’t inherit the TTY, so no more need for those systemd-run shenanigans. Even if this is not the case I have some other ideas in how we can handle this situation better than depending in the systemd-run.

CC @roberth.

Keep in mind that I don’t think reading the PR in its current state is a good use of your time, at least not right now. I only implemented nixos-rebuild boot/switch, and I only really tested for now nixos-rebuild switch --flake (that works, but I wouldn’t recommend to do so yet because I had a nasty bug earlier on that completely messed up my profile, thankfully I recovered it by running nix-collect-garbage -d).

I am most playing right now with:

  • Making sure that the code is safe to play, e.g.: setting up mypy and ruff and make sure that the code is as safe as possible to modify
  • How to organize the code in a way that this doesn’t become another unmanagable mess
2 Likes

It would be great to converge to Rust in all these, but I think Python is a lot better than bash! Good work.

11 Likes

I appreciate the work done and modernization of the nixos-* script-suite.

Though I consider it a pity that the opportunity to extract them out of nixpkgs isn’t used.

Having them in a seperate repository might make it easier to contribute to them in the long term I think, as relevant PRs aren’t getting burried anymore under hundreds of of other PRs per day.

As much as I generally think that it doesn’t make much sense to factor out the NixOS modules out of nixpkgs, I think factoring out the related tooliing really would benefit nixos, nixpkgs, and potential contributors.

Especially as you could then assign more fine grained merge control.

Also I have to agree with Domen, standardizing all the tooling to a single language/runtime would be ideal, though if we should prefer rust or python or perl or gleam, I do not really care, as long as there is an agreed upon runtime, and if it is for the next iteration when all the tooling is namend *-ng-ng

10 Likes

relevant discussion:

2 Likes

I think I already made this clear, but I am not against someone writing this in Rust. Definitely will not be me though, since I think if I started this project as a Rust project, considering that I have no prior experience with the language, one of those things would happen:

  • I will get stuck, lose enthusiasm and the project will never finish
  • The end result will be bad (I think it is difficult to be worse than the original, but you never know)

So if I am to be a big contributor to this rewrite, it will be in Python just because it is the language I have more experience and it is one of the 3 big languages used in Nixpkgs for core things.

Also, I don’t think having a *-ng-ng is that bad either. The current state of nixos-rebuild (I welcome anyone that is in this thread to open the original code and try to understand what is happening) is bad enough that even if the code was rewritten in Bash itself it could be improved a lot. So if this rewrite cleared up the code enough that someone got enthusiasm to rewrite this in Rust, for example, it would already do its job.

4 Likes

The apply script won’t be a problem for a nixos-rebuild reimplementation. More than anything it is a bottom-up improvement to give a better interface to the toplevel, such that the operations on toplevel are self-contained. The benefits aren’t limited to systemd-run; also the management of the profile link is taken care of, and it improves the architecture / separation of concerns.

Specifically it simplifies the interface between the deployable operating system (toplevel) and deployment tools such as nixos-rebuild, especially when support for the “legacy” non-apply code path can be dropped from such tools.
All improvements to the “local” part of the switch operation can be decided by NixOS and picked up by any deployment tool. This includes swapping out systemd-run for something better, if it turns out we were mistaken.

Speaking of mistakes, the mistake I made in nixos-rebuild in NixOS `apply` script by roberth · Pull Request #344407 · NixOS/nixpkgs · GitHub was a simple logic inversion (-n vs -z) that I didn’t spot, and wasn’t sufficiently tested, so your effort to better test the rewrite is much appreciated.
I expect that we can merge a fixed version after branch-off without much trouble and we might even end up backporting the apply script; maybe even the nixos-rebuild improvement as well.
nixos-rebuild-ng can implement this change at its own pace, because the addition of apply is not a breaking change.


Regarding systemd-run, one of the goals is to persist into journald all logs that are produced as part of the switching operations.
The fact that some activations may succeed despite a broken SSH connection doesn’t mean that all of them will. For instance, if they log more, they may receive a SIGHUP.
It solves a real problem, using an appropriate tool (a transient systemd unit), so no “shenanigans”.
I’m happy to discuss alternative ways to make it more robust.

2 Likes

Regarding Rust vs Python, I’d favor Python for a small tool like this. nixos-rebuild wouldn’t benefit from Rust’s low level capabilities or its extra capable type system; in fact, those may slow most contributors down.

If we have an overriding reason to switch to Rust, that can be done later. I’m not opposed to “converge on Rust” but “convergence” should’t be the only reason.
I don’t think anyone is disagreeing about Python being good for this btw. :rocket:

22 Likes

Fair enough. Since this is going to be moved to this new apply script anyway I will not reimplement the systemd-run logic for activations in the rewrite.

1 Like

Has it been explored to move as much as possible of the logic out of nixos-rebuild and into switch-to-generatation/apply? I have a feeling that’s a good idea, but I know too little :smile: Thoughts?

nixos-rebuild has much more logic that wouldn’t make sense to move to apply script, for example the whole remote builder/deployment part (to be fair, it is not something that I will reimplement in the first version of nixos-rebuild-ng since this will take some effort).

I was taking a look at the code from nixos-rebuild to build/bootstrap nix and… Does it even make sense nowadays?

From what I understand reading the code, the idea is to bootstrap a newer version of nix because the code inside nixpkgs can include Nix code that is only supported by newer versions, but we also have this policy for a long time where code inside nixpkgs should be compatible with Nix 2.3, so I imagine this is one of those piece of code that made sense someday in the past and nowadays it doesn’t.

Keep in mind that nixos-rebuild-ng will always build with a reasonable up-to-date version of Nix, since I added this as a buildInput and this is wrapped inside the binary, so as long we suddenly doesn’t jump to the newest version of nix it should be fine.

Anyway, I will probably put this in very low priority to reimplement for now.

Opening the PR for review. This should be a good first version since it has most of the basic features implemented and can be used to manage your system (with caveats, see the warning below). Also, it should have enough for other people to look at the code and start hacking (help is welcome to implement the remaining features, especially the remote part).

Of course there are lots of things to do yet:

However this is still a good step. It is possible to use it as-is, so for early adopters that want to test and report issues I think this a good start. You need to understand the limitations though:

  • For now we will install it in nixos-rebuild-ng path by default, to avoid conflicting with the current nixos-rebuild. This means you can keep both in your system at the same time, but it also means that a few things like bash completion are broken right now
  • _NIXOS_REBUILD_EXEC is not implemented yet, so different from nixos-rebuild, this will use the current version of nixos-rebuild-ng in your PATH to build/set profile/switch, while nixos-rebuild builds the new version (the one that will be switched) and re-exec to it instead. This means that in case of possible bugs the only way that you will get them is after you switch to a new version
  • nix bootstrap is also not implemented yet, so this means that you will eval with an old version of Nix instead of a newer one. This is unlikely to cause issues, because the build will happen in the daemon anyway (that is only changed after the switch), and unless you are using bleeding edge nix features you will probably have zero problems here. You can basically think that using nixos-rebuild-ng is similar to running nixos-rebuild --fast right now
  • Ignore any performance advantages of the rewrite right now, because of the 2 caveats above
  • --target-host and --build-host are not implemented yet and this is probably the thing that will be most difficult to implement. Help here is welcome
  • Bugs in the profile manipulation can cause corruption of your profile that may be difficult to fix, so right now I only recommend using nixos-rebuild-ng if you are testing in a VM or in a filesystem with snapshots like btrfs or ZFS. Those bugs are unlikely to be unfixable but the errors can be difficult to understand. If you want to go anyway, nix-collect-garbage -d and nix store repair are your friends
1 Like