Eventually reproducible nix(os)-builds

I would like to implement a new feature to nix/NixOS, and want to brainstorm the idea with you folks :slight_smile:

Problem:

I’m running NixOS outside of its comfort zone: On multiple VPS with 1GB RAM and 10GB storage. This is only possible by

  • Using Remotebuilder → Some packages take too much space when being build.
  • nixpkgs.url = github:NixOS/nixpkgs/master and updating hourly → when there are too many changes at once, the disk will run full while rebuilding
  • doing hourly nix-garbage-collect -g → Otherwise the disk is full in matters of hours
  • Having tweaked the store as much as possible (Hard-Linking, etc)
    • nix.settings.auto-optimise-store = true;
    • nix.settings.min-free = "${toString (100 * 1024 * 1024)}";
    • nix.settings.max-free = "${toString (1024 * 1024 * 1024)}";
    • nix.settings.max-jobs = 0;
  • Using Impermanence to have some “auto clean” on startup
  • Move Logfiles off the system very aggressively (hourly logrotate)
  • having /nix/store on a compressed btrfs filesystem
  • (Some optimizations I added in the past year and I’m probably forgetting)

And still, sometimes the updates pile up and I have to change my config to remove all “non-vital” packages, rebuild the system in the minimal configuration, (reboot and reclaim the space by garbage collection) and then add the “non-vital” packages back.

Idea

Instead of building the whole derivation at once, I would like to make partial derivations. So when an upgrade is to be built and it would for example include upgrades to bigPackageA and bigPackageB, the regular nixos-rebuild would fail with “disk full”, since there is not enough space for four bigPackages (i.e bigPackageA.old, bigPackageA.new, bigPackageB.old, bigPackageB.new)

I would like the new upgrade mechanism to:

  1. try building the systemDerivation using the regular nixos-rebuild → If that works, I rather prefer the old and tested.
  2. If the previous step was not successful due to a lack of storage space, I would like to build a new systemDerivation, which is in between my old derivation and the new one, only upgrading one package and masking the other package versions to their old versions. If the build of new derivation is successful, then the old one may be discarded from the nix store. Repeat until the target systemDerivation is reached.
  3. When an error happens during the build, return to the initial systemDerivation building each previous systemDerivation until we’re back to the initial state. Somewhat like this:
    InitialState → build derivationWithA → build was successful, cleaning up old A → derivationWithAB → build was successful, cleaning up old B → build derivationWithABC (Errors) → clean up → build derivationA → build initialState

The builds would still be reproducible while having a smaller storage footprint.

Implementation

  • First, I believe I need some way of tagging a package/service as “non-vital”, so that the new upgrade mechanism knows, which packages can be removed, without the update system breaking. The tagging should not be a problem.
  • Also, I think having both of the build trees (currentState, targetState) would be beneficial. I will need to make small alterations to the currentState tree and then feed it back in to nix. These alterations will be using the targetStates tree and changing the hashes of the parents node.
  • Pulling it all together would be a script that would automate the tree manipulation and nix calling parts.

Questions

  • Do you think, this is a feasable Project? Are there technical reasons that make this impossible? Is there a logical fallacy I stumbled into?
  • Where could I start getting the trees from?
    • I was able to get the abstract syntax tree (AST) from a Nix-expression by using nix-instantiate, but its only for one file and not for the whole project.
    • I poked around in github:nixos/nix, looking for good entry points where I could hook into, but have found none.

Disclaimer

I’m aware that the whole idea of running NixOS on tiny machines is a bit rediculous, or straight stupid. I don’t advise anyone to do such a thing, let alone in production.

But I believe that this feature might become handy for someone, somewhere, someday.

You could base your solution on this. Instead of bumping this hourly, walk through nixpkgs commit-by-commit. That should reduce your update scope, and probably be much quicker than you expect since not every commit has to rebuild anything (in fact, if you stop using master you’d actually benefit from the hydra cache). This does mean writing your own updater, but that is relatively trivial.

Much more so than some crazy partial package deployment solution, at least; you’re basically writing your own distro at that point. That is viable, but it’d probably be easier to just start from scratch and not use the NixOS module system.

Hell, you’d probably do a bit better with non-flake nix and only pulling in post-eval paths, rather than potentially turning self-references into store paths that are required on the target.

3 Likes

I do this at work all the time. You just have to have realistic expectations. In fact our default root size is 8G.

  1. It’s really important to have swap with that little ram
  2. It’s really important before running nixos-rebuild to kill every service you can, successful rebuild will solve it.
  3. It’s really important to nuke every generation and collect it
  4. It’s really important to have aggressive journald settings

EDIT: I’d also add, if you intend to have a stateful service that writes to disk you just can’t have an 8GiB root. You should have a separate partition for the state data (so you have built-in quota, you can recover the os in the worst case by killing it, etc etc).

EDIT2: I guess the other thing I should say is all the machines that this small are generally not really updated that much, mostly we just blow them away when we are doing bumps to the underlying system. It’s nixos after all :smiley:

[ben@i-046e6826ad9bf7de7:~]$ df -h
Filesystem                Size  Used Avail Use% Mounted on
devtmpfs                   92M     0   92M   0% /dev
tmpfs                     914M     0  914M   0% /dev/shm
tmpfs                     457M  7.6M  450M   2% /run
/dev/disk/by-label/nixos  8.6G  4.4G  3.8G  54% /
efivarfs                  128K  3.0K  126K   3% /sys/firmware/efi/efivars
tmpfs                     1.0M     0  1.0M   0% /run/credentials/systemd-resolved.service
tmpfs                     1.0M     0  1.0M   0% /run/credentials/systemd-networkd.service
tmpfs                     914M  2.2M  912M   1% /run/wrappers
/dev/nvme0n1p1            249M   84M  166M  34% /boot
tmpfs                     1.0M     0  1.0M   0% /run/credentials/systemd-journald.service
tmpfs                     183M  4.0K  183M   1% /run/user/1003

Why do you initiate updates from the small system? You’d have more success pushing updates to the small system.

4 Likes

Wouldn’t it be easier to remotely build (using Nix if you like) and deploy some kind of read-only system image? It looks like you’re already giving up on most of the benefits of NixOS: you can’t use nixos-rebuild, you can’t rollback because you’re doing hourly gc…

4 Likes

There are various image builders available for this exact purpose, fyi, see nixos-generators. There are some image formats for cloud targets in there, which is most likely what you’re looking for if you’re deploying to this kind of host.

I’m not sure why I assumed you’re pushing the build results with nixos-rebuild --target-host, if you’re deploying the results of a remote build locally… Well, this does smell a bit like an XY problem :wink:

1 Like

My remote builder is caching everything too, so I’m faster than the hydra cache for the cost of building it once and the other vps hit the cache. When I wait for the hydra cache (been there, using unstable) the storage bottleneck case happens alot more often.

I don’t think, i’ll be able to do what nixpkgs does just on my own. It feels like I’m 90% where I want, and I have no idea how to bootstap my own OS using nix. Are there any guides? xD

Could you elaborate on that? I understand only half of it.

Is there a big difference other than downloading the tars to the vps and evaluating them locally, then call for the remotebuilder to build the package and send the result back? The .cache folder on root is cleared by a process after the build.

I don’t like giving up the “independence” of the vps that would come with the usage of a deployment server. It feels more decentralized the way I do it now. I could quickly go onto console and say --max-jobs 1 and it is independent of the remotebuilder. The deployment server seems to be more involved to switch…

My vps doesn’t really support image uploading. I used nixos-anywhere to get nixos there. They still think I’m using debian… Or is there a way to deploy via ssh and replace everything? At some point, the ssh-binary would be overwritten and than … everything dies?

This is in fact an XY problem:

  • I want to describe my whole Infrastructure via nix or similar tooling
  • I want as few inter-system dependencies as possible (best would be no external builders whatsoever)
  • I want machines with minimal footprint (cost efficiency) → ideal would alpine level of minimalism

Nothing exists in this space. So I would have to build it myself. And my Idea was, to hack it somehow into Nix/NixOS, to build the ultimate system while still being close to good and tested.

Do you mind to share? I never really found out how to do it the clean way…

  1. I haven’t seen any explicit discussion of limiting the number of generations for the bootloader. By default you may well be keeping too many of them which would prevent GC from collecting them, and contribute to the pile up.
  2. I am surprised to see such a requirement that you want to eval on a device with 1GB of RAM. I’m curious, can you elaborate?

I talk to a looooot of entities doing this kind of thing and they all prefer to eval+build on a trusted remote instead of doing that work on edge nodes. Especially since it seems you’re already leveraging a centralized cache.

1 Like
services.journald.extraConfig = "SystemMaxUse=256M";

Is currently doing the business for us.

1 Like

You’re speaking of something like boot.loader.grub.configurationLimit? I didn’t know that these side effects would occur. Thanks for the notice!

I can’t speak to the OPs requirements, but I think there can be good reasons for doing eval on small nodes. We are using aws autoscaling groups, which lends itself quite nicely to a pull model where on boot the device receives its nix config and evaluates to a target state without need for any third parties involvement. If your actual service has modest requirements its pointless to spin up a huge machine just for boot. On boot you have nothing else to compete with the nix-daemon, so with a smattering of swap a pretty small vm can work just fine.

nixos-anywhere supports kexec, but It’s difficult to predict if that’ll work with your system and limited memory. It really seems like you would be better off either getting bigger VM’s, or changing to a host where you can just boot a ready baked image (or doing something something PXE).

You could give your NixOS configuration two different versions of nixpkgs (e.g. nixpkgs-old nixpkgs-new) and a “level” (e.g. build-level) with a range of rebuild stages (e.g. 1 - 10). Then decide which “level” is required for a package to use the new version. You can then rerun your builds by iterating over these stages. When using flakes you could expose them as different nixosConfigurations outputs. You can switch to and rebuild with stage 1 first, then stage 2, etc. until all your packages have reached their newest version.

Sure, but the XY problem is why do you want these things. If it’s just academic interest, fair enough, but at that point I’d argue you should be building your own cool nix-based distro.

You can get significantly further if you just drop your second requirement; at that point having a management server which deploys the others is almost trivial. It’s also the de-facto industry standard, it’s how all cloud deployment infrastructure works. You don’t explain why you have that second requirement. So… what @colemickens said :smiley:

It could be done with e.g. an A/B scheme where you switch between a runtime system and an upgrade system on reboot (or with kexec). Just ask your firmware to boot a different partition on next boot, and write your updated image to that partition. It’s how (well-designed) embedded firmware upgrades are done. That, or you simply tear down your node and boot another - maybe get a better cloud provider which actually supports programmatic node provisioning.

The openssh server (with all relevant configuration) would just be statically configured in the image, perhaps with a read-/writeable volume that isn’t overwritten on deployment if you don’t hardcode ssh keys in the configuration. Just wait for the service to go up again and everything works.

This can also be done in a pull-based fashion, but it’s less flexible (i.e., can’t swap out a specific service) since you’re deploying full images rather than generations. It’d be my personal preference for how to do resource-constrained cloud deployments with nix, though, since my interpretation of the cattle pattern implies that they should never be meddled with in individual fashion.

No need to evaluate the entirety of nixpkgs locally, given you’re resource constrained that frees up a good 300MB of system memory. Maybe that would let you go for even cheaper nodes!

Evaluating a flake requires a copy of your repository in /nix/store, which makes your storage constraints slightly more problematic. I thought you were building and deploying from other hosts, but as-is you’re always doubling your actual repo size so self-references aren’t even relevant. If you didn’t use flakes, you’d eat the cost only once.

Anyway, the point is more that this felt like another small incentive to ditch using flake update semantics, especially since that’s most likely all you’re using them for, given this is about a NixOS deployment.

2 Likes

I’m a cheapskate and using vps 1€/month/node. But I want systems, that keep themselves updated, no matter what happens around them. That way, they are more resilient when parts of my infrastructure will fall apart.

Edit: Is it possible to spin up (and afterwards) a swap file for certain packages via an overlay? That would practically eliminate the memory problems?

I’m using NixOS only on my homelab, no professional incentives behind.

I’m trying to explain, but maybe hitting the language barrier? I wan’t them all to be able to work independently, in order to make the system more resilient. When the central deployment server is down, the systems will not be updated. There are security implications if I’m unable to fix everything quickly.

In that case make when you want to update you can kill the old one and build a new one. Don’t update in place. Have backups of the state. VPS can die at any point, even in a world with live migrations. Commit to the bit!

1 Like

i think i am missing something here… why is the nixos module system the problem for final closure size? nixos could be considered the culprit because it includes many packages in default system, but there are efforts to reduce this - i am sure i misunderstood something here, can anyone point out to me? maybe something obvious?

I don’t think, i’ll be able to do what nixpkgs does just on my own. It feels like I’m 90% where I want, and I have no idea how to bootstap my own OS using nix. Are there any guides? xD

it isn’t nearly as difficult as you would expect! i run my own nix based os on my laptop as my daily driver for work and everything: GitHub - aanderse/finix: An experimental os, featuring finit as pid 1, to explore the NixOS design space

your use case sounds interesting to me so i would love to learn more
maybe a blog series on writing your own nix based os would be helpful to a wider audience…

1 Like

I would subscribe to your newsletter.

2 Likes