Nix-Infra Provisions and Orchestrates Your NixOS Private Cloud

Announcing nix-infra: Create a private PaaS on Hetzner Cloud in minutes using nix-infra. Leverages NixOS and Nix Packages to build a reproducible and auditable private cloud.

The humble goal of nix-infra is to make managing your private PaaS so simple that Azure, AWS or other PaaS-providers become a waste of time and money.

I think this could appeal to NixOS-users who want to avoid black box services, have privacy concerns or just want a predictable cloud bill.

Feedback much appreciated! I have only spent a year in the Nix ecosystem so there is much to learn. I have however previously created a similar automation tool that has been running in production for five years so I know that the basic building blocks are robust. This implementation is a massive improvement. I will be migrating my own cluster to nix-infra so it will at least be maintained for my own needs.

I have created a template project to allow you to easily try this out. Just head over to GitHub - jhsware/nix-infra-test: Minimal cluster setup for testing nix-infra and follow the 5 steps to creating your private cloud. All you need is a Hetzner Cloud API-key. The entire test takes 7-8 minutes and automagically does the following:

  1. provision nodes
  2. convert them to NixOS
  3. install and configure the cluster
  4. install user applications
  5. run tests to see that everything is working
  6. tear down the cluster

The actual tool is available in this repo: GitHub - jhsware/nix-infra: Create a private PaaS on Hetzner Cloud in minutes using nix-infra.

My hope is that this could allow users to create and share their own cluster setups in the same way people share their NixOS-configurations.

I am aware that there are a couple of other projects that provide similar functionality, but I wanted something that is easy to fork and maintain but still can be distributed as a single, self-contained, binary.

The tool is written in Dart, an approachable language that can both be interpreted and compiled. It is fast enough to run in interpreted mode without any noticeable performance penalty which makes development a breeze. You can use nix-shell to set up the dev-environment.

NOTE: I am a macOS user. There is a compiled Linux binary for x86, but my testing during the pre-release phase is on macOS.

11 Likes

I have now published a high-availability cluster configuration that you can use with nix-infra. The cluster consists of:

  • 3-node control plane
  • 3-node Elasticsearch cluster
  • 3-node KeyDB-cluster (Redis clone by Snap Inc.)
  • 3-node MongoDB-cluster
  • Test applications for each database
  • Connection strings passed as secrets via Systemd Credentials

This configuration only has a single ingress node, which would obviously be a single point of failure, but data is stored on multiple nodes.

Building, testing and tearing down the cluster takes less than 10minutes. There is aprox a 80% success rate when building the cluster, if it fails it is automatically dismantled and you re-run the script.

Follow the instructions at nix-infra-ha-cluster to try this out.

This is a proof-of-concept and I had to choose som shortcuts to get this done. It is easy to modify the configuration and the automation script is a good starting point to learn how you create your own private cloud.

3 Likes

Why is the success rate 80%? That doesn’t sound typical for managing infrastructure.

I am using nixos-infect to convert Ubuntu-servers to Nixos and sometimes that phase fails on the odd node. The orchestration doesn’t fail and this has no impact once the provisioning and conversion of nodes to NixOS is complete.

I am not sure why nixos-infect fails, if you have any thoughts I’d much appreciate the feedback!

nixos-infect is dead, use nixos-anywhere

1 Like

At some point I am going to switch to an ISO-image instead, but nixos-infect has been both useful and easy to understand.

1 Like

ISO images are inferior imo

Care to elaborate on why ISO-images are inferior?

I have now implemented retry on fail for nodes that fail to convert to NixOS on first try. This reduces the failure rate from ~20% down to ~0%. Retrying adds aprox 1 minute to the provisioning step for each required try.

I have tested 100 provisioning attempts of 7 nodes each. !00% success rate with 13% of the cases requiring retries.

nice! i should check this out a bit further. as a first question, how would you position this vs say doing the infra by terraform?

1 Like

nix-infra does the provisioning and deployment of nodes imperatively because I have found that to be easier to reason about. Especially given that stateful services may require having local disk caches or data stores that may need to be managed. The declarative part is scoped to node and app configuration.

Basically you create an automation script that provisions your nodes and deploy declarative configuration. To support this you build your cluster like an onion:

  1. provision nodes
  2. deploy control plane node configuration
  3. deploy worker node configuration
  4. deploy service app configuration (databases etc.)
  5. deploy worker app configuration (web apps etc.)
  6. deploy ingress configuration
1 Like

thanks for your response! i might wanna try and draw some inspiration from your progress for my terraform tinkering.
given you noted the project is about privacy, do note hetzner does a mitm on their servers, so ensuring privacy may be a bit more involved.

2 Likes

Thanks for the heads up, I’ll look in to that!

1 Like

It’s always nice to see more options for Nix based infra. The flannel and wireguard use is pretty cool!

Since you asked for feedback, here’s a couple things I noticed looking around even though I’m not really the target audience since I prefer stateful infra management.

  • I didn’t see an explanation for what “a cluster” means, I know docs are WIP, but that’s a central concept already used a lot in the README.

  • I’m not convinced a provisioning tool should be installing and configuring NixOS things. I think this might just be a framing thing: the tool does infra and opinionated OS & mesh setup. To me, the name doesn’t express that.

  • The text templating could be replaced with a generated .nix the config imports, just like hardware-configuration.nix and networking.nix to avoid all the pitfalls of using text templating for structured data/code.
    In this case since you want to expose data to the configuration you can do something like:

    {
      options.infrastructure.self = lib.mkOption {
        type = lib.attrs;
        readOnly = true;
        default = { ... };
      };
    }
    

    And in the node config you’d use config.infrastructure.self.example instead of [%%example%%].
    You could use text templating to fill in the default, but you can avoid all the pitfalls by serializing the data to JSON and then parsing it: the only templated thing would be the JSON file path which you control.

  • build.sh in a Nix project is ironic :stuck_out_tongue:

  • guaranteed privacy – all data within your private walled garden

    That’s not really something you can guarantee, you’re better off making more precise claims

Don’t feel pressured to make any of these changes, I’m just commenting on what stood out to me while waiting for NixOS test runs :slight_smile:

1 Like

Many thanks for your thoughtful feedback and I appreciate the humble tone! I really like your note on templating, that is definitely the way to go!

1 Like

Released v0.9.6-alpha with support for specifying placement group when provisioning a node. This will instruct Hetzner to spread your nodes over several racks to avoid a server failure causing complete disruption of service or data loss.

The HA cluster example has been updated to use this feature.