NixOS and GitOps

To avoid going OT on another threads, let’s move here for the discussion of the topic.

For the general NixOS GitOps workflow I am quite happy with comin. Solely relying on monitoring makes it a little harder to use than it should be - but I think it’s a great start. The recently added post deploy hook will also help with visibility.

Now we are still using OCI containers with nixOS. For the product that we deploying - but also for some dependencies. This has a couple of reasons:

a) Devs

There is no way to onboard all devs to Nix without some casualties. I am still struggling with nix on Darwin myself sometimes. OCI images is what everyone knows and at least they take away some of the ambiguities when it comes down to running the same code locally and on the servers.

b) Official Releases

For dependencies like databases OCI images gives easy control of what version to - independent of the OS release cycle. While this technically is also true with nix, there is a benefit of using the exact same official container release by the original project. And also a)

c) Eval

Now, if we wanted to use comin to deploy our app (multiple times a day) it would always require a switch. Unfortunately the evaluation is expensive and I imagine doing that on a busy server is … not … that great. Having another build server that copies the derivation is … a lot more machinery.

By default we are just updating and switching in low traffic times once a week.

Oh, and I am also not a fan of a CI pipeline making commits. Instead we usually use image tagging instead of having the CI committing the SHA to the infra repo. Which would be required for the comin approach.

d) Rollout

But what might be even worse: I am not aware there is a way to roll out an update to a native service in a rolling or blue/green manner to avoid downtime. Neither is there a great concept of running multiple instances of a service. I know systemd has @-templating … but that does not feel to be enough either.

And while I would love to ditch some of the container parts, it just feels the most pragmatic approach to still use them with NixOS.

That said, even with k3s on nixOS the complexity is still just :pensive:.
Very much so when you look at helm and all the painful yaml.

We have been running with nomad before. Which was nice in-between. But the community feels tiny and IMO the operations experience isn’t that particular great either.

I have not found a solution that makes me happy TBH.

1 Like

Yep, sounds like you’re missing some tooling. I think among these there are only two points I would really disagree with:

I think this somewhat boils down to superstition, unless you have some contract with the original upstream. If you do, nothing stops you having a support contract with one of the nix consultancies.

Often upstream doesn’t even actually provide the images, e.g. the postgres docker container is unaffiliated with upstream.

This means that you don’t actually have a reproducible deployment. You need to collect the image tags that happened to be deployed at any one point to reproduce the system state.

The “NixOS way” is definitely to have deployment follow whatever is in git. That’s what declarative configuration means.

CI doesn’t actually need to create commits, either, but it’s very convenient to do so for automated version bumps. This is where a monorepo approach may be helpful, but yeah, I appreciate that would be a significant departure from your current workflow.


I don’t think you need to ditch the oci parts anyway, pragmatic solutions aren’t bad. My opinion comes from an idealistic perspective - oci containers as a concept don’t really serve any purpose that isn’t already provided by nix+systemd, and using them comes at an integration cost.

I’m also just not a huge fan of oci containers where they can be avoided, they’re .exes through the backdoor and therefore make it harder to control the deployed software. Swapping out a vulnerable library isn’t trivial.

In practice there are practical benefits to doing what everyone else is doing, and NixOS tooling for gradual rollouts indeed isn’t in a great state. You basically need k8s, which yeah, I’m not a huge fan either.

1 Like

That’s not really my point.

https://hub.docker.com/_/postgres/

is marked as the “official postgres image”.

When devs pull a docker image that’s usually what they will get. This isn’t really about contracts - it’s about a convention. When we talk about “the” docker image for postgres that’s what 90% of the dev will probably have in mind.

Nix is a still a much smaller crowd. That makes a difference.
There are many more people using that “official” image than the nix install.
And more people using it usually means a different situation when it comes down to support.

As awesome as the nix crowd might be - this is about numbers.
It’s just easier to find help and docker and friends are more approachable.

Instead of one SHA we have two SHAs.
One for the application and one for the infra.
Calling that “not reproducible” feels a bit of a purist stretch.

If no commit is necessary how does the SHA find it’s way from application to the infra repo? This certainly is a non-issue for a monorepo setup. Is that what you meant?

I know there are big proponents for monorepos.
While I see some benefits - I can’t say I am a fan.

My idealistic point of view is also very much more in line with nix. But without containers there still is a gap between what I need and what nix without containers offers.

I certainly wish that would be different.

tooling for gradual rollouts indeed isn’t in a great state

Is there any that isn’t container related?

I suppose I’ll throw in my two pennyworth:

if you are on AWS a very pleasant experience (in my not at all humble opinion) is to use the official AWS iso + configuration.nix (or a shell script, whatever floats your boat) in user-data. Then you just have your CICD bump the launch template and kick off an instance refresh. Working at medium size company X with ~ 100s of vms this works basically flawlessly (especially when you have your nix artifacts in pinned to versioned s3 objects) for us. I literally don’t think of k8s at all. And if I were in google cloud I’d do the same thing and if I were on openstack I’d hook into cloud init. I have absolutely zero need for containers anywhere. nixos + systemd + systemd hardening imo is 10x more pleasant than being dropped into some random bullshit container with no tools (I know there are workarounds but why do that when you don’t have to).

Yes, by the company that hosts hub.docker.com - it’s “official”, but isn’t directly affiliated with the postgres folks. I.e., it’s just as official as the nixpkgs postgres package. If you talk about the “official” nix package for postgres, people will understand it to mean that one.

Not that I disagree with the point about oci images having a larger userbase, and I agree this can be a benefit. Your statement from the first post is just factually wrong in the case of postgres (since you explicitly said the original project, which is pretty clearly implied to be the postgres upstream) and your statement about what benefit this actually gives you was vague beyond “it’s official”.

Now you have restated it to be a repeat of your first point, which is fair, but I think you can see why I disagree with your original statement. “officialness”, especially with the broad definition in the context of oci containers, is not an inherent benefit.

This is fair, I’d assumed you weren’t setting the hash in a second repository, but relying on docker pull shenanigans in the field.

Yep! Though you can also have devs actually perform the version bumps in the infra repository by hand; automation isn’t required. What makes sense depends on your workflows.

I felt like this historically, but the more gitops projects I’m involved in the more I feel like monorepos are the way to go. But yep, opinions differ.

I don’t think there are any production-ready projects that do that, no. I recall seeing some experimentation somewhere, but can’t find it :expressionless:

That sounds pretty sweet. Can you elaborate on this?

specially when you have your nix artifacts in pinned to versioned s3 objects

Unfortunately full hosting in the big clouds is not a option for us.
Some things would be considerable easier - but there are reasons :pensive:

I have absolutely zero need for containers anywhere

How do run multiple instances of a service on one machine?
When you do an instance refresh - what’s the roll out story in regards to availability?

I mean there isn’t much more to it: user-data is automatically run as a script on instance launch which pulls relevant nix code for that instance based on the asg launch template. That script is just a templated glorified nixos-rebuild switch

We don’t? I’m not sure I see why I would want to do that. Anything that we write is cpu scalable, it can use whatever resources on the box it needs, a box is a failure domain, so if I wanted more resiliency I’d add more boxes, not more service instances on the same box.

The service is fully up. You can configure AWS to not destroy existing ASG instance before new ones are passing health checks (launch before terminate), so we are never unavailable during a deployment.

1 Like

I agree that “original” and “officialness” can be argued. But I my point stands:
It’s more used and more approachable to the average dev to get help with.

I am not saying I didn’t wish it was different. But unless things change in that regard I do see this as a benefit. The difference in team composition (to put this as nicely as possible) will also make a difference here.

I totally accept if you feel differently.

I might be pragmatic - but I am not a monster :upside_down_face:

We are using CD. “By hand” is not really something I see as an option.

But even with a monorepo and a single SHA I am not sure how to the deploy workflow would look when I don’t want to build the app on the server.

I guess building on the CI server, caching and then doing the switch would work.
Is there a self-hosted cachix alternative?

Unfortunately it still would not help with the multi-instance and zero downtime aspect.

But given that systemd does not really support this either - I wouldn’t even know how to bring this to nixOS natively.

Not quite the same for us. (node.js)

But maybe it could be worth pushing that kind of scaling into the app itself.

I am jealous of your simple setup.
Food for thoughts.

Oh I see. If this were me then I’d probably have the bootstrap script pull the number of CPU cores, and use that in my rebuild switch to run N instances of a templated systemd service. I think that would work quite nicely.

1 Like

Serving nix caches is built-in functionality, that bit is pretty trivial: Serving a Nix store via HTTP - Nix 2.28.4 Reference Manual

There are some other projects which add bonus features like tenancy, but your every day company binary cache has zero need for those advanced cachix features.

The S3 store that the official binary cache is hosted with is also built-in, bucket egress/ingress costs are higher than just hosting it on a random VPS from most cloud providers though.

Right, fair, if you don’t have versioned releases this indeed makes no sense. CD is the exact scenario that convinced me of monorepos, and the reduced need for empty automation for a clean gitops flow is at least part of it, FWIW.

Yep, this definitely is what kills pure NixOS deployments in your case.

I’ve only seen this solved with k8s so far. You could probably build your own poor man’s k8s with systemd socket activated services or something, but it’d be a mess and reinventing some fairly complex logic.

Nixops is apparently being revived, maybe it’ll have a solution.

Wait - that’s more or less just serving the nix-store via HTTP?

Are there any options for http auth? The manual does not seem to cover that part.

It sure makes things a little easier.
But I feel the real friction is somewhere else.

We are currently using k3s. But even with k3s the complexity just stinks.

Yep. That’s basically what cachix does, too.

Nope. That’s where the third party cache implementations come in. It shouldn’t really be necessary for private caches, though, just don’t host them on the public internet. Or put it behind nginx; you should be able to configure nix to provide user/pass via a netrc file.

If configuring it with the server itself is a requirement the ssh caches work fine, too. Which yes, that’s basically just having a host with a nix store lying around somewhere. Nix’ caching is remarkably simple.