Incrementally salvaging stateVersion

Edit: Proposal on hold; see update.


system.stateVersion is widely acknowledged to be an imperfect thing. I’d like to propose incrementally augmenting it with a less imperfect thing.

Problems I want to address:

  1. Users, new and old, perennially want to update system.stateVersion. We can tell them not to until we’re blue in the face, but there’s a certain kind of OCD computer toucher who simply cannot allow anything that looks like a date in their system config to be more than a year old; and the NixOS community has, I would wager, more than the base rate of such users. Some have suggested we mollify these users by making stateVersion not look like a date, but I think that’s letting the tail wag the dog. It’s not entirely unreasonable to want to migrate your system state periodically to conform to current best practices. The biggest problem with actually doing so is that:
  2. The effects of system.stateVersion are opaque. The number of modules that use it is in the dozens — not impossible to manually review, especially given that many of those modules you can know are not relevant to your system, but enough to make it possible for something to slip through if you are less than maximally careful. And even if you are careful:
  3. Updating system.stateVersion is all-or-nothing. You have to migrate the data for every module that uses it at once, which increases both the risk of getting it wrong and the cost of doing it at all. AFAIK, there are no intermodule dependencies using system.stateVersion — no module assumes that another module will operate in a certain way because of the current value of system.stateVersion — so there’s no reason for it to be this way, other than avoiding the clutter of a lot of individual stateVersion definitions.

I’d like to get some support for doing the following:

  • Declare a new option system.moduleStateVersions of type attrsOf int. Nothing should depend on this option, and it’ll only be really useful once we’ve done more work, but we can create it now.
  • On a per-module basis, for modules that use or want to start using system.stateVersion, use the following pattern:
    • Declare a new stateVersion option for that module (services.whatever.stateVersion) of type int. It should take a default value that depends on system.stateVersion. Nothing else in the module may depend on system.stateVersion. Everything you want to version depends on the module-local stateVersion instead. (Edit: Original credit to @ElvishJerricco for seeding this idea in my brain a year ago, apparently.)
    • Inside the config = lib.mkIf cfg.enable block, set system.moduleStateVersions."services.whatever" = cfg.stateVersion;.
  • Put the above in the NixOS documentation.

Example module:

{ config, lib, pkgs, ... }:
let
  cfg = config.services.whatever;
in
{
  options.services.whatever = {
    enable = lib.mkEnableOption "whatever";
    # or: stateVersion = lib.mkStateVersionOption "the whatever service" config [ "23.11" "25.05" "26.05" ];
    stateVersion = lib.mkOption {
      description = ''
        Versions the format of persistent state used by the whatever
        service. Changing this value requires understanding the
        module well enough to be able to migrate this state by hand.
      '';
      type = lib.types.int;
      default = lib.lists.findFirstIndex (lib.versionOlder config.system.stateVersion) 3 [ "23.11" "25.05" "26.05" ];
      defaultText = ''
        Depends on system.stateVersion:
          if >= 26.05: 3
          if >= 25.05: 2
          if >= 23.11: 1
          otherwise: 0
      '';
    };
  };

  config = lib.mkIf cfg.enable {
    systemd.services.whatever = {
      # ...
      something = if cfg.stateVersion >= 3 then "newest" else "legacy";
      somethingElse = if cfg.stateVersion >= 1 then "modern-ish" else "really-old";
    };
    system.moduleStateVersions."services.whatever" = cfg.stateVersion;
  };
}

Doing this incrementally immediately solves problem 3 for the modules that have been incrementally updated. By setting services.whatever.stateVersion themselves, users have a way to locally migrate modules one at a time instead of being forced into doing it all at once. Clutter is still avoided in the default case because these options take their default from system.stateVersion.

Doing this for all modules solves problems 1 and 2 as well, via inspecting the attributes of system.moduleStateVersions. Since modules only add attributes to this option when they are enabled, printing this config value will show only the modules that are relevant to the user. Comparing this config value before and after a bump of system.stateVersion will show which modules need migration, and then users can incrementally migrate their module state by fixing the services.whatever.stateVersion options to their old values until they get around to migrating that specific service.

None of this precludes doing anything more clever in the future, either with system.stateVersion or with automating migrations in module code. This isn’t meant to be the perfect solution for persistent state; I think several folks have been looking for that for a while, and if it exists, it’s not likely to be implemented soon. It’s just less imperfect, and it’s a small enough proposal to be tried on a few modules to find any rough edges before going on a treewide crusade.

If this meets with general approval, then based on the conversation in this thread, I’d suggest starting with jellyseerr: rename jellyseerr to seerr by nicdumz · Pull Request #500782 · NixOS/nixpkgs · GitHub, which very recently added a dependency on stateVersion to a module that previously didn’t use it. We can try this pattern out there and see how it works, and give the OP of that Discourse thread the opportunity to ‘preemptively’ migrate their config by setting services.seerr.stateVersion = 1;.

18 Likes

Thanks for starting this discussion back up!

I like this proposal. I believe the majority of desktop systems wouldn’t even need to define a stateVersion once all modules have been migrated.

So besides removing yet another footgun, for new users this would remove yet another piece of opaque magic, all without having to make any major changes to existing modules.


To keep context and threads linked, I suggested the non-version idea in this thread a little while ago.

This resulted, among other suggestions, in a proposal much like this one; it didn’t see much approval because this block of state versions for each module might grow rather big if it’s used by a lot of modules.

That said, in that thread we also concluded that a good number of uses of system.stateVersion are probably anti-patterns anyway; Most of the time it’s used to find “default” versions for software that cannot be upgraded with more than one increment at a time. Since then modules that do this have switched to the much more reasonable approach of making the now-ubiquitous package option mandatory.

So, to appease those fears, curb proliferation, and make future changes a bit less unachievable, could we add a step of defining specific use cases for which stateVersion is permissible, and perhaps even introduce review guidelines that’ll help enforce this?

Maybe we can also document what the stateVersion of each module is actually used for in its option?

2 Likes

Thanks so much for digging that up. I knew this idea wasn’t novel but I wasn’t finding the prior art.

So, importantly, in this iteration of the proposal (which is not quite represented in your final breakdown in the older thread, but is something of a partial move towards topics 2 and 3), system.stateVersion isn’t going to change. Users will still be expected to define it and, generally, not touch it. I could see a future proposal that pushes all the way to per-module stateVersion that gets rid of it, but I think that starting the migration of individual modules to per-module stateVersion is useful both as a baby step toward that goal and in the intermediate state as a way to make it less of a headache to touch, for those determined to touch it.

I think this is a good idea and independent of the narrow, technical thing I’m proposing here. We can try them in parallel?

Personally, I’m neutral on this. It’s probably a smell if even a relatively new Nix user can’t look at the module source and see what stateVersion does, and if that’s true then documentation doesn’t add much; what it costs is effectively maintaining a duplicate ‘implementation’ of that logic which is never machine-checked. I’m generally in favor of allowing the code to serve as documentation for implementation details (roughly, low-level information about what an option does as opposed to explanatory information about when, why, and how to use options) and I think this counts.

2 Likes

This is why users hate NixOS (myself included). Users need to migrate to change this option, making us wade in the code to figure out what needs migration is pretty rude IMO. Intentionally underdocumenting is even more rude.

Documenting new options should be a given, and I’m currently still convinced this just adds complexity the way that .package-as-version-proxy already does (having to manually keep track of the current package versions in nixpkgs vs the version in the config).

3 Likes

Why is code something you ‘wade’ in, while documentation is not?

I mean, maybe I’m being obtuse, but to me they’re both text. Documentation is useful if it tells me something the code doesn’t, or gets me to a truth about the code faster. Documentation that repeats the code is redundant.

Documentation that repeats the code is redundant.

Is wasting the reader’s time not also rude?

(This is not an ‘our documentation is good enough’ argument, to be clear! Nor is it a ‘documentation is not as important as code’ argument. Good documentation is essential, and we have plenty of room to improve. But good documentation IMO is not the same as maximalist documentation.)

3 Likes

Because the option doc is 3-4 sentences that you read linearly, and the code is probably 500 lines long and requires jumping around - which is pretty obvious, though, so I can’t help but assume bad faith on your part. I’m out.

3 Likes

Generally, documentation is more browsable. Being able to see at a glance in https://search.nixos.org that a module has conditionally changed its state directory location in a new NixOS version is nicer than having to locate its source (which is sometimes spread across multiple files, or even generated or nested in complex ways, see e.g. the prometheus exporters) and C-f for stateVersion (and keeping a local clone is often not very ergonomic, thanks to nixpkgs’ frankly ridiculous size).

That doesn’t mean it has to be documented. If it’s internal and shouldn’t be touched, then it shouldn’t be exposed in documentation. But then again, why are we bothering to expose the option if it’s not supposed to be understood?

I do also appreciate the concern for drift between code and its documentation.

I feel like this is a bit of an edge case. When in doubt, personally I favor more documentation; if nothing else clearly stating why you want to use stateVersion might help guide reasonable design decisions.

But in either case, it’s splitting hairs and tangential to the overall cause, details like this can be hashed out in a PR that changes contribution docs or something.

5 Likes

Most uses of system.stateVersion are bad ideas so I’d be hesitant to institutionalize it further with a new mechanism. We can pretty easily get to a state where it’s frozen forever and we use better alternatives, and none of them would match system.stateVersion.

The canonical use is “default version of a package that has incompatible state over time”. Those modules should just not have default versions. If we think it’s going to be fine but then an incompatible state change happens unexpectedly, then removing the default and adding an assertion/release note explaining the need to set one is a less invasive breaking change than ones that happen every release, and doesn’t leave bumping system.stateVersion as a permanent footgun for that module.

A not so great use is when there was a bad default that affects state but that we could easily migrate to the new one. For those, we should just do migration.

The other common case is stuff like “we included a bunch of stuff by default that we shouldn’t have, but removing it now will break setups”. Again this is a case where we realize there’s a better default for an option or it shouldn’t have a default at all, but it could cause issues for existing users if we switched it for everyone; the difference is that in this case it’s more painful to have no default at all, which motivates using system.stateVersion.

What we don’t have much of a mechanism for is gradual roll‐out of deliberate breaking changes, or experimental things that shouldn’t be meaningfully breaking but aren’t necessarily ready to migrate all existing setups to yet. Those are cases where opting in to the new stuff is plausible to do and not necessarily a mistake.

I think Rust has solved this problem pretty well with editions: breaking changes in defaults happen on version boundaries, and tooling lets you automatically migrate from one edition to the next while preserving the previous behaviour to the greatest extent possible. For us, that would look like bumping the edition of a configuration while inserting the option definitions necessary to keep any load‐bearing backwards‐incompatible behaviour the same.

But it’s honestly not that common to have a genuinely breaking defaults change where there’s no plausible migration path to the new one or way to detect setups that would break, and where we still really want a default of some kind, and when it happens it’s usually because an avoidable bad decision was made in the past.

4 Likes

To make it more concrete, in your example:

  1. If the state versions are tied to upstream releases that can only process one data format, I’d make it a mandatory package option.

  2. Otherwise, if it’s reflected in the upstream service configuration somehow, I’d just expose that upstream option and condition on it where necessary.

  3. Otherwise, if it’s something we did, I would question why we have our own complicated state‐handling logic rather than letting the upstream software handle that stuff (and working with them if it’s inadequate).

  4. However, if it’s something where it really does make sense for us to be handling that ourselves, then I’d try hard to make it seamless and do migration automatically where necessary, just like I’d hope for upstream software to.

  5. Failing all of that, I would add an explicit stateVersion like you have there, ideally when creating the module if this kind of breaking change can be anticipated, otherwise as soon as it becomes clear it’s necessary. But I’d make it mandatory, with no default.

When adding an option like that, I’d also add an assertion and/or release note saying that if you are doing a new installation you can set it to the maximum value and if you already had an installation you should set it to the minimum. Then there’s a one‐time migration step for people who already used the module before it turned out that we’d need to handle breaking state changes and can’t do it elegantly, which is fine. If we had editions, we could make it default to minimum on older editions, make it have no default on newer editions, and have a migration tool insert the explicit old default when the module is in use and you migrate to a new edition.

Are these actually representative? I don’t claim full knowledge of all the places stateVersion is used, but I just spot-checked a few modules, and all three of them matched what I thought the canonical use of stateVersion was, which is: setting some default directory or database option that changed as a package evolved, such that it doesn’t really depend on the package version, but if the user peeks under the hood or is expecting some newer feature of the software to hold, they might be surprised if it’s stuck on the old behavior forever.

I hear you that in an ideal world, such things would be handled by migration scripts. But that RFC was opened and closed three years ago and we’re still introducing system.stateVersion uses to solve the above problem as of a week ago. So why not do it in a less imperfect way? As I said, this proposal doesn’t make it any harder for us to implement migration scripts later; probably the opposite, as it will more explicitly delimit the effects of system.stateVersion.

1 Like

Let’s get really concrete: Please review #500782 and tell me which of these things you think useNewConfigLocation ought to be. I don’t think it’s 1 or 2. I don’t really understand the boundaries of 3 (everything in Nixpkgs is ‘something we did’ at some level?). This could be 4 but migration scripts for moving directories is not something we generally support yet, AFAIK. That leaves 5?

And 5 sucks. Adding an explicit stateVersion with no default every time you want to use a module for a package that at some point changed its name is a sad amount of bloat to accept. Having the per-module stateVersion default to a value that depends on system.stateVersion is a very intentional part of this proposal, to keep the mysterious clutter to a minimum, and to keep the cost of migrating to a smarter, migration-script-based solution (or whatever!) to a minimum.

If you’re planning on conjuring a PR for general-purpose migration scripts, then I’d say go for it, because sure, this fits nicely into that category, and probably so do a lot of other things, and I’ll get behind that instead. If you aren’t going to do that, though, I think we shouldn’t let perfect be the enemy of better, as long as better doesn’t take us any further away from perfect.

Well, people doing things in worse ways when there are better ways is a general problem; I expect people would add new direct uses of system.stateVersion if your proposal was implemented, too. The potential solutions are the same: review norms, automated linting, better documentation, and so on.

I think that if we put effort into making new functionality and trying to widely disseminate it among contributors and reviewers, that’s migration pain that should be expected to pay itself off by dealing with the more fundamental problems of stateVersion. But I don’t mean to say that we should wait on migration tooling before doing better things; in many circumstances you can avoid anything like this kind of mechanism at all, and even when you do need some edition that ratchets up, tooling is nice to have but not mandatory, since documentation can achieve the same thing with more toil.

Looking at the most recent addition of system.stateVersion, it seems like the kind of thing where we could have done an automatic migration of the data directory at the service level but didn’t. The one before that was making an existing package option have a system.stateVersion‐conditional default, which should instead just have dropped the default. (I don’t mean that the contributors should have known not to; system.stateVersion being a thing makes it inevitable that it’ll keep getting used for this kind of thing.)

(Given that EOL packages inevitably end up vulnerable, often end up broken, and usually end up removed, having a system.stateVersion‐conditional package default is basically just throwing an error when it’s unset but with extra steps.)

Modulo the StateDirectory= sandboxing this would just be ExecStartPre=-mv /var/lib/{jellyseerr,seer} with a comment that it can be removed in a couple of release’s time, I think. I’d have to double check exactly what incantation you’d want to do it properly with the /var/lib/private stuff that StateDirectory= uses, but if you’re going to alpha‐rename a state directory I think it’s reasonable to add a few lines to migrate existing configurations.

1 Like

That ExecStartPre idea is pretty smart. If we can find a home in the documentation for collecting examples like that, along with whatever other advice we have for avoiding use of stateVersion, then I’m happy pointing module authors to that instead of building them a cleaner stateVersion interface to use.

I fear I might mess up what should be done for the StateDirectory= case unless I spent more time on it than I ought to right now, so I’ll leave that part to someone with one level higher systemd‐fu than I have :sweat_smile:

FWIW, there are 45 non‐test Nix files out of ~2.4k in nixos/modules that reference stateVersion, and some of those are references in documentation strings etc. rather than direct uses. Some of those references are pretty load‐bearing, but it comes up a lot less than you might expect; I think moving beyond it entirely isn’t pie‐in‐the‐sky. But we will probably end up wanting some kind of mechanism that lets us evolve defaults without breaking new configurations in a few cases; it just brings a lot more pain than benefit to couple that to state handling.

Why not just get rid of it entirely? Like, what’s it actually do?

https://nixos.org/manual/nixos/stable/options#opt-system.stateVersion

Hope this helps

So the only real function it seems to serve is to give users something to screw up (internal to NixOS, it serves a reasonable purpose–it’s the exposure of the knob that I’m commenting on)?

That feels like something that should be buried elsewhere and not left on the table to temp fate.

If I can help somehow or give specific advice on this service, feel free to contact me. It’s in my own interest to improve this.