[SOLVED] 23.11 broke systemd services running as lingering users

bamhm182 · December 21, 2023, 6:26pm

I’ve been using NixOS for about a year now, and for most of this time, I have been using rootless podman running as a podman user to persistently run some services for me. Unfortunately, with the upgrade to 23.11, I have found that nixos-rebuild switch will clobber all running containers, whereas 23.05 left them alone. I’ve been banging my head against this ever since I updated the other day and I’m at a bit of a loss on how I can get this resolved…

Here is a snippet of some of the relevant configuration:

config = {
  virtualisation.podman.enable = true
  users = {
    groups.podman = { gid = 31000; };
    users.podman = {
      uid = 31000;
      linger = true;
      group = "podman";
      home = "/home/podman";
      createHome = true;
      subUidRanges = [ { count = 65536; startUid = 615536; } ];
      subGidRanges = [ { count = 65536; startGid = 615536; } ];
    };
  };
  systemd.services = {
    "podman-compose@" = {
      enable = true;
      after = [ "podman.service" ];
      path = [ "/run/wrappers" ];
      serviceConfig = {
        ExecStart = [
          "${pkgs.podman-compose}/bin/podman-compose --podman-path ${pkgs.podman}/bin/podman --project-name %i up --detach --remove-orphans --build --force-recreate"
        ];
        ExecStop = [
          "${pkgs.podman-compose}/bin/podman-compose --podman-path ${pkgs.podman}/bin/podman --project-name %i down"
        ];
        RemainAfterExit = true;
        Type = "oneshot";
        User = "podman";
        Group = "podman";
        WorkingDirectory = "/etc/containers/compose/%i";
      };
    };
  };
};

Each of my compose files is setup with the following:

config = {
  environment.etc."containers/compose/heimdall/compose.yml".source = ./heimdall/compose.yml;
  systemd.services."podman-compose@heimdall" = {
    overrideStrategy = "asDropin";
    path = [ "/run/wrappers" ]; # https://github.com/NixOS/nixpkgs/issues/219013
    wantedBy = [ "machines.target" ];
  };
};

In my actual config, I have them more templated, but for this message, I’ve simplified it a bit so that it’s easier to glance at. This gives me the ability to have the containers all managed by an unprivileged user, allows me to have them run at boot, and allows me to manage them with something like systemctl status podman-compose@heimdall.service.

The error message that gets spit out varies sometimes, but this is one of the ones I have been seeing most persistently:

reloading user units for podman...
Failed to start nixos-activation.service: Transaction for nixos-activation.service/start is destructive (systemd-exit.service has 'start' job queued, but 'stop' is included in transaction).
See user logs and 'systemctl --user status nixos-activation.service' for details.
setting up tmpfiles

When this happens, all of the containers die and complain about losing their socket. I can go through and restart them all with something like systemctl podman-compose@heimdall.service, but it’s really annoying when in 23.05, they worked without a problem. When I would run nixos-rebuild switch there, they would continue to run and not even be restarted.

I would be exceptionally grateful if anyone could help me figure out why nixos-activation.service demands that it kill my lingering session as of 23.11.

EDIT: It isn’t nixos-activation.service. I pulled out the switch-to-configuration perl script and I’ve been trying to narrow down when exactly it murders podman. It is well before nixos-activation.service. Also realizing that I never added all the various things I tried. There isn’t much of a point in doing so right now because every single thing I’ve tried besides this has been a red herring.

EDIT 2: It breaks on line 820 of switch-to-configuration:

/nix/store/1pqilys7x9kqdbkzkilsbgilc4w7pmaz-nixos-system-xxx-23.11.20231218.d02ffbb/activate

In between then and line 940, any commands attempted by podman yell about not having crun (Error: default OCI runtime "crun" not found: invalid argument).

/nix/store/i0sdqs34r68if9s4sfmpixnnj36npiwj-systemd-254.6/bin/systemctl start -- basic.target cryptsetup.target getty.target local-fs.target machines.target multi-user.target network-interfaces.target network-online.target paths.target remote-fs.target slices.target sockets.target sound.target swap.target sysinit.target timers.target

EDIT 3: I got it!
23.11 added a new users.users.<name>.linger option which handles users which should linger. That activate script has a line which does the following:

ls /var/lib/systemd/linger | sort | comm -3 -1 /nix/store/pplsfrc0hqkqdfi7mj43z125ya0kdiy2-lingering-users -

That comm part strips out the users which have users.users.<name>.linger = true;, so they aren’t reset. When I throw that setting on my podman user, everything works exactly as it used it.

EDIT 4: I updated the scripts at the top to what they should be for 23.11. If you want to see the original version for some reason, click the pencil in the top left of this post.

TLATER · December 21, 2023, 10:38pm

This isn’t mentioned as a breaking change in the release notes, wonder if that could be added retroactively?

bamhm182 · December 21, 2023, 11:22pm

I was wondering about that too, so much so that I went back through and read the breaking changes after I figured out what the issue was. I reasoned that because it was something hacky I did, that an option was added for, and not something that originally had an option that changed, it wasn’t a “breaking change.”

bamhm182 · December 21, 2023, 11:34pm

Also, after rebooting, I noticed that my linger users were no longer working. I had fully removed my code to actually have the user linger from a systems perspective. I added a tmpfiles rule for “f /var/lib/systems/linger/podman”. Seems prudent to add that file if linger is specified. I will look into making a PR for it.

EDIT: Though, now that I say that out loud, I’m wondering about the purpose of the linger option. I’m going to dig into why that was added first.

Chris10k · December 27, 2023, 2:22am

i just upgrade to 23.11 and it breaks things with podman…

VERY THANKS for pointing out the new user.linger = true option as it solve the error (i used the activationScripts in 23.05).