Nixos-rebuild switch stuck after upgrade to nixos-24.11

I haven’t investigated a lot. I don’t really see much happening

nixos-system-zuse-24.11.20241203.b> building '/nix/store/48nzy6dx0wr8sgb3wb5wqhs05598pwhv-nixos-system-zuse-24.11.20241203.b681065.drv'
$ sudo nix-env -p /nix/var/nix/profiles/system --set /nix/store/0m6p25nsccv34b6nwnrf22ycim88rn76-nixos-system-zuse-24.11.20241203.b681065
$ sudo systemd-run -E LOCALE_ARCHIVE -E NIXOS_INSTALL_BOOTLOADER= --collect --no-ask-password --pipe --quiet --service-type=exec --unit=nixos-rebuild-switch-to-configuration --wait true
Using systemd-run to switch configuration.
$ sudo systemd-run -E LOCALE_ARCHIVE -E NIXOS_INSTALL_BOOTLOADER= --collect --no-ask-password --pipe --quiet --service-type=exec --unit=nixos-rebuild-switch-to-configuration --wait /nix/store/0m6p25nsccv34b6nwnrf22ycim88rn76-nixos-system-zuse-24.11.20241203.b681065/bin/switch-to-configuration switch
stopping the following units: accounts-daemon.service, ntpd-rs.service
NOT restarting the following changed units: libvirt-guests.service, libvirtd.service
activating the configuration...
[agenix] creating new generation in /run/agenix.d/2
[agenix] decrypting secrets...
decrypting '/nix/store/fg5i4wvcgmdlj7gg3g7qnhxmfdisrvz5-arr-api-key.age' to '/run/agenix.d/2/arr-api-key'...
decrypting '/nix/store/v8w5ljkavgkzqza79h841yam3b8c8370-cloak-accounts.age' to '/run/agenix.d/2/cloak-accounts'...
decrypting '/nix/store/803402ybg74xnabkxhg2j48znfqkhpdz-direnv-backup-garage.age' to '/run/agenix.d/2/direnv-backup-garage'...
decrypting '/nix/store/zgnkfy55609r5ih7q5lnhcvasnsin1hr-direnv-backup-rsync-net.age' to '/run/agenix.d/2/direnv-backup-rsync'...
decrypting '/nix/store/w2alhmbmqmv1fq9zqb5j4fdvbzk2v7j0-gist-cli.age' to '/run/agenix.d/2/gist-cli'...
decrypting '/nix/store/fl04xzb61frjsi9vqzz1b91mixgakzn4-grafana-password.age' to '/run/agenix.d/2/grafana-password'...
decrypting '/nix/store/6ck22lyy2nm24vifibmrg4bvpaszz4z2-grafana-secret.age' to '/run/agenix.d/2/grafana-secret'...
decrypting '/nix/store/8r5rsp1pbpghc81ksmdchqa6d2ajmjgi-nix-access-tokens-github.age' to '/run/agenix.d/2/nix-access-tokens-github'...
decrypting '/nix/store/al0z370hj25yal95fkqv3plqd079ajmy-openrc-fs.stackxperts.com.age' to '/run/agenix.d/2/openrc-fs.stackxperts.com'...
decrypting '/nix/store/awf55wy2pdrh4vzmxp5znl15xg6g5hld-restic-garage-credentials.age' to '/run/agenix.d/2/restic-garage-credentials'...
decrypting '/nix/store/i1iaq1a17qrazvfv55sdnm9lympi3c7n-restic-password.age' to '/run/agenix.d/2/restic-password'...
decrypting '/nix/store/4an2m9vry9683fp8bmpm0ypba69y08im-stack_baumann-cbxgate_cbxnet_de-password.age' to '/run/agenix.d/2/stack_baumann-cbxgate_cbxnet_de-password'...
decrypting '/nix/store/cws5wvn508qnrc6ghg8hl9qh0r6famad-stack_baumann-cbxgate.cbxnet.de.ovpn.age' to '/run/agenix.d/2/stack_baumann-cbxgate_cbxnet_de_ovpn'...
decrypting '/nix/store/dbqfjzqmkd18ayvsd425n1fbp26zgxnl-tailscale-key.age' to '/run/agenix.d/2/tailscale-key'...
decrypting '/nix/store/awfcl2aazx6y3wj9m1bil1f732x9d8h2-tilli-id_rsa.age' to '/run/agenix.d/2/tilli-id_rsa'...
[agenix] symlinking new secrets to /run/agenix (generation 2)...
[agenix] removing old secrets (generation 1)...
[agenix] chowning...
chown: invalid group: ‘0:media’
chown: invalid user: ‘grafana:0’
chown: invalid user: ‘grafana:0’
Activation script snippet 'agenixChown' failed (1)
remounting /etc...
Moving mount
Mounting beneath top mount
Attaching mount /tmp/nixos-etc.WQIXwoh5gD -> /etc
Moving single attached mount
Failed to run activate script
reloading user units for tilli...
restarting sysinit-reactivation.target
reloading the following units: dbus-broker.service
restarting the following units: polkit.service
starting the following units: accounts-daemon.service, ntpd-rs.service

It gets stuck at the last line. For all I can tell the process waits for a signal on a socket to systemd.

I don’t have a lot to go by. I was hoping that’s somehow a common issue.

nixos-rebuild boot works

ps axf

  13046 pts/2    S+     0:00          \_ sh -cu NIXPKGS_ALLOW_UNFREE=1 nixos-rebuild switch --flake '.#zuse' --impure --use-remote-sudo --log-format internal-json -v --show-trace  |& nom --json
  13047 pts/2    S+     0:00              \_ /nix/store/p6k7xp1lsfmbdd731mlglrdj2d66mr82-bash-5.2p37/bin/bash /nix/store/w4cpms0dk79d2cbkfai77vlrmy313x55-nixos-rebuild/bin/nixos-rebuild switch --flake .#zuse --impure --use-remote-sudo --log-format internal-json -v --show-trace
 104716 pts/2    S+     0:00              |   \_ sudo systemd-run -E LOCALE_ARCHIVE -E NIXOS_INSTALL_BOOTLOADER= --collect --no-ask-password --pipe --quiet --service-type=exec --unit=nixos-rebuild-switch-to-configuration --wait /nix/store/0m6p25nsccv34b6nwnrf22ycim88rn76-nixos-system-zuse-24.11.20241203.b681065/bin/switch-to-configuration switch
 104718 pts/3    Ss+    0:00              |       \_ sudo systemd-run -E LOCALE_ARCHIVE -E NIXOS_INSTALL_BOOTLOADER= --collect --no-ask-password --pipe --quiet --service-type=exec --unit=nixos-rebuild-switch-to-configuration --wait /nix/store/0m6p25nsccv34b6nwnrf22ycim88rn76-nixos-system-zuse-24.11.20241203.b681065/bin/switch-to-configuration switch
 104719 pts/3    S      0:00              |           \_ systemd-run -E LOCALE_ARCHIVE -E NIXOS_INSTALL_BOOTLOADER= --collect --no-ask-password --pipe --quiet --service-type=exec --unit=nixos-rebuild-switch-to-configuration --wait /nix/store/0m6p25nsccv34b6nwnrf22ycim88rn76-nixos-system-zuse-24.11.20241203.b681065/bin/switch-to-configuration switch
  13048 pts/2    Sl+    3:32              \_ nom --json

If you’re jumping an entire stable version you probably should be using boot instead of switch.

Ah I should have perhaps mentioned that. The actual switch was fine. I just can’t activate anything since.

Did you reboot after the aforementioned nixos-rebuild switch?

Of course. :smiley:
I had the damn problem for a while now.

In that case, what isn’t activating once you boot?

Everything is fine when I do nix-rebuild boot and reboot.

The problem is that switch gets stuck. I can no longer in-situ activate. I guess I need to dig deeper. It’s hard to pin down. The activate command just waits on a socket to systemd. And systemd status looks fine.

Found it.

  systemctl list-jobs                                                                                                                                                                                                                                                                                  
JOB   UNIT                           TYPE  STATE
10777 tailscaled-autoconnect.service start running
10543 multi-user.target              start waiting
9630  graphical.target               start waiting

systemctl cancel 8797 # -> and it continues

Not sure what’s broken there yet. But I guess I will find out.

// sudo /nix/store/kyb8926xa9hcqv7gdml21ij3mqi6w48h-tailscale-1.78.1/bin/tailscale status --json --peers=false                                                                                                                                                                                              
{
  "Version": "1.78.1",
  "TUN": true,
  "BackendState": "NoState",
  "HaveNodeKey": true,
  "AuthURL": "",
  "TailscaleIPs": null,
  "Self": {
    "ID": "",
    "PublicKey": "nodekey:0000000000000000000000000000000000000000000000000000000000000000",
    "HostName": "zuse",
    "DNSName": "",
    "OS": "linux",
    "UserID": 0,
    "TailscaleIPs": null,
    "Addrs": [],
    "CurAddr": "",
    "Relay": "",
    "RxBytes": 0,
    "TxBytes": 0,
    "Created": "0001-01-01T00:00:00Z",
    "LastWrite": "0001-01-01T00:00:00Z",
    "LastSeen": "0001-01-01T00:00:00Z",
    "LastHandshake": "0001-01-01T00:00:00Z",
    "Online": false,
    "ExitNode": false,
    "ExitNodeOption": false,
    "Active": false,
    "PeerAPIURL": null,
    "InNetworkMap": false,
    "InMagicSock": false,
    "InEngine": false
  },
  "Health": [
    "Tailscale is starting. Please wait.",
    "You are logged out."
  ],
  "MagicDNSSuffix": "",
  "CurrentTailnet": null,
  "CertDomains": null,
  "Peer": null,
  "User": null,
  "ClientVersion": null
}

Not sure how it gets into that state. But kicking it in the balls unblocks everything.

Now I wonder if the autoconnect task should wait indefinitely…

tailscale status                                                                                                                                                                                                                   
[ snip ]

# Health check:
#     - Tailscale can't reach the configured DNS servers. Internet connectivity may be affected.

I set Override local DNS in Tailscale