Colmena deployment corrupts the bootloader

Zurga · January 25, 2025, 11:11am

This is a duplicate of the issue I posted on the colmena github, but I thought maybe someone here has an idea why this is happening.

I have a server that has one partition with the entire Nixos installation on Linode, using GRUB as the boot manager. The nix version is 2.24.11. This is the hardware-configuration:

# Do not modify this file!  It was generated by ‘nixos-generate-config’
# and may be overwritten by future invocations.  Please make changes
# to /etc/nixos/configuration.nix instead.
{ config, lib, pkgs, modulesPath, ... }:

{
  imports =
    [ (modulesPath + "/profiles/qemu-guest.nix")
    ];

  boot.initrd.availableKernelModules = [ "virtio_pci" "virtio_scsi" "ahci" "sd_mod" ];
  boot.initrd.kernelModules = [ ];
  boot.kernelModules = [ ];
  boot.extraModulePackages = [ ];
  boot = {
    kernelParams = [ "console=ttyS0,19200n8" ];
    loader = {
      grub = {
      	forceInstall = true;
        extraConfig = ''
          serial --speed=19200 --unit=0 --word=8 --parity=0 --stop=1;
          terminal_input serial;
          terminal_output serial
        '';
        device = "nodev";
      };
      timeout = 10;
    };
  };

  fileSystems."/" =
    { device = "/dev/sda";
      fsType = "ext4";
    };

  swapDevices =
    [ { device = "/dev/sdb";}
    ];

  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
  # (the default) this is the recommended approach. When using systemd-networkd it's
  # still possible to use this option, but it's recommended to use it in conjunction
  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
  networking.useDHCP = lib.mkDefault true;
  # networking.interfaces.enp0s5.useDHCP = lib.mkDefault true;

  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
}

Colmena deployments began with Nixos 24.05 this week I updated the channel to 24.11. Now after deploying with apply switch, the deployments hangs either at activating system profile or at starting the systemd services I configured. After a long wait (2hrs) I Ctrl-C the deployment. When I reboot the server, the bootloader is corrupted. It seems that the kernel has been updated from 6.6.63 to 6.6.72 over a few configurations. But that did not cause an issue before…

I am shown this message:

kbd_mode: KDSKBMODE: Inappropriate ioctl for device
starting device mapper and LVM...
File descriptor 8 (/dev/console) leaked on lvm invocation. Parent PID 1: /nix/sh
File descriptor 9 (/dev/console) leaked on lvm invocation. Parent PID 1: /nix/sh
checking /dev/sda...
fsck (busybox 1.36.1)
[fsck.ext4 (1) -- /mnt-root/] fsck.ext4 -a /dev/sda
fsck.ext4: Bad magic number in super-block while trying to open /dev/sda
/dev/sda:
The superblock could not be read or does not describe a valid ext2/ext3/ext4
filesystem.  If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
 or 
    e2fsck -b 32768 <device>

/dev/sda contains a swap file system labelled 'linode-swap'
fsck on /dev/sda failed.

An error occurred in stage 1 of the boot process, which must mount the
root filesystem on `/mnt-root' and then start stage 2.  Press one
of the following keys:

  r) to reboot immediately
  *) to ignore the error and continue

After reinstalling the bootloader with the following commands:

for i in dev proc sys; do mount --rbind /$i /mnt/$i; done
  chroot /mnt /nix/var/nix/profiles/system/bin/switch-to-configuration boot --install-bootloader

and then rebooting, everything seems to work fine.
Thinking this might be Colmena related, I will try deploying with deploy-rs to see if that make a difference.

Is there a way that I can set the server up again so that deployments function like they used to?

waffle8946 · January 25, 2025, 1:15pm

This is invalid, /dev/sda refers to the drive, not the partition. And in any case, you never want to use this format, use by-label or by-uuid.

https://wiki.archlinux.org/title/Persistent_block_device_naming

The error comes from trying to fsck something that is not a partition with a filesystem on it, naturally. So that has nothing to do with colmena.

cafkafk · January 25, 2025, 1:19pm

It may we worth noting that currently (or the last time I checked), the architecture of colmena doesn’t easily support setting NIXOS_INSTALL_BOOTLOADER=1 which is the environment variable used in /nix/var/nix/profiles/system/bin/switch-to-configuration boot to indicate reinstalling the bootloader.

Having to reinstall the bootloader is odd, and might likely be a product of e.g. issues like the one waffle8946 mentioned.

Zurga · January 25, 2025, 2:49pm

Thanks for pointing it out, and on my laptop I do not have it set like this, but to a partition. But this is according to the Linode guide to install Nixos on their vms. It has worked for several months without any issue on Nixos 24.05.

Zurga · January 25, 2025, 2:55pm

Here is the content of the switch-to-configuration in the current system if that helps:

#! /nix/store/gwgqdl0242ymlikq9s9s62gkp5cvyal3-bash-5.2p37/bin/bash -e
export OUT='/nix/store/amdpmm3l4p841ks6hvgj480fajcvpwgg-nixos-system-nixos-24.11.713515.47addd76727f'
export TOPLEVEL='/nix/store/amdpmm3l4p841ks6hvgj480fajcvpwgg-nixos-system-nixos-24.11.713515.47addd76727f'
export DISTRO_ID='nixos'
export INSTALL_BOOTLOADER='/nix/store/nzfg7rgd4ls5wl10933n562nppqa89dl-install-grub.sh'
export PRE_SWITCH_CHECK='/nix/store/3gzlyv55pqrw2krjhls8ny2s07afcqdn-pre-switch-checks'
export LOCALE_ARCHIVE='/nix/store/yr4m7nrg1p1qbl7fr2n5h04qh3pnzbzh-glibc-locales-2.40-36/lib/locale/locale-archive'
export SYSTEMD='/nix/store/bl5dgjbbr9y4wpdw6k959mkq4ig0jwyg-systemd-256.10'
exec -a "$0" "/nix/store/amdpmm3l4p841ks6hvgj480fajcvpwgg-nixos-system-nixos-24.11.713515.47addd76727f/bin/.switch-to-configuration-wrapped"  "$@"

and then here is the grub_config.xml:

<?xml version='1.0' encoding='utf-8'?>
<expr>
  <attrs>
    <attr name="backgroundColor">
      <string value="#2F302F" />
    </attr>
    <attr name="bootPath">
      <string value="/boot" />
    </attr>
    <attr name="bootloaderId">
      <string value="NixOS-boot" />
    </attr>
    <attr name="canTouchEfiVariables">
      <bool value="false" />
    </attr>
    <attr name="configurationLimit">
      <int value="100" />
    </attr>
    <attr name="copyKernels">
      <bool value="false" />
    </attr>
    <attr name="default">
      <string value="0" />
    </attr>
    <attr name="devices">
      <list>
        <string value="nodev" />
      </list>
    </attr>
    <attr name="efiInstallAsRemovable">
      <bool value="false" />
    </attr>
    <attr name="efiSupport">
      <bool value="false" />
    </attr>
    <attr name="efiSysMountPoint">
      <string value="/boot" />
    </attr>
    <attr name="entryOptions">
      <string value="--class nixos --unrestricted" />
    </attr>
    <attr name="extraConfig">
      <string value="serial --speed=19200 --unit=0 --word=8 --parity=0 --stop=1;&#xA;terminal_input serial;&#xA;terminal_output serial&#xA;" />
    </attr>
    <attr name="extraEntries">
      <string value="" />
    </attr>
    <attr name="extraEntriesBeforeNixOS">
      <bool value="false" />
    </attr>
    <attr name="extraGrubInstallArgs">
      <list>
      </list>
    </attr>
    <attr name="extraPerEntryConfig">
      <string value="" />
    </attr>
    <attr name="extraPrepareConfig">
      <string value="" />
    </attr>
    <attr name="font">
      <string value="/nix/store/fhkj7sr6azdv0ga9hfjq3hk3vr2x5v76-grub-2.12/share/grub/unicode.pf2" />
    </attr>
    <attr name="forceInstall">
      <bool value="true" />
    </attr>
    <attr name="fsIdentifier">
      <string value="uuid" />
    </attr>
    <attr name="fullName">
      <string value="grub" />
    </attr>
    <attr name="fullVersion">
      <string value="2.12" />
    </attr>
    <attr name="gfxmodeBios">
      <string value="1024x768" />
    </attr>
    <attr name="gfxmodeEfi">
      <string value="auto" />
    </attr>
    <attr name="gfxpayloadBios">
      <string value="text" />
    </attr>
    <attr name="gfxpayloadEfi">
      <string value="keep" />
    </attr>
    <attr name="grub">
      <string value="" />
    </attr>
    <attr name="grubEfi">
      <string value="" />
    </attr>
    <attr name="grubTarget">
      <string value="" />
    </attr>
    <attr name="grubTargetEfi">
      <string value="" />
    </attr>
    <attr name="path">
      <string value="/nix/store/6wgd8c9vq93mqxzc7jhkl86mv6qbc360-coreutils-9.5/bin:/nix/store/yq39xdwm4z0fhx7dsm8mlpgvcz3vbfg3-gnused-4.9/bin:/nix/store/vniy1y5n8g28c55y7788npwc4h09fh7c-gnugrep-3.11/bin:/nix/store/r99d2m4swgmrv9jvm4l9di40hvanq1aq-findutils-4.10.0/bin:/nix/store/3sln66ij8pg114apkd8p6nr04y37q5z2-diffutils-3.10/bin:/nix/store/ymdz757zq8vs2lq0kvfxj8ny17w551ka-btrfs-progs-6.11/bin:/nix/store/98nx72xz9kigvqw0magiq1vixi68ik9p-util-linux-2.39.4-bin/bin:/nix/store/1nsp9dla4r3wfk13f6ws6plrx222zgpy-mdadm-4.3/bin" />
    </attr>
    <attr name="shell">
      <string value="/nix/store/gwgqdl0242ymlikq9s9s62gkp5cvyal3-bash-5.2p37/bin/bash" />
    </attr>
    <attr name="splashImage">
      <string value="/nix/store/zw7x0wgvwzph58ib5bf1f6gv916g65zz-simple-dark-gray-bootloader-2018-08-28/share/backgrounds/nixos/nix-wallpaper-simple-dark-gray_bootloader.png" />
    </attr>
    <attr name="splashMode">
      <string value="normal" />
    </attr>
    <attr name="storePath">
      <string value="/nix/store" />
    </attr>
    <attr name="subEntryOptions">
      <string value="--class nixos" />
    </attr>
    <attr name="theme">
      <string value="" />
    </attr>
    <attr name="timeout">
      <int value="10" />
    </attr>
    <attr name="timeoutStyle">
      <string value="menu" />
    </attr>
    <attr name="useOSProber">
      <bool value="false" />
    </attr>
    <attr name="users">
      <attrs>
      </attrs>
    </attr>
  </attrs>
</expr>

Zurga · January 25, 2025, 3:11pm

Looking at /boot/grub/grub.cfg it also seems to detect the correct drives by uuid:

menuentry "NixOS" --class nixos --unrestricted {
search --set=drive1 --fs-uuid 3ece7869-f315-4d4b-bb5e-41731976d08e
search --set=drive2 --fs-uuid 3ece7869-f315-4d4b-bb5e-41731976d08e
  linux ($drive2)/nix/store/x90qzpr5ni06i9pcly118qbf7fzm2vff-linux-6.6.72/bzImage init=/nix/store/amdpmm3l4p841ks6hvgj480fajcvpwgg-nixos-system-nixos-24.11.713515.47addd76727f/init console=ttyS0,19200n8 loglevel=4 net.ifnames=0
  initrd ($drive2)/nix/store/7clpda6v4j618rbzrmc0cq7mxv2lkdg9-initrd-linux-6.6.72/initrd
}

[root@nixos:/dev/disk/by-uuid]# lsblk -o UUID,FSTYPE,MOUNTPOINT
UUID                                 FSTYPE MOUNTP
3ece7869-f315-4d4b-bb5e-41731976d08e ext4   /
f1408ea6-59a0-11ed-bc9d-525400000001 swap   [SWAP]

A deployment now hangs at the following output:

inspection-backup | Activating system profile
inspection-backup | updating GRUB 2 menu...
inspection-backup | stopping the following units: inspection_prod.service
inspection-backup | activating the configuration...
inspection-backup | setting up /etc...
inspection-backup | reloading user units for root...
inspection-backup | restarting sysinit-reactivation.target
inspection-backup | starting the following units: inspection_prod.service

Zurga · January 25, 2025, 9:32pm

You were right, thank you

I changed the identifier to the uuid and it worked. I never rebooted the server after installing it and when a deploy failed for a different reason I rebooted and guessed that the failed deployment caused issues with the bootloader.