Colmena deployment corrupts the bootloader

This is a duplicate of the issue I posted on the colmena github, but I thought maybe someone here has an idea why this is happening.

I have a server that has one partition with the entire Nixos installation on Linode, using GRUB as the boot manager. The nix version is 2.24.11. This is the hardware-configuration:

# Do not modify this file!  It was generated by ‘nixos-generate-config’
# and may be overwritten by future invocations.  Please make changes
# to /etc/nixos/configuration.nix instead.
{ config, lib, pkgs, modulesPath, ... }:

{
  imports =
    [ (modulesPath + "/profiles/qemu-guest.nix")
    ];

  boot.initrd.availableKernelModules = [ "virtio_pci" "virtio_scsi" "ahci" "sd_mod" ];
  boot.initrd.kernelModules = [ ];
  boot.kernelModules = [ ];
  boot.extraModulePackages = [ ];
  boot = {
    kernelParams = [ "console=ttyS0,19200n8" ];
    loader = {
      grub = {
      	forceInstall = true;
        extraConfig = ''
          serial --speed=19200 --unit=0 --word=8 --parity=0 --stop=1;
          terminal_input serial;
          terminal_output serial
        '';
        device = "nodev";
      };
      timeout = 10;
    };
  };

  fileSystems."/" =
    { device = "/dev/sda";
      fsType = "ext4";
    };

  swapDevices =
    [ { device = "/dev/sdb";}
    ];

  # Enables DHCP on each ethernet and wireless interface. In case of scripted networking
  # (the default) this is the recommended approach. When using systemd-networkd it's
  # still possible to use this option, but it's recommended to use it in conjunction
  # with explicit per-interface declarations with `networking.interfaces.<interface>.useDHCP`.
  networking.useDHCP = lib.mkDefault true;
  # networking.interfaces.enp0s5.useDHCP = lib.mkDefault true;

  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
}

Colmena deployments began with Nixos 24.05 this week I updated the channel to 24.11. Now after deploying with apply switch, the deployments hangs either at activating system profile or at starting the systemd services I configured. After a long wait (2hrs) I Ctrl-C the deployment. When I reboot the server, the bootloader is corrupted. It seems that the kernel has been updated from 6.6.63 to 6.6.72 over a few configurations. But that did not cause an issue before…

I am shown this message:

kbd_mode: KDSKBMODE: Inappropriate ioctl for device
starting device mapper and LVM...
File descriptor 8 (/dev/console) leaked on lvm invocation. Parent PID 1: /nix/sh
File descriptor 9 (/dev/console) leaked on lvm invocation. Parent PID 1: /nix/sh
checking /dev/sda...
fsck (busybox 1.36.1)
[fsck.ext4 (1) -- /mnt-root/] fsck.ext4 -a /dev/sda
fsck.ext4: Bad magic number in super-block while trying to open /dev/sda
/dev/sda:
The superblock could not be read or does not describe a valid ext2/ext3/ext4
filesystem.  If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
 or 
    e2fsck -b 32768 <device>

/dev/sda contains a swap file system labelled 'linode-swap'
fsck on /dev/sda failed.

An error occurred in stage 1 of the boot process, which must mount the
root filesystem on `/mnt-root' and then start stage 2.  Press one
of the following keys:

  r) to reboot immediately
  *) to ignore the error and continue

After reinstalling the bootloader with the following commands:

for i in dev proc sys; do mount --rbind /$i /mnt/$i; done
  chroot /mnt /nix/var/nix/profiles/system/bin/switch-to-configuration boot --install-bootloader

and then rebooting, everything seems to work fine.
Thinking this might be Colmena related, I will try deploying with deploy-rs to see if that make a difference.

Is there a way that I can set the server up again so that deployments function like they used to?

This is invalid, /dev/sda refers to the drive, not the partition. And in any case, you never want to use this format, use by-label or by-uuid.

https://wiki.archlinux.org/title/Persistent_block_device_naming

The error comes from trying to fsck something that is not a partition with a filesystem on it, naturally. So that has nothing to do with colmena.

2 Likes

It may we worth noting that currently (or the last time I checked), the architecture of colmena doesn’t easily support setting NIXOS_INSTALL_BOOTLOADER=1 which is the environment variable used in /nix/var/nix/profiles/system/bin/switch-to-configuration boot to indicate reinstalling the bootloader.

Having to reinstall the bootloader is odd, and might likely be a product of e.g. issues like the one waffle8946 mentioned.

2 Likes

Thanks for pointing it out, and on my laptop I do not have it set like this, but to a partition. But this is according to the Linode guide to install Nixos on their vms. It has worked for several months without any issue on Nixos 24.05.

Here is the content of the switch-to-configuration in the current system if that helps:

#! /nix/store/gwgqdl0242ymlikq9s9s62gkp5cvyal3-bash-5.2p37/bin/bash -e
export OUT='/nix/store/amdpmm3l4p841ks6hvgj480fajcvpwgg-nixos-system-nixos-24.11.713515.47addd76727f'
export TOPLEVEL='/nix/store/amdpmm3l4p841ks6hvgj480fajcvpwgg-nixos-system-nixos-24.11.713515.47addd76727f'
export DISTRO_ID='nixos'
export INSTALL_BOOTLOADER='/nix/store/nzfg7rgd4ls5wl10933n562nppqa89dl-install-grub.sh'
export PRE_SWITCH_CHECK='/nix/store/3gzlyv55pqrw2krjhls8ny2s07afcqdn-pre-switch-checks'
export LOCALE_ARCHIVE='/nix/store/yr4m7nrg1p1qbl7fr2n5h04qh3pnzbzh-glibc-locales-2.40-36/lib/locale/locale-archive'
export SYSTEMD='/nix/store/bl5dgjbbr9y4wpdw6k959mkq4ig0jwyg-systemd-256.10'
exec -a "$0" "/nix/store/amdpmm3l4p841ks6hvgj480fajcvpwgg-nixos-system-nixos-24.11.713515.47addd76727f/bin/.switch-to-configuration-wrapped"  "$@" 

and then here is the grub_config.xml:

<?xml version='1.0' encoding='utf-8'?>
<expr>
  <attrs>
    <attr name="backgroundColor">
      <string value="#2F302F" />
    </attr>
    <attr name="bootPath">
      <string value="/boot" />
    </attr>
    <attr name="bootloaderId">
      <string value="NixOS-boot" />
    </attr>
    <attr name="canTouchEfiVariables">
      <bool value="false" />
    </attr>
    <attr name="configurationLimit">
      <int value="100" />
    </attr>
    <attr name="copyKernels">
      <bool value="false" />
    </attr>
    <attr name="default">
      <string value="0" />
    </attr>
    <attr name="devices">
      <list>
        <string value="nodev" />
      </list>
    </attr>
    <attr name="efiInstallAsRemovable">
      <bool value="false" />
    </attr>
    <attr name="efiSupport">
      <bool value="false" />
    </attr>
    <attr name="efiSysMountPoint">
      <string value="/boot" />
    </attr>
    <attr name="entryOptions">
      <string value="--class nixos --unrestricted" />
    </attr>
    <attr name="extraConfig">
      <string value="serial --speed=19200 --unit=0 --word=8 --parity=0 --stop=1;&#xA;terminal_input serial;&#xA;terminal_output serial&#xA;" />
    </attr>
    <attr name="extraEntries">
      <string value="" />
    </attr>
    <attr name="extraEntriesBeforeNixOS">
      <bool value="false" />
    </attr>
    <attr name="extraGrubInstallArgs">
      <list>
      </list>
    </attr>
    <attr name="extraPerEntryConfig">
      <string value="" />
    </attr>
    <attr name="extraPrepareConfig">
      <string value="" />
    </attr>
    <attr name="font">
      <string value="/nix/store/fhkj7sr6azdv0ga9hfjq3hk3vr2x5v76-grub-2.12/share/grub/unicode.pf2" />
    </attr>
    <attr name="forceInstall">
      <bool value="true" />
    </attr>
    <attr name="fsIdentifier">
      <string value="uuid" />
    </attr>
    <attr name="fullName">
      <string value="grub" />
    </attr>
    <attr name="fullVersion">
      <string value="2.12" />
    </attr>
    <attr name="gfxmodeBios">
      <string value="1024x768" />
    </attr>
    <attr name="gfxmodeEfi">
      <string value="auto" />
    </attr>
    <attr name="gfxpayloadBios">
      <string value="text" />
    </attr>
    <attr name="gfxpayloadEfi">
      <string value="keep" />
    </attr>
    <attr name="grub">
      <string value="" />
    </attr>
    <attr name="grubEfi">
      <string value="" />
    </attr>
    <attr name="grubTarget">
      <string value="" />
    </attr>
    <attr name="grubTargetEfi">
      <string value="" />
    </attr>
    <attr name="path">
      <string value="/nix/store/6wgd8c9vq93mqxzc7jhkl86mv6qbc360-coreutils-9.5/bin:/nix/store/yq39xdwm4z0fhx7dsm8mlpgvcz3vbfg3-gnused-4.9/bin:/nix/store/vniy1y5n8g28c55y7788npwc4h09fh7c-gnugrep-3.11/bin:/nix/store/r99d2m4swgmrv9jvm4l9di40hvanq1aq-findutils-4.10.0/bin:/nix/store/3sln66ij8pg114apkd8p6nr04y37q5z2-diffutils-3.10/bin:/nix/store/ymdz757zq8vs2lq0kvfxj8ny17w551ka-btrfs-progs-6.11/bin:/nix/store/98nx72xz9kigvqw0magiq1vixi68ik9p-util-linux-2.39.4-bin/bin:/nix/store/1nsp9dla4r3wfk13f6ws6plrx222zgpy-mdadm-4.3/bin" />
    </attr>
    <attr name="shell">
      <string value="/nix/store/gwgqdl0242ymlikq9s9s62gkp5cvyal3-bash-5.2p37/bin/bash" />
    </attr>
    <attr name="splashImage">
      <string value="/nix/store/zw7x0wgvwzph58ib5bf1f6gv916g65zz-simple-dark-gray-bootloader-2018-08-28/share/backgrounds/nixos/nix-wallpaper-simple-dark-gray_bootloader.png" />
    </attr>
    <attr name="splashMode">
      <string value="normal" />
    </attr>
    <attr name="storePath">
      <string value="/nix/store" />
    </attr>
    <attr name="subEntryOptions">
      <string value="--class nixos" />
    </attr>
    <attr name="theme">
      <string value="" />
    </attr>
    <attr name="timeout">
      <int value="10" />
    </attr>
    <attr name="timeoutStyle">
      <string value="menu" />
    </attr>
    <attr name="useOSProber">
      <bool value="false" />
    </attr>
    <attr name="users">
      <attrs>
      </attrs>
    </attr>
  </attrs>
</expr>

Looking at /boot/grub/grub.cfg it also seems to detect the correct drives by uuid:

menuentry "NixOS" --class nixos --unrestricted {
search --set=drive1 --fs-uuid 3ece7869-f315-4d4b-bb5e-41731976d08e
search --set=drive2 --fs-uuid 3ece7869-f315-4d4b-bb5e-41731976d08e
  linux ($drive2)/nix/store/x90qzpr5ni06i9pcly118qbf7fzm2vff-linux-6.6.72/bzImage init=/nix/store/amdpmm3l4p841ks6hvgj480fajcvpwgg-nixos-system-nixos-24.11.713515.47addd76727f/init console=ttyS0,19200n8 loglevel=4 net.ifnames=0
  initrd ($drive2)/nix/store/7clpda6v4j618rbzrmc0cq7mxv2lkdg9-initrd-linux-6.6.72/initrd
}
[root@nixos:/dev/disk/by-uuid]# lsblk -o UUID,FSTYPE,MOUNTPOINT
UUID                                 FSTYPE MOUNTP
3ece7869-f315-4d4b-bb5e-41731976d08e ext4   /
f1408ea6-59a0-11ed-bc9d-525400000001 swap   [SWAP]

A deployment now hangs at the following output:

inspection-backup | Activating system profile
inspection-backup | updating GRUB 2 menu...
inspection-backup | stopping the following units: inspection_prod.service
inspection-backup | activating the configuration...
inspection-backup | setting up /etc...
inspection-backup | reloading user units for root...
inspection-backup | restarting sysinit-reactivation.target
inspection-backup | starting the following units: inspection_prod.service

You were right, thank you

I changed the identifier to the uuid and it worked. I never rebooted the server after installing it and when a deploy failed for a different reason I rebooted and guessed that the failed deployment caused issues with the bootloader.