NixOS froze while running sudo nixos-rebuild switch --upgrade; Csums don't match and verify failed; superblock cannot be read; system cannot boot

nPrevail · February 9, 2024, 12:01am

For the most part, NixOS has been quite impervious for me… except when you’re in the middle of an nixos-rebuild.

I was in the middle of running sudo nixos-rebuild switch --upgrade until my system froze (I waited 1 hour, and still no response from the computer; I did a dirty forced shutdown). When booting the system, came across this error:

<<< NixOS Stage 1 >>>

loading module btrfs...
loading module dm_mod...
running udev...
Starting systemd-udevd version 254.6
Passphrase for /dev/disk/by--uuid/b90c2010-a53c-4f76-9802-95fdde248601 to appear. . . . -successs
Verifying passphrase for /dev/disk/by--uuid/b90c2010-a53c-4f76-9802-95fdde248601... - successs
starting device mapper and LVM...
Scanning for Btrfs filesystems
registered: /dev/mapper/luks-/b90c2010-a53c-4f76-9802-95fdde248601
mounting /dev/disk/by--uuid/b90c2010-a53c-4f76-9802-95fdde248601 on /...
[             8.956086] BTRFS error(device dm-0_: openk_ctree failed
mount: mounting /dev/disk/by--uuid/b90c2010-a53c-4f76-9802-95fdde248601 on /mnt-root/ failed: Input/output error

An error occurred in stage 1 of the boot process, which must mount the
root filesystem on `/mnt-root' and then start stage 2.  Press one 
of the following keys:
  r) to reboot immediately
  *) to ignore the error and continue

When I opt for “ignore the error and continue,” I’m notified the following:

Continuing...
BusyBox v.136.1 () multi-call binary.

Usage: switch_root [-c CONSOLE_DEV] NEW_ROOT NEW_INIT [ARGS]
Free initramfs and switch to another root fs:
chroot to NEW_ROOT, delete all in /, move NEW_ROOT to /.
execute NEW_INIT.  PID must be 1. NEW_ROOT must be a mountpoint.

When I booted to a live disk, and ran GNOME-disks, it gives me these errors when trying to open the drive or running Repair Disk:

I think the worst part of the situation is that I’m unable to extract my Home files since they’re all saved on root. Normally, this has never been an issue for me.

Any idea on how I can fix the filesystem or at least extract these files?

TLATER · February 10, 2024, 8:32am

Well, I’ve seen this post pop up for a few days now. Hoped someone more familiar with btrfs data recovery would respond.

I’m not sure what exactly you mean by a “file system upgrade”. A freeze during updates should normally not be devastating during a NixOS update, because the actual software changes are atomic (i.e., the update is complete before any changes to the running system occur).

nixos-rebuild switch will add new paths to /nix/store, and restart a bunch of systemd services. It will also change the kernel version that will be started on next boot by adding a boot entry for the new generation, but it does not attempt to kexec a new kernel or anything.

Since in Linux land file system implementations are in the kernel (assuming you’re not doing something crazy with a userland filesystem thing), only a kernel change could have any effect on what the underlying file system is doing, and that can only happen through kexec or a reboot.

So it’s a little hard to grok how an update could be the direct cause of file system breakage. That said, physical hardware failure that happened to coincide with a system upgrade - while not impossible, lots of data being written after all - doesn’t seem that likely either, given a modern SSD.

If you want safer updates in the future, consider using nixos-rebuild boot. This will not make any changes to the running system, and since NixOS effectively reinstalls the system from scratch on each reboot, you can be pretty sure that any resulting issues are caused by the new software versions (and just boot an old generation to fix them afterwards).

Anyway, your file system looks rather broken. It’s unfortunate that the usual file system recovery methods haven’t just done their magic. Have you tried using the btrfs recovery tools manually? You’d probably start with btrfs check to figure out what’s actually going wrong, and move on to btrfs rescue when you’ve done some sleuthing.

Checking dmesg for messages about the failed mount could be helpful, too.

I think it might also be worth trying to boot an older generation, just in case this is some kind of issue caused by a change to btrfs between kernel versions (after all, unless the freeze happened before the switch - in which case this is much more likely to be hardware failure or a serious kernel bug - you should have rebooted into a new kernel).

nPrevail · February 10, 2024, 4:00pm

I was honestly not thinking when I wrote this part, haha. I rewrote that paragraph. Thanks for pointing it out!