How to boot btrfs RAID1 while degraded?

I have a btrfs root filesystem mirrored across two disks, like so:
mkfs.btrfs -f -L root -m raid1 -d raid1 /dev/disk/by-id/XXXXX-part3 /dev/disk/by-id/YYYYY-part3

Setup a fully working system, and now I want to test my recovery capabilities. I pulled one of the drives, but system keep working in a degraded state. So far so good.

Next I rebooted… and I have a complete failure there. I see from searching that btrfs will not boot in a degraded state by default, but supposedly adding ‘degraded’ to grub flags works. However, I’m running systemd-boot, and so far nothing seems to work. I’m having trouble confirming whether the degraded thing is expected to only work with grub, but rootflags=degraded appended to the boot arguments has no effect for me.

Anyone know how to make this work? So far I’m completely striking out.

The boot loader has nothing to do with it, and you can edit the params in systemd-boot by pressing e just like in grub. Unfortunately, NixOS ignores the rootflags= param.

As a side note, scripted stage 1 probably would work with the degraded mount option, but systemd stage 1 (boot.initrd.systemd.enable = true;), or file systems mounted in stage 2 (i.e. non-root file systems), probably wouldn’t, because of a udev problem. The devices in a btrfs system are either all available, or none available; the udev rules for btrfs mark any of them SYSTEMD_READY=0 until they’re all present, so if one never shows up then none of them ever do. Currently systemd does not have a solution to this. This is done so that it doesn’t try to mount before all devices are present (which is actually a bug that can happen with scripted stage 1), but of course that’s a problem when you want to boot with a missing disk. That’s why I’ve never advocated making similar udev rules for bcachefs or zfs, and why I recommend users of those file systems to create explicit soft dependencies on all desired devices so that they can time out and a degraded mount can be attempted.

Unfortunately, NixOS ignores the rootflags= param.

Well that is unfortunate. Based in this and the rest of your explanation, does that mean it’s literally not possible to boot a degraded btrfs root fs in nixos?

I was actually just about to try switching to grub just for grins, but sounds like that wouldn’t buy me much.

I wouldn’t say “literally not possible”. You can enter a shell in stage 1 and mount the root FS manually. But yes, it’s a pain. I’d like to see some changes in systemd to make this better in systemd stage 1. A) I’d like for systemd-fstab-generator not to ignore rootflags= when we do root=fstab like we do, and B) I’d like for there to be some way for btrfs devices to be marked SYSTEMD_READY=1 after a timeout. But absent those upstream improvements (and I am loathe to make any similar improvements to the scripted stage 1 since we’d like to deprecate that for the systemd stage 1 anyway), the best thing to do is just enter a shell and mount manually. In scripted stage 1, you can use boot.shell_on_fail, and in systemd stage 1 you can add SYSTEMD_SULOGIN_FORCE=1 as a param to allow logging into the rescue shell or you can use rd.systemd.debug_shell to start a shell on tty9.

2 Likes

thanks, ElvishJerricco. Appreciate the insight.

I did make a new discovery - since you said the bootloader essentially ignored rootflags=degraded, I tried setting degraded as an option on the filesystem definitions:

        "/" = {
          device = "LABEL=root";
          fsType = "btrfs";
          options = [ "subvol=root" "compress=zstd:3" "degraded" ];
        };

With this, it actually booted. Mostly. It couldn’t complete the init process because /boot was also missing and not RAID (my understanding is it’s a really bad idea to do that), but based on some earlier guidance I read I have a synced boot partition on the other disk:

    extraInstallCommands = ''
       if \[ -d /boot2 \]; then
         echo "Syncing ESP to /boot2"
         ${pkgs.rsync}/bin/rsync -a --delete /boot/ /boot2/
       fi
    ‘‘;

It prompted me to enter a maintenance shell, and from there I was able to unmount boot2 and mount that backup partition as /boot, and now I’m back up and running. Not exactly elegant, but works in an emergency.

Now to make sure I have some very clear and very load notifications when a disk fails. :slight_smile:

well, to be clear, it’s not the boot loader. It’s NixOS’s stage 1, which is after the boot loader.

The risk with adding degraded unconditionally is that it could mount degraded when it shouldn’t, so this problem remains thorny.

Yea, mirroring it isn’t great, but sync’ing it like this has basically the same problems as mirroring it (i.e. /boot is technically stateful and sync’ing the state on a partition that isn’t being booted is questionable, e.g. you’re likely reusing old random-seed files this way). But the right thing to do is run the nixos boot loader update script multiple times, once for each drive, and that has its own thorns so we don’t currently support it.

This seems like it could be nicely fixed with specializations. If you make a specialization that mounts the root fs degraded, you have the option to do so from the bootloader in the rare event that you need to, but you normally don’t, and thus don’t take on the risks mentioned above.

roger, thanks for the clarification. Have to say I just ran into an additional problem, though: after restoring my drive, rebooting, and rebalancing, the system basically locked up. Can still ping, but zero available disk i/o, and console keep printing messages about a hung task timeout.

I was honestly okay with the hoop jumping up until this point, but if I can’t reliably restore my filesystem after a failure then it’s a non-starter. I know btrfs has some issues with raid5/6, but I thought raid1 was supposed to be pretty solid, so this is disappointing.

Back to trying to figure out how to make zfs work with kernel 6.18 again. Man, that’s a bummer.

I’ve never even heard of specializations before, but that’s a good tip. Going to read up further on that. Thanks!