NixOS disk images with ephemeral root volume cache/overlay?

Question

Suppose we have a system with the following block devices:

  • 1 slow persistent block device (e.g. network volume, HDD).
    • This is the boot device (i.e. has the boot, EFI, and system partitions).
  • 0+ fast ephemeral block devices (e.g. ephemeral SSD, RAM disk).

Is there a good method for declaring a NixOS disk image that:

  1. Treats the slow block device as the origin.
  2. If there are any fast block devices, treats them as:
    • A read cache (origin blocks copied to cache on read).
    • An ephemeral write overlay (writes are stored in the cache and never flushed to the origin, i.e. their presence makes the origin immutable).
  3. Auto-mounts this entire hybrid volume as the root system partition (i.e. /).
    • For ease of use (applications don’t need to write everything under a specific directory).

Background

AWS EC2 generally only allows boot volumes to be an EBS volume (i.e. a network block device that looks like a standard NVMe drive).

Some EC2 instance types, however, also come with instance stores (i.e. local NVMe SSDs) which often have significantly better latency and throughput than most EBS volume types.

Instance stores were usable as root volumes for a select few generation 1-3 EC2 instance types, but it’s not supported on newer generations and is effectively deprecated.

For example, here’s the lsblk --fs --noempty --topology --tree output for an m6id.large instance:

NAME         FSTYPE FSVER LABEL           UUID                                 FSAVAIL FSUSE% MOUNTPOINTS ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME
nvme0n1                                                                                                           0   4096   4096    4096     512    0 none       63 128    0B
├─nvme0n1p1  ext4   1.0   cloudimg-rootfs dc3674d2-7170-41d8-af2b-a6e56486dda8   10.5G    27% /                   0   4096   4096    4096     512    0 none       63 128    0B
├─nvme0n1p14                                                                                                      0   4096   4096    4096     512    0 none       63 128    0B
├─nvme0n1p15 vfat   FAT32 UEFI            2F09-7168                              98.2M     6% /boot/efi           0   4096   4096    4096     512    0 none       63 128    0B
└─nvme0n1p16 ext4   1.0   BOOT            f58db4cd-64fe-4116-a0cb-190e4ed9997d  733.5M    10% /boot               0   4096   4096    4096     512    0 none       63 128    0B
nvme1n1                                                                                                           0    512      0     512     512    0 none      127 128    0B

nvme0n1 is a 16 GB boot EBS volume and nvme1n1 is the 64 GB instance store. The instance stores start unformatted.

Other instance types may have multiple instance stores. For example, an m6id.32xlarge has 4 x 2048 GB instance stores (e.g. nvme1n1, nvme2n1, nvme3n1, nvme4n1).

Note that the AWS documentation may state capacities like 59/1900 instead of 64/2048. They’re listing these in GiB (mislabeled as GB in some places).

While EBS volumes are persistent, instance stores are ephemeral. Stateless services, however, don’t really mind this ephemeral nature.

For example, suppose we have a CI/CD fleet setup that creates ephemeral per-job EC2 instances running NixOS. We’re not really interested in persisting any disk state. We’d also like reads + writes to any directory (e.g. /nix, /home/{user}) to be accelerated by the instance store so job scripts don’t need to understand special mount points.

Options

There’s 2 general areas I’m considering:

  1. Caching/overlay tools.
  2. Disk setup tools.

Caching/Overlay Tools

Ideally, the origin (e.g. 8 GB boot EBS volume) can be smaller than the cache (e.g. 64+ GB instance stores). This lets us keep the boot EBS volume as small as possible to reduce costs.

Currently leaning towards block-level caching so ext4 or XFS can be used for performance.

Block-Level Caching

These tools cache at the block level. The main benefit is that these should work with arbitrary filesystems (e.g. ext4, XFS, BTRFS, ZFS).

:white_check_mark: LVM Cache or bcache

These seem to be the main options for block-level caching.

For LVM cache, we probably want some topology like this:

# Boot EBS volume (8-16 GB).
nvme0n1
├── nvme0n1p1
│   └── vg-origin
│       └── lv-origin # → /         (ext4 or XFS)
├── nvme0n1p14
├── nvme0n1p15        # → /boot/efi (FAT32)
└── nvme0n1p16        # → /boot     (ext4 or XFS)

# Instance store (2048 GB).
nvme1n1
└── vg-cache
    └── lv-cache      # → /         (cachevol)

# Instance store (2048 GB).
nvme2n1
└── vg-cache
    └── lv-cache      # → /         (cachevol)

Writeback from the cache to the origin seems like it can be disabled with the low and high watermark settings.

For bcache, writeback can be disabled by setting writeback_running to off.

Filesystem-Level Caching

These tools cache at the filesystem level. Essentially tied to a specific filesystem which can reduce flexibility.

:cross_mark: overlayfs

I think this can’t function as a read cache (i.e. won’t cache files from the lower filesystem to the upper filesystem on read).

:warning: ZFS or bcachefs

Both of these trade generally reduced performance for higher durability. For stateless services, this may not be worth it since durability generally isn’t of concern (e.g. any state is stored in a separate database).

Disk Setup Tools

These tools probably have to be run during the initrd/initramfs stage (i.e. before an actual disk is mounted as /).

NixOS has a few options for configuring this phase like boot.initrd.systemd.units and boot.initrd.services.udev.rules.

For example, we might have a systemd unit run disko to create the initial LVM logical volumes and volume groups along with udev rules or systemd.device units for the instance stores which add them to an lv-cache LVM logical volume on attach.

2 Likes

Not sure if completely applicable, but in the past I’ve had great succes with a setup like: 5.4.7. Creating LVM Cache Logical Volumes | Logical Volume Manager Administration | Red Hat Enterprise Linux | 6 | Red Hat Documentation

That was not on Nixos (either Ubuntu or Debian), but you can probably setup something like this with disko.

I did this for purely performance reasons back then, not with the context of allowing no writes to the HDD. I am not sure whether that is possible, as you put both disks in the same logical volume.

I played a bit with ZFS on AWS EC2 machines with instance store (mostly trying to use various ZFS snapshot importing schemes, using the instance store drive as a separate ZFS pool for performance reasons). I just thought it may be interesting to try the following.

You can format the EBS image to contain a ZFS pool. The pool will have one of the EBS volume’s partitions added as a top-level vdev (presumably another partition for EFI is needed, so cannot use the whole volume as vdev). On boot, you can add the ephemeral drive as a second top-level vdev, and then remove the EBS partition vdev from the pool.

zpool add tank /dev/nvme1n1 # add instance store
zpool remove tank /dev/nvme0n1p16 # remove EBS partition

All new writes will go now to the instance store. It also will trigger ZFS’s “data evacuation” from the now to-be-removed EBS vdev, copying data to the instance store vdev. After some time only the instance store drive will be used for all data, and the EBS partition will be abandoned.

Of course it is not exactly a read cache, we just moving data from EBS to the instance store, but it is transparent for the system, the data evacuation by ZFS will be happening in background, and the amount of IO should be configurable with ZFS’s resilver parameters. Also if there’re multiple instance stores, you can add them all as separate vdevs, and ZFS will transparently distribute data among them.

Of course this method will render the machine unbootable if the ephemeral drive is lost (e.g. during power down - power up cycle). But for the use case when we only run ephemeral EC2 machines which are never rebooted, only powered down and deleted, it should be fine.

It looks like LVM cache writeback can be disabled by:

  1. Setting the dm-cache mode to writeback.
  2. Setting the dm-writecache settings to high_watermark=100, low_watermark=100, and writeback_jobs=0.
    • We may only need some of these settings. The documentation isn’t too clear on what permutations are valid so this needs testing.

This looks interesting, though it sounds like data movement here is eager instead of lazy (e.g. LVM cache or bcache which act as lazy pull-through caches) unless ZFS’s “data evacuation” still allows reads from the EBS vdev while migration is happening and will prioritize moving blocks being accessed. That might be covered by the ZFS ARC instead, but I’m not too familiar with ZFS.

Assuming eager data movement during the initrd/initramfs stage for whatever reason, boot time would mostly be limited by EBS volume throughput. For example, gp3 volumes have a baseline throughput of 125 MiB/s which we can increase up to 1000 MiB/s at extra cost.

That means an 8 GB boot EBS gp3 volume (relatively small already) can take anywhere between ~61 seconds (at 125 MiB/s) to ~7.6 seconds (at 1000 MiB/s) to completely copy to the instance store. This can be problematic for use cases where low scale-up time is critical (e.g. web services, CI/CD fleets).


The lack of reboot support is also a bit problematic since it’s possible for a power outage to trigger an unintended reboot.

For example, lets say an availability zone (AZ) has a power outage. If a web service is balanced across 3 AZs in the region, there will be several minutes where ~1/3 of the instances are broken until the auto-scaling group (ASG) replaces them due to failing health checks.

My understanding is that ZFS allows reads from the EBS while moving data (it will transparently redirect reads to the EBS or the ephemeral disk depending on whether a specific wanted block has been moved already). So there’s no need to delay boot process by waiting for the removal to finish - it may continue in background after boot (and zpool remove command by default is instanteous and does not wait). Of course performance may take a hit due to additional IO, but hopefully it may be configured to be minimal. However, I don’t think ZFS will prioritize moving blocks which are being accessed, so yeah, it’s not really a cache.

Unfortunately I don’t see an ideal solution with pure ZFS. I considered alternatives, but they are also not complete:

  1. We can add a cache vdev based on the ephemeral disk, but it would be only a read cache - writes will still go to the EBS.
  2. Or we can split the ephemeral disk into two partitions, and make one a cache vdev, and another a special vdev, and force all new writes to go to the special dev by setting special_small_blocks to a big value. This way we can direct new writes to the instance store drive, and have a read cache. Unfortunately the read cache will try to cache reads from the special vdev too, which will be wasteful since they are on the same physical drive (it’s unlikely that ZFS is smart enough to see that and skip cache for reads from the special vdev). So also not great.

I think a setup like you’ve proposed makes sense for workloads that know they’re going to access most of the data from the boot EBS volume. In these cases, ZFS doing an eager asynchronous data migration doesn’t result in much wasted work.

The focus on laziness in these requirements comes from handling workloads where this isn’t true. This seems to be the case for a lot of setups, making it advantageous to use lazy pull-through block caches, sparse block devices, and sparse filesystems.

For example, AWS Lambda uses this approach to support large container images (though it’s also doing fancier things for availability, efficiency, and security).

EBS is doing something similar under the hood but even warm EBS volumes are a bit slow for some use cases, so we need to do something similar ourselves with the instance store (e.g. via LVM cache or bcache).

Nix-based setups (especially NixOS-based setups) are likely better than typical approaches in keeping closures minimal (thus keeping the resulting disk or container image minimal) but even for NixOS, there’s likely a significant portion of the Linux kernel (e.g. unloaded kernel modules or unused device drivers) or certain runtime dependencies (e.g. parts of language runtimes) that aren’t accessed at runtime.