Question
Suppose we have a system with the following block devices:
- 1 slow persistent block device (e.g. network volume, HDD).
- This is the boot device (i.e. has the boot, EFI, and system partitions).
- 0+ fast ephemeral block devices (e.g. ephemeral SSD, RAM disk).
Is there a good method for declaring a NixOS disk image that:
- Treats the slow block device as the origin.
- If there are any fast block devices, treats them as:
- A read cache (origin blocks copied to cache on read).
- An ephemeral write overlay (writes are stored in the cache and never flushed to the origin, i.e. their presence makes the origin immutable).
- Auto-mounts this entire hybrid volume as the root system partition (i.e.
/).- For ease of use (applications don’t need to write everything under a specific directory).
Background
AWS EC2 generally only allows boot volumes to be an EBS volume (i.e. a network block device that looks like a standard NVMe drive).
Some EC2 instance types, however, also come with instance stores (i.e. local NVMe SSDs) which often have significantly better latency and throughput than most EBS volume types.
Instance stores were usable as root volumes for a select few generation 1-3 EC2 instance types, but it’s not supported on newer generations and is effectively deprecated.
For example, here’s the lsblk --fs --noempty --topology --tree output for an m6id.large instance:
NAME FSTYPE FSVER LABEL UUID FSAVAIL FSUSE% MOUNTPOINTS ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME
nvme0n1 0 4096 4096 4096 512 0 none 63 128 0B
├─nvme0n1p1 ext4 1.0 cloudimg-rootfs dc3674d2-7170-41d8-af2b-a6e56486dda8 10.5G 27% / 0 4096 4096 4096 512 0 none 63 128 0B
├─nvme0n1p14 0 4096 4096 4096 512 0 none 63 128 0B
├─nvme0n1p15 vfat FAT32 UEFI 2F09-7168 98.2M 6% /boot/efi 0 4096 4096 4096 512 0 none 63 128 0B
└─nvme0n1p16 ext4 1.0 BOOT f58db4cd-64fe-4116-a0cb-190e4ed9997d 733.5M 10% /boot 0 4096 4096 4096 512 0 none 63 128 0B
nvme1n1 0 512 0 512 512 0 none 127 128 0B
nvme0n1is a 16 GB boot EBS volume andnvme1n1is the 64 GB instance store. The instance stores start unformatted.Other instance types may have multiple instance stores. For example, an
m6id.32xlargehas 4 x 2048 GB instance stores (e.g.nvme1n1,nvme2n1,nvme3n1,nvme4n1).Note that the AWS documentation may state capacities like 59/1900 instead of 64/2048. They’re listing these in GiB (mislabeled as GB in some places).
While EBS volumes are persistent, instance stores are ephemeral. Stateless services, however, don’t really mind this ephemeral nature.
For example, suppose we have a CI/CD fleet setup that creates ephemeral per-job EC2 instances running NixOS. We’re not really interested in persisting any disk state. We’d also like reads + writes to any directory (e.g. /nix, /home/{user}) to be accelerated by the instance store so job scripts don’t need to understand special mount points.
Options
There’s 2 general areas I’m considering:
- Caching/overlay tools.
- Disk setup tools.
Caching/Overlay Tools
Ideally, the origin (e.g. 8 GB boot EBS volume) can be smaller than the cache (e.g. 64+ GB instance stores). This lets us keep the boot EBS volume as small as possible to reduce costs.
Currently leaning towards block-level caching so ext4 or XFS can be used for performance.
Block-Level Caching
These tools cache at the block level. The main benefit is that these should work with arbitrary filesystems (e.g. ext4, XFS, BTRFS, ZFS).
LVM Cache or bcache
These seem to be the main options for block-level caching.
For LVM cache, we probably want some topology like this:
# Boot EBS volume (8-16 GB).
nvme0n1
├── nvme0n1p1
│ └── vg-origin
│ └── lv-origin # → / (ext4 or XFS)
├── nvme0n1p14
├── nvme0n1p15 # → /boot/efi (FAT32)
└── nvme0n1p16 # → /boot (ext4 or XFS)
# Instance store (2048 GB).
nvme1n1
└── vg-cache
└── lv-cache # → / (cachevol)
# Instance store (2048 GB).
nvme2n1
└── vg-cache
└── lv-cache # → / (cachevol)
Writeback from the cache to the origin seems like it can be disabled with the low and high watermark settings.
For bcache, writeback can be disabled by setting writeback_running to off.
Filesystem-Level Caching
These tools cache at the filesystem level. Essentially tied to a specific filesystem which can reduce flexibility.
overlayfs
I think this can’t function as a read cache (i.e. won’t cache files from the lower filesystem to the upper filesystem on read).
ZFS or bcachefs
Both of these trade generally reduced performance for higher durability. For stateless services, this may not be worth it since durability generally isn’t of concern (e.g. any state is stored in a separate database).
Disk Setup Tools
These tools probably have to be run during the initrd/initramfs stage (i.e. before an actual disk is mounted as /).
NixOS has a few options for configuring this phase like boot.initrd.systemd.units and boot.initrd.services.udev.rules.
For example, we might have a systemd unit run disko to create the initial LVM logical volumes and volume groups along with udev rules or systemd.device units for the instance stores which add them to an lv-cache LVM logical volume on attach.