ZFS expansion weirdness

Roo · October 23, 2025, 2:03am

While not specifically a NixOS thing - there are folks here who run with ZFS and may have some thoughts.

I’m playing with the recent support for ZFS expansion (zpool attach). Allowing you to add volumes to a RAIDZ array.

I’m playing in a VM - where I can rapidly re-provision things. It’s super nice. So I set myself up with a basic NixOS install - and add 3 blank data drives of 10G each.

These drives appear as /dev/vda /dev/vdb /dev/vdc

So we can make a ZFS RAIDZ array by doing…

$ sudo zpool create dpool raidz vda vdb vdc
$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool            20G  128K   20G   1% /dpool

Awesome – 20G of storage on top of 3 x 10G drives - I’m giving up 1 drive worth of space for parity.

Now, because I can quickly burn this to the ground and start again, I’ll do that and get myself back to having 3 x 10G blank drives… This time I’m going to create a very small ZFS RAIDZ with just 2 of the drives, then add a 3rd.

$ sudo zpool create dpool raidz vda vdb

$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool           9.5G  128K  9.5G   1% /dpool

No surprise here - but we now add the 3rd drive

$ sudo zpool attach dpool raidz1-0 vdc

$ df -h /dpool
Filesystem      Size  Used Avail Use% Mounted on
dpool            15G  128K   15G   1% /dpool

What? Why only 15G and not 20G?

Roo · October 23, 2025, 2:19am

Interesting - if I move up to 4 drives … my results get a bit better

Creating a RAIDZ1 with 4 drives give me 29GB of space
Creating a RAIDZ1 with 3 drives, then attaching the 4th gives me 26GB of space

It still isn’t working like I expect… but I suppose at least you CAN add a drive to a RAIDZ, but it appears you don’t get the same thing as if you just built the array to that number of drives originally.

Roo · October 23, 2025, 2:47am

Maybe this comment is relevant? https://github.com/openzfs/zfs/pull/15022#issuecomment-1700753693

As for the “loss of space”: There is an excellent article on Ars Technica: ZFS fans, rejoice—RAIDz expansion will be a thing very soon - Ars Technica - you will understand what is meant if you look at the graphic. Essentially, the ratio of data-to-parity is not changed in the expansion process. This is different in RAID5/6, where new parity blocks are calculated. With ZFS expansion, blocks are only moved, therefore you gain no data-to-parity ratio by adding disks later on.

Therefore, it is best to start out with the largest number of drives you can afford. However, the ZFS calculator (ZFS Capacity Calculator - WintelGuy.com) shows that for a 6 drive raidz2, the remaining capacity is 63.93% (which would not change if you expanded it to, say, 8 drives), while for an 8 drive raidz2, it is 68.19%, which is not that much bigger anyway.

While this might explain what I’m seeing with my simple experiments, I still don’t get it - so if someone can explain this… that would be great

Roo · October 23, 2025, 3:04am

Ah… so maybe the amount of free space is just an estimate? And that estimate is based on the storage ‘efficiency’ of the ZFS filesystem as originally created.

A two drive RAIDZ1 has 48% efficiency due to the parity cost. So when I have an expanded 3 drive setup - it’s taking the 30TB and deciding that due to the parity cost, I’m only going to get 15GB of space.

However, in actual fact - I can write MORE than 15GB of data to that drive

head -c 9G /dev/urandom > file1
head -c 9G /dev/urandom > file2

Should work … despite the fact that the drive “only” has 15GB of space free.

Roo · October 23, 2025, 4:13pm

Ok - I figured it out. (I think)

This article was key to understanding what was going on.

This is my mental model of what is happening - it may be wrong, but experimental evidence supports my conclusions.

When you create a RAIDZ1 - there is always a capacity percentage, because you are giving up space to manage parity. When you create a RAIDZ1 with only 2 drives, you only get ~48% usable capacity. Thus, if I take 2 x 10GB drives, having only 9.5GB free on the resulting filesystem makes a lot of sense.

Due to how the ZFS expansion works for RAIDZ1, it doesn’t re-write all of the existing data - so the efficiency of the storage stays at that ~48% level - but you do get more storage. Additionally the amount of ‘free’ space is, I believe, an estimate based on that same ~48% efficiency number. Thus when we add the 3rd 10G drive, for a total of 30G of space - we end up with only 15G usable based on that guess.

I we started with 3 drives, and expand to 4 - the storage efficiency rises to ~64%. This reflects the experiments done above in this post.

Now, just because the free space guess says 15GB, in this naive example where the volume is empty – we can actually store a lot more than 15GB in it. This means that ZFS expansion is actually not a bad deal, but it depends on how much data you have stored there before adding a drive (because all of that old data, is less space efficient).

Let’s just do an example that should make the point clear. A RAIDZ with 4 x 10G drives, will give us ~29G of storage. What if we build that same 4 drive array, starting with 2 and then growing to 4 total?

$ sudo zpool create dpool raidz vda vdb

$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool           9.5G  128K  9.5G   1% /dpool

$ sudo zpool attach dpool raidz1-0 vdc

$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool            15G  128K   15G   1% /dpool

$ sudo zpool attach dpool raidz1-0 vdd

$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool            20G  128K   20G   1% /dpool

Now let’s fill that drive up with random data (to avoid compression)

# head -c 35G /dev/urandom > /dpool/big_file
head: error writing 'standard output': No space left on device

# ls -l /dpool/
total 20079455
-rw-r--r-- 1 root root 30762213376 Oct 23 08:45 big_file

# ls -lh /dpool/
total 20G
-rw-r--r-- 1 root root 29G Oct 23 08:45 big_file

Neat - so I have a 20GB capacity filesystem – with a 29GB file on it. ZFS expansion works! You just get a bad estimate on how much free space you have.

What if we did the same 4 x 10G setup, but avoided expansion?

$ sudo zpool create dpool raidz vda vdb vdc vdd

$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool            29G  128K   29G   1% /dpool

$ sudo -i

# head -c 35G /dev/urandom > /dpool/big_file
head: error writing 'standard output': No space left on device

# ls -l /dpool/
total 30058412
-rw-r--r-- 1 root root 30780301312 Oct 23 08:53 big_file

# ls -lh /dpool/
total 29G
-rw-r--r-- 1 root root 29G Oct 23 08:53 big_file

The same 29GB fits, but the filesystem believes there is 29GB free.

If you look closely - you’ll notice the first file was 30762213376 bytes, and the second is 30780301312 - nearly the same… but not quite. My math says it is a 0.06% difference.

I’m certain that if I were to create say a 9G file on the original 2 disk pool, the difference would be bigger. Still, we are really getting a good portion of the drive via the ZFS expansion.

Good advice: If you can build the RAIDZ with the right number of disks up front, do so. Expansion does work, but you pay a little bit of a tax because the usable storage capacity goes up when you have more disks, and expansion doesn’t fix that ratio for previously written data.

I’m still a bit concerned I don’t have this quite right - because parity is parity… why does having more disks mean less parity? (and this is about being able to survive the loss of one disk, so with only 2 disks - both need to have effectively a copy of the data. With 3, you can spread the data such that all of the data is on any 2 disks – thus higher efficiency)

firecat53 · October 23, 2025, 5:44pm

This sounds like a great question to ask over at Practical ZFS.

Roo · October 24, 2025, 12:54pm

Thanks - I’ll try for confirmation of my findings there.

Thread on Practical ZFS where I cover the same question. tl;dr; - yes it’s just df lying to you.