ZFS expansion weirdness

While not specifically a NixOS thing - there are folks here who run with ZFS and may have some thoughts.

I’m playing with the recent support for ZFS expansion (zpool attach). Allowing you to add volumes to a RAIDZ array.

I’m playing in a VM - where I can rapidly re-provision things. It’s super nice. So I set myself up with a basic NixOS install - and add 3 blank data drives of 10G each.

These drives appear as /dev/vda /dev/vdb /dev/vdc

So we can make a ZFS RAIDZ array by doing…

$ sudo zpool create dpool raidz vda vdb vdc
$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool            20G  128K   20G   1% /dpool

Awesome – 20G of storage on top of 3 x 10G drives - I’m giving up 1 drive worth of space for parity.

Now, because I can quickly burn this to the ground and start again, I’ll do that and get myself back to having 3 x 10G blank drives… This time I’m going to create a very small ZFS RAIDZ with just 2 of the drives, then add a 3rd.

$ sudo zpool create dpool raidz vda vdb

$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool           9.5G  128K  9.5G   1% /dpool

No surprise here - but we now add the 3rd drive

$ sudo zpool attach dpool raidz1-0 vdc

$ df -h /dpool
Filesystem      Size  Used Avail Use% Mounted on
dpool            15G  128K   15G   1% /dpool

What? Why only 15G and not 20G?

Interesting - if I move up to 4 drives … my results get a bit better

Creating a RAIDZ1 with 4 drives give me 29GB of space
Creating a RAIDZ1 with 3 drives, then attaching the 4th gives me 26GB of space

It still isn’t working like I expect… but I suppose at least you CAN add a drive to a RAIDZ, but it appears you don’t get the same thing as if you just built the array to that number of drives originally.

Maybe this comment is relevant? https://github.com/openzfs/zfs/pull/15022#issuecomment-1700753693

As for the “loss of space”: There is an excellent article on Ars Technica: ZFS fans, rejoice—RAIDz expansion will be a thing very soon - Ars Technica - you will understand what is meant if you look at the graphic. Essentially, the ratio of data-to-parity is not changed in the expansion process. This is different in RAID5/6, where new parity blocks are calculated. With ZFS expansion, blocks are only moved, therefore you gain no data-to-parity ratio by adding disks later on.

Therefore, it is best to start out with the largest number of drives you can afford. However, the ZFS calculator (ZFS Capacity Calculator - WintelGuy.com) shows that for a 6 drive raidz2, the remaining capacity is 63.93% (which would not change if you expanded it to, say, 8 drives), while for an 8 drive raidz2, it is 68.19%, which is not that much bigger anyway.

While this might explain what I’m seeing with my simple experiments, I still don’t get it - so if someone can explain this… that would be great

Ah… so maybe the amount of free space is just an estimate? And that estimate is based on the storage ‘efficiency’ of the ZFS filesystem as originally created.

A two drive RAIDZ1 has 48% efficiency due to the parity cost. So when I have an expanded 3 drive setup - it’s taking the 30TB and deciding that due to the parity cost, I’m only going to get 15GB of space.

However, in actual fact - I can write MORE than 15GB of data to that drive

head -c 9G /dev/urandom > file1
head -c 9G /dev/urandom > file2

Should work … despite the fact that the drive “only” has 15GB of space free.

Ok - I figured it out. (I think)

This article was key to understanding what was going on.

This is my mental model of what is happening - it may be wrong, but experimental evidence supports my conclusions.

When you create a RAIDZ1 - there is always a capacity percentage, because you are giving up space to manage parity. When you create a RAIDZ1 with only 2 drives, you only get ~48% usable capacity. Thus, if I take 2 x 10GB drives, having only 9.5GB free on the resulting filesystem makes a lot of sense.

Due to how the ZFS expansion works for RAIDZ1, it doesn’t re-write all of the existing data - so the efficiency of the storage stays at that ~48% level - but you do get more storage. Additionally the amount of ‘free’ space is, I believe, an estimate based on that same ~48% efficiency number. Thus when we add the 3rd 10G drive, for a total of 30G of space - we end up with only 15G usable based on that guess.

I we started with 3 drives, and expand to 4 - the storage efficiency rises to ~64%. This reflects the experiments done above in this post.

Now, just because the free space guess says 15GB, in this naive example where the volume is empty – we can actually store a lot more than 15GB in it. This means that ZFS expansion is actually not a bad deal, but it depends on how much data you have stored there before adding a drive (because all of that old data, is less space efficient).

Let’s just do an example that should make the point clear. A RAIDZ with 4 x 10G drives, will give us ~29G of storage. What if we build that same 4 drive array, starting with 2 and then growing to 4 total?

$ sudo zpool create dpool raidz vda vdb

$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool           9.5G  128K  9.5G   1% /dpool

$ sudo zpool attach dpool raidz1-0 vdc

$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool            15G  128K   15G   1% /dpool

$ sudo zpool attach dpool raidz1-0 vdd

$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool            20G  128K   20G   1% /dpool

Now let’s fill that drive up with random data (to avoid compression)

# head -c 35G /dev/urandom > /dpool/big_file
head: error writing 'standard output': No space left on device

# ls -l /dpool/
total 20079455
-rw-r--r-- 1 root root 30762213376 Oct 23 08:45 big_file

# ls -lh /dpool/
total 20G
-rw-r--r-- 1 root root 29G Oct 23 08:45 big_file

Neat - so I have a 20GB capacity filesystem – with a 29GB file on it. ZFS expansion works! You just get a bad estimate on how much free space you have.

What if we did the same 4 x 10G setup, but avoided expansion?

$ sudo zpool create dpool raidz vda vdb vdc vdd

$ df -h /dpool/
Filesystem      Size  Used Avail Use% Mounted on
dpool            29G  128K   29G   1% /dpool

$ sudo -i

# head -c 35G /dev/urandom > /dpool/big_file
head: error writing 'standard output': No space left on device

# ls -l /dpool/
total 30058412
-rw-r--r-- 1 root root 30780301312 Oct 23 08:53 big_file

# ls -lh /dpool/
total 29G
-rw-r--r-- 1 root root 29G Oct 23 08:53 big_file

The same 29GB fits, but the filesystem believes there is 29GB free.

If you look closely - you’ll notice the first file was 30762213376 bytes, and the second is 30780301312 - nearly the same… but not quite. My math says it is a 0.06% difference.

I’m certain that if I were to create say a 9G file on the original 2 disk pool, the difference would be bigger. Still, we are really getting a good portion of the drive via the ZFS expansion.

Good advice: If you can build the RAIDZ with the right number of disks up front, do so. Expansion does work, but you pay a little bit of a tax because the usable storage capacity goes up when you have more disks, and expansion doesn’t fix that ratio for previously written data.

I’m still a bit concerned I don’t have this quite right - because parity is parity… why does having more disks mean less parity? (and this is about being able to survive the loss of one disk, so with only 2 disks - both need to have effectively a copy of the data. With 3, you can spread the data such that all of the data is on any 2 disks – thus higher efficiency)

This sounds like a great question to ask over at Practical ZFS.

Thanks - I’ll try for confirmation of my findings there.

Thread on Practical ZFS where I cover the same question. tl;dr; - yes it’s just df lying to you.