Ok - I figured it out. (I think)
This article was key to understanding what was going on.
This is my mental model of what is happening - it may be wrong, but experimental evidence supports my conclusions.
When you create a RAIDZ1 - there is always a capacity percentage, because you are giving up space to manage parity. When you create a RAIDZ1 with only 2 drives, you only get ~48% usable capacity. Thus, if I take 2 x 10GB drives, having only 9.5GB free on the resulting filesystem makes a lot of sense.
Due to how the ZFS expansion works for RAIDZ1, it doesn’t re-write all of the existing data - so the efficiency of the storage stays at that ~48% level - but you do get more storage. Additionally the amount of ‘free’ space is, I believe, an estimate based on that same ~48% efficiency number. Thus when we add the 3rd 10G drive, for a total of 30G of space - we end up with only 15G usable based on that guess.
I we started with 3 drives, and expand to 4 - the storage efficiency rises to ~64%. This reflects the experiments done above in this post.
Now, just because the free space guess says 15GB, in this naive example where the volume is empty – we can actually store a lot more than 15GB in it. This means that ZFS expansion is actually not a bad deal, but it depends on how much data you have stored there before adding a drive (because all of that old data, is less space efficient).
Let’s just do an example that should make the point clear. A RAIDZ with 4 x 10G drives, will give us ~29G of storage. What if we build that same 4 drive array, starting with 2 and then growing to 4 total?
$ sudo zpool create dpool raidz vda vdb
$ df -h /dpool/
Filesystem Size Used Avail Use% Mounted on
dpool 9.5G 128K 9.5G 1% /dpool
$ sudo zpool attach dpool raidz1-0 vdc
$ df -h /dpool/
Filesystem Size Used Avail Use% Mounted on
dpool 15G 128K 15G 1% /dpool
$ sudo zpool attach dpool raidz1-0 vdd
$ df -h /dpool/
Filesystem Size Used Avail Use% Mounted on
dpool 20G 128K 20G 1% /dpool
Now let’s fill that drive up with random data (to avoid compression)
# head -c 35G /dev/urandom > /dpool/big_file
head: error writing 'standard output': No space left on device
# ls -l /dpool/
total 20079455
-rw-r--r-- 1 root root 30762213376 Oct 23 08:45 big_file
# ls -lh /dpool/
total 20G
-rw-r--r-- 1 root root 29G Oct 23 08:45 big_file
Neat - so I have a 20GB capacity filesystem – with a 29GB file on it. ZFS expansion works! You just get a bad estimate on how much free space you have.
What if we did the same 4 x 10G setup, but avoided expansion?
$ sudo zpool create dpool raidz vda vdb vdc vdd
$ df -h /dpool/
Filesystem Size Used Avail Use% Mounted on
dpool 29G 128K 29G 1% /dpool
$ sudo -i
# head -c 35G /dev/urandom > /dpool/big_file
head: error writing 'standard output': No space left on device
# ls -l /dpool/
total 30058412
-rw-r--r-- 1 root root 30780301312 Oct 23 08:53 big_file
# ls -lh /dpool/
total 29G
-rw-r--r-- 1 root root 29G Oct 23 08:53 big_file
The same 29GB fits, but the filesystem believes there is 29GB free.
If you look closely - you’ll notice the first file was 30762213376 bytes, and the second is 30780301312 - nearly the same… but not quite. My math says it is a 0.06% difference.
I’m certain that if I were to create say a 9G file on the original 2 disk pool, the difference would be bigger. Still, we are really getting a good portion of the drive via the ZFS expansion.
Good advice: If you can build the RAIDZ with the right number of disks up front, do so. Expansion does work, but you pay a little bit of a tax because the usable storage capacity goes up when you have more disks, and expansion doesn’t fix that ratio for previously written data.
I’m still a bit concerned I don’t have this quite right - because parity is parity… why does having more disks mean less parity? (and this is about being able to survive the loss of one disk, so with only 2 disks - both need to have effectively a copy of the data. With 3, you can spread the data such that all of the data is on any 2 disks – thus higher efficiency)