Zfs dedup on /nix/store -- Is it worth it?

jonringer · September 20, 2020, 1:45am

on the flip side, my review server’s store is plateauing in size, even without auto-optimize:

$ zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
nixstore  1.81T   553G  1.27T        -         -    31%    29%  2.35x    ONLINE  -

jonringer · October 28, 2020, 5:48pm

It took several days, but, running nix-store --optimise was able to bring down the dedup value to 1.62x from 3.02x. While the store size remained relatively constant.

$ zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
nixstore  1.81T   814G  1.02T        -         -    50%    43%  1.62x    ONLINE  -

I don’ think the space values are correctly reported with zfs

$ nix-store --optimise
18900.37 MiB freed by hard-linking 593001 files

Running optimize on the nix store may make the dedup cache more effective? I’ve noticed my ram usage when way up, (50GB to 180GB idle). This could have also been from the optimize command populating the cache as well.

Either way, there seems to be significant overlap in benefits with nix store optimize vs zfs dedup.

uep · October 30, 2022, 1:21am

NB: Apologies for the necrobump; this came up recently and is written for the sake of posterity rather than trying to directly tell Jon anything he doesn’t already know…

Yes, content ZFS was previously deduping “behind the scenes” is now visibly dedup’d with fs hard links. Lots of extra work to move from one layer to another for no benefit.

No, not really. It indexes the hash of every block written (recordsize=128k by default), even if there’s only one copy, so it can dedup a second copy that arrives later. This kind of change won’t make the DDT smaller or more effective, unless distinct data is removed. If anything, it makes it less effective, since there are fewer deduplicated blocks for the same DDT.

Yes, they do the same work, in a different way. zfs dedup is inline, costs amortised over each write, and has some metadata and seek overhead that’s almost invisible and entirely worthwhile, unless you’re particularly low on memory and have high latency rotating media, at which point it can suddenly throw you off a cliff.

The hard-links do a whole lot of additional IO and CPU work to recalculate checksums (that zfs is already using anyway), but that work can be scheduled for off hours and idle times. The cost is also proportional to the size of the store, even if it’s fully / recently been optimised already, so it might throw you off a cliff repeatedly (and slowly over the course of several days if you’re like Jon).

There’s one other difference: zfs dedup works at the block level; nix store dedup works at the file level. This means that where there are only small changes within a file between revisions, the blocks before (and potentially after, if alignment doesn’t change) the rest can still be dedup’ed by zfs. 1.62x seems like a lot. Your use case of reviewing changes may well mean many more copies of slightly-different (and large) files than usual, and maybe there are parts of the store that the optimise doesn’t consider?

My strong recommendation is to skip the work of doing both. ZFS is worthwhile generally, and on anything like a reasonable modern (not highly-constrained) system, if you have ZFS you might as well use dedup on the nix store. You will save yourself IO, including flash wear for large blocks.

These days, or in the near future, I think the better comparison is the content-addressed store format.

ElvishJerricco · October 30, 2022, 1:38am

@uep note that Nix supports auto optimization, where a store path is individually replaced with hard links to /nix/store/.links after it is added to the store. This is very efficient because it doesn’t scan the entire store to optimize it completely if you have it enabled from the start; it simply adds or reuses hard links when the build / substitution is complete.

This is substantially less overhead than the risks of zfs dedup and will accomplish the majority of the same thing at just about no cost since you move the cryptographic hashing out of zfs (in favor of something like fletcher4) and into nix. In fact, because ZFS dedup will have to re-perform the expensive cryptographic hashing algorithm on every single read, you will likely notice a significant CPU cost just from reading from the store. And it’s not necessarily true to say that it’s ok on SSDs, because, according to some TrueNAS docs:

Data rates of 50,000-300,000 4K I/O per second (IOPS) have been reported by the TrueNAS community for SSDs handling DDT

That is significant even for SSDs.

I very much do not recommend ZFS dedup on any system unless data storage is literally the only thing it does and your CPU can do a cryptographic hash extremely quickly.

uep · October 30, 2022, 4:05am

[auto optimisation]

That is not an option I was aware of, only the scheduled systemd timer. It certainly changes the tradeoffs, although it sounds like it still writes out files (during build) and dedups later. Still, very useful on (say) a Pi3 without enough memory for zfs.

[TrueNas IOPS]

A good example of the cliff. Applying dedup over too much data (like maybe an entire NAS media store), on insufficient hardware, can readily tip over. On a well-chosen, suitable specific set of data, it’s good. The nix store, with several generations of packages that really mostly only change paths because of merkle tree cascades, is a very good use-case. Almost an ideal one.

The same reasons make the nix-native option good too.

Note carefully, my recommendation was to avoid doing both.

ElvishJerricco · October 30, 2022, 5:22am

although it sounds like it still writes out files (during build) and dedups later

I doubt that this is a significant cost at all. Files that aren’t dedup’d this way are simply moved to the .links directory, and files that are dedup’d are immediately deleted and so likely never even made it to disk since they probably weren’t written with synchronous writes.

Also, I ran a fio random read test (not testing writes here; I want to know the impact on read performance due to cryptographic hashing). I set the size of the test to something that wouldn’t fit entirely in ARC on my machine. And I didn’t even enable dedup, because I want to show the hashing alone is enormous.

fio --name=random-read --ioengine=posixaio --rw=randread --bs=1m --size=48g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1

It’s about half as fast with checksum=sha256 on my Threadripper 1950X + Samsung 960 Pro + 64G of memory (32G ARC limit, so even this test benefitted tremendously from ARC). Without the extra overhead of dedup. There are significant performance costs to the checksumming alone. I bet this would have a fairly large effect on boot times, as ARC is completely cold.

uep · October 30, 2022, 5:45am

Yes, I have and want those checksums anyway; I might as well use them for dedup as well.

ElvishJerricco · October 30, 2022, 5:47am

Most people don’t use cryptographic checksums with ZFS, precisely because they’re slow

charles-dyfis-net · October 30, 2022, 3:03pm

This is a sidebar for folks who haven’t already settled on zfs, but I implemented a NixOS module for bees (userland-driven deduplication for btrfs) with specifically this use case in mind, and it’s worked well in the places where I needed it.