NB: Apologies for the necrobump; this came up recently and is written for the sake of posterity rather than trying to directly tell Jon anything he doesn’t already know…
Yes, content ZFS was previously deduping “behind the scenes” is now visibly dedup’d with fs hard links. Lots of extra work to move from one layer to another for no benefit.
No, not really. It indexes the hash of every block written (recordsize=128k by default), even if there’s only one copy, so it can dedup a second copy that arrives later. This kind of change won’t make the DDT smaller or more effective, unless distinct data is removed. If anything, it makes it less effective, since there are fewer deduplicated blocks for the same DDT.
Yes, they do the same work, in a different way. zfs dedup is inline, costs amortised over each write, and has some metadata and seek overhead that’s almost invisible and entirely worthwhile, unless you’re particularly low on memory and have high latency rotating media, at which point it can suddenly throw you off a cliff.
The hard-links do a whole lot of additional IO and CPU work to recalculate checksums (that zfs is already using anyway), but that work can be scheduled for off hours and idle times. The cost is also proportional to the size of the store, even if it’s fully / recently been optimised already, so it might throw you off a cliff repeatedly (and slowly over the course of several days if you’re like Jon).
There’s one other difference: zfs dedup works at the block level; nix store dedup works at the file level. This means that where there are only small changes within a file between revisions, the blocks before (and potentially after, if alignment doesn’t change) the rest can still be dedup’ed by zfs. 1.62x seems like a lot. Your use case of reviewing changes may well mean many more copies of slightly-different (and large) files than usual, and maybe there are parts of the store that the optimise doesn’t consider?
My strong recommendation is to skip the work of doing both. ZFS is worthwhile generally, and on anything like a reasonable modern (not highly-constrained) system, if you have ZFS you might as well use dedup on the nix store. You will save yourself IO, including flash wear for large blocks.
These days, or in the near future, I think the better comparison is the content-addressed store format.