Zfs dedup on /nix/store -- Is it worth it?

on the flip side, my review server’s store is plateauing in size, even without auto-optimize:

$ zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
nixstore  1.81T   553G  1.27T        -         -    31%    29%  2.35x    ONLINE  -

It took several days, but, running nix-store --optimise was able to bring down the dedup value to 1.62x from 3.02x. While the store size remained relatively constant.

$ zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
nixstore  1.81T   814G  1.02T        -         -    50%    43%  1.62x    ONLINE  -

I don’ think the space values are correctly reported with zfs

$ nix-store --optimise
18900.37 MiB freed by hard-linking 593001 files

Running optimize on the nix store may make the dedup cache more effective? I’ve noticed my ram usage when way up, (50GB to 180GB idle). This could have also been from the optimize command populating the cache as well.

Either way, there seems to be significant overlap in benefits with nix store optimize vs zfs dedup.

2 Likes

NB: Apologies for the necrobump; this came up recently and is written for the sake of posterity rather than trying to directly tell Jon anything he doesn’t already know…


Yes, content ZFS was previously deduping “behind the scenes” is now visibly dedup’d with fs hard links. Lots of extra work to move from one layer to another for no benefit.

No, not really. It indexes the hash of every block written (recordsize=128k by default), even if there’s only one copy, so it can dedup a second copy that arrives later. This kind of change won’t make the DDT smaller or more effective, unless distinct data is removed. If anything, it makes it less effective, since there are fewer deduplicated blocks for the same DDT.

Yes, they do the same work, in a different way. zfs dedup is inline, costs amortised over each write, and has some metadata and seek overhead that’s almost invisible and entirely worthwhile, unless you’re particularly low on memory and have high latency rotating media, at which point it can suddenly throw you off a cliff.

The hard-links do a whole lot of additional IO and CPU work to recalculate checksums (that zfs is already using anyway), but that work can be scheduled for off hours and idle times. The cost is also proportional to the size of the store, even if it’s fully / recently been optimised already, so it might throw you off a cliff repeatedly (and slowly over the course of several days if you’re like Jon).

There’s one other difference: zfs dedup works at the block level; nix store dedup works at the file level. This means that where there are only small changes within a file between revisions, the blocks before (and potentially after, if alignment doesn’t change) the rest can still be dedup’ed by zfs. 1.62x seems like a lot. Your use case of reviewing changes may well mean many more copies of slightly-different (and large) files than usual, and maybe there are parts of the store that the optimise doesn’t consider?

My strong recommendation is to skip the work of doing both. ZFS is worthwhile generally, and on anything like a reasonable modern (not highly-constrained) system, if you have ZFS you might as well use dedup on the nix store. You will save yourself IO, including flash wear for large blocks.

These days, or in the near future, I think the better comparison is the content-addressed store format.

1 Like

@uep note that Nix supports auto optimization, where a store path is individually replaced with hard links to /nix/store/.links after it is added to the store. This is very efficient because it doesn’t scan the entire store to optimize it completely if you have it enabled from the start; it simply adds or reuses hard links when the build / substitution is complete.

This is substantially less overhead than the risks of zfs dedup and will accomplish the majority of the same thing at just about no cost since you move the cryptographic hashing out of zfs (in favor of something like fletcher4) and into nix. In fact, because ZFS dedup will have to re-perform the expensive cryptographic hashing algorithm on every single read, you will likely notice a significant CPU cost just from reading from the store. And it’s not necessarily true to say that it’s ok on SSDs, because, according to some TrueNAS docs:

Data rates of 50,000-300,000 4K I/O per second (IOPS) have been reported by the TrueNAS community for SSDs handling DDT

That is significant even for SSDs.

I very much do not recommend ZFS dedup on any system unless data storage is literally the only thing it does and your CPU can do a cryptographic hash extremely quickly.

2 Likes

[auto optimisation]

That is not an option I was aware of, only the scheduled systemd timer. It certainly changes the tradeoffs, although it sounds like it still writes out files (during build) and dedups later. Still, very useful on (say) a Pi3 without enough memory for zfs.

[TrueNas IOPS]

A good example of the cliff. Applying dedup over too much data (like maybe an entire NAS media store), on insufficient hardware, can readily tip over. On a well-chosen, suitable specific set of data, it’s good. The nix store, with several generations of packages that really mostly only change paths because of merkle tree cascades, is a very good use-case. Almost an ideal one.

The same reasons make the nix-native option good too.

Note carefully, my recommendation was to avoid doing both.

although it sounds like it still writes out files (during build) and dedups later

I doubt that this is a significant cost at all. Files that aren’t dedup’d this way are simply moved to the .links directory, and files that are dedup’d are immediately deleted and so likely never even made it to disk since they probably weren’t written with synchronous writes.

Also, I ran a fio random read test (not testing writes here; I want to know the impact on read performance due to cryptographic hashing). I set the size of the test to something that wouldn’t fit entirely in ARC on my machine. And I didn’t even enable dedup, because I want to show the hashing alone is enormous.

fio --name=random-read --ioengine=posixaio --rw=randread --bs=1m --size=48g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1

It’s about half as fast with checksum=sha256 on my Threadripper 1950X + Samsung 960 Pro + 64G of memory (32G ARC limit, so even this test benefitted tremendously from ARC). Without the extra overhead of dedup. There are significant performance costs to the checksumming alone. I bet this would have a fairly large effect on boot times, as ARC is completely cold.

Yes, I have and want those checksums anyway; I might as well use them for dedup as well.

Most people don’t use cryptographic checksums with ZFS, precisely because they’re slow :stuck_out_tongue:

1 Like

This is a sidebar for folks who haven’t already settled on zfs, but I implemented a NixOS module for bees (userland-driven deduplication for btrfs) with specifically this use case in mind, and it’s worked well in the places where I needed it.

3 Likes

Sorry to necrobump, but after reading this thread I had the impression that I would need a monster machine with terabytes of ram that built all nixpkgs in existence to see any benefit from zfs dedup for /nix.

But just for kicks while building a new machine I turned dedup on my /nix volume and keep it on a small partition to prevent it from getting too large.

I have been surprised that this does not use up terrible amounts of ram (this machine has 32GB of ram), increases the amount of generations I can keep around without garbage collecting and by that, it extends that particular superpower of NixOS and makes it even more interesting.

So if you were on the fence on this one I encourage you to try it out.

Here’s a few stats, upon enabling the dedup and having a few generations go through I had over 1.5x hit rate on the dedup, and after running an optimize (which only took a few minutes) I am down to 1.08x hit rate, both machines are setup as desktops with plasma/kde and a normalish amount of apps installed.

$ zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
nixdedup   190G  27.2G   163G        -         -     5%    14%  1.08x    ONLINE  -
nixpool   4.83T  6.69G  4.82T        -         -     0%     0%  1.00x    ONLINE  -

 
$ zpool status -D nixdedup
  pool: nixdedup
 state: ONLINE
  scan: scrub repaired 0B in 00:00:13 with 0 errors on Sun May 26 02:00:13 2024
config:

        NAME                                    STATE     READ WRITE CKSUM
        nixdedup                                ONLINE       0     0     0
          2b978b83-0e0a-4558-a8cf-687c57b55f18  ONLINE       0     0     0
          78b485a4-7ffb-4390-bfbd-90e20f4e5091  ONLINE       0     0     0
          27ac861c-e179-4e1b-abf9-4884a4d6b2ed  ONLINE       0     0     0

errors: No known data errors

 dedup: DDT entries 867328, size 330B on disk, 193B in core

bucket              allocated                       referenced          
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1     797K   44.1G   22.8G   23.5G     797K   44.1G   22.8G   23.5G
     2    48.7K   4.91G   1.87G   1.89G    98.5K   9.95G   3.78G   3.83G
     4      780   97.1M   37.1M   37.1M    3.58K    456M    173M    174M
     8      108   13.0M   9.64M   9.65M    1.08K    133M    100M    101M
    16        3    384K     12K     12K       59   7.38M    236K    236K
    32        1    128K      4K      4K       58   7.25M    232K    232K
   128        1    128K      4K      4K      198   24.8M    792K    792K
    1K        1    512B    512B      4K    1.57K    803K    803K   6.27M
 Total     847K   49.1G   24.8G   25.5G     902K   54.6G   26.9G   27.6G

$ free -m
               total        used        free      shared  buff/cache   available
Mem:           32006        7512       14941         729        9553       23324
Swap:          17024           0       17024

and on another machine with much more ram, but nearly identical stats

zpool list
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
nixdedup    214G  26.4G   188G        -         -     3%    12%  1.09x    ONLINE  -

$ zpool status -D nixdedup
  pool: nixdedup
 state: ONLINE
config:

        NAME                                      STATE     READ WRITE CKSUM
        nixdedup                                  ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            9cfe86dd-14e1-4b3a-98e8-f14acb1ab87e  ONLINE       0     0     0
            d6adadef-05                           ONLINE       0     0     0
            1a50fee7-08                           ONLINE       0     0     0
            cada0105-3850-4e77-95a3-e15b636491e9  ONLINE       0     0     0

errors: No known data errors

 dedup: DDT entries 695536, size 323B on disk, 195B in core

bucket              allocated                       referenced          
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1     640K   31.8G   14.4G   15.9G     640K   31.8G   14.4G   15.9G
     2    39.0K   3.67G   1.38G   1.43G    78.6K   7.42G   2.78G   2.89G
     4      719   88.7M   35.6M   36.0M    3.26K    413M    164M    166M
     8       10    898K    222K    244K      106   10.3M   2.57M   2.76M
    16        3    384K     12K   17.4K       59   7.38M    236K    343K
    64        2    256K      8K   11.6K      153   19.1M    612K    889K
    1K        1    512B    512B   5.81K    1.06K    542K    542K   6.16M
 Total     679K   35.5G   15.8G   17.4G     723K   39.6G   17.4G   19.0G

$ free -m
               total        used        free      shared  buff/cache   available
Mem:           80316        6744       71497         748        2074       72074
Swap:         132546           0      132546
1 Like