Zfs dedup on /nix/store -- Is it worth it?

according to linux - How can I determine the current size of the ARC in ZFS, and how does the ARC relate to free or cache memory? - Super User you can ping /proc/spl/kstat/zfs/arcstats for arc metrics.

And it seems to align with my system (256 GB of ram):

[12:00:03] jon@nixos ~
$ awk '/^size/ { print $1 " " $3 / 1048576 }' < /proc/spl/kstat/zfs/arcstats
size 41537.6
$ sudo zfs list
NAME             USED  AVAIL     REFER  MOUNTPOINT
nixstore         747G  1.03T       24K  /nixstore
nixstore/store   746G  1.03T      746G  legacy
tank             328G  6.70T      104K  /tank
tank/movies     7.09G  6.70T     7.09G  /tank/movies
tank/nixstore    112K  6.70T      112K  /tank/nixstore
tank/swap        272G  6.96T     5.55G  -
tank/torrents   48.4G  6.70T     48.4G  /tank/torrents

I use lz4 compression with nix-store optimize. But I may just switch to zfs dedup if I do it again.

1 Like

dedup property can be set at any time …

yes, but i already set my nix-store to optimisize, and I think it may be largely duplicated work.

However, I may try it next time i nix-collect-garbage

The big issue with dedup is that it uses quite a lot of memory, and it’s scattered randomly across the disk, making it slow to read (and AFAIK, the entire table has to be read just to import the pool). Like it can take hours just to import a large array made of HDDs.

For the nix store… Eh, for most people I guess it’s small enough to not be such an issue. I wouldn’t count on it being all that much better than auto-optimize though.

4 Likes

ZFS dedup actually works while writing data. It calculates the checksum of a new block and then checks its table if a block with the same checksum is already on the disk. If yes, instead of writing the new block, ZFS just points to the old block in the metadata. This means that ZFS needs the full table of all blocks and their checksums in RAM while writing and a fast CPU. If it does not fit into the ZFS ARC, then ZFS will happily re-read the missing part of the table from the disk for each write. ZFS dedup is heavily biased towards servers with loads of RAM and enabling it without calculating the required RAM and adjusting the ZFS ARC size for it may cause massive performance hits instantly or later on, when the block list becomes too large.
Due to the way ZFS ARC size is set, you may not even see a memory increase, as ZFS happily uses ~50% RAM if it would otherwise be free for its own cache. But that cache may shrink due to the size of the block table it needs for dedup, while ARC size remains the same.

ZFS dedup is completely transparent during read, as it’s just a block pointer. If that block is used multiple times, it simply does not matter. For that reason it also should not affect pool import times.
Good thing is that you can just disable it at any time.

3 Likes

Careful with the set dedup=off, that or set atime=off commands, one of these very likely destroyed my zfs partition (I was on root and issued these after creation and installation)

Also the link I gave in my first post touch on the RAM usage and it depends on your data size and it’s not that bad …

EDITED: Sorry my original post was unclear.

EDIT2: I meant running these commands afterwards. I otherwise have great experience with a dedup=off pool which started as such from the beginning.

I’ve been using atime=off and haven’t add any issues. Essentially it just prevents another write when accessing files. And since the nix-store doesn’t care about time it seems like a good fit.

2 Likes

This might be just for my use case of doing a lot of reviews, but dedup is really helping.

$ zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
nixstore  1.81T   462G  1.36T        -         -    29%    24%  1.79x    ONLINE  -
$ zfs get all nixstore/store | grep compressra
nixstore/store  compressratio         1.85x                            -

might be able to extend the usefulness of 2TB well past it’s original 2TB

1 Like

This is without store auto-optimize right? Can someone with auto-optimize on post their store stats? Specifically I’m looking for two things:

  1. du -sh on nix store (will take at least 15 minutes)
  2. optimise reported savings (typically at the end of garbage collectino)

this is with auto optimize I believe, at least I had it on and I’m not aware of a way to disable it.

One thing to note is that my server now floats like a baseline of 100-180 GB of zfs arc + dedup (out of 256GB). However, I haven’t really suffered memory pressure so it hasn’t affected performance too much.

Can’t believe I’m telling you :slight_smile: … but what does nixos-option nix.autoOptimiseStore return?

I ask because 1.79 with auto-optimise on is very suspicious … There’s nothing zfs does over /nix/store optimize (zfs does block level dedup but saying that 79 out of 179 blocks are equal yet files are not doesn’t seem right to me).

Can you keep an eye out during your next garbage collection and see what savings it reports?

It’s false, I guess I did nix-store --optimise once, but it was just a one-time action, not a persistent action.

1 Like

After botching up my previous zfs install, I reinstalled nixos on zfs and this time (by accident) I had both zfs’s dedup and nix.autoOptimiseStore on. Amazingly zfs still managed to give me a 1.35x deduplication. Substantial but not sure if I would like to pay the RAM cost.

Compression ratio went up to 1.95 (maybe just a difference of data …?)

on the flip side, my review server’s store is plateauing in size, even without auto-optimize:

$ zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
nixstore  1.81T   553G  1.27T        -         -    31%    29%  2.35x    ONLINE  -

It took several days, but, running nix-store --optimise was able to bring down the dedup value to 1.62x from 3.02x. While the store size remained relatively constant.

$ zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
nixstore  1.81T   814G  1.02T        -         -    50%    43%  1.62x    ONLINE  -

I don’ think the space values are correctly reported with zfs

$ nix-store --optimise
18900.37 MiB freed by hard-linking 593001 files

Running optimize on the nix store may make the dedup cache more effective? I’ve noticed my ram usage when way up, (50GB to 180GB idle). This could have also been from the optimize command populating the cache as well.

Either way, there seems to be significant overlap in benefits with nix store optimize vs zfs dedup.

2 Likes

NB: Apologies for the necrobump; this came up recently and is written for the sake of posterity rather than trying to directly tell Jon anything he doesn’t already know…


Yes, content ZFS was previously deduping “behind the scenes” is now visibly dedup’d with fs hard links. Lots of extra work to move from one layer to another for no benefit.

No, not really. It indexes the hash of every block written (recordsize=128k by default), even if there’s only one copy, so it can dedup a second copy that arrives later. This kind of change won’t make the DDT smaller or more effective, unless distinct data is removed. If anything, it makes it less effective, since there are fewer deduplicated blocks for the same DDT.

Yes, they do the same work, in a different way. zfs dedup is inline, costs amortised over each write, and has some metadata and seek overhead that’s almost invisible and entirely worthwhile, unless you’re particularly low on memory and have high latency rotating media, at which point it can suddenly throw you off a cliff.

The hard-links do a whole lot of additional IO and CPU work to recalculate checksums (that zfs is already using anyway), but that work can be scheduled for off hours and idle times. The cost is also proportional to the size of the store, even if it’s fully / recently been optimised already, so it might throw you off a cliff repeatedly (and slowly over the course of several days if you’re like Jon).

There’s one other difference: zfs dedup works at the block level; nix store dedup works at the file level. This means that where there are only small changes within a file between revisions, the blocks before (and potentially after, if alignment doesn’t change) the rest can still be dedup’ed by zfs. 1.62x seems like a lot. Your use case of reviewing changes may well mean many more copies of slightly-different (and large) files than usual, and maybe there are parts of the store that the optimise doesn’t consider?

My strong recommendation is to skip the work of doing both. ZFS is worthwhile generally, and on anything like a reasonable modern (not highly-constrained) system, if you have ZFS you might as well use dedup on the nix store. You will save yourself IO, including flash wear for large blocks.

These days, or in the near future, I think the better comparison is the content-addressed store format.

1 Like

@uep note that Nix supports auto optimization, where a store path is individually replaced with hard links to /nix/store/.links after it is added to the store. This is very efficient because it doesn’t scan the entire store to optimize it completely if you have it enabled from the start; it simply adds or reuses hard links when the build / substitution is complete.

This is substantially less overhead than the risks of zfs dedup and will accomplish the majority of the same thing at just about no cost since you move the cryptographic hashing out of zfs (in favor of something like fletcher4) and into nix. In fact, because ZFS dedup will have to re-perform the expensive cryptographic hashing algorithm on every single read, you will likely notice a significant CPU cost just from reading from the store. And it’s not necessarily true to say that it’s ok on SSDs, because, according to some TrueNAS docs:

Data rates of 50,000-300,000 4K I/O per second (IOPS) have been reported by the TrueNAS community for SSDs handling DDT

That is significant even for SSDs.

I very much do not recommend ZFS dedup on any system unless data storage is literally the only thing it does and your CPU can do a cryptographic hash extremely quickly.

2 Likes

[auto optimisation]

That is not an option I was aware of, only the scheduled systemd timer. It certainly changes the tradeoffs, although it sounds like it still writes out files (during build) and dedups later. Still, very useful on (say) a Pi3 without enough memory for zfs.

[TrueNas IOPS]

A good example of the cliff. Applying dedup over too much data (like maybe an entire NAS media store), on insufficient hardware, can readily tip over. On a well-chosen, suitable specific set of data, it’s good. The nix store, with several generations of packages that really mostly only change paths because of merkle tree cascades, is a very good use-case. Almost an ideal one.

The same reasons make the nix-native option good too.

Note carefully, my recommendation was to avoid doing both.

although it sounds like it still writes out files (during build) and dedups later

I doubt that this is a significant cost at all. Files that aren’t dedup’d this way are simply moved to the .links directory, and files that are dedup’d are immediately deleted and so likely never even made it to disk since they probably weren’t written with synchronous writes.

Also, I ran a fio random read test (not testing writes here; I want to know the impact on read performance due to cryptographic hashing). I set the size of the test to something that wouldn’t fit entirely in ARC on my machine. And I didn’t even enable dedup, because I want to show the hashing alone is enormous.

fio --name=random-read --ioengine=posixaio --rw=randread --bs=1m --size=48g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1

It’s about half as fast with checksum=sha256 on my Threadripper 1950X + Samsung 960 Pro + 64G of memory (32G ARC limit, so even this test benefitted tremendously from ARC). Without the extra overhead of dedup. There are significant performance costs to the checksumming alone. I bet this would have a fairly large effect on boot times, as ARC is completely cold.

Yes, I have and want those checksums anyway; I might as well use them for dedup as well.