Zfs dedup on /nix/store -- Is it worth it?

Ninlives · November 29, 2019, 10:42am

Recently I’m experimenting with several zfs features, I learned that zfs can do block level deduplication.
Has anyone tried it on /nix/store?
Since we already have nix optimise-store, can I still benefit from the block level deduplication?
The deduplication implementation of zfs lowered the write performance, but /nix/store is read-only in most of time, I think it doesn’t matter, am I right about that?

SRGOM · August 4, 2020, 10:03pm

No answer but acouple of points

Disable auto store optimisation for now · NixOS/nix@6c4ac29 · GitHub because auto-optimize can cause performance degradation (not zfs related). My guess would be that in an enterprise fs like zfs, this wont happen.
nix store is written maybe 99% of times from cache and my internet vs disk speeds are 1.x orders of magnitude in favor of disk. Unless deduplication slows down more than 10x, it’s going to be helpful for me.

Have you read this post-
ZFS: To Dedupe or Not to Dedupe… · Constant Thinking?
Also this- https://www.oracle.com/technical-resources/articles/it-infrastructure/admin-o11-113-size-zfs-dedup.html

Hoping that infrequent writes would make zfs remove the deduo table from RAM, I see dedup has an almost free saving.

jonringer · August 4, 2020, 11:59pm

dedup is almost never worth it, but nix store is significantly different than most other use cases.

I would, however, recommend using compression. I get compressratio’s around 1.6 to 2. Which is definitely nice for users who use spinning disks, as it essentially doubles your I/O bandwidth

casept · August 5, 2020, 10:45am

Despite my /nix/store being optimized I still get ZFS compression ratios of over 1.5, so compression seems to be worth it. Don’t know about dedup.

Ninlives · August 6, 2020, 12:46am

Using auto-optimize and lz4 compression now, works great for me

SRGOM · August 6, 2020, 3:50am

Can you post the output of

sudo zfs get all <MYPOOLNAME> | grep compressratio and sudo zpool get all <MYPOOLNAME> | grep dedupratio

I don’t know how to check zfs ram usage but would you happen to know how it has changed after enabling dedup? (I got the sense from skimming the linked article that a rule of thumb would be 1/200 of used size of the dataset)

SRGOM · August 19, 2020, 6:49pm

I just migrated my old installation to new one on zfs.

Old- ext4, new zfs pool with dedup=on and compression=on

old:

du -sh /nix/store : 46GB

new (after only copying /nix/store, not actual installation):

➤ sudo zfs get  all tank/nix  | grep compress                                                             
tank/nix  compressratio          1.84x                   -
tank/nix  compression            on                      local
tank/nix  refcompressratio       1.84x                   -

➤ sudo zpool get  all tank  | grep dedup                                                                 
tank  dedupditto                     0                              default
tank  dedupratio                     1.70x                          -

➤ zpool list tank                                                                                         
NAME             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank   113G  17.1G  95.9G        -         -    10%    15%  1.69x    ONLINE  -

Goes without saying that dedup is pretty useful (only if you don’t use store optimization, if you do use store optimization then dedup wins you nothing). That said, my computer was unusable for 3 hours for just 46 GB. It’s an old machine but I think I regularly get 100 MB/s copying on my HDD. Since /nix/store is ready-heavy, I guess I won’t mind the occasional slowdown …

Note also that copying stores is probably not the right approach since it means that your sqlite db is not built which means everything is at risk for garbage collection. I did this only for testing … (see: Rebuild sqlite db from scratch? · Issue #3091 · NixOS/nix · GitHub)

jonringer · August 19, 2020, 7:04pm

according to linux - How can I determine the current size of the ARC in ZFS, and how does the ARC relate to free or cache memory? - Super User you can ping /proc/spl/kstat/zfs/arcstats for arc metrics.

And it seems to align with my system (256 GB of ram):

[12:00:03] jon@nixos ~
$ awk '/^size/ { print $1 " " $3 / 1048576 }' < /proc/spl/kstat/zfs/arcstats
size 41537.6
$ sudo zfs list
NAME             USED  AVAIL     REFER  MOUNTPOINT
nixstore         747G  1.03T       24K  /nixstore
nixstore/store   746G  1.03T      746G  legacy
tank             328G  6.70T      104K  /tank
tank/movies     7.09G  6.70T     7.09G  /tank/movies
tank/nixstore    112K  6.70T      112K  /tank/nixstore
tank/swap        272G  6.96T     5.55G  -
tank/torrents   48.4G  6.70T     48.4G  /tank/torrents

I use lz4 compression with nix-store optimize. But I may just switch to zfs dedup if I do it again.

SRGOM · August 19, 2020, 7:06pm

dedup property can be set at any time …

jonringer · August 19, 2020, 7:09pm

yes, but i already set my nix-store to optimisize, and I think it may be largely duplicated work.

However, I may try it next time i nix-collect-garbage

ElvishJerricco · August 20, 2020, 3:18am

The big issue with dedup is that it uses quite a lot of memory, and it’s scattered randomly across the disk, making it slow to read (and AFAIK, the entire table has to be read just to import the pool). Like it can take hours just to import a large array made of HDDs.

For the nix store… Eh, for most people I guess it’s small enough to not be such an issue. I wouldn’t count on it being all that much better than auto-optimize though.

tokudan · August 20, 2020, 10:40pm

ZFS dedup actually works while writing data. It calculates the checksum of a new block and then checks its table if a block with the same checksum is already on the disk. If yes, instead of writing the new block, ZFS just points to the old block in the metadata. This means that ZFS needs the full table of all blocks and their checksums in RAM while writing and a fast CPU. If it does not fit into the ZFS ARC, then ZFS will happily re-read the missing part of the table from the disk for each write. ZFS dedup is heavily biased towards servers with loads of RAM and enabling it without calculating the required RAM and adjusting the ZFS ARC size for it may cause massive performance hits instantly or later on, when the block list becomes too large.
Due to the way ZFS ARC size is set, you may not even see a memory increase, as ZFS happily uses ~50% RAM if it would otherwise be free for its own cache. But that cache may shrink due to the size of the block table it needs for dedup, while ARC size remains the same.

ZFS dedup is completely transparent during read, as it’s just a block pointer. If that block is used multiple times, it simply does not matter. For that reason it also should not affect pool import times.
Good thing is that you can just disable it at any time.

SRGOM · August 22, 2020, 8:26pm

Careful with the set dedup=off, that or set atime=off commands, one of these very likely destroyed my zfs partition (I was on root and issued these after creation and installation)

Also the link I gave in my first post touch on the RAM usage and it depends on your data size and it’s not that bad …

EDITED: Sorry my original post was unclear.

EDIT2: I meant running these commands afterwards. I otherwise have great experience with a dedup=off pool which started as such from the beginning.

jonringer · August 23, 2020, 4:22pm

I’ve been using atime=off and haven’t add any issues. Essentially it just prevents another write when accessing files. And since the nix-store doesn’t care about time it seems like a good fit.

jonringer · September 1, 2020, 2:48am

This might be just for my use case of doing a lot of reviews, but dedup is really helping.

$ zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
nixstore  1.81T   462G  1.36T        -         -    29%    24%  1.79x    ONLINE  -
$ zfs get all nixstore/store | grep compressra
nixstore/store  compressratio         1.85x                            -

might be able to extend the usefulness of 2TB well past it’s original 2TB

SRGOM · September 1, 2020, 10:14am

This is without store auto-optimize right? Can someone with auto-optimize on post their store stats? Specifically I’m looking for two things:

du -sh on nix store (will take at least 15 minutes)
optimise reported savings (typically at the end of garbage collectino)

jonringer · September 1, 2020, 5:21pm

this is with auto optimize I believe, at least I had it on and I’m not aware of a way to disable it.

One thing to note is that my server now floats like a baseline of 100-180 GB of zfs arc + dedup (out of 256GB). However, I haven’t really suffered memory pressure so it hasn’t affected performance too much.

SRGOM · September 1, 2020, 5:39pm

Can’t believe I’m telling you … but what does nixos-option nix.autoOptimiseStore return?

I ask because 1.79 with auto-optimise on is very suspicious … There’s nothing zfs does over /nix/store optimize (zfs does block level dedup but saying that 79 out of 179 blocks are equal yet files are not doesn’t seem right to me).

Can you keep an eye out during your next garbage collection and see what savings it reports?

jonringer · September 2, 2020, 4:08pm

It’s false, I guess I did nix-store --optimise once, but it was just a one-time action, not a persistent action.

SRGOM · September 19, 2020, 11:36am

After botching up my previous zfs install, I reinstalled nixos on zfs and this time (by accident) I had both zfs’s dedup and nix.autoOptimiseStore on. Amazingly zfs still managed to give me a 1.35x deduplication. Substantial but not sure if I would like to pay the RAM cost.

Compression ratio went up to 1.95 (maybe just a difference of data …?)