Introducing Attic, a self-hostable Nix Binary Cache server

Though we only need to catch all the references, right? It’s conceivable but unlikely that there are store paths in the output that weren’t also specified in the inputs, so searching for any store path will be unlikely to yield more matches. Though I guess it would be best to actually test that hypothesis on the current nixpkgs.

Yeah I guess. I can’t come up with a situation where missing a store path would cause any harm, other than files not being deduplicated even though they could be.

1 Like

Right, just moving any string matching /nix/store/[^/]+ out of the files should improve deduplication, it doesn’t matter if those are real paths. After restoring the end result is the same.

4 Likes

I did a benchmark of Attic for deduplicating chromium derivations.

It includes comparisons with other dedup backup tools.

Could you do the benchmark after removing or zeroing out anything matching /+nix/+store/+[^/]*?

Removing ensures that length changes of the path don’t matter, but generally the lengths don’t change much.

How can this regex work?

[^/]* continues matching until / is found.

So if a binary file contains anywhere in the binary a string such as /nix/store/abc..123-myfile (without a trailing / since it’s a file), large parts of the binary will be zeroed out that have nothing to do with store paths.

Maybe you could use [^/\0]* to stop matching at the next null-byte.

But even that is not guaranteed to work, in case store paths are stored not as null-terminated strings, but with an associated length instead.


It seems impossible to detect full-length store paths in an arbitrary file.

The best you could do is finding the pattern

/nix/store/.{32}-

which matches

/nix/store/3c666h6rp892xvsl22bfahvmga8c3d9k-

After that, all bets are off.

2 Likes

Another thing I don’t understand: Why would length changes matter? The benchmarks are doing content-defined chunking (CDC). They deduplicate even in the presence of length changes of store paths.

Further, should zeroing out store paths bring a significant improvement of deduplication?

Most references to store paths should be in the section of binaries or .so files that are usef for dynamic linking, e.g. this for chromium:

# grep --text --byte-offset -R -P '/nix/store/.{32}-' /nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110 | cut -d: -f-2        
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/libvk_swiftshader.so:6537
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/libvulkan.so.1:15569
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/libVkLayer_khronos_validation.so:8277
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/libVkICD_mock_icd.so:4477
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chrome_crashpad_handler:0
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chrome_crashpad_handler:9905
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/libEGL.so:8649
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/libGLESv2.so:83361
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chromium:20499185
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chromium:21989989
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chromium:23409090
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chromium:32017337
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chromium:267909201
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chromium:359225482

Here we can see that the locations of store paths are compact, which means that under CDC, they will not contribute significantly to stored file size.

I think a key thing that works against CDC for such binaries is that absolute reference to labels, string constants, etc, are spread around the file, so if the position of a static string in the binary changes, this will result in changed pointers all across the binary, making the CDC find different chunks.

Thus I would not be surprised if “decompilation” into assembly, which turns addresses back into named labels, would improve the deduplication significantly. However, then it’s unclear how to get back to the original bit-identical binary.

1 Like

ah yes, good point. But the part after that still contains a name+version which highly likely differs between builds, so maybe it could be

/+nix/+store/+[a-z0-9]{32}-[a-zA-Z0-9._+?=-]{0,150}

And it doesn’t matter if it removes too much, since it would put back the exact same thing afterwards.

(this regex matches everything in my store, checked with

find /nix/store/ -maxdepth 1 -regextype posix-extended ! -regex '/+nix/+store/+[a-z0-9]{32}-[a-zA-Z0-9._+?=-]{0,150}

)

This is discussed and solved in elfshaker:

Discussed for NixOS in:

I from this README section and this manyclangs link, with elfshaker one does not store the final binary or .so, but instead the “pre-link object files” (.o), which are stable due to -ffunction-sections and -fdata-sections.

Then for getting a binary or .so out, you summon the .o files with elfshaker , and link them.

This means that for attic to use this much-more-effective type of binary deduplication, it needs to

  • run during the build process, where .o files are still available
  • perform reproducible linking for retrieval
1 Like

Unfortunately sliding window chunked deduplication isn’t magic. It works by calculating a rolling hash that removes a byte out front and adds a byte at the back, and then when that hash hits 0, that’s a chunk boundary and on average that gives a certain block size.

So it wouldn’t find the real boundaries, which are at the store paths, and also the data between the paths is too short to be a chunk. Shell scripts would surely never deduplicate (but of course they are tiny, relatively speaking).

By moving the data that we know will change out of the file, the rest of the file has a good chance of hitting rolling checksum duplicate blocks.

If there is other surely-changed data like that, it should be moved out before dedupe, too.

What I am saying is:

Doing this has negligible effect on size.

When sliding window chunked deduplication hits a changed store path, it causes at max 2*chunkSize differences (left and right of the difference), which for Attic is 512 KiB.

As shown in the post above the chromium binary has 6 matches, so that would be roughly 3 MB overhad, while the chromium binary is 350 MB.

So the impact of the proposed optimisation should be <= 1% for this test.

The .so files even have only 1 match each.

2 Likes

Very good points. Might indeed not be worth doing, even if that means scripts will never be deduped.

I’d like to be sure though, I’ll experiment with a script that determines offsets of store paths.

3 Likes

I finally got around to setting up Attic and it’s been working great! Definitely sped up my local deployments to several machines. Thank you @zhaofengli!

3 Likes

Is there any docs about how to deploy it like this?

Anyone aware on how to install attic client locally with flake?

This won’t works for me:

nix --extra-experimental-features flakes build -vvvv https://github.com/zhaofengli/attic/tarball/main#attic-client

Try github:zhaofengli/attic#attic-client instead of https://github.com/zhaofengli/attic/tarball/main#attic-client.

1 Like

Thanks! It works and installs attic binary as expected.

nix --extra-experimental-features flakes build github:zhaofengli/attic#attic-client