Introducing Attic, a self-hostable Nix Binary Cache server

https://github.com/zhaofengli/attic/issues/2

2 Likes

Given that the chunks are still quite big, and I suspect that many files only change by their inclusion of nix store paths that changed, how about pre-processing all stored files as follows:

The idea is to move the nix store paths (only up to the first part) into a separate list, and remove them from the file. So then you would replace a file F with a tuple (Fx, L). Fx is the binary contents of the file with every sequence matching /nix/store/[^/]+ removed, and L is a list of (position, path) tuples.

This can be encoded in a streaming manner, and decoded in a streaming manner provided you have access to the tuples L.

L can be compressed better by making position be relative to the end of the last match, and making path a index of a list of found paths. So then we get Lrel being a list of (relPosition, pathIndex) tuples, and P a list of paths, so F becomes (Fx, Lrel, P).

This result should be way better at being chunked. I am hoping that many rebuilt files will have the same Fx and Lrel, and only P will differ.

For future-proofing etc, the /nix/store/ part should be configurable.

What do you think @zhaofengli ?

2 Likes

Content-adressable derivations use a similar pattern for hashing modulo self-references.
However, they specifically only look for the hash parts to avoid missing split references (think something like [storePath, "dbeu123...-opencv-2.13.0'].join('/')). I guess you could do this in this case as well; you’d have to scan for the hashes of all dependencies, record their positions, replace them with a string of zeroes (the “cleaned” version of the file) and store a set of tuples of what hashes occurred in what positions {(h, [p])}. If you then store a new file in this scheme and the “cleaned” version is equal and the set is the exact same except for the hashes, you have a match and can map both files to the same on-disk representation, just with different maps.

1 Like

With the self-reference, you’re sure that the string is what you give it to find, but in this case we want to catch as many store paths as possible, and it’s not a big problem if we miss one.

But yes, very similar.

1 Like

Though we only need to catch all the references, right? It’s conceivable but unlikely that there are store paths in the output that weren’t also specified in the inputs, so searching for any store path will be unlikely to yield more matches. Though I guess it would be best to actually test that hypothesis on the current nixpkgs.

Yeah I guess. I can’t come up with a situation where missing a store path would cause any harm, other than files not being deduplicated even though they could be.

1 Like

Right, just moving any string matching /nix/store/[^/]+ out of the files should improve deduplication, it doesn’t matter if those are real paths. After restoring the end result is the same.

4 Likes

I did a benchmark of Attic for deduplicating chromium derivations.

It includes comparisons with other dedup backup tools.

Could you do the benchmark after removing or zeroing out anything matching /+nix/+store/+[^/]*?

Removing ensures that length changes of the path don’t matter, but generally the lengths don’t change much.

How can this regex work?

[^/]* continues matching until / is found.

So if a binary file contains anywhere in the binary a string such as /nix/store/abc..123-myfile (without a trailing / since it’s a file), large parts of the binary will be zeroed out that have nothing to do with store paths.

Maybe you could use [^/\0]* to stop matching at the next null-byte.

But even that is not guaranteed to work, in case store paths are stored not as null-terminated strings, but with an associated length instead.


It seems impossible to detect full-length store paths in an arbitrary file.

The best you could do is finding the pattern

/nix/store/.{32}-

which matches

/nix/store/3c666h6rp892xvsl22bfahvmga8c3d9k-

After that, all bets are off.

2 Likes

Another thing I don’t understand: Why would length changes matter? The benchmarks are doing content-defined chunking (CDC). They deduplicate even in the presence of length changes of store paths.

Further, should zeroing out store paths bring a significant improvement of deduplication?

Most references to store paths should be in the section of binaries or .so files that are usef for dynamic linking, e.g. this for chromium:

# grep --text --byte-offset -R -P '/nix/store/.{32}-' /nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110 | cut -d: -f-2        
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/libvk_swiftshader.so:6537
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/libvulkan.so.1:15569
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/libVkLayer_khronos_validation.so:8277
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/libVkICD_mock_icd.so:4477
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chrome_crashpad_handler:0
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chrome_crashpad_handler:9905
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/libEGL.so:8649
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/libGLESv2.so:83361
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chromium:20499185
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chromium:21989989
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chromium:23409090
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chromium:32017337
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chromium:267909201
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chromium:359225482

Here we can see that the locations of store paths are compact, which means that under CDC, they will not contribute significantly to stored file size.

I think a key thing that works against CDC for such binaries is that absolute reference to labels, string constants, etc, are spread around the file, so if the position of a static string in the binary changes, this will result in changed pointers all across the binary, making the CDC find different chunks.

Thus I would not be surprised if “decompilation” into assembly, which turns addresses back into named labels, would improve the deduplication significantly. However, then it’s unclear how to get back to the original bit-identical binary.

1 Like

ah yes, good point. But the part after that still contains a name+version which highly likely differs between builds, so maybe it could be

/+nix/+store/+[a-z0-9]{32}-[a-zA-Z0-9._+?=-]{0,150}

And it doesn’t matter if it removes too much, since it would put back the exact same thing afterwards.

(this regex matches everything in my store, checked with

find /nix/store/ -maxdepth 1 -regextype posix-extended ! -regex '/+nix/+store/+[a-z0-9]{32}-[a-zA-Z0-9._+?=-]{0,150}

)

This is discussed and solved in elfshaker:

Discussed for NixOS in:

I from this README section and this manyclangs link, with elfshaker one does not store the final binary or .so, but instead the “pre-link object files” (.o), which are stable due to -ffunction-sections and -fdata-sections.

Then for getting a binary or .so out, you summon the .o files with elfshaker , and link them.

This means that for attic to use this much-more-effective type of binary deduplication, it needs to

  • run during the build process, where .o files are still available
  • perform reproducible linking for retrieval
1 Like

Unfortunately sliding window chunked deduplication isn’t magic. It works by calculating a rolling hash that removes a byte out front and adds a byte at the back, and then when that hash hits 0, that’s a chunk boundary and on average that gives a certain block size.

So it wouldn’t find the real boundaries, which are at the store paths, and also the data between the paths is too short to be a chunk. Shell scripts would surely never deduplicate (but of course they are tiny, relatively speaking).

By moving the data that we know will change out of the file, the rest of the file has a good chance of hitting rolling checksum duplicate blocks.

If there is other surely-changed data like that, it should be moved out before dedupe, too.

What I am saying is:

Doing this has negligible effect on size.

When sliding window chunked deduplication hits a changed store path, it causes at max 2*chunkSize differences (left and right of the difference), which for Attic is 512 KiB.

As shown in the post above the chromium binary has 6 matches, so that would be roughly 3 MB overhad, while the chromium binary is 350 MB.

So the impact of the proposed optimisation should be <= 1% for this test.

The .so files even have only 1 match each.

2 Likes

Very good points. Might indeed not be worth doing, even if that means scripts will never be deduped.

I’d like to be sure though, I’ll experiment with a script that determines offsets of store paths.

3 Likes

I finally got around to setting up Attic and it’s been working great! Definitely sped up my local deployments to several machines. Thank you @zhaofengli!

3 Likes

Is there any docs about how to deploy it like this?

Anyone aware on how to install attic client locally with flake?

This won’t works for me:

nix --extra-experimental-features flakes build -vvvv https://github.com/zhaofengli/attic/tarball/main#attic-client

Try github:zhaofengli/attic#attic-client instead of https://github.com/zhaofengli/attic/tarball/main#attic-client.

1 Like

Thanks! It works and installs attic binary as expected.

nix --extra-experimental-features flakes build github:zhaofengli/attic#attic-client