Though we only need to catch all the references, right? It’s conceivable but unlikely that there are store paths in the output that weren’t also specified in the inputs, so searching for any store path will be unlikely to yield more matches. Though I guess it would be best to actually test that hypothesis on the current nixpkgs.
Yeah I guess. I can’t come up with a situation where missing a store path would cause any harm, other than files not being deduplicated even though they could be.
Right, just moving any string matching /nix/store/[^/]+ out of the files should improve deduplication, it doesn’t matter if those are real paths. After restoring the end result is the same.
So if a binary file contains anywhere in the binary a string such as /nix/store/abc..123-myfile (without a trailing / since it’s a file), large parts of the binary will be zeroed out that have nothing to do with store paths.
Maybe you could use [^/\0]* to stop matching at the next null-byte.
But even that is not guaranteed to work, in case store paths are stored not as null-terminated strings, but with an associated length instead.
It seems impossible to detect full-length store paths in an arbitrary file.
Another thing I don’t understand: Why would length changes matter? The benchmarks are doing content-defined chunking (CDC). They deduplicate even in the presence of length changes of store paths.
Further, should zeroing out store paths bring a significant improvement of deduplication?
Most references to store paths should be in the section of binaries or .so files that are usef for dynamic linking, e.g. this for chromium:
Here we can see that the locations of store paths are compact, which means that under CDC, they will not contribute significantly to stored file size.
I think a key thing that works against CDC for such binaries is that absolute reference to labels, string constants, etc, are spread around the file, so if the position of a static string in the binary changes, this will result in changed pointers all across the binary, making the CDC find different chunks.
Thus I would not be surprised if “decompilation” into assembly, which turns addresses back into named labels, would improve the deduplication significantly. However, then it’s unclear how to get back to the original bit-identical binary.
I from this README section and this manyclangs link, with elfshaker one does not store the final binary or .so, but instead the “pre-link object files” (.o), which are stable due to -ffunction-sections and -fdata-sections.
Then for getting a binary or .so out, you summon the .o files with elfshaker , and link them.
This means that for attic to use this much-more-effective type of binary deduplication, it needs to
run during the build process, where .o files are still available
Unfortunately sliding window chunked deduplication isn’t magic. It works by calculating a rolling hash that removes a byte out front and adds a byte at the back, and then when that hash hits 0, that’s a chunk boundary and on average that gives a certain block size.
So it wouldn’t find the real boundaries, which are at the store paths, and also the data between the paths is too short to be a chunk. Shell scripts would surely never deduplicate (but of course they are tiny, relatively speaking).
By moving the data that we know will change out of the file, the rest of the file has a good chance of hitting rolling checksum duplicate blocks.
If there is other surely-changed data like that, it should be moved out before dedupe, too.
When sliding window chunked deduplication hits a changed store path, it causes at max 2*chunkSize differences (left and right of the difference), which for Attic is 512 KiB.
As shown in the post above the chromium binary has 6 matches, so that would be roughly 3 MB overhad, while the chromium binary is 350 MB.
So the impact of the proposed optimisation should be <= 1% for this test.
I finally got around to setting up Attic and it’s been working great! Definitely sped up my local deployments to several machines. Thank you @zhaofengli!