Given that the other attic
project hasn’t been maintained in eight years, why not just leave it as is?
There is no name clash in nixpkgs
and package names are usually taken on a first-come first-served basis in nixpkgs
. Imho, attic
is a good choice and I think you can keep it.
Since attic
seems to support separate GC configuration per-cache, it seems like you could get something similar by pushing releases to a dedicated cache without time-based GC.
I think renaming it would be good, since the existing attic
backup tool’s key feature is also content-based deduplication.
From the heading, I also immediately incorrectly concluded that your tool was directly based on the existing attic
.
I like the suggestion “attix
”, that makes it quite clear that Nix is involved.
Given that the chunks are still quite big, and I suspect that many files only change by their inclusion of nix store paths that changed, how about pre-processing all stored files as follows:
The idea is to move the nix store paths (only up to the first part) into a separate list, and remove them from the file. So then you would replace a file F
with a tuple (Fx, L)
. Fx
is the binary contents of the file with every sequence matching /nix/store/[^/]+
removed, and L
is a list of (position, path)
tuples.
This can be encoded in a streaming manner, and decoded in a streaming manner provided you have access to the tuples L.
L
can be compressed better by making position
be relative to the end of the last match, and making path
a index of a list of found paths. So then we get Lrel
being a list of (relPosition, pathIndex)
tuples, and P
a list of paths, so F
becomes (Fx, Lrel, P)
.
This result should be way better at being chunked. I am hoping that many rebuilt files will have the same Fx
and Lrel
, and only P
will differ.
For future-proofing etc, the /nix/store/
part should be configurable.
What do you think @zhaofengli ?
Content-adressable derivations use a similar pattern for hashing modulo self-references.
However, they specifically only look for the hash parts to avoid missing split references (think something like [storePath, "dbeu123...-opencv-2.13.0'].join('/')
). I guess you could do this in this case as well; you’d have to scan for the hashes of all dependencies, record their positions, replace them with a string of zeroes (the “cleaned” version of the file) and store a set of tuples of what hashes occurred in what positions {(h, [p])}
. If you then store a new file in this scheme and the “cleaned” version is equal and the set is the exact same except for the hashes, you have a match and can map both files to the same on-disk representation, just with different maps.
With the self-reference, you’re sure that the string is what you give it to find, but in this case we want to catch as many store paths as possible, and it’s not a big problem if we miss one.
But yes, very similar.
Though we only need to catch all the references, right? It’s conceivable but unlikely that there are store paths in the output that weren’t also specified in the inputs, so searching for any store path will be unlikely to yield more matches. Though I guess it would be best to actually test that hypothesis on the current nixpkgs.
Yeah I guess. I can’t come up with a situation where missing a store path would cause any harm, other than files not being deduplicated even though they could be.
Right, just moving any string matching /nix/store/[^/]+
out of the files should improve deduplication, it doesn’t matter if those are real paths. After restoring the end result is the same.
I did a benchmark of Attic for deduplicating chromium derivations.
It includes comparisons with other dedup backup tools.
Could you do the benchmark after removing or zeroing out anything matching /+nix/+store/+[^/]*
?
Removing ensures that length changes of the path don’t matter, but generally the lengths don’t change much.
How can this regex work?
[^/]*
continues matching until /
is found.
So if a binary file contains anywhere in the binary a string such as /nix/store/abc..123-myfile
(without a trailing /
since it’s a file), large parts of the binary will be zeroed out that have nothing to do with store paths.
Maybe you could use [^/\0]*
to stop matching at the next null-byte.
But even that is not guaranteed to work, in case store paths are stored not as null-terminated strings, but with an associated length instead.
It seems impossible to detect full-length store paths in an arbitrary file.
The best you could do is finding the pattern
/nix/store/.{32}-
which matches
/nix/store/3c666h6rp892xvsl22bfahvmga8c3d9k-
After that, all bets are off.
Another thing I don’t understand: Why would length changes matter? The benchmarks are doing content-defined chunking (CDC). They deduplicate even in the presence of length changes of store paths.
Further, should zeroing out store paths bring a significant improvement of deduplication?
Most references to store paths should be in the section of binaries or .so
files that are usef for dynamic linking, e.g. this for chromium:
# grep --text --byte-offset -R -P '/nix/store/.{32}-' /nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110 | cut -d: -f-2
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/libvk_swiftshader.so:6537
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/libvulkan.so.1:15569
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/libVkLayer_khronos_validation.so:8277
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/libVkICD_mock_icd.so:4477
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chrome_crashpad_handler:0
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chrome_crashpad_handler:9905
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/libEGL.so:8649
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/libGLESv2.so:83361
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chromium:20499185
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chromium:21989989
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chromium:23409090
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chromium:32017337
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chromium:267909201
/nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110/libexec/chromium/chromium:359225482
Here we can see that the locations of store paths are compact, which means that under CDC, they will not contribute significantly to stored file size.
I think a key thing that works against CDC for such binaries is that absolute reference to labels, string constants, etc, are spread around the file, so if the position of a static string in the binary changes, this will result in changed pointers all across the binary, making the CDC find different chunks.
Thus I would not be surprised if “decompilation” into assembly, which turns addresses back into named labels, would improve the deduplication significantly. However, then it’s unclear how to get back to the original bit-identical binary.
ah yes, good point. But the part after that still contains a name+version which highly likely differs between builds, so maybe it could be
/+nix/+store/+[a-z0-9]{32}-[a-zA-Z0-9._+?=-]{0,150}
And it doesn’t matter if it removes too much, since it would put back the exact same thing afterwards.
(this regex matches everything in my store, checked with
find /nix/store/ -maxdepth 1 -regextype posix-extended ! -regex '/+nix/+store/+[a-z0-9]{32}-[a-zA-Z0-9._+?=-]{0,150}
)
This is discussed and solved in elfshaker
:
Discussed for NixOS in:
I from this README section and this manyclangs
link, with elfshaker
one does not store the final binary or .so
, but instead the “pre-link object files” (.o
), which are stable due to -ffunction-sections
and -fdata-sections
.
Then for getting a binary or .so
out, you summon the .o
files with elfshaker
, and link them.
This means that for attic
to use this much-more-effective type of binary deduplication, it needs to
- run during the build process, where
.o
files are still available - perform reproducible linking for retrieval
Unfortunately sliding window chunked deduplication isn’t magic. It works by calculating a rolling hash that removes a byte out front and adds a byte at the back, and then when that hash hits 0, that’s a chunk boundary and on average that gives a certain block size.
So it wouldn’t find the real boundaries, which are at the store paths, and also the data between the paths is too short to be a chunk. Shell scripts would surely never deduplicate (but of course they are tiny, relatively speaking).
By moving the data that we know will change out of the file, the rest of the file has a good chance of hitting rolling checksum duplicate blocks.
If there is other surely-changed data like that, it should be moved out before dedupe, too.
What I am saying is:
Doing this has negligible effect on size.
When sliding window chunked deduplication hits a changed store path, it causes at max 2*chunkSize
differences (left and right of the difference), which for Attic is 512 KiB
.
As shown in the post above the chromium
binary has 6 matches, so that would be roughly 3 MB overhad, while the chromium
binary is 350 MB.
So the impact of the proposed optimisation should be <= 1% for this test.
The .so
files even have only 1 match each.
Very good points. Might indeed not be worth doing, even if that means scripts will never be deduped.
I’d like to be sure though, I’ll experiment with a script that determines offsets of store paths.
I finally got around to setting up Attic and it’s been working great! Definitely sped up my local deployments to several machines. Thank you @zhaofengli!