Optimise store while building/downloading from the cache

Philipp-M · January 17, 2021, 2:19pm

Hi,

I’m not sure if this might be a dumb or haphazard idea, and I don’t have much insight of the codebase of nix etc. (that’s why I’m asking).

Since I have a rather slow Internet connection, most of the time for a nix (re-)build (on my desktop using the unstable channel) is “wasted” while downloading from the cache.
This is much less an issue on servers, as they usually have a fast internet connection (and in my case are less resource (nix/store) intensive as they are optimized on a particular use-case).

I’m asking myself if it is possible to avoid transferring the same files again, and in the process, probably directly create hard-links to the same files (like nix-optimise-store but already included in the build itself), since often a lot of redundant data is transferred (same files as in previous builds, but in a newly created derivation).

For this maybe something like a key-value database of hash-of-file -> location on disk might help to check if the file is already existent and if this is the case, only (hard-)link the file into the derivation itself.

Maybe I’m overlooking something (like hash-collisions?)?

domenkozar · January 17, 2021, 6:33pm

There used to be manifests that allowed for diffing, but my understanding is that those were too complicated due to their stateful nature.

I understand in some situations that’s not possible, but is there no way to improve your internet connection? I’d say that’s probably the easy way out.

Philipp-M · January 17, 2021, 7:21pm

Is there any link to these manifests as it would be interesting to read.

In my current situation it is difficult to increase my internet connection, but it isn’t that much of an issue for me.
I just thought that a lot of disk-writes, bandwidth and download pressure on the servers could be avoided, if only data is transferred/written that has actually changed, via some kind of checksum/hashing of the files.
But I understand that the statefulness that results from it might make it difficult.

tomberek · January 17, 2021, 9:29pm

Two comments:

Some portion of your concern is being addressed by the content-addressable effort. It’s granularity is at the package level, but it may help short-circuit the required builds.

Storage mechanism and cache:
I’ve thought it would be interesting to have a GitHub - systemd/casync: Content-Addressable Data Synchronization Tool based store (or cache). It would perform deduplication at a much more granular level, making sync up to the cache as well as download much faster. It would still use S3 and narinfo, but each nar would be a caidx file pointing to all the blobs needed. I’m not sure how much work it would be to wrap the existing S3 store for this, but I wouldn’t mind trying if someone can point me in the right direction.

Some discussion: git tree object as alternative to NAR · Issue #1006 · NixOS/nix · GitHub

Philipp-M · January 17, 2021, 10:01pm

I was just reading a paper about Content-Addressable Data and IPFS
Yes depending on how it will be implemented (if it will be), it should pretty much address these issues.

Also relevant [RFC 0017] Intensional Store by wmertens · Pull Request #17 · NixOS/rfcs · GitHub

wmertens · January 17, 2021, 10:16pm

Also relevant: this interesting discussion with @ebkalderon Support hashing and serializing directory paths into store · Issue #3 · ebkalderon/merkle-tree-nix-store-thing · GitHub