fetchFromGithub and The Versioneer: fixing source reproducibility

I’ve recently run into an issue with a non-reproducible source for the msgspec package. It definitely has changed between the last version bump and the current state of the archive fetched from GitHub.

It turned out that the root of the problem was that the source for msgspec includes the magic line $Format:%d$ that gets interpreted by the git archive analog that GitHub uses to generate source archives. When the commit for which the archive is generated is the latest on the branch, that string is substituted with (HEAD -> main, tag: 0.19.0), but when the branch moves forward, the first part disappears: (tag: 0.19.0).

I’ve updated the hash here: python3Packages.msgspec: fix src hash by YorikSar · Pull Request #416464 · NixOS/nixpkgs · GitHub, but decided to dig a bit deeper into how we could prevent this in the future to avoid such big rebuilds and make things more reproducible. I found a general issue on this topic: fetchFromGitHub is not reproducible when export-subst is used · Issue #84312 · NixOS/nixpkgs · GitHub, but short of switching away from fetching GitHub sources, I can’t see a general solution for it. I decided to see if there is something more specific to the msgspec case, and found that this venerable $Format:%d$ string comes from The Versioneer - a tool that specifically uses this string to embed more detailed version information into package sources. This issue was raised in that project several years ago, with no resolution.

I decided to see how popular The Versioneer is in Nixpkgs sources, and found ~60 packages that have a copy of it in their sources on GitHub, with 16 out of that having broken hashes in Nixpkgs right now. It seems like a significant enough number, so I wrote a small hook that detects The Versioneer configuration and patches out everything except the tag from that line. We can add it to source derivations for all affected packages and never have broken hashes in them again.

For details and to review this hook, please look here:

10 Likes

Short answer: use forceFetchGit = true; and update the hash. IIUC the contents will have the raw file in it. You can script a suitable placeholder replacement in postPatch.

Long overengineered answer: You could also get a tarball without the git archive postprocessing by producing a URL for a tree hash instead of a commit hash. You can find the tree hash using the GitHub API or the git client.
This fetchTree issue has more details, although in the context of non-FOD fetching.

3 Likes

@roberth I find that just falling back to getting a plain Git checkout is insufficient in such cases. Using The Versioneer example, these strings would be used in --version output (or similar). It means we’d have to emulate export-subst for these strings if we were to make it produce expected output.
Do you find the hook approach lacking in some way?

The hook is probably an ok workaround.
It’s not unsolvable, but I don’t know if you’re interested in a complete solution to the problem. You could disable export-subst processing using one of the methods I mentioned, and then we could have our own replacement for the export-subst filtering that takes no info from git, but only from the fetcher’s parameters. That would solve more problems that you may or may not run into.

1 Like

@roberth, I put together a more general workaround as you suggested, that seems to be working at least for one package. Is this close to what you were talking about? Please take a look. The PR is still a bit raw, just demonstrating how this could be implemented.