Reproducibility of github tarballs is not guaranteed

Please read https://github.blog/2023-02-21-update-on-the-future-stability-of-source-code-archives-and-hashes/

Github just announced that their tarballs will not be reproducible by byte at some point in the future.

AFAIK nix and nixpkgs rely deeply on this… we should have a way to future-proof both projects before it’s too late. Are there any plans or ideas?

1 Like

Most of the time nixpkgs relies on the content, not the tarball itself.

Therefore this is a non issue for most things.

3 Likes

See also Github archive checksums may change

1 Like

I don’t think they were reproducible in the past, at least over long term.

2 Likes

fetchFromGitHub unpacks the tarballs before hashing, for exactly this reason.

8 Likes

This was a very smart decision, I wonder who had the idea, I would send some kudos to them :clap:t3:

9 Likes

Is the same true for github:owner/repo flake inputs?

Yes, it applies for github: flake inputs as well.

1 Like

Actually…

There have been valid concerns voiced in a hackernews thread when the inital announcements happened…

What if there is a vulnerability in $decompressionTool that allows for a RCE? Depending on the details, that could even escape the sandbox if actively taretting nix and its known weakpoints (not that I was aware of any, this is a hypothetical case). We currently do not have a way to check authenticity before unzipping, and the GH announcement made this even harder, even if we would like to fix the issue…

I have to be honest though… Even if we confirm authenticity prior to unpacking, the hash could already created against the “defect”/“crafted” archive that exploits the vulnerability.

2 Likes

What is the scenario here?

  1. Some input configuration references a remote tar resource
  2. The input is downloaded and decompressed to compute the hash and validate that it matches the input
  3. The the input is malicious and intends to exploit the decompressor
    a. The decompressor is safe and does not get exploited. (no problem)
    b. The decompressor is exploited and chains that vulnerability with another vulnerability in the nix sandbox to pwn the machine.
    c. The decompressor is exploited and the nix sandbox is intact, but the attacker-controlled decompressor exits after creating whatever files it wants that can somehow be derived using only the contents of the tar file (it shouldn’t be able to make additional network requests, etc).
  4. The next step hashes the file tree that was created by the decompressor
    a. The hash doesn’t match because it decompressed differently than the input config expected. Build fails.
    b. The hash does match, meaning that the original hash from the input config was already compromised.
  • Clearly 3a is ideal. Nix should try to ensure the integrity of all the processes in the build pipeline, but this isn’t more impactful than other sensitive code like the TCP handling, or the TLS library, or the HTTP client etc. It’s still useful to acknowledge that the decompression tool should be treated like a sensitive component that handles potentially malicious input.
  • Case 3b comes down to a separate vulnerability in the sandbox. Chained vulns are bad, so we should try to solve it on both sides, but that’s not news.
  • Case 3c splits into case 4a and 4b.
    • 4a: great! nix correctly failed the build because the hash of the files didn’t match the input config, but at least the machine was not pwned. Now go back to 3a and fix it there.
    • 4b: if the user is using a compromised config that generates compromised files, I’m not sure what can be done about that… but at least the exploit was reproduceable? /s Really, if the user is using a config that references some compromised input files then was exploiting the decompressor even necessary?

I’m not sure I see any glaring issues here.

I don’t believe the builtin fetchers run the decompression in a sandbox. Only FODs do that.

Since decompression happens in the FOD (for fetchFromGitHub), it can make network requests, and I’m not sure what all else.

Purely hypothetical, mentioned as a possible threat and danger in checking only the content.

Also it is not my scenario. It has been buit by other commentors in another community, and I do not find the link anymore to the discussion.

I see. Would it be reasonable to change builtin fetchers to run decompression in a sandbox?

It’s a fine hypothetical. But when broken down it doesn’t seem particularly alarming; since builtin fetchers were never sandboxed, any vulnerable code they use could result in pwning the system, including TLS, HTTP, etc, as well as decompressors, so this doesn’t seem like a big increase in surface area.

I guess the only remaining questions are: Do the fetchers expect to handle intentionally malicious inputs?; Do we trust them to do it correctly?

I don’t think there would be any fundamental problem with this, however it wouldn’t be easy. The sandboxing code is all built around derivations, which the builtin fetchers are intentionally not.

There are also probably quite a few environmental interactions with the builtin fetchers that people depend on in the wild. One fairly obvious example is when downloading git+ssh flake urls, as that depends on interacting with the user’s ssh config, and running ssh agent. I have no idea how many other things like that there are.

I do think this question and the reply’s thus far deserves an insightfull answer from someone deeply involved with Nix.
Maybe someone from the foundation or even @edolstra might be willing to provide more details about the proces?

Which question specifically?

Well the original question was food for some replies with assumptions and a hypothetical possibility, which at least for me asks for some more clarification.
I am no way that technical to oversee the decompression tool RCE scenario, and therefore sure would be reassured by someone who is, to know that it is not or cannot become possible in any way.
No pun intended!