Nix sha256 is bug not feature. solution: a global /cas filesystem

problem: source files and build files are stored in the same CAS format

/nix/store/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-pname-source-1.2.3
/nix/store/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-pname-build-1.2.3

… which is a LOSSY transformation,
since /nix/store is messing with file permissions

solution: store source files in their original format,
to get a LOSSLESS local copy of the source files

/cas/nix/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3
/cas/git/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3
/cas/sha256/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3.tar.gz

focus!

nix should “do one thing and do it well”:
manage build instructions and build artifacts (“makefiles and buildfiles”)

cas = content addressable store (wikipedia)

/cas/git = content is in “git” format
/cas/nix = content is in “nix” format

existing solutions?

simple question: do we already have an implementation of a global /cas prefix for the filesystem hierarchy standard (FHS)? aka a “meta content-addressable storage, providing one interface to many CAS backends”

we have gitfs, but it has not the interface that i would expect

but even if this “thing” does not-yet exist, it “should” be easy to build, since all the parts exist already, and its just a matter of “connecting the dots”. this could be called “nix light”, since we would use the same nix build system, and just get rid off the sha256 pinning “non-feature”. instead, we would pin source files ONLY by their native cas hash (git commit, sha256 of source tarball, …)

relevant xkcd comic:

we have 14 standards?? ridiculous!

we should create one universal standard, that covers EVERYones use cases.

yeah!

soon: there are 15 competing standards.

concept for a global /cas filesystem

hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh = 40 char git COMMIT hash (not tree hash)

the first 2 chars are used as directory name, to make directory-listings smaller.
the same filesystem is used for git objects.

since hashes have high entropy (randomness),
this is a good way to partition many hashes into smaller groups of hashes.

tttttttt-tttttt = human-readable time of commit in UTC timezone, for example 20211031-084210

space and time

so we have location (hash) and time,
which are the two universal properties of any object.

pname = optional package name

1.2.3 = optional packagae version (“stable release? what is that? just shut up and give me the git HEAD!”)

one interface, many backends

the global /cas filesystem gives access
to many different content addressable stores

for example

/cas/ipfs/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3
/cas/sha256/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3.tar.gz
/cas/bittorrent/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3
/cas/kad/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3
/cas/bitcoin/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3
/cas/nanocoin/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3
/cas/onion/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3
/cas/i2p/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3
/cas/retroshare/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3
/cas/pgp/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3
/cas/tahoelafs/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3
/cas/gitannex/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3
/cas/perkeep/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3
/cas/docker/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3
/cas/oci/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3
/cas/npm/hh/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-tttttttt-tttttt-pname-1.2.3
...

npm packages are pinned by their hash, see package-lock.json.
generally, anything that is “pinned by hash” can be integrated into this system
(“we are the borg … we incorporate all your stuff”)

all content must be explicitly added to the /cas store
so it is not possible to load unknown content only by its hash

when storing content, we must add at least one “remote”
from where the content can be fetched.
we can add multiple remotes, to use mirrors or p2p networks

what exactly is the problem with sha256?

the problem is the “reinvention of the wheel”.

most source files are content-addressable already.
by forcing these files into the “nix” format,
we create overhead by introducing additional hashes
(which probably give the impression of “more security”,
since we all “know” that “sha1 is unsafe” …)
in short: avoid collisions.

but there are much cheaper solutions
for the problem of “collision avoidance”:
simply add more metadata!
especially human-readable metadata,
which is easy to verify “in plain sight”.

every source file has a human-readable name and time.
so lets just use these “natural options”
to make our hashes more collision-safe.

security

if you say “now that megacorp has quantum-crypto,
megacorp can produce fake sha1 hashes and hack my system”
then you probably ignore the fact that many of your tools
have an “oh-so-convenient” autoupdate feature,
which in the old days we called computer-virus,
and you probably ignore the fact that your hardware,
especially “your” processor and network controller,
are closed-source machines with backdoors,
which already give megacorp full access to your digital privacy.

note: autoupdate was aggressively normalized
by such “trustworthy” players as microsoft and google …
so now, projects like the “brave” browser
(users must be “brave” to trust that piece of software)
can easily get away with their “autoupdate by default” dogma,
acting as if “there is a new zeroday exploit in SSL every day,
so we must update every day to be on the safe side …”

south park episode S17E02: Informative Murder Porn

Randy: our content is being blocked and we need it now!

Cable Guy: I’m sorry sir. If you need it now,
perhaps you should switch to another cable company.
[tauntingly] Ohhh there’s not another cable company, is there?
[begins to rub his nipples in circles]
Ohhh, that’s right, we’re the only one in town.

sha256 is overhead

we create overhead by introducing additional hashes, in this case, the infamous sha256 in nix files

in reality, this is just a useless pain in the ass.

consider how we update packages in nix:
we must change both the commit hash AND the sha256 of the source.
why? cos as we all “know”:
“sha1 is unsafe”, “sha1 is unsafe”, “sha1 is unsafe” …

my point is:
sha256 is just another version of “security by obscurity”.

real security would require AUDITING of source code, aka “peer review” in science.
but this is the same problem as with any “fine print”
(terms of use, end user license agreement, manmade laws in the legal system, …)
who the fuck actually reads all this crap?

most of this stuff was specifically designed to be unreadable (to hide backdoors),
and even if you can “read”, there will always be someone,
who will have a different interpretation of the same text
(keywords: class justice, unwritten laws)
in the domain of IT, closed source hardware represents the unwritten laws.

south park episode S15E01: HumancentiPad

Kyle is kidnapped after agreeing to an iTunes user agreement,
and forced to become part of a “revolutionary new product”.

This episode parodies reports
about tracking software built into Apple’s iPads and iPhones,
and also the tediously long end-user license agreements

south park episode S13E03: Margaritaville

“Margaritaville” reflected Parker and Stone’s belief
that most Americans view the economy in the same way as religion,
in that it is seldom understood [obscurity]
but seen as an important, elusive entity.

/nix/store is lossy

problem:

source files stored in /nix/store have different permissions
than the original source files,
so storing source files in /nix/store is a LOSSY transformation,
so it is not trivial to calculate the git TREE hash from the stored source files

solution:

preserve the original file permissions,
AND also store the raw commit object,
to get a LOSSLESS copy of the source files,
which later can be re-used (deduplication, sharing)

challenge: the raw commit object is NOT available in the github API.
the github API is lossy at this point,
cos the TIMEZONE of the commit time is missing!
potential workaround: use github’s graphQL API to get the timezone.

read only

to make the store “read only”,
we can use a virtual filesystem (FUSE)
to provide the files with their original metadata (permissions, attributes),
but we simply block ALL write operations to the filesystem.
(could be solved cheaper with a ready-only bind-mount)

  1. There are ongoing efforts to make the Nix store content-addressed.
  2. Not all sources are addressable by a hash like, e.g. Git or SVN and simply using the sha256 of the tree is a useful abstraction.

Nice blog post. Unsubscribe.

1 Like

I very much do like the idea of trusting identities provided by tools like git and namespacing like could easily solve issues such as:

  1. Discoverability by end-user since it provides more information
  2. Secrets persistence
  3. Hash collision

But I got several concerns:
How does one generates the first 2 chars are used as directory name, to make directory-listings smaller, you have not noted it being content addressable? Why does one even require an additional namespace as such?

How does one solve issues like git commit reference to two different results? For example one with whole history, another with --depth 1. What if git sources are fetched as a tarball - do we identify it with commit reference anyways?

Then we got the whole thing on “is it necessary to preserve permissions”? Sure, I had a couple of issues with it, nothing major. Might be a good idea to preserve permissions if there was a better solution than a fuse mount.

And what is the difference between /cas/{sha256,docker,oci} - as far as I know all of these are being addressed as sha256 hashes?

Also, may I note that the main issue being solved here is how we fetch sources, and there is a better idea suggested by @Ericson2314 - Use https://archive.softwareheritage.org/ since it already provides normalized content-addressable identifiers out of the box

2 Likes

yepp, fuse can be unstable : /
probably too unstable, so the core /cas filesystem would be just regular files,
with deduplication via hardlinks.
extra features could be implemented with a fuse-mount overlay
(problem with regular files: storing “a million small files” is a waste of inodes)

yepp: Content-addressed Nix − call for testers

the opposite of CAS is LAS (location-addressed storage).
for LAS, the additional sha256 is required to “pin the source”

but for CAS sources,
the additional sha256 does not give better security,
only more work for maintainers.
collision avoidance (“sha1 is unsafe”) is achieved by adding pname,
assuming that “local” collisions inside one pname have probability zero
(“what if pnames change? what if pnames collide?” - hmm …)

nope, how we STORE sources in the local filesystem - lossy or lossless

nice! yes, this is useful to fetch the sources

thats just implementation detail …
/cas could be a virtual filesystem (fuse mount)
so we can implement “variable prefix listing” such as

ls /cas/git/abcd/

to list all known hashes with the prefix abcd
if we know the full hash, we can just say

ls /cas/git/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh/

or

ls /cas/git/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh-pname-1.2.3/

to additionally verify the pname and version of the source

it could be useful to group git hashes by type, for example

/cas/git/tree/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh/
/cas/git/blob/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
/cas/git/commit/hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

… since these hashes have different algorithms

the /cas/git filesystem (concept) can deduplicate git objects,
so the shallow-clone version (--depth 1) is part of the deep-clone version

practically, the tarball is fetched by the commit hash.
but this is a lossy operation:
from the source files, we can only compute the tree hash.
to compute the commit hash, we need the tree hash + commit metadata.
so, to validate the source files, we must fetch the commit metadata,
for example from the github API (or gitlab, gitea, cgit, …).
using only the tree hash is not practical (time is only stored in the commit, etc)

different hash algorithms. oci is pure sha256, so oci would be just an alias of sha256.
docker calls sha256 on the contents of the tar.gz files (command tarsum),
which is much slower than just hashing the tar.gz files

nixpkgs almost always fetches tarballs. Fetching Git is way too slow and implementing an maintaining a gazillion backends for every content-addressed source delivery doesn’t scale. Also mirroring tarballs is much easier than mirroring various VCS which usually need a service running. I don’t want to dismiss these great ideas but unless you can show an actual working implementation this just amounts to useless bikeshedding.

3 Likes

So I personally don’t have much interest in preserving file permissions since I do not see much value in that.

As for CAS - I like the concept and there is already quite a bit of work being done on that, but there is a distinction between CAS and XYZ provided content identifier which is mentioned here. The latter one requires moving our fetchers implementations to nix itself (for which I am all up for but this topic is for a different blogpost :wink: )

The thing that I very much do like is are store namespaces (like mentioned git,nix,sha256,oci,docker, etc). This by design reduces chances of hash collisions and could be used per derivation builder basis (If drv results in git sources /nix/store/git, if results in python package /nix/store/python, if built by trivial-builders.nix /nix/store/trivial, etc).

And on top of such namespaces, one could quite easily implement secret management from within the store. Yes feature that Nix has been lacking since 2003

Or at least these are my two cents

1 Like