Package URL's (purl) for Nix packages

raboof · May 7, 2023, 2:37pm

There are all kinds of distro-agnostic tools and file formats that try talk about software packages and versions - recently, there’s a lot of activity around SBOM and security scanning.

Identifying software is always a recurring problem for those. One promising emerging standard for this is package-url (purl). I think it is time we start defining how to refer to Nix packages using purl’s.

This has been discussed in a couple of places, such as Add guix and nix as package types · Issue #149 · package-url/purl-spec · GitHub, The future of the vulnerability roundups, Things to learn from tea.xyz, Add Nix cataloger by wagoodman · Pull Request #1696 · anchore/syft · GitHub and in the #slsa:nixos.org matrix channel (not sure if there’s any public history I can link to?).

To give a super-quick introduction to purl, a purl is typically of the structure scheme:type/namespace/name@version?qualifiers#subpath, where scheme is always pkg, type is a type from the registry at https://github.com/package-url/purl-spec/blob/c02b002f09bdc88a501f62259eec18761957828a/PURL-TYPES.rst, and namespace, name and qualifiers are type-specific. We could define a nix purl type and decide how to populate it.

You’ll notice ‘the same’ software could be present in multiple types. This is intentional and useful: that way you can distinguish between information about ‘software X’ generally and information about ‘software X as packaged in NixOS’.

I think the nix purl type should be symbolic enough so tools have enough information to perform some level of ‘fuzzy matching’, but can also contains all the information to know exactly how to recreate that specific build of a package.

Of course, we already have a format to refer to Nix packages: flake URI’s. As a straw man to get the discussion started, I would like to propose a definition of the nix purl type as a sort of different representation of the flake URI (since they have slightly different rules). I came up with some rules for defaults to make it succinct to refer to nixpkgs packages, but keep things general enough to also use this type to refer to any 3rd-party nix package:

pkg:nix/[<org>/]<attr>?<qualifiers>

Where:

org defaults to NixOS when not specified
attr is the attribute path to the package

And the following qualifiers can be added:

type: corresponds to the Flake type. For the NixOS org, for now default to github (though we can reserve the right to change change this default in the future, as long as history is kept across forges)
repo: the GitHub repo under the org. Defaults to nixpkgs when the org is NixOS, otherwise to (the first segment of) the attribute path
ref: tag or branch in the repo
rev: revision, which must be part of the ref tree
output: the derivation output, default to out

This leads to the following examples (purl and flake syntax side-by-side):

purl	flake
`pkg:nix/wget`	`github:NixOS/nixpkgs#wget`
`pkg:nix/wget@1.21.3?ref=nixos-unstable&rev=897876e4c484f1e8f92009fd11b7d988a121a4e7`	`github:NixOS/nixpkgs?rev=897876e4c484f1e8f92009fd11b7d988a121a4e7#wget`
`pkg:nix/tiiuea/sbomnix?type=github`	`github:tiiuea/sbomnix#sbomnix`
`pkg:nix/tiiuea/nixgraph?type=github&repo=sbomnix`	`github:tiiuea/sbomnix#nixgraph`
`pkg:nix/python3Packages.enamlx`	`github:NixOS/nixpkgs#python3Packages.enamlx`
`pkg:nix/eicas/omeka-s?type=git+https://codeberg.org&rev=bfe132f6540a175beb432c2c95472f929cbf310f`	`git+https://codeberg.org/eicas/omeka-s-flake?rev=bfe132f6540a175beb432c2c95472f929cbf310f`#omeka-s
`pkg:nix/grub2@2.06?output=doc&ref=nixos-unstable&rev=897876e4c484f1e8f92009fd11b7d988a121a4e7`	`github:NixOS/nixpkgs?rev=897876e4c484f1e8f92009fd11b7d988a121a4e7#grub2!out`

Now this is different from what’s being proposed in syft: they seem to just take the pname (?) and add the output hash. I can see how that is much easier for a filesystem scanning tool such as syft to discover, but it also seems much less useful: it is almost impossible from such a purl to ‘work backwards’ and find the exact derivation without additional context.

Should we ‘allow’ both ‘output-centric’ and ‘input-centric’ purls for the nix type? That seems like while it’d make ‘creating’ purls much easier for some cases, it also might make doing anything useful based on them much harder…

flokli · May 8, 2023, 8:01am

I’m not quite convinced the “pkg source code” makes up a good package identifier.

Multiple revisions of nixpkgs have exactly the same package recipe / literal .drv contents, so you end up with a lot of purls describing the same thing, so it’s not a unique identifier.
The “source” of a package is not clear. Usually, you start with a nixpkgs repo, and then build on top. You rarely bootstrap everything on your own. Normally you have a nixpkgs pin, and slightly override it inside your own override in your own repo. You probably don’t want to loose all references to the nixpkgs pin used, when building a static binary without any references.

The .drv hash, or the output hash(es) however uniquely describe the package and allow tracing back to the build recipe. It’s just not very nice UX to look it up, but cache.nixos.org seems to populate the Deriver field, but doesn’t upload the derivations themselves. IMHO, we should do that, and additionally work on some plugins/tooling to include auxillary metadata into container images etc - but I wouldn’t want to abuse the purl for that.

raboof · May 8, 2023, 8:38am

Agreed: it is unique in that it precisely points to a particular version of the software, but indeed there would be many identifiers that differ only in the rev/ref fields and “point to the same thing”. I’m not sure that is necessarily a problem compared to using something like output hashes: after all, many changes that will change the output hash will be ‘irrelevant’ in the context of a given tool/use case, so they’ll need to deal with different purl’s for ‘essentially the same’ package anyway.

The “source” of a package is not clear. Usually, you start with a nixpkgs repo, and then build on top. You rarely bootstrap everything on your own. Normally you have a nixpkgs pin, and slightly override it inside your own override in your own repo. You probably don’t want to loose all references to the nixpkgs pin used, when building a static binary without any references

I agree that’s a valid use case: it is useful when the identifier can express not only ‘wget from nixpkgs rev xyz’, but also ‘wget as overridden by Alice in project X’. With this you’ve convinced me that we probably do indeed want to have a way to refer to things that don’t have an attribute path.

The .drv hash, or the output hash(es) however uniquely describe the package

Agreed

and allow tracing back to the build recipe

I’m not entirely convinced there: maybe that works(/can be made to work) for everything that’s in cache.nixos.org, but as you mention above we want to support people slightly overriding things and not lose references. Also I think the nix purl type should support pointing to things that are not in nixpkgs at all (similar to flakes).

IMHO, we should do that, and additionally work on some plugins/tooling to include auxillary metadata into container images etc - but I wouldn’t want to abuse the purl for that.

I agree we will likely want to do work on plugins/tooling to ‘close the loop’. I’m not sure what exactly you mean by “abuse the purl for that”, I do think that it’s on-topic for this thread to discuss what’s feasible since that might inform what a useful purl format can look like.

wamserma · May 9, 2023, 6:41am

I’d prefer to move a bit from the qualifiers to the namespace and do away with some of the defaults in favour of explicitly stating things:

namespace = org : packageset

where

packageset = repo | repo : branch.

Example pkg:nix/NixOS:nixpkgs:release-22.11/ponysay@3.0.4&rev=...

Multiple purls pointing to effectively the same package will also occur after release branch-off, not only by rev changing on a single branch. This is actually an important aspect w.r.t to BOMs, as this may signal a change in the transitive dependencies of a package and e.g. in the case of static linking, a difference between a vulnerable and patched version of a binary.
The output hash/part of the store path could still be adopted as an optional qualifier, which helps downstream consumers (e.g. syft) to conflate purls for vulnerability tracking. Plus: a store path can then (probably) be realized from a purl.

blaggacao · May 9, 2023, 6:47am

There’s another slight way of looking at this that can become quite useful matching the currently scheduled work of the Nixpkgs Architecture Team.

purl	PkgFun / PkgMod
`pkg:nix/wget`	`wget` PgkMod w/ default version
`pkg:nix/wget@1.21.3`	`wget` PkgMod w/ `version = 1.21.3`

This perspective echoes this sentiment:

There is an emerging tension where the above concern is true for a Nixpkgs Style cataloguing repository, but where it is (mostly) not true for a flake style in-tree nix-base build pipeline.

I think that distinction may be important.

Thank you for getting the discussion going again!

raboof · May 9, 2023, 10:30am

I like the : notation. I’m not sure about removing some of the defaults, I found the succinct id’s for common cases pretty nice, but I’m not opposed to it. It would be nice not to be too GitHub-centric, so while for NixOS/nixpkgs we could default to github, this should probably remain mandatory to specify for other nix purls?

Yes, I’d say a change in transitive dependencies should definitely be expressable in the purl. @flokli’s criticism of using rev is that rev (unlike the output_hash) may change even when the transitive dependencies don’t change.

j-k · May 9, 2023, 12:31pm

thanks for raising this for discussion. I think it’s great we can flesh out some of the key decisions early before submitting to the PURL spec repo.

Some of the existing nixpkgs versioning discussion would come in useful here too

Is this something where content-addressed nix could come in really useful?

It might also be good to involve some of the core nix devs in it’s design

wamserma · May 9, 2023, 1:09pm

We should leave as little room for interpretation as possible (that is, no default based on values of other fields) as downstream consumers will misinterpret these.

That is why I suggested an output-hash qualifier. Aside from that, output hashes do not necessarily reflect the build tool/package definition used. A reproducible static binary may be produced by Nix or by OpenEmbedded (assuming same source repo and similar compiler versions/settings). Both just provide the build infrastructure, then delegate to make.

wamserma · May 9, 2023, 1:24pm

Not core nix devs, but people working on Nix + SBOMs:
@henrirosten (GitHub - tiiuae/sbomnix: A suite of utilities to help with software supply chain challenges on nix targets)

henrirosten · May 10, 2023, 8:48am

Quick comment from the vulnerability scanners’ viewpoint, which is one potential downstream use of this data:

Most current vulnerability scanners identify packages based on CPE since that’s what NVD supports.
OSV is one example that supports purl, however, nix ecosystem is not currently supported in OSV.

IMHO, mapping nix packages to purl will be usefull and this discussion is surely needed. However, CPE seems most widely used currently, therefore, the ability to map nix packages to CPEs accurately (as accurately as possible) would have more concrete benefits right now.

FRidh · May 10, 2023, 9:35am

I agree with @wamserma that we should not default to NixOS/Nixpkgs (or GitHub) and be explicit instead.

Would it be an option to have two purl types for Nix? One corresponding to an evaluation attribute, and another to the evaluated derivation?

That could be another purl type, entirely independent from Nix. pkg:ca/<hash>, or instead of ca the hash type used.

FRidh · August 19, 2023, 10:50am

For Python packaging there is now a PEP for describing external (native) dependencies, using purl. Discussion

balsoft · May 27, 2024, 11:36am

After reading this discussion and the PURL spec, I’ve come up with my own proposal.

The main consideration was that it’s not always possible to locate a given Nix package, since the provenance is not always recorded, as is the case with channels; and a given exact package might be present in multiple locations, such as different nixpkgs revisions or forks.

Also, I believe the “main” triplet of namespace/name@version must uniquely correspond to a particular derivation.

As such, I think we should disregard using namespace and instead rely on optional qualifiers to locate the package if possible.

Here is my proposal:

Let package be a Nix derivation output (e.g. nixpkgs#pkgs.hello.out);

type is a constant nix;
No namespace
name is the package name; precisely speaking, it is the output of (builtins.parseDrvName package.name).name (e.g. hello)
version is the file name of the package derivation file, sans the .drv extension; precisely it is builtins.concatStringsSep "." (lib.init (lib.splitString "." (builtins.baseNameOf package.drvPath))). (e.g. qb6j8v8z50shmrgsj2pk4fwrk2ff5jpn-hello-2.12.1)
Optional qualifier flakeRef is the url-encoded locked flake-ref of the flake from which this package was evaluated, if applicable&known. Mutually exclusive with channelUrl (e.g. github%3Anixos%2Fnixpkgs%2F074522643cc9ccbb871ca3b31ed599e9b1b7b5a2)
Optional qualifier channelUrl is the url-encoded absolute http(s) URL pointing to a tar.gz archive containing default.nix, from which package was evaluated, if applicable&known. Mutually exclusive with flakeRef (e.g. https%3A%2F%2Fgithub.com%2Fnixos%2Fnixpkgs%2Farchive%2F074522643cc9ccbb871ca3b31ed599e9b1b7b5a2.tar.gz)
Optional qualifier attrPath is the (fully-qualified for flakes) attribute path from which the package was evaluated, if known and applicable (e.g. legacyPackages.x86_64-linux.hello.out for flakes or hello.out for channels)
Required qualifier outputName is package.outputName (e.g. out)
Optional qualifier outPathCA is the url-encoded output path, only present if the package is content-addressed and the output path is relevant&known; precisely package.outPath
Optional qualifier substituter is a url-encoded URL pointing to a substituter (binary cache) in which the output path is present.

Versions should be considered opaque and non-ordered, as is the current practice in nixpkgs; as such, the only possible comparison is of version equality.

Here is the example for nixpkgs#hello.out in full: pkg:nix/hello@qb6j8v8z50shmrgsj2pk4fwrk2ff5jpn-hello-2.12.1?flakeRef=github%3Anixos%2Fnixpkgs%2F074522643cc9ccbb871ca3b31ed599e9b1b7b5a2&attrPath=legacyPackages.x86_64-linux.hello.out&outputName=out

And here is an example for nix-build '<nixpkgs>' -A hello: pkg:nix/hello@qb6j8v8z50shmrgsj2pk4fwrk2ff5jpn-hello-2.12.1?attrPath=hello&outputName=out

Notice how even though the source of the package is different, and in the second case it’s impossible for Nix to locate the package source on the internet, we still end up with the same first part of the PURL.

For cases when it is possible to locate the source (whether flake or not), we provide our consumers with a way to fetch it and evaluate the package.

raboof · May 27, 2024, 2:57pm

Thanks for your thoughtful input!

Notice how even though the source of the package is different, and in the second case it’s impossible for Nix to locate the package source on the internet, we still end up with the same first part of the PURL.

For cases when it is possible to locate the source (whether flake or not), we provide our consumers with a way to fetch it and evaluate the package.

I think those are sensible properties.

Also, I believe the “main” triplet of namespace/name@version must uniquely correspond to a particular derivation.

Why? I don’t think this is true for other purl types (i.e. pkg:deb/debian/curl@7.50.3-1?arch=i386&distro=jessie seems like it may be slightly different depending on when/where/how you build it).

I think including the derivation hash makes the version field overly specific: for instance, if I want to express, “hello 2.12.1 is vulnerable to CVE-2024-foo”, how would I do that? Enumerating all derivation hashes of hello 2.12.1 is unfeasible. It’s true that just saying “hello 2.12.1 is affected” is also imprecise, as some derivations of 2.12.1 may have a patch for CVE-2024-foo applied - but I think that should be solved by taking patch information into account like sbomnix and vulnix do - by no means perfect, but I don’t see another way, and I don’t see including the derivation hash in the version as helpful. For those cases where it’s useful, though I can’t think of any, you could still get it from the attribute.

(the above is my main response, what comes below is perhaps more nitpicky)

Optional qualifier flakeRef is the url-encoded locked flake-ref of the flake from which this package was evaluated, if applicable&known

I wonder if we should have this information in one field or split it out into its parts

Optional qualifier channelUrl is the url-encoded absolute http(s) URL pointing to a tar.gz archive containing default.nix, from which package was evaluated, if applicable&known

We should probably also allow pointing to other things than .tar.gz’s here, e.g. git repositories?

Optional qualifier attrPath is the (fully-qualified for flakes) attribute path from which the package was evaluated, if known and applicable (e.g. legacyPackages.x86_64-linux.hello.out for flakes or hello.out for channels)

I wonder if we should remove .out here (as it can be derived from the outputName), and perhaps similarly derive legacyPackages.x86_64-linux from system?

RaitoBezarius · May 27, 2024, 4:03pm

Not so sure about that, it doesn’t seem to be easy to normalize, is it? Should we take non-canonical representatives of the set of equivalent system classes?

balsoft · May 27, 2024, 7:42pm

@raboof thanks for the detailed feedback!

Yes, but AFAIU that’s mainly because Debian doesn’t provide a common “hash of all dependencies” in the same way as Nix does. For an example where there is such a hash, see OCI images (The version is the sha256:hex_encoded_lowercase_digest of the artifact and is required to uniquely identify the artifact.) and Docker images (The version should be the image id sha256 or a tag. Since tags can be moved, a sha256 image id is preferred.).

I think including the derivation hash makes the version field overly specific: for instance, if I want to express, “hello 2.12.1 is vulnerable to CVE-2024-foo”, how would I do that? Enumerating all derivation hashes of hello 2.12.1 is unfeasible.

Maybe something like *-2.12.1 would work? And then, for cases when only a small list of particular derivations is affected (e.g. a dependency update/patch quickly solved the issue without a version update), you have the ability to list those precisely.

For those cases where it’s useful, though I can’t think of any, you could still get it from the attribute.

Indeed, I agree it won’t often be useful for vulnerability scanning, but it is useful in the general context of purl, in that it identifies the package more precisely and uniquely.

I’ve spent some time deliberating on where exactly to split the full flakeRef. I think it makes more sense to keep the reference to the flake it as a single field, because the flake-ref syntax is quite complex and has various URI “parts” depending on the “scheme”. Inheriting this complexity into purl can lead to the need for updates of the schema in the future, which I think we should avoid if possible. However, I’ve come to the conclusion that it’s better to split off the attrPath from the flakeRef to keep compatibility with channelUrl and lack of any source.

I think this should be something which nix accepts as a --file argument or part of NIX_PATH. AFAIR there’s no documentation on that, and it only accepts local files and http(s) URLs pointing to tar.gz files, but I might be wrong here.

I was considering the same, but I think it makes sense to keep the output here and also as outputName, since it’s not clear how to derive one from another and vice versa (consider a flake which re-exports zlib.dev as outputs.myPackageScheme.zlib-dev, overriding dev to be something else; then, just appending outputName to the attrPath would possibly break things, and it’s impossible to infer outputName from the path). Ditto for system: the flake output schema is not strict, there can easily be a custom output schema which doesn’t use ${system} in the attrPath at all.

balsoft · May 27, 2024, 7:43pm

I’m not sure I’m following; the system qualifier is simply copied from the system attribute of a derivation, which is part of the derivation hash. Why would we normalize it at all?

RaitoBezarius · May 27, 2024, 9:53pm

Well, maybe a more general question: does the purl format you propose has a canonical representation? Do we have purl(p) == purl(p') <=> p = p' for some reasonable = and == is canonicalize(purl(p)) = canonicalize(purl(p')) ?

It seems like to me that system is not canonical, thus you do not have this property and it seems undesirable (?).

lordkekz · May 28, 2024, 12:10am

Hi, sorry to interrupt. As a relatively new Nix user (but one very interested in making SBOMs from Nix derivations), I’m not sure I understand.

I agree that it is desirable to have a canonical form for nix purls.

It wasn’t immediately obvious to me whether system is canonical. Could it be that x86_64-linux vs amd64-linux are both valid and refer to the same thing?

Why do we need system as a required qualifier? It’s not required to uniquely identify the package. It’s just needed to build the package. But with only the required qualifiers it’s still not enough to build the package (attribute Path and channel / flakeRef are missing). So system might as well be optional?

After all, we cannot hope to make a canonical purl which includes a flakeRef or attrPath, since there are likely many equivalent flakeRefs and attrPaths which cannot easily be determined.

ehmry · May 28, 2024, 4:13am

All these examples are torturous, ugly, and vague because Nix and purl are antithetical to each other. It’s never going to work.

If you are listing a Nix path in an SBOM then use a component with properties in a nix: namespace - Pre-RFC: CycloneDX BOM taxonomy.