Atemu
July 14, 2024, 9:53am
2
Follow-up analysis of other lockfiles and automatically generated files on the size of the Nixpkgs tarball:
opened 09:52AM - 14 Jul 24 UTC
6.topic: hygiene
9.needs: maintainer feedback
significant
# Introduction
The size of the Nixpkgs tarball places burden onto internet co… nnections and storage systems of every user. We should therefore strive to keep it small. Over the past years that I've been contributing, it has more than doubled in size.
In https://github.com/NixOS/nixpkgs/issues/327063 link I discovered quite negative effects of `Cargo.lock` files in the Nixpkgs tree with just 300 packages bloating the compressed Nixpkgs tarball by ~6MiB.
Here I'd like to document the status quo of sizes of lockfiles found in Nixpkgs and other automatically generated files of significant size.
# Methodology
- `ncdu --apparent-size` on the nixos-24.05 tree (a046c1202e11b62cbede5385ba64908feb7bfac4)
- Manual look through the tree
- Looked at everything where the directory is larger than your average `Cargo.lock` file (a few dozen KiB)
- Only considered files that were obviously auto-generated
- i.e. not kodi add-ons, they're all separate drvs updated separately
- Compressed sizes were measured using `gzip -9 < file | wc -c` or `tar -cf - files... | gzip -9 | wc -c`
- Lockfiles were either manually measured or using these commands:
Amounts:
```
$ for file in Cargo.lock composer.lock package-lock.json yarn.nix yarn.lock gemset.nix Gemfile.lock ; do echo -n "$file " ; fd -t f "^$file\$" | wc -l ; done
```
Sizes:
```
$ for file in Cargo.lock composer.lock package-lock.json yarn.nix yarn.lock gemset.nix Gemfile.lock ; do echo -n "$file " ; fd -t f "^$file\$" -x sh -c 'gzip -9 < {} | wc -c' | jq -s 'add' ; done
```
# Results
Numbers for the lockfiles and patches are `(total bytes)` or `(total bytes / number of files = average per file)`
- Lockfiles
- Cargo.lock (5986458 / 316 = 18944.5)
- composer.lock (185411 / 14 = 13243.6)
- package-lock.json (923349 / 17 = 54314.6)
- info.json (51904 / 2 = 25952) (electron)
- yarn.nix (41356 / 1 = 41356)
- yarn.lock (661464 / 5 = 132293)
- gemset.nix (262092 / 141 = 1858.81)
- Gemfile.lock (86498 / 138 = 626.797)
- bazel_7 locks (105719 / 3 = 105719)
- nuget deps.nix (489003 / 67 = = 7298.55)
- patches (2807106 / 3929 = 714.458)
- Particularly large patches:
- glibc patch
- terraform-docs
- hackage-packages (2435846)
- node2nix
- elm/packages (180222)
- node-packages (843310)
- netlify-cli (111438)
- cran-packages (1473653)
- lisp-modules (319315)
- android-env (255090)
- cuda-modules (109929)
- tree-sitter/gammars/ (20964)
- elisp-packages (1183087)
- jetbrains/{brokenplugins,idea_maven_artefacts}.json (273277)
- vim/plugins (174661)
- vscode extensions (36921)
- firefox-bin (33067)
- libreoffice (364120)
- kde (13013)
- gnome/extensions.json (613383)
- perl-packages.nix (2324210)
## Notable non-generated files
For comparison and out of interest I also recorded the compressed sizes of notable files that were made by hand:
- The allmighty all-packages.nix (251060)
- python-packages.nix (76738)
- aliases.nix (24372)
- haskell-modules/configuration-hackage2nix/broken.yaml (75753)
- haskell-modules/configuration-common.nix (40859)
- maintainer-list.nix (143647)
- doc (317070)
- lib (233078)
- nixos (3595412)
# Analysis
Lockfiles Contribute greatly to nixpkgs compressed tarball size. In total, you can attribute 8793206 Bytes ~= 8.4MiB out of the ~41MiB to lockfiles used in individual packages (~20%). The biggest offender by far are rust packages' `Cargo.lock`s which are analysed in deeper detail in https://github.com/NixOS/nixpkgs/issues/327063.
The worst offenders in terms of Bytes per package are packages which lock their yarn dependencies at ~130KiB/package. Though these are fortunately rare but still add up to ~600KiB.
The next worst appears to be `bazel_7` which single-handedly requires ~100KiB of compressed data.
More notably bloated packages are those which have a `package-lock.json` at ~50KiB/package and electron's two `info.json`s combining to ~50KiB.
Patches also present significant burden for compressed tarball size. Individually, they're usually quite small but they're very common, adding up to 2.6MiB.
All automatically generated files discovered here (package lockfiles + set lock files) sum up to 19558712 Bytes ~= 18.6 MiB (compressed) which is about half the size of the Nixpkgs tarball.
# Discussion
- Should huge lockfiles continue to be allowed in Nixpkgs?
- Sometimes they might be the only option?
- Should we impose a Byte limit per package?
- Some packages are clearly out of hand, requiring >100KiB each
- If every package did that, the nixpkgs tarball would approach 10GiB in compressed size
- Even if you think hundreds of KiB is fine, would it be okay for a single package to use multiple MiB? Multiple dozen MiB?
# Solutions
There are a few measures that could be taken to reduce file size of generated files:
## Summarise hashes (i.e. vendorHash)
Rather than hashing a bunch of objects individually, hash a reproducible record of all objects. This is already the status quo for i.e. `buildGoModule`.
## Record less info
Some info is not strictly necessary to record for the lock files to function. For each elisp package for instance, at least two commit ids and two hashes are recorded. Commit IDs could probably be dropped entirely here which would reduce the compressed file size by 1/3.
## Fetch files rather than vendoring them
Often times, files required for some derivation are available from an online source. Fetching the file rather than vendoring it into the nixpkgs tree reduces the space required to a few dozen Bytes (~32 Bytes for the hash and a similar amount for the URL).
This is especially relevant for patches as those are frequently available elsewhere. Use `pkgs.fetchpatch2` in such cases.
## Lock an entire package set
Lockfiles usually represent a set of desired transitive dependency versions that some language-specific external SAT solver spat out. These are frequently duplicated because many separate packages use the same libraries but are often not exact duplicates due to differences in upstream-defined dependency constraints.
Instead, it is possible to record one large snapshot of the latest desirable versions of all packages in existence in some ecosystem and have dependent packages use the "one true version" instead of their externally locked versions.
It also provides efficiency gains as dependencies are only built once and brings us closer to what the purpose of a software distribution has traditionally been: Integrate one set of packages.
This approach is used quite successfully by i.e. the `haskellPackages`, measuring at just 133 Bytes per package.
This is not feasible for all ecosystems however as just the names of all 3330720 npm packages (no hashes) is ~20MiB compressed and the hashes would be at least another 100MiB. Though perhaps a subset approach could be used; only accepting packages into the auto-generated set that are depended upon at least once in Nixpkgs.
# Future work
- Calculate and analyse bytes / package for package sets
- Some lockfile formats were perhaps not recognised as such or aren't actually lockfiles
3 Likes
IIRC, one of the reasons we switched away from cargoHash
was that the output of cargo vendor
was not guaranteed to remain stable (see e.g. Recompute all cargoSha256/cargoHash · Issue #121994 · NixOS/nixpkgs · GitHub , `cargoHash` might be different on linux and darwin systems · Issue #308089 · NixOS/nixpkgs · GitHub ). We would probably need to work with upstream to add more checks to ensure this cannot happen again.
As another example, Composer vendor directory was not reproducible in the past until @drupol fixed it . Not sure how much Composer upstream cares about this use case and how well it is tested to prevent regressions but so far introduces vendorHash
in mkComposerRepository
appears to have been stable.
Also relevant is the now closed RFC109: Nixpkgs Generated Code Policy
6 Likes