Parallel fetching

ursi · December 12, 2024, 7:22am

I have a codebase that fetches many many packages from github using builtins.fetchGit. I just switched it over to using pkgs.fetchurl instead, pulling tarballs of the packages from a package registry, and it is so much slower. With fetchGit it would fetch as many things as it could in parallel, but after switching to fetchurl, it fetches the packages one at a time. Any idea why this is happening. Is there anything that can be done about it?

edit: Okay something even more fishy is going on. I’m realizing that the total packages the derivation sees is all of them when I was using fetchGit, and now it sees just the next 2 packages, adding to the total every time it downloads a new one. So it’s probably not the fetchers that are the issues, but maybe some lazy code or something.

Here is the code that is doing the new slow fetching

let inherit (args.src.registry) version;
  name' = args.src.registry.name or name;
  metadata = l.importJSON "${registry}/metadata/${name'}.json";
  registry-url = "https://packages.registry.purescript.org/${name'}/${version}.tar.gz";

  tarball = p.fetchurl {
    name = "${name'}-${version}.tar.gz";
    url = registry-url;
    inherit (metadata.published.${version}) hash;
  };

  unpacked = p.runCommand "${name'}-${version}" { } ''
    tar -xzf ${tarball}
    mv ${name'}-${version} $out
  '';

unpacked is the the final derivation.

zimward · December 12, 2024, 8:41am

Could you maybe explain what you mean by the derivation “seeing” a package (a package isn’t really a nix concept as far as i am aware as it’s derivations all the way down), because i don’t really understand what you want to say.

fetchurl is only a fancy way of running curl in a derivation AFAIK.

Atemu · December 12, 2024, 9:17am

Specifically a fixed-output-derivation.

@ursi are you doing any import-from-derivation? Please temporarily disable that feature using --option allow-import-from-derivation false to verify.

ursi · December 12, 2024, 4:46pm

Yes, so what I mean is as it’s building the derivation that fetches all the packages, the number of derivations that need to be built total (the z in the x/y/z display) just goes up by 2 every time I build a new derivation (which is happening one package at a time), whereas before, that number would accurately show the number for all the packages that needed to be built.

ursi · December 12, 2024, 4:47pm

Okay as per your test, it looks like I have introduced some new IFD.

ursi · December 12, 2024, 4:54pm

Okay, the problem is something I conveniently left out of the example code cuz I didn’t think it was relevant. On each package, I am grabbing some important metadata like this

purs-json = l.importJSON "${unpacked}/purs.json";

This is apparently the IFD that is getting in the way. But why is this putting a bottleneck on everything?

waffle8946 · December 12, 2024, 4:55pm

Because eval is single-threaded. Import From Derivation - Nix Reference Manual

ursi · December 12, 2024, 6:07pm

Right now the problem is I process a list of packages by folding through the packages, and then recursively looking at the package’s dependencies (part of the metadata retrieved from the IFD), and building up the dependency closure. This means at each package I’m stopping and doing an IFD to get that package’s dependencies. Would it be possible to download the packages in the list strictly up front at each level of that recursion, using seq/deepSeq? I’m trying but I can’t get it to work.

edit: after putting trace statements in my code. I think I’m wrong about how it’s being evaluated.

Atemu · December 13, 2024, 1:14am

I’m smelling an XY problem here. What are you trying to achieve in the end?

ursi · December 13, 2024, 1:30am

PureScript has a packages registry you can download packages from, and they also have a git repo that has a list of all the package hashes. The problem is, the git repo does not have the dependencies of each package. That information is only available from a json file inside the package that you download from the registry. I have a program that takes a list of packages and traverses the dependency tree, eventually grabbing the whole dependency closure. With the way it is currently done, essentially with centralized codegen, this is a very fast, parallelized process. Trying to switch away from the codegen and use the registry instead has resulted in the problem I have tried to outline above. The packages are being fetched one at a time.

Atemu · December 13, 2024, 1:51am

Okay, that’s your attempted solution (X). What did you want to achieve with that (Y)?

ursi · December 13, 2024, 2:05am

I maintain a nix library for working with purescript. It includes a package set with hundreds of purescript packages. What I “want to achieve” in the sense that I think you’re asking, is to be able to download these packages quickly. But I have already achieved this. What I really want to achieve is to be able to do this using the purescript registry alone.

Atemu · December 13, 2024, 2:12am

Is there any particular reason you’re attempting to do this within Nix?

This sounds like to me you could just run a tool made in a programming language over the package registry, generate a pure-data lock file of sorts and then merely import that lockfile for use in your Nix expressions.

As you have noticed, Nix really isn’t suited to fetching things whose location is dependant upon previously fetched data. It’s a domain-specific expression language, not a general-purpose programming language.

ursi · December 13, 2024, 2:27am

I want to do it with nix cuz nothing integrates with nix easier than nix haha. I’ve had this project going on for a while, since before the registry existed, working great. I’m only now just seeing this limitation of nix.

waffle8946 · December 13, 2024, 3:14am

JSON also works well with nix, if you do some preprocessing with some other script to generate a JSON lockfile, you’re in business