Should data for spaCy be a Nix package?

spaCy is a Pip package on PyPI and a Nix package on nixpkgs containing the source code for a natural language processing.

en_core_web_sm is a Pip package at spaCy’s GitHub releases containing the data for the pipeline (for English, trained on the web, and is small-ish).

Should en_core_web_sm and its friends be a Nix package in nixpkgs?

Are there other instances that discuss using Nix to manage “data packages”?

AFAIK, there is no clear rule to point to, but if the data is easy to download from within the library (e.g. like how you would just run nltk.download() for nltk), and/or the library is not used by a bunch of programs that would need patching to acquire or relocate the data, I think it might do more harm than good. The binary cache currently stores all sources, all versions. Considering normal source repositories are pretty small, large corpora and other datasets like this would quickly eat an unreasonable chunk of the budget. It should probably be avoided if possible. But as mentioned earlier, it might be considered “impossible” or at least not worth the effort if it requires a bunch of patching and mitigations to do so.

That being said, we do store some kinds of data-like packages, like icon themes and wordlists so I’m not sure where to put any clear threshold.

One trouble is that spaCy’s download(...) wants to use Pip to download its dataset, so spaCy "should be patched to do that in a way that is compatible with Nix.

However, the data package need not be stored in Nix.

As a user it’s quite cumbersome to have to download the data needed by a package, especially since nix often doesn’t play well with custom downloaders. E.g. I think nltk tries to download directly to the nltk Python package location, which is in the nix store.

I don’t think it necessarily has to be a separate package, it can also for example be a function like spacy.withModels (models: [models.en_core_web_sm]). Either way it should be possible to make sure hydra doesn’t build and cache it, e.g. using Nixpkgs Reference Manual , so it can be fetched directly from the source by the user.

1 Like

I found pkgs/development/python-modules/spacy/models.json, which gets imported by pkgs/development/python-modules/spacy/models.nix, but I don’t know where it goes after that.

How do I actually reference the spaCy models (already in Nixpkgs) from a Nix expression?

Not sure why it isn’t in nixpkgs search, but it looks like it should be available as python3packages.spacy-models.