AFAIK, there is no clear rule to point to, but if the data is easy to download from within the library (e.g. like how you would just run nltk.download() for nltk), and/or the library is not used by a bunch of programs that would need patching to acquire or relocate the data, I think it might do more harm than good. The binary cache currently stores all sources, all versions. Considering normal source repositories are pretty small, large corpora and other datasets like this would quickly eat an unreasonable chunk of the budget. It should probably be avoided if possible. But as mentioned earlier, it might be considered “impossible” or at least not worth the effort if it requires a bunch of patching and mitigations to do so.
That being said, we do store some kinds of data-like packages, like icon themes and wordlists so I’m not sure where to put any clear threshold.
One trouble is that spaCy’s download(...) wants to use Pip to download its dataset, so spaCy "should be patched to do that in a way that is compatible with Nix.
However, the data package need not be stored in Nix.
As a user it’s quite cumbersome to have to download the data needed by a package, especially since nix often doesn’t play well with custom downloaders. E.g. I think nltk tries to download directly to the nltk Python package location, which is in the nix store.
I don’t think it necessarily has to be a separate package, it can also for example be a function like spacy.withModels (models: [models.en_core_web_sm]). Either way it should be possible to make sure hydra doesn’t build and cache it, e.g. using Nixpkgs Reference Manual , so it can be fetched directly from the source by the user.