Data dump of all nixpkgs pull requests

During the oceansprint I wrote a script to dump all pull requests (only metadata, you can use git to get the actual code content and correlate it with the git sha references).

The script is designed in a way that it can be restarted in case it’s interrupted or the github api fails. This can be also used to update file incrementally.

The code is here in case someone wants to make this a project: Scrape all nixpkgs pull requests · GitHub

A scrape takes on a 1 Gbit/s connection less than a day. Here is the result: prs.jsonl.zstd - Google Drive (compressed with zstandard)

The format is in https://jsonlines.org/ format.

16 Likes

That’s only 167M in size! I expected this to be much larger. Grabbing a copy for my archive.

1 Like

Issues would be good too.

While discussing GitHub archiving with @RaitoBezarius on Matrix I found this tool, which looks pretty good for this job, even supports incremental updates:

Might be interesting to experiment with running that continuously on our infra.