Let’s not go too fast about GDPR as a potential blocker.
The only hurdle to clear is the one of lawful processing, c.f. Art. 6 GDPR – Lawfulness of processing - General Data Protection Regulation (GDPR).
One way to see why the data provided by who is accessing the cache can fall under lawful processing is to use the concept of “legitimate interest”, in particular:
processing is necessary for the purposes of the legitimate interests pursued by the controller or by a third party, except where such interests are overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data, in particular where the data subject is a child.
Knowing who is using the cache, which is a public good of the project, is important to be able to ensure fair access to the cache.
There are two types of orgs on GitHub: individuals and entities.
For reminder: Do the data protection rules apply to data about a company? - European Commission (answer: no.)
In fact, we do not care about cache usage from individuals per se, we only care about legal entities that are not individuals.
With this information, I believe what is important is that, if it comes to pass:
- NixOS Foundation (or whoever is behind the collection) declares that this data is collected in opt-in or opt-out. I suggest opt-out because otherwise obviously we won’t get any data in general.
- NixOS Foundation publishes a blurb that demonstrate why there is a legitimate interest into knowing who download from the cache in a more fine-grained fashion, explaining all the possibilities to disable such data collection and what will happen of that data and when it will disappear if it gets collected someday.
- It’s important to define the granularity of the collected data, do we want a
<entity>: <total downloaded>
or do we want more fine-grained than that? <entity>: <download of month M>
and ideally, all of that data should always aim to be collapsed in coarsely-grained metrics.
- I would suggest trying to enable this by default only on organizations rather than individual account on GitHub first, then wait after some months and see if it makes sense to enable for individuals at all or not based on the collected information (we can deduce individual footprint from total footprint - org footprint anyway.)
If we end up enabling it only for “legal entities”, we should strive to manage GDPR like we would for individuals because they may end up being individuals anyway.
As for privacy-concerned individuals, if they care about the privacy of their CI, there are many more tools at play that will perform similar “phone home” procedure, ideally, we would have one environment variable to rule any kind of “phone home” mechanism, that is: https://do-not-track.dev/. We must honor this environment variable.