Tracking NixOS cache usage by User-Agent

zimbatm · November 2, 2023, 5:02pm

We discovered that most of our Fastly traffic is coming from North America, but we don’t know exactly from where. It would be interesting to find out!

This is a question that came out of the unofficial nix-archeologist team. We (I mean @flokli and @edef really) are doing some analysis on the nix-cache S3 bucket to find out more about its content and gain more insight on how to reduce its size. Adjacent to that are also all of the Fastly logs that help us determine the popularity of store paths.

One idea to find this question out is to configure Nix installers with extra metadata in the user-agent-suffix config. That’s the proposal that I made to Better cache analytics trough custom User-Agent · Issue #198 · cachix/install-nix-action · GitHub, for example.

One of the questions that came up is; do you think it’s fair trade to capture which GitHub org is accessing the cache, knowing that we only keep the data for a limited amount of time and that it’s not available to the public? This could be useful information for the foundation, to approach companies about funding.

peterhoeg · November 3, 2023, 9:00am

In principle that’s great, but I think you will get a lot of push-back from privacy-concerned individuals and you might be hitting some GDPR issues as well.

It looks to me as one of those things very likely to blow up into a big issue.

zimbatm · November 3, 2023, 10:53am

You’re probably right.

Probably a better middle-ground would be to make it:

Opt-in for private repos
Opt-out for public ones

Technically speaking, the information is already public on public repos; it’s just more challenging for us to connect the dots.

iFreilicht · November 3, 2023, 10:57am

I agree with @peterhoeg, you have to be mindful about how to handle this.

Something you could do for example is have unique IDs that you only use to accumulate the number of times one server/pc/whatever has hit the cache, but never associate them with identifiable information. This might give you some way of better understanding how much traffic an average user generates, how big the percentage of high-traffic users is etc.

The problem is that when you store that data, you are able to misuse it, and as you can potentially combine it with stuff like the IP, which can be considered personally identifiable information, you probably have to have users opt-in or accept a privacy policy or something like that.

That seems like a lot of potential legal trouble.

RaitoBezarius · November 3, 2023, 1:17pm

Let’s not go too fast about GDPR as a potential blocker.

The only hurdle to clear is the one of lawful processing, c.f. Art. 6 GDPR – Lawfulness of processing - General Data Protection Regulation (GDPR).

One way to see why the data provided by who is accessing the cache can fall under lawful processing is to use the concept of “legitimate interest”, in particular:

processing is necessary for the purposes of the legitimate interests pursued by the controller or by a third party, except where such interests are overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data, in particular where the data subject is a child.

Knowing who is using the cache, which is a public good of the project, is important to be able to ensure fair access to the cache.

There are two types of orgs on GitHub: individuals and entities.

For reminder: Do the data protection rules apply to data about a company? - European Commission (answer: no.)

In fact, we do not care about cache usage from individuals per se, we only care about legal entities that are not individuals.

With this information, I believe what is important is that, if it comes to pass:

NixOS Foundation (or whoever is behind the collection) declares that this data is collected in opt-in or opt-out. I suggest opt-out because otherwise obviously we won’t get any data in general.
NixOS Foundation publishes a blurb that demonstrate why there is a legitimate interest into knowing who download from the cache in a more fine-grained fashion, explaining all the possibilities to disable such data collection and what will happen of that data and when it will disappear if it gets collected someday.
It’s important to define the granularity of the collected data, do we want a <entity>: <total downloaded> or do we want more fine-grained than that? <entity>: <download of month M> and ideally, all of that data should always aim to be collapsed in coarsely-grained metrics.
I would suggest trying to enable this by default only on organizations rather than individual account on GitHub first, then wait after some months and see if it makes sense to enable for individuals at all or not based on the collected information (we can deduce individual footprint from total footprint - org footprint anyway.)

If we end up enabling it only for “legal entities”, we should strive to manage GDPR like we would for individuals because they may end up being individuals anyway.

As for privacy-concerned individuals, if they care about the privacy of their CI, there are many more tools at play that will perform similar “phone home” procedure, ideally, we would have one environment variable to rule any kind of “phone home” mechanism, that is: https://do-not-track.dev/. We must honor this environment variable.

iFreilicht · November 3, 2023, 1:59pm

Right, if it’s possible to differentiate between companies and individuals, activating it only for companies is not as much of an issue.

Also, I didn’t consider that this is exclusively about the cachix github action, not about Nix or NixOS, which also means that even if the ids were linked to IPs, timestamps or whatever else, that would only identify the github server the action was running on, not any individual person.

This might be something to emphasize in the communication about this change.