2023-10-24 re: Long-term S3 cache solutions meeting minutes #1

RaitoBezarius · October 25, 2023, 12:08pm

This is a follow-up of the NixOS S3 Short Term Resolution! situation.

We created an ad-hoc meeting in this issue following the GC issue we wanted to avoid in Garbage-collect cache.nixos.org · Issue #282 · NixOS/infra · GitHub.

Here are the meeting notes: NixOS Cache GC Meeting - HedgeDoc with as much detail as I could gartner.

Quick Recap

Multiple initiatives are being done:

@zimbatm spent some time to give access to trusted community members to read the S3 bucket to unblock data analysis to understand the usage patterns.

There’s also an older copy of the Hydra database, cleaned out of the PII, I was entrusted with and I shared it with multiple contributors for many instances.

At the moment, we see ~400GB of narinfos, 500TB of nars as a reminder.

Self-host baremetal copies of the cache server outside of AWS by @RaitoBezarius, stalled on a lack of budget to start a reasonable cache server.
Moving to R2 via CloudFlare, can be handled by @ron if necessary
Deduplication system, led by @flokli and @edef, ongoing
Garbage collection efforts, led by @edolstra, ongoing

Other initiatives may be happening, if you are working on one, please inform me of the status and I will update it for the next meeting.

Summary / take-away

The bucket growth size per year is as it follows:

┌─year─┬──TiB─┐
│ 2013 │  0.5 │
│ 2014 │  0.3 │
│ 2015 │  2.6 │
│ 2016 │ 30.9 │
│ 2017 │   43 │
│ 2018 │ 42.4 │
│ 2019 │ 51.7 │
│ 2020 │ 53.1 │
│ 2021 │ 66.6 │
│ 2022 │ 89.9 │
│ 2023 │ 90.2 │
└──────┴──────┘

We will schedule a meeting for the next week on the 2023-10-31 at 4PM CEST.

For the next week, we will answer the following (data) queries:

Can narinfos URLs can be fully qualified?
Find out if we can have a fallback for CDN 404s, with no additional delays.
Amount of bandwidth to S3, amount of bandwidth for “dead” store paths
Number of queries per second from Fastly to S3
How many files bigger than 200MB are there, bucketed per year?
Mapping of channels bumps to store paths (and their sizes)

@zimbatm will work on an EC2 instance and a S3 bucket for offering people working on that a development environment and processing through this data without incurring too much costs.

What I gather is the following:

@edolstra is open to the idea of an archive binary cache, which could make use of the deduplication technology @flokli and @edef are working on. Until it’s proven to have proper performance characteristics, it is not meant to be used for the main binary cache, though, we open the question of rewiring in front of the CDN the archival binary cache as a fallback to 404s and offer good performance to end-users.

@zimbatm reminded that we aim to reduce the binary cache size by the end of December, if we cannot get the nice solution by then, we will have to delete. An alternative is to try to clean up only files bigger than 200MB and this may give us enough breathing.

@edolstra will not delete anything before a plan is announced.

We are all curious and need to collect why people need historical data for and if there’s concrete usecases (beyond the one already mentioned in the meeting notes) for that data.

As per the first table I remind everyone of here, a question will appear also for more recent data and @vcunat proposed to explore to move the data to archival on a 2 years basis if we figure out that the technology works more or less fine.

cc @flokli @edef @edolstra @zimbatm @vcunat @delroth (mentioned for the Requester Pays)

ron · October 25, 2023, 3:52pm

This is incredibly critical work, huge shout out for pulling this together <3
Also bumping this up for visibility!

Sandro · October 25, 2023, 10:29pm

Is it feasible to collect information on what is creating this huge growth other than new packages, stdenv rebuilds and staging runs? Is there maybe a packaging area on nix which consumes unproportional amounts of data and could be reduced?

Also I am wondering since a while: Why are ca-derivations not really progressing and becoming a thing that can be used on a daily basis with only occasional bugs? Wouldn’t this help a lot especially when we can to start to optimize around it?

tomberek · October 26, 2023, 11:50am

Yes. It is allowed for a narinfo URL field to refer to an absolute URL, thereby storing the NAR itself somewhere else.

colemickens · October 26, 2023, 11:16pm

I’m not paying the bill, and I don’t really need historical artifacts, but I’m surprised deletion is being considered so soon.

Would it be feasible to somehow determine the cached artifacts that can’t be easily recreated? It seems like quite a task to write the “script” and even more so to run it, but it would be neat if we only pruned paths that could be trivially recreated.

Put another way, in my opinion, it would be amazing if the pruning specifically preserved the paths that cannot be recreated (the original source has disappeared, etc).

RaitoBezarius · November 4, 2023, 2:35pm

Follow-up meeting in 2023-10-30 re: Long-term S3 cache solutions meeting minutes #2 !

zimbatm · November 4, 2023, 4:33pm

That’s what we are aiming for, but we have a deadline to take into account as well.

nixinator · November 4, 2023, 5:10pm

I placed a proposal into NGI, to distribute the binary cache over hypercore in p2p fashion…

Every machine that has a nar, can help any other machine get hold of it. Simple.

@Ericson2314 also worked on a similar solution with IPFS.

Thus doing away with centralised S3 buckets entirely.

This is not going to happen by December, but a p2p solution for distributing nar’s probably goes a long way in slashing bandwidth costs of distributing binary software.

Unless it’s a storage problem? However, what ever the details are, these costs will spiral soon, if it’s not only your hardware,network,disk,cpu then costs can widely fluctuate depending on the service provider.

As users scale, the costs will start to become unreal and this is exactly what happens to all open successful source projects that have high bandwidth and storage needs… (that’s any repo of software), i’m looking at you… .NPM.

interesting times.

RaitoBezarius · November 11, 2023, 1:58pm

Follow-up in 2023-11-07 re: Long-term S3 cache solutions meeting minutes #3.

RaitoBezarius · November 17, 2023, 4:49pm

Follow-up in 2023-11-14 re: Long-term S3 cache solutions meeting minutes #4!