This is a follow-up of the NixOS S3 Short Term Resolution! situation.
We created an ad-hoc meeting in this issue following the GC issue we wanted to avoid in Garbage-collect cache.nixos.org · Issue #282 · NixOS/infra · GitHub.
Here are the meeting notes: NixOS Cache GC Meeting - HedgeDoc with as much detail as I could gartner.
Quick Recap
Multiple initiatives are being done:
@zimbatm spent some time to give access to trusted community members to read the S3 bucket to unblock data analysis to understand the usage patterns.
There’s also an older copy of the Hydra database, cleaned out of the PII, I was entrusted with and I shared it with multiple contributors for many instances.
At the moment, we see ~400GB of narinfos, 500TB of nars as a reminder.
- Self-host baremetal copies of the cache server outside of AWS by @RaitoBezarius, stalled on a lack of budget to start a reasonable cache server.
- Moving to R2 via CloudFlare, can be handled by @ron if necessary
- Deduplication system, led by @flokli and @edef, ongoing
- Garbage collection efforts, led by @edolstra, ongoing
Other initiatives may be happening, if you are working on one, please inform me of the status and I will update it for the next meeting.
Summary / take-away
The bucket growth size per year is as it follows:
┌─year─┬──TiB─┐
│ 2013 │ 0.5 │
│ 2014 │ 0.3 │
│ 2015 │ 2.6 │
│ 2016 │ 30.9 │
│ 2017 │ 43 │
│ 2018 │ 42.4 │
│ 2019 │ 51.7 │
│ 2020 │ 53.1 │
│ 2021 │ 66.6 │
│ 2022 │ 89.9 │
│ 2023 │ 90.2 │
└──────┴──────┘
We will schedule a meeting for the next week on the 2023-10-31 at 4PM CEST.
For the next week, we will answer the following (data) queries:
- Can narinfos URLs can be fully qualified?
- Find out if we can have a fallback for CDN 404s, with no additional delays.
- Amount of bandwidth to S3, amount of bandwidth for “dead” store paths
- Number of queries per second from Fastly to S3
- How many files bigger than 200MB are there, bucketed per year?
- Mapping of channels bumps to store paths (and their sizes)
@zimbatm will work on an EC2 instance and a S3 bucket for offering people working on that a development environment and processing through this data without incurring too much costs.
What I gather is the following:
@edolstra is open to the idea of an archive binary cache, which could make use of the deduplication technology @flokli and @edef are working on. Until it’s proven to have proper performance characteristics, it is not meant to be used for the main binary cache, though, we open the question of rewiring in front of the CDN the archival binary cache as a fallback to 404s and offer good performance to end-users.
@zimbatm reminded that we aim to reduce the binary cache size by the end of December, if we cannot get the nice solution by then, we will have to delete. An alternative is to try to clean up only files bigger than 200MB and this may give us enough breathing.
@edolstra will not delete anything before a plan is announced.
We are all curious and need to collect why people need historical data for and if there’s concrete usecases (beyond the one already mentioned in the meeting notes) for that data.
As per the first table I remind everyone of here, a question will appear also for more recent data and @vcunat proposed to explore to move the data to archival on a 2 years basis if we figure out that the technology works more or less fine.
cc @flokli @edef @edolstra @zimbatm @vcunat @delroth (mentioned for the Requester Pays)