2024-07-10 Long-term S3 cache solutions meeting minutes #7

I’ve spent time with @edef today to help everyone keep tabs on what’s happening. Here are some notes.

  • Very good news: Got read-through to work on test bucket after some initial hiccups
  • Added latency is negligible, on the order of 1-2 ms, probably dominated by the handshake
  • Side notes:
    • AWS logs effectively show which software is used how much
      • we could in theory use that to optimise maintenance efforts
      • slightly tricky to work with that sensitive data (IP addresses), but can be done
      • we even have a machine in there for just that
      • right now we’re doing remarkably little with the huge amount of data we have at our disposal
    • currently the “data/archivists” team is just @edef
      • mostly figuring out which questions to ask
      • lots of time goes into data cleaning
      • trying to do some analysis what we can do with the cache data
    • there’s a difficulty with mapping store path hashes to packages
      • if we have the hash in the store, we get a narinfo
        • there’s a narinfo dump tool from last year
      • otherwise all we get is a store path
      • only a few % of all store derivations are cached
        • not clear what’s the criterion for keeping them or whether we started saving all of them at some point
        • ~430k drvs as of end 2023, but 200M store paths
    • since recenlty we’re collecting very granular long-term AWS cost data
      • there’s something to be gleaned from that for sure
      • e.g. we only serve ~1T/mo of traffic from the bucket directly, costing a bit under $100/mo
  • since Tigris claims they copy async, this would mean we’d serve each object twice initially
  • the front-end for all this is Fastly
  • when moving the cache, we’d likely break the Tsinguhua University cache replication mechanism
  • Next steps:
    • @edolstra @ron: we need a credit card to pay for the Tigris account
      • 5GB of free allowance, but that is obviously too little
    • ideally we’d not hit S3 for the 404 path
      • need to serve 404s very fast
      • currently we’re serving from S3, this is bad
        • we should be able to do a lot better
        • there’s only 5GiB of data required for answering whether to 404
        • paying S3 for requests, but fairly little, so cost is secondary concern
      • narinfo is on the critical path for end-user experience
      • this is optimisation for later though
    • and we don’t want to hit Tigris with the narinfo workload yet
    • have to think about costs of uploading to Glacier
12 Likes