Celler: An Attic Fork

tl;dr There is an Attic binary cache fork that aims to continue the stalled Attic development (7 months of no activity).

Why Fork Attic?

Here at Cyberus Technology, we tried to set up a large Attic instance for CTRL-OS last year. We eventually switched back to a simpler solution. While much of the blame goes to the underlying blob storage provider (cough Hetzner cough), we noticed that for a production setup, there are some crucial features that are lacking. Most notably for our case, there is no good way to troubleshoot an Attic instance beyond instrumenting the code. Or to put it in more economic terms, while Attic roughly saves us 60% of storage costs, it also massively increases our maintenance costs. Not a win!

So while production is happily chugging along on a plain old S3 bucket for now, I would really love to switch back to Attic, because of storage savings, but also because it offers a nice way to have per-developer or per-project caches on our own infrastructure. But upstream Attic development has stalled. So instead of letting our changes sit in PRs without getting merged, there is now celler. (“Celler” is Catalan for wine cellar, but pronounce it as you please. :wine_glass:)

Our Plans

So what are the plans for Celler? Initially, I have been working on updating all dependencies. This was harder than planned, because Attic uses C++ FFI bindings to libnixstore. These bindings would subtly break on updates. The first step was getting rid of these bindings. Instead we now use native Rust code to talk to the Nix daemon (via gorgon’s nix-daemon crate) where possible and shell out to Nix CLI where necessary.

With the updates largely done, we can look towards features that are crucial for production setups. My most important goals are to add proper (error) logging and metrics to the code base. Anyone operating Celler should see in their ops dashboard how things are going.

A bit more in the future, I’m looking towards OIDC integration. Using sudo on the server to create user credentials is not a workflow that scales!

What Are Your Ideas?

I’m genuinely curious what other people are missing from Attic. What features would you love to have? Which PR from the attic repository deserves to be merged?

44 Likes

Is this a drop in replacement at this time, or has the db schema changed?

I see this PR for loading tokens from files instead of environment variables. I would find this useful if it hasn’t been addressed yet.

I don’t think a fork of an attic is a celler. Garret would be a much better name given the project concept.

Can we have Celler expose a gRPC API for a generic CAS (content addressable storage) for other remote building / CI/CD systems? :slight_smile:

2 Likes
  • Especially long-lived JWTs are a footgun and Attic provides little remediation short of rotating signing keys. OIDC and short-lived tokens is the correct approach here. That would also allow integrating with Forgejo (Details), if I understand correctly.
  • We can inspect caches, if we know their name. Why can’t we list existing caches?
  • Fix the horrible CLI naming (attic is fine, atticd-atticadm isn’t)

I very much appreciate work going into this code base.

8 Likes

gc pinning (i.e. the software is released, don’t purge it without pulling the pin). and potentially look at options to drop the database entirely. I mean at that point it’s probably a rewrite, but i’d love to a design that stores bulk metadata updates as a WAL in the blob store (with eventual compaction into an index), s.t. that you can have a stateless cache server and a blob store and you are golden. I know s3 can do this safely with conditional writes, I just don’t know if other not s3 but s3-compatible blob stores can do that kind of thing? Anyway, a good gc pinning story would be incredible, and if there were a pot I would try to throw money into it for that feature.

EDIT: I mean maybe you could do it without conditional writes, but I think the story gets much more complicated. I’d also be prepared to run 3 servers instead of one with RDS (i.e. way cheaper to run 3 t4g micros than one HA RDS, and just replicate txn metadata into raft log written back to s3), but now I’ve strayed into @bme’s fantasy cache server.

Glad someone picked up the work.

I noticed that Prometheus was planned. I would go further (and simplify at the same time) and would suggest OpenTelemetry support instead! As it can be configured to be compatible with Prometheus exporters it should be the best of both worlds, while giving way more observability to the application.

9 Likes

What concrete problem are you looking to solve?

Consider generic CI/CD systems or local building systems who wants to directly push to the cache artifacts, not necessarily Nix binary cache artifacts.

If celler is this multi-tenant generic system for content addressable artifacts, with a touch for Nix binary cache, we can reuse it as an interface in many other software without locking ourselves via the HTTP Nix binary cache API protocol which is very naive.

The database schema is still identical. So you can migrate by pointing celler to your attic database. This is not tested, so please do so at your own risk. I’ve opened this ticket to document the process: Migration Guide for Attic · Issue #19 · blitz/celler · GitHub

Loading secrets via systemd LoadCredentials= and friends would be really neat. It’s not done yet. I’ve opened: Use systemd credentials · Issue #18 · blitz/celler · GitHub

1 Like

Totally agree with the long-lived JWTs. The way this is currently handled also leaves them in your shell history, making this extra bad.

Regarding listing caches, I’ve opened: Allow listing existing caches · Issue #20 · blitz/celler · GitHub

atticd-atticadm (and cellerd-celleradm) is just the result of lacking proper authentication support and should go away in this form once OIDC is supported.

1 Like

I like the idea of adding GC roots. I’ve tracked this here: Per-Cache Garbage Collection Roots · Issue #21 · blitz/celler · GitHub

Regarding re-architecting Celler to remove the RDS: Sounds interesting, but that would require rearchitecting the codebase. The nice thing about the RDS approach is conceptual simplicity.

That sounds like a brainstorming session in waiting. Let’s get back to that when we have an opportunity to chat.

1 Like

I’m wondering if Celler should join forces with GitHub - kalbasit/ncps: Nix binary cache proxy service -- with local caching and signing. · GitHub . I use both (well ncps and attic).

I think there some features overlap.

2 Likes

Support for uploading and fetching build logs would be useful.

Related:

1 Like

Thank you for picking that up!
Could you please also release a docker image, like Package attic · GitHub?

Thank you so much for the fork! I guess the law of the open source never fails - if you have a problem with any given program, as long as you wait long enough the problem will solve itself :grinning_face_with_smiling_eyes:

I just migrated my attic deploy to celler, without having to change anything or moving any data, so far works perfectly!

6 Likes

If you’re looking for a binary cache with GC and OIDC support, GitHub - Mic92/niks3: S3-backed Nix binary cache with garbage collection · GitHub also exists. We found that uploads are much faster than anything else out there. It’s still leaning on S3 for the read path, so niks3 doesn’t become a single point of contention for the read path. It also doesn’t try to do de-duplication.

2 Likes

I’m glad I’m not the only one who thought a fork was in order. For me, I have two pain points with Attic:

First, it fails on uploading exceptionally large NARs. I have one that is about 120GB (it is a Wikipedia English Zim file, so it is a single file at that size). That one causes the client to crash, because it was reading the entire file into memory at 2 or 3 separate points during the upload, and my build machines do NOT have 360GB of RAM. I maintain a patch for that on the client side, applied as an override to my nixpkgs.

My second pain is that network speeds are dog slow. Uploads can get up to about 20MB/s tops and downloads struggle to stay at 10MB/s. This is on hardware that routinely transfers other files at well over 200MB/s using other protocols and is highly capable: server side is a Xeon v4, 128GB of RAM, backed with an 11-wide ZFS SAS array with a 10Gb connection and client is an i7 155H, 64GB of RAM, 2.5Gb ethernet, saving to an NVMe drive. Both hooked to the same switch with negligible latency. NFS, SMB, and Minio transfers are hella-fast between the two systems, but Attic is painfully slow.

3 Likes

Hi Greg!

I also observed performance issues. Out of curiosity, do you have chunking enabled? If so, a quick performance fix is probably to disable it or increase the thresholds to create much larger chunks. The suggested settings for chunking are way too small for good performance, especially with S3 as your backend, but likely also with local disks.

That being said, the current focus for me is not performance optimization. My motivation right now is to get the infrastructure in place (logging, metrics) to get meaningful insights into the current performance of the celler instance. But I’m happy to look at and merge non-intrusive performance fixes.

Uploading large files should be possible. I opened an issue to investigate: Memory Use Issues for Large Store Paths · Issue #45 · blitz/celler · GitHub