Colmena deployment extremely slow

I’ve been trying to set up a deployment of my app to a DigitalOcean Droplet using colmena.
I set up my flake as described in the manual.
After executing colmena apply everything builds fine locally, but once the deployment reaches the stage when it copies files to remote /nix/store/... everything slows down to unacceptable state (i.e. after 5h it still hasn’t deployed a small application).

It looks like this:

And then it more or less gets stuck copying some of the packages. To speed this up I cancel the deployment, ssh the machine, execute nix-store --repair-path /nix/store/<package> and restart the deployment, but for hundreds of packages this is a bit annoying, at best.

What’s surprising, colmena has an option deployment.buildOnTarget. Switching it to true causes the transfer to go up and complete within 10 minutes, which I find surprising.

Has anyone seen anything like this before?

Regarding the flake I use - it’s pretty long, so below you can find a truncated version. Happy to post more if this is needed/helpful.

{
  inputs = {
    nixpkgs.url = "nixpkgs/nixos-23.11";    
    flake-parts.url = "github:hercules-ci/flake-parts";
    haskell-flake.url = "github:srid/haskell-flake";
  };

  outputs = inputs@{ self, nixpkgs, flake-parts, ... }:
    let
      system = "x86_64-linux";
      pkgs = nixpkgs.legacyPackages.${system};
    in
      flake-parts.lib.mkFlake { inherit inputs; } {
        systems = [ system ];
        imports = [
          inputs.haskell-flake.flakeModule
        ];

        flake = {
          colmena = {
            meta = {
              nixpkgs = import inputs.nixpkgs { inherit system; };
            };

            defaults = { pkgs, ... }: {
              environment.systemPackages = with pkgs; [
                curl sqlite htop neovim
              ];
              programs.zsh.enable = true;
            };

            digital-ocean-my-server = {
              imports = [ "${nixpkgs}/nixos/modules/virtualisation/digital-ocean-image.nix" ];
              
              system.stateVersion = "23.11";

              deployment = {
                targetHost = "<correct IP of my server>";
                targetPort = 22;
                # surprisingly, uncommenting this increases massively transfer speed
                #buildOnTarget = true;
              };

              time.timeZone = "Etc/UTC";

              networking = {
                hostName = "my-server";
                firewall = {
                  enable = true;
                  allowedTCPPorts = [ 22 80 443 ];
                  allowPing = true;
                };
              };
     // ....
}

I’m not at all sure if this is related but in my own deployment scripts I’ve found that if nix copy is called concurrently on one target ssh store, it will lock up. I’ve only had cases of calling 10 of those at the same time, maybe colmena does more and that leads to corruption. But I have no clue how colmena works internally.

1 Like

(…) But I have no clue how colmena works internally.

I don’t know that either, but happy to hear what was your solution to counter this case.

Oh sorry didnt mention it because it won’t really help you, i used flock to limit the number of concurrent nix copys to one. So in my case terraform will dispatch them at the same time but then n-1 will block waiting for the one that won the race to finish

1 Like

I see, thanks for sharing this.

In the meantime I worked out one more thing: using an older version of colmena (0.3.2 in my case) works perfectly fine and copies all the files very quickly.

I noticed that it uses ssh://... instead of ssh-ng://... . Is it possible that I misconfigured something on my side, which resulted in the ver slow ssh-ng connection?

using flakes? what version of nix? i think this impacts what colmena uses IIRC… :thinking:

Oh yeah im also using ssh-ng, see ~magic_rb/dotfiles (master): terranix/lib/build_nixng_system.nix - sourcehut git i wonder if ssh would work, though with flock it works perfectly fine.

1 Like

Hi, yes, using flake, you’re right.
Regarding the nix version:

> nix --version
nix (Nix) 2.18.2

I don’t. Without building the closure on the target host, you’re building it on your local machine and then uploading it in its entirety from your local machine to the remote. You’re always limited by this link and your upload speed is certainly slower than cache.nixos.org’s which is where most paths usually come from.

It shouldn’t be this slow unless you’re on some really whacky bamboo pipe or have horrendous peering issues on the way to that VPS though.

hmm i would have expected ssh-ng in that case
if you grep the code for ssh-ng i think you will easily find the logic to help you answer your question

Hang on, I’ve just realized that I create DigitalOcean NixOS machines using an image I had uploaded there ~2 years ago. And after opening an ssh connection I saw this:

my-server> # nix --version
nix (Nix) 2.3.16

I think we’ve got the reason.
@aanderse by any chance do you know where I can find a newer version of Digital Ocean-compatible NixOS image?

@justinas wrote a great article about digital ocean which includes generating your own image

there is also nixos-infect which is an option that works well on digital ocean

… and i suppose i should mention my own colmena related project teraflops which works well with digital ocean too

Thanks @aanderse . I’ll take a look at these later this week.
Regarding the article you mentioned, I had actually seen it before, but the image generation failed for me.
Looks like there’s still a lot for me to learn about nix.

good to know - do you mind filing an issue about this with the details required? not great for us to have this sort of bug…

Done: Error while creating a DigitalOcean image file · Issue #322588 · NixOS/nixpkgs · GitHub

1 Like