Podman/docker in nixos container (ideally in unprivileged one)?

I’m trying to run docker (or podman) in a nixos container and wondering if someone achieved that.

For now I’m able to run them in a privileged container if I manually remount /sys/fs/cgroup as read-write and I’m able to get docker daemon running in an unprivileged container, though runc fails:

docker:

[root@docker:~]# docker run --rm hello-world
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to apply cgroup configuration: mkdir /sys/fs/cgroup/system.slice/docker-134a627882acf5b82da421ed2fb29c4de9cd9a9d737972967e0e32bad8be4076.scope: read-only file system:unknown.

podman:

[root@docker:~]# podman run --rm hello-world
WARN[0000] Failed to add conmon to systemd sandbox cgroup: Permission denied
Error: OCI runtime error: crun: clone: Invalid argument

I guess the problem is the same, r/o cgroup filesystem, though in an unprivileged container I can’t remount it even manually.

My current config is:

{ config, pkgs, lib, ... }:

let stateVersion = config.systemStateVersion;
in {
  systemd.services."container@docker" = {
    environment = {
      SYSTEMD_NSPAWN_USE_CGNS = "0";
      SYSTEMD_NSPAWN_UNIFIED_HIERARCHY = "1";
    };
  };

  containers.docker = {
    autoStart = true;
    ephemeral = true;

    privateNetwork = true;
    hostBridge = "br-untrusted";
    extraFlags = [
      "--private-users=${toString (65536 * 28)}:65536"
      "--private-users-ownership=chown"
      "--system-call-filter=add_key"
      "--system-call-filter=keyctl"
      "--system-call-filter=bpf"
    ];

    bindMounts = {
      "/docker" = {
        hostPath = "/srv/nixos-containers/docker/work";
        isReadOnly = false;
      };
      "/ssl" = {
        hostPath = "/srv/nixos-containers/docker/ssl";
        isReadOnly = true;
      };
    };

    config = { config, pkgs, ... }: {
      imports = [ ../../shared-configs/roles/network.nix ];

      networking = { useHostResolvConf = false; };

      systemd.network = {
        networks = {
          "10-eth0" = {
            name = "eth0";
            DHCP = "ipv4";
            dhcpV4Config = {
              SendHostname = true;
              Hostname = "docker.testnet";
            };
          };
        };
      };

      virtualisation.containers = {
        storage.settings.storage = {
          driver = "vfs";
          graphroot = "/docker";
        };
      };
      virtualisation.podman = {
        enable = true;
        dockerCompat = true;
        dockerSocket.enable = true;

        networkSocket = {
          enable = true;
          server = "ghostunnel";
          openFirewall = true;
          tls.key = "/ssl/server-key.pem";
          tls.cacert = "/ssl/ca-cert.pem";
          tls.cert = "/ssl/server-cert.pem";
        };
      };
      users = { groups = { podman = { gid = 131; }; }; };
      system.stateVersion = stateVersion;
    };
  };
}

For docker it would be the same, just replace podman section with docker one:

      virtualisation.containerd.enable = true;
      virtualisation.docker = {
        enable = true;
        extraOptions = ''
          --containerd=/run/containerd/containerd.sock
        '';
        listenOptions = [ "/run/docker.sock" "0.0.0.0:2376" ];
        # "unix:///run/docker.sock" "tcp://0.0.0.0:2376"

        # storageDriver = "vfs";
        daemon.settings = {
          debug = true;
          log-level = "debug";
          data-root = "/docker";
          tls = true;
          tlsverify = true;
          tlscacert = "/ssl/ca-cert.pem";
          tlscert = "/ssl/server-cert.pem";
          tlskey = "/ssl/server-key.pem";
          storage-driver = "vfs";
          # exec-opts = [ "native.cgroupdriver=systemd" ];
        };
      };
      networking.firewall = { allowedTCPPorts = [ 2376 ]; };
      users = { groups = { docker = { gid = 131; }; }; };

For those who may want to reproduce my steps:

  1. /srv/nixos-containers/docker/work should be chowned to 1835139:1835139
  2. The host system has to have the following config:
    users.users.root = {
      subUidRanges = [
        {
          count = 65536;
          startUid = 65536 * 28; # 1835008, docker
        }
      ];
    };
1 Like

I’ve also tried to create a systemd slice and start it:

  systemd.user.slices.nesteddocker = {
    enable = true;
  };

then chown’ed it to the docker uid/gid on the host, then mapped it to the container, with a fake name just in case:

      "/sys/fs/cgroup/test.slice" = {
        hostPath = "/sys/fs/cgroup/nesteddocker.slice";
        isReadOnly = false;
      };

and configured docker to use it:

          cgroup-parent = "test.slice";

However this mount is not visible inside the container.

Somehow relevant discussion about r/o cgroups fs in containers: Cannot run systemd containers because of cgroup read-only · Issue #1336 · kubernetes-sigs/kind · GitHub

Another relevant topic: Can not bind `/sys/fs/cgroup` · Issue #11703 · systemd/systemd · GitHub

Another observation. In fact there is a writeable part of the cgroup fs:

[root@docker:~]# cat /proc/mounts|grep cg
cgroup /sys/fs/cgroup cgroup2 ro,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/machine.slice/container@docker.service/payload cgroup2 rw,nosuid,nodev,noexec,relatime 0 0

The problem is that docker/podman are not aware of such configuration and it seems to be impossible to pass that path through the cgroup-parent argument.

Well, I’ve created an issue in the docker tracker, I doubt anyone would care though: docker is unable to run inside systemd-nspawn container · Issue #44402 · moby/moby · GitHub

I have managed to do this, albeit in a somewhat degraded way. The key was to enable cgroups v2 in both host and container and allowing some system calls and some other settings:

# cgroups v2
systemd.enableUnifiedCgroupHierarchy = true;

containers.example = {
  ...
  enableTun = true;
  additionalCapabilities = ["all"];
  allowedDevices = [
    { node = "/dev/fuse"; modifier = "rwm"; }
    { node = "/dev/mapper/control"; modifier = "rwm"; }
  ];
  bindMounts.dev-fuse = { hostPath = "/dev/fuse"; mountPoint = "/dev/fuse"; };
  bindMounts.dev-fuse = { hostPath = "/dev/mapper"; mountPoint = "/dev/mapper"; };
};

# enable cgroups v2 in the container
systemd.services."container@example".environment.SYSTEMD_NSPAWN_UNIFIED_HIERARCHY = "1";

# allow syscalls via an nspawn config file, because arguments with spaces work bad with containers.example.extraArgs
environment.etc."systemd/nspawn/example.nspawn".text = ''
  [Exec]
  SystemCallFilter=add_key keyctl bpf
'';

This is extracted from my current config, so I’m unsure if all is needed, but it works for me with both podman and docker. The one thing that does not work is privileged containers, but otherwise it seems to handle the containers I run.

2 Likes

This set by default now.

Probably no, I’ve tried your config but still getting the same error. Is your container privileged?..

I just run a regular nspawn container, with the tweaks I posted.

Are you sure cgroups v2 is enabled in both host and nspawn container? Do you have /sys/fs/cgroup/cgroup.controllers in both host and nspawn container?

Yeah, both enabled, the file is there, but the filesystem is readonly.

I am running into this too here Simulating a kubernetes cluster with containers - #5 by azazel75

I assume there are still no solutions to this? ^^

I don’t really understand if the answers in the github issues are relevant to nixos?

I also can’t get @ndreas solution to work

Edit: nvm i got it to work, i just had a typo ^^
my working version is in this thread: Simulating a kubernetes cluster with containers - #6 by ZerataX

This broke for me recently on unstable, most likely with the upgrade of systemd to version 253. Both docker and podman started throwing EPERM errors. It took some time testing around with Arch, where it worked, and NixOS unstable, and I had to resort to using strace and reading up on capabilities and seccomp.

While testing I came up with a minimal flake that works for me:

{
  description = "lab";
  inputs = {
    unstable.url = "nixpkgs/nixos-unstable";
  };
  outputs = { self, unstable }: {
    nixosConfigurations.lab = unstable.lib.nixosSystem {
      system = "x86_64-linux";
      modules = [
        ({ config, pkgs, lib, ... }: {
          imports = [
            ./hardware-configuration.nix
          ];

          nix.settings.experimental-features = [ "nix-command" "flakes" ];

          boot.loader.systemd-boot.enable = true;
          boot.loader.efi.canTouchEfiVariables = true;

          networking.hostName = "<hostname>";

          time.timeZone = "<timezone>";

          i18n.defaultLocale = "en_US.UTF-8";
          console = {
            font = "Lat2-Terminus16";
          };

          services.openssh.enable = true;
          services.openssh.settings.PermitRootLogin = "yes";

          networking.nat.enable = true;
          networking.nat.internalInterfaces = [ "ve-+" ];
          networking.nat.externalInterface = "enp1s0";

          containers.lab = {
            autoStart = true;
            privateNetwork = true;
            hostAddress = "192.168.100.10";
            localAddress = "192.168.100.11";
            enableTun = true;
            extraFlags = [ "--private-users-ownership=chown" ];
            additionalCapabilities = [
              # This is a very ugly hack to add the system-call-filter flag to
              # nspawn. extraFlags is written to an env file as an env var and
              # does not support spaces in arguments, so I take advantage of
              # the additionalCapabilities generation to inject the command
              # line argument.
              ''all" --system-call-filter="add_key keyctl bpf" --capability="all''
            ];
            allowedDevices = [
              { node = "/dev/fuse"; modifier = "rwm"; }
              { node = "/dev/mapper/control"; modifier = "rw"; }
              { node = "/dev/console"; modifier = "rwm"; }
            ];
            bindMounts.fuse = {
              hostPath = "/dev/fuse";
              mountPoint = "/dev/fuse";
              isReadOnly = false;
            };
            config = { config, pkgs, ... }: {
              boot.isContainer = true;
              system.stateVersion = "22.11";
              virtualisation.docker.enable = true;
              systemd.services.docker.path = [ pkgs.fuse-overlayfs ];
            };
          };

          system.stateVersion = "22.11";
        })
      ];
    };
  };
}

It may be that --private-users-ownership=chown flag is not needed, I just found a change introduced in systemd 253 that may have caused permission errors.

I also opted to hack the flag --system-call-filter directly into the start script, since extraFlags do not support spaces in arguments, because it seemed that using a file in /etc/systemd/nspawn stopped working.

The addition of /dev/console to allowedDevices was necessary because some containers need to create that device, and failed on an EPERM.

4 Likes

for k3s on 23.11 i now also need /dev/kmsg

 allowedDevices = [
  { node = "/dev/fuse"; modifier = "rwm"; }
  { node = "/dev/mapper/control"; modifier = "rwm"; }
  { node = "/dev/consotruele"; modifier = "rwm"; }
  { node = "/dev/kmsg"; modifier = "rwm"; }
];

bindMounts = {
  kmsg = {
    hostPath = "/dev/kmsg";
    mountPoint = "/dev/kmsg";
    isReadOnly = false;
  };
  fuse = {
    hostPath = "/dev/fuse";
    mountPoint = "/dev/fuse";
    isReadOnly = false;
  };
};

complete config: simulating a k3s cluster in nixos (gist.github.com)

2 Likes