I’m trying to run docker (or podman) in a nixos container and wondering if someone achieved that.
For now I’m able to run them in a privileged container if I manually remount /sys/fs/cgroup as read-write and I’m able to get docker daemon running in an unprivileged container, though runc fails:
docker:
[root@docker:~]# docker run --rm hello-world
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to apply cgroup configuration: mkdir /sys/fs/cgroup/system.slice/docker-134a627882acf5b82da421ed2fb29c4de9cd9a9d737972967e0e32bad8be4076.scope: read-only file system:unknown.
podman:
[root@docker:~]# podman run --rm hello-world
WARN[0000] Failed to add conmon to systemd sandbox cgroup: Permission denied
Error: OCI runtime error: crun: clone: Invalid argument
I guess the problem is the same, r/o cgroup filesystem, though in an unprivileged container I can’t remount it even manually.
The problem is that docker/podman are not aware of such configuration and it seems to be impossible to pass that path through the cgroup-parent argument.
I have managed to do this, albeit in a somewhat degraded way. The key was to enable cgroups v2 in both host and container and allowing some system calls and some other settings:
This is extracted from my current config, so I’m unsure if all is needed, but it works for me with both podman and docker. The one thing that does not work is privileged containers, but otherwise it seems to handle the containers I run.
This broke for me recently on unstable, most likely with the upgrade of systemd to version 253. Both docker and podman started throwing EPERM errors. It took some time testing around with Arch, where it worked, and NixOS unstable, and I had to resort to using strace and reading up on capabilities and seccomp.
While testing I came up with a minimal flake that works for me:
{
description = "lab";
inputs = {
unstable.url = "nixpkgs/nixos-unstable";
};
outputs = { self, unstable }: {
nixosConfigurations.lab = unstable.lib.nixosSystem {
system = "x86_64-linux";
modules = [
({ config, pkgs, lib, ... }: {
imports = [
./hardware-configuration.nix
];
nix.settings.experimental-features = [ "nix-command" "flakes" ];
boot.loader.systemd-boot.enable = true;
boot.loader.efi.canTouchEfiVariables = true;
networking.hostName = "<hostname>";
time.timeZone = "<timezone>";
i18n.defaultLocale = "en_US.UTF-8";
console = {
font = "Lat2-Terminus16";
};
services.openssh.enable = true;
services.openssh.settings.PermitRootLogin = "yes";
networking.nat.enable = true;
networking.nat.internalInterfaces = [ "ve-+" ];
networking.nat.externalInterface = "enp1s0";
containers.lab = {
autoStart = true;
privateNetwork = true;
hostAddress = "192.168.100.10";
localAddress = "192.168.100.11";
enableTun = true;
extraFlags = [ "--private-users-ownership=chown" ];
additionalCapabilities = [
# This is a very ugly hack to add the system-call-filter flag to
# nspawn. extraFlags is written to an env file as an env var and
# does not support spaces in arguments, so I take advantage of
# the additionalCapabilities generation to inject the command
# line argument.
''all" --system-call-filter="add_key keyctl bpf" --capability="all''
];
allowedDevices = [
{ node = "/dev/fuse"; modifier = "rwm"; }
{ node = "/dev/mapper/control"; modifier = "rw"; }
{ node = "/dev/console"; modifier = "rwm"; }
];
bindMounts.fuse = {
hostPath = "/dev/fuse";
mountPoint = "/dev/fuse";
isReadOnly = false;
};
config = { config, pkgs, ... }: {
boot.isContainer = true;
system.stateVersion = "22.11";
virtualisation.docker.enable = true;
systemd.services.docker.path = [ pkgs.fuse-overlayfs ];
};
};
system.stateVersion = "22.11";
})
];
};
};
}
It may be that --private-users-ownership=chown flag is not needed, I just found a change introduced in systemd 253 that may have caused permission errors.
I also opted to hack the flag --system-call-filter directly into the start script, since extraFlags do not support spaces in arguments, because it seemed that using a file in /etc/systemd/nspawn stopped working.
The addition of /dev/console to allowedDevices was necessary because some containers need to create that device, and failed on an EPERM.