Docker fails to start (repost)

[ reposting here instead of #Development as there were no replies within a week ]

I have a strange problem with starting docker. Two systems which are configured the same (as far as docker is concerned; the boxes have lots of other configuration differences) and one runs (box1) while the other does not (box2). I am hoping someone can point me in the right direction as I have been unable to figure out the problem.

Both boxes are at the same version of NixOS (19.09.2345.6d445f8398d):

[me@box1:~]$ cat /etc/issue
<elided>
<<< Welcome to NixOS 19.09.2345.6d445f8398d (\m) - \l >>>
<elided>
[me@box2:~]$ cat /etc/issue
<elided>
<<< Welcome to NixOS 19.09.2345.6d445f8398d (\m) - \l >>>
<elided>

But docker works on box1 but not on box2:

[me@box1:~]$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
[me@box2:~] $ docker ps
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

The docker configurations are exactly the same on both boxes (and in fact are from the same virtualization.nix cloned from a git repo):

[me@box1:~]$ grep -rni docker /etc/nixos/virtualization.nix
virtualization.docker.enable = true;
users.users.me.extraGroups = [ "docker" ];
[me@box2:~]$ grep -rni docker /etc/nixos/virtualization.nix
virtualization.docker.enable = true;
users.users.me.extraGroups = [ "docker" ];

On box1, docker is running fine:

[root@box1:~]# systemctl status docker.service
● docker.service - Docker Application Container Engine
   Loaded: loaded (/nix/store/7qjn8j5cy2imizis9lys222ak7smql34-docker-19.03.5/etc/systemd/system/docker.service; enabled; vendor preset: enabled)
  Drop-In: /nix/store/liy275aqw412ggyd3xzfxj7cn235s1hk-system-units/docker.service.d
           └─overrides.conf
   Active: active (running) since Fri 2020-03-27 04:44:23 EDT; 5 days ago
     Docs: https://docs.docker.com
 Main PID: 21791 (dockerd)
       IP: 1.3G in, 1.4M out
    Tasks: 46 (limit: 4915)
   Memory: 227.7M
      CPU: 41min 57.026s
   CGroup: /system.slice/docker.service
           ├─21791 /nix/store/7qjn8j5cy2imizis9lys222ak7smql34-docker-19.03.5/libexec/docker/dockerd --group=docker --host=fd:// --log-driver=journald --live-restore
           └─21911 containerd --config /var/run/docker/containerd/containerd.toml --log-level info

Apr 01 16:31:34 box1 dockerd[21791]: time="2020-04-01T16:31:34.139669954-04:00" level=info msg="Layer sha256:5969d16c5b6921e0be2f921c0542404f748fd883fb3aef80146cd5a19e02c56a cleaned up"
Apr 01 16:31:42 box1 dockerd[21791]: time="2020-04-01T16:31:42.677811415-04:00" level=info msg="Layer sha256:cb5548570d4596dfd019ae24bd3821d44bd5b1cec116384fae127852fd6569dd cleaned up"
Apr 01 16:31:43 box1 dockerd[21791]: time="2020-04-01T16:31:43.944538688-04:00" level=info msg="Layer sha256:cb5548570d4596dfd019ae24bd3821d44bd5b1cec116384fae127852fd6569dd cleaned up"

While on box2 it does not:

[root@box2:~]# systemctl status docker.service
● docker.service - Docker Application Container Engine
   Loaded: loaded (/nix/store/7qjn8j5cy2imizis9lys222ak7smql34-docker-19.03.5/etc/systemd/system/docker.service; enabled; vendor preset: enabled)
  Drop-In: /nix/store/3r2gm2ik5m393p2jq8zs42xz2kj2aa51-system-units/docker.service.d
           └─overrides.conf
   Active: failed (Result: exit-code) since Wed 2020-04-01 20:39:26 EDT; 1h 11min ago
     Docs: https://docs.docker.com
  Process: 13236 ExecStart=/nix/store/7qjn8j5cy2imizis9lys222ak7smql34-docker-19.03.5/bin/dockerd --group=docker --host=fd:// --log-driver=journald --live-restore (code=exited, status=1/FAILURE)
 Main PID: 13236 (code=exited, status=1/FAILURE)
       IP: 0B in, 0B out
      CPU: 96ms

Apr 01 20:39:26 box2 systemd[1]: docker.service: Scheduled restart job, restart counter is at 3.
Apr 01 20:39:26 box2 systemd[1]: Stopped Docker Application Container Engine.
Apr 01 20:39:26 box2 systemd[1]: docker.service: Start request repeated too quickly.
Apr 01 20:39:26 box2 systemd[1]: docker.service: Failed with result 'exit-code'.
Apr 01 20:39:26 box2 systemd[1]: Failed to start Docker Application Container Engine.

Comparing start up messages in the system journal (showing only the lines which differ, excluding differences in machine names, kernel versions, dates/times, etc), box1 succeeds:

[root@box1:~]# journalctl -u docker.service
<elided>
Mar 27 04:44:23 box1 dockerd[21791]: time="2020-03-27T04:44:23.472232856-04:00" level=warning msg="Usage of loopback devices is strongly discouraged for production use. Please use `--storage-opt dm.thinpooldev` or use `man dockerd` to refer to dm.thinpooldev section." storage-driver=devicemapper
Mar 27 04:44:23 box1 dockerd[21791]: time="2020-03-27T04:44:23.495267796-04:00" level=warning msg="Base device already exists and has filesystem xfs on it. User specified filesystem  will be ignored." storage-driver=devicemapper
Mar 27 04:44:23 box1 dockerd[21791]: time="2020-03-27T04:44:23.531122236-04:00" level=info msg="[graphdriver] using prior storage driver: devicemapper"
Mar 27 04:44:23 box1 dockerd[21791]: time="2020-03-27T04:44:23.531164648-04:00" level=warning msg="[graphdriver] WARNING: the devicemapper storage-driver is deprecated, and will be removed in a future release"
Mar 27 04:44:23 box1 dockerd[21791]: time="2020-03-27T04:44:23.548394425-04:00" level=error msg="layer does not exist, not restoring image sha256:8e090d9629330f49c7b88def487ff984d708463d2022ac1994bb68e3a4b1ae19, sha256:c756f7785bb37606fb32e16ca04758c2a3ba0d53928aef7a1c0431f60ba794f4, linux"
Mar 27 04:44:23 box1 dockerd[21791]: time="2020-03-27T04:44:23.561426329-04:00" level=warning msg="Your kernel does not support cgroup rt period"
Mar 27 04:44:23 box1 dockerd[21791]: time="2020-03-27T04:44:23.561440099-04:00" level=warning msg="Your kernel does not support cgroup rt runtime"
Mar 27 04:44:23 box1 dockerd[21791]: time="2020-03-27T04:44:23.561628650-04:00" level=info msg="Loading containers: start."
Mar 27 04:44:23 box1 dockerd[21791]: time="2020-03-27T04:44:23.688613784-04:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Mar 27 04:44:23 box1 dockerd[21791]: time="2020-03-27T04:44:23.722178192-04:00" level=info msg="Loading containers: done."
Mar 27 04:44:23 box1 dockerd[21791]: time="2020-03-27T04:44:23.738616490-04:00" level=info msg="Docker daemon" commit=633a0ea838f10e000b7c6d6eed1623e6e988b5bc graphdriver(s)=devicemapper version=19.03.5
Mar 27 04:44:23 box1 dockerd[21791]: time="2020-03-27T04:44:23.738683060-04:00" level=info msg="Daemon has completed initialization"
Mar 27 04:44:23 box1 dockerd[21791]: time="2020-03-27T04:44:23.753080422-04:00" level=info msg="API listen on /run/docker.sock"
Mar 27 04:44:23 box1 systemd[1]: Started Docker Application Container Engine.

While box2 which does not:

[root@box1:~]# journalctl -u docker.service
<elided>
Apr 01 20:39:25 box2 dockerd[13236]: time="2020-04-01T20:39:25.154392684-04:00" level=error msg="[graphdriver] prior storage driver zfs failed: prerequisites for driver not satisfied (wrong filesystem?)"
Apr 01 20:39:25 box2 dockerd[13236]: time="2020-04-01T20:39:25.154712494-04:00" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd
Apr 01 20:39:25 box2 dockerd[13236]: time="2020-04-01T20:39:25.154721394-04:00" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
Apr 01 20:39:26 box2 dockerd[13236]: failed to start daemon: error initializing graphdriver: prerequisites for driver not satisfied (wrong filesystem?)
Apr 01 20:39:26 box2 systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Apr 01 20:39:26 box2 systemd[1]: docker.service: Failed with result 'exit-code'.
Apr 01 20:39:26 box2 systemd[1]: Failed to start Docker Application Container Engine.
Apr 01 20:39:26 box2 systemd[1]: docker.service: Scheduled restart job, restart counter is at 3.
Apr 01 20:39:26 box2 systemd[1]: Stopped Docker Application Container Engine.

As noted above, the kernel versions are different, perhaps because I forgot to reboot after nixos-rebuild on box1.

[me@box1:~]$ uname -a
Linux box1 4.19.109 #1-NixOS SMP Wed Mar 11 13:15:13 UTC 2020 x86_64 GNU/Linux
[me@box2:~]$ uname -a
Linux box2 4.19.113 #1-NixOS SMP Wed Mar 25 07:06:15 UTC 2020 x86_64 GNU/Linux

Starting dockerd by hand rather than letting systemd do it gives the same “API listen” message as the successful run on box1:

[root@box2:~]# dockerd  # works
<elided>
INFO[2020-04-01T20:57:23.239202099-04:00] API listen on /var/run/docker.sock

And now docker works:

[me@box2:~]$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

Comparing the system journal output for dockerd run by hand vs run by systemd: starting by-hand has errors (related to storage-driver=zfs, sandbox not found and failure to unmount) but in the end it succeeds while starting through systemd fails.

$ diff -ubB <(cut -c44- /tmp/box2.by-hand.log) <(cut -c83- /tmp/box2.systemd.txt)
<elided>
-level=info msg="[graphdriver] using prior storage driver: zfs"
-level=warning msg="Your kernel does not support cgroup rt period"
-level=warning msg="Your kernel does not support cgroup rt runtime"
-level=info msg="Loading containers: start."
-level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
-level=error msg="Error unmounting /var/lib/docker/zfs/graph/78ef607577558ab46ba682c1b18bca91e15b72c61abb7cd4910946ec4288836d: invalid argument" storage-driver=zfs
-level=error msg="Error unmounting /var/lib/docker/zfs/graph/d438772fd056272c9921c7772d39bb89a4b5faaebf633f024744d93da6e9ba82: invalid argument" storage-driver=zfs
-level=warning msg="error locating sandbox id 9b6add6e012710a6421b7fa6bb244b9f507f0f5ab4382ed7e89db05d1fa14878: sandbox 9b6add6e012710a6421b7fa6bb244b9f507f0f5ab4382ed7e89db05d1fa14878 not found"
-level=warning msg="cc5f8114468997d3e70a7e0cce90c18b75d26a46811951e05d8bdf5760235249 cleanup: failed to unmount IPC: umount /var/lib/docker/containers/cc5f8114468997d3e70a7e0cce90c18b75d26a46811951e05d8bdf5760235249/mounts/shm, flags: 0x2: no such file or directory"
-level=warning msg="error locating sandbox id 3f7db32a07a4bffe2a720c3f0a3f639e16eafa57c307b9519ae045b64c9fdff1: sandbox 3f7db32a07a4bffe2a720c3f0a3f639e16eafa57c307b9519ae045b64c9fdff1 not found"
-level=warning msg="d6f798e9a09b19187a6e8a59bbb74bf1e32fa4e6f87ca8e783cb175ba6ed7f7f cleanup: failed to unmount IPC: umount /var/lib/docker/containers/d6f798e9a09b19187a6e8a59bbb74bf1e32fa4e6f87ca8e783cb175ba6ed7f7f/mounts/shm, flags: 0x2: no such file or directory"
-level=warning msg="Failed to unmount 78ef607577558ab46ba682c1b18bca91e15b72c61abb7cd4910946ec4288836d mount /var/lib/docker/zfs/graph/78ef607577558ab46ba682c1b18bca91e15b72c61abb7cd4910946ec4288836d: no such file or directory" storage-driver=zfs
-level=warning msg="Failed to unmount d438772fd056272c9921c7772d39bb89a4b5faaebf633f024744d93da6e9ba82 mount /var/lib/docker/zfs/graph/d438772fd056272c9921c7772d39bb89a4b5faaebf633f024744d93da6e9ba82: no such file or directory" storage-driver=zfs
-level=error msg="d6f798e9a09b19187a6e8a59bbb74bf1e32fa4e6f87ca8e783cb175ba6ed7f7f cleanup: failed to delete container from containerd: no such container"
-level=error msg="Failed to start container d6f798e9a09b19187a6e8a59bbb74bf1e32fa4e6f87ca8e783cb175ba6ed7f7f: error creating zfs mount: mount rpool/root/nixos/d438772fd056272c9921c7772d39bb89a4b5faaebf633f024744d93da6e9ba82:/var/lib/docker/zfs/graph/d438772fd056272c9921c7772d39bb89a4b5faaebf633f024744d93da6e9ba82: no such file or directory"
-level=error msg="cc5f8114468997d3e70a7e0cce90c18b75d26a46811951e05d8bdf5760235249 cleanup: failed to delete container from containerd: no such container"
-level=error msg="Failed to start container cc5f8114468997d3e70a7e0cce90c18b75d26a46811951e05d8bdf5760235249: error creating zfs mount: mount rpool/root/nixos/78ef607577558ab46ba682c1b18bca91e15b72c61abb7cd4910946ec4288836d:/var/lib/docker/zfs/graph/78ef607577558ab46ba682c1b18bca91e15b72c61abb7cd4910946ec4288836d: no such file or directory"
-level=info msg="Loading containers: done."
-level=info msg="Docker daemon" commit=633a0ea838f10e000b7c6d6eed1623e6e988b5bc graphdriver(s)=zfs version=19.03.5
-level=info msg="Daemon has completed initialization"
-level=info msg="API listen on /var/run/docker.sock"
+level=error msg="[graphdriver] prior storage driver zfs failed: prerequisites for driver not satisfied (wrong filesystem?)"
+level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd
+level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby

At this point, I am at my wits end. Any help in identifying the problem will be greatly appreciated.

Looking at the logs, it seems that the second fails to start the “zfs” storage driver it is configured with. Check if you have a different fs layout between the two. It seems that the first settles on using the devicemapper

Thanks @azazel75. That got me looking in the right direction.

The problem was that dockerd must have been originally selected some other storage driver. Somewhere along the line, it changed its mind and decided to use zfs which because it hadn’t been configured broke. The reason why the other machine was working is because it choose the device-manager storage driver. I don’t know why the difference. Both machines are configured similarly.

I solved it by disabling docker and explicitly configuring for the zfs storage driver as discussed in Use the ZFS storage driver | Docker Docs as so:

# sed -i 's/virtualization.docker.enable=true/virtualization.docker.enable=false/' configuration.nix
# nixos-rebuild switch
# rm -rf /var/lib/docker  # no need to save; nothing I want to keep
# zfs create -o mountpoint=/var/lib/docker bigpool/docker
# sed -i 's/virtualization.docker.enable=false/virtualization.docker.enable=true;\nvirtualisation.docker.storageDriver = "zfs"/' configuration.nix
# nixos-rebuild switch
# docker version  # succeeds