How to debug a failing container?

I have a bunch of containers on my NixOS machine. I recently did a big machine-wide NixOS update, all successful, but one of the containers will no longer come up. The others are fine.

Can anyone advise me how to debug? I see:

# nixos-container start dd_db             
Job for container@dd_db.service failed because the control process exited with error code.
See "systemctl status container@dd_db.service" and "journalctl -xe" for details.
/run/current-system/sw/bin/nixos-container: failed to start container

and

# systemctl status container@dd_db.service
● container@dd_db.service - Container 'dd_db'
   Loaded: loaded (/nix/store/7y0sr19lrq1zhgy2rkxpr0zl4q5m133q-unit-container-dd_db.service/container@dd_db.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2019-06-24 14:26:47 PDT; 8s ago
  Process: 12313 ExecStartPre=/nix/store/rlw1qr2qy0gg9bsiy3pddvnh8gqrbmdb-unit-script-container_dd_db-pre-start (code=exited, status=0/SUCCESS)
  Process: 12315 ExecStart=/nix/store/fyzw0brvamzdwiz60m3isfj0s4nislqg-unit-script-container_dd_db-start (code=exited, status=1/FAILURE)
 Main PID: 12315 (code=exited, status=1/FAILURE)
   Status: "Terminating..."

Jun 24 14:26:47 srvr systemd[1]: container@dd_db.service: Service RestartSec=100ms expired, scheduling restart.
Jun 24 14:26:47 srvr systemd[1]: container@dd_db.service: Scheduled restart job, restart counter is at 5.
Jun 24 14:26:47 srvr systemd[1]: Stopped Container 'dd_db'.
Jun 24 14:26:47 srvr systemd[1]: container@dd_db.service: Start request repeated too quickly.
Jun 24 14:26:47 srvr systemd[1]: container@dd_db.service: Failed with result 'exit-code'.
Jun 24 14:26:47 srvr systemd[1]: Failed to start Container 'dd_db'.

which really doesn’t tell me much, and the only line that looks interesting in the journal is:

Jun 24 14:26:47 srvr container dd_db[12315]: Invalid machine name: dd_db

which looks like it matters, but this container hasn’t changed since before I updated the system.

The container’s filesystem is all present and the files are all there. What can I do to figure this out a bit further? Anything obvious?

You shouldn’t have an underscore in your container name, see systemd/systemd#11765.

For the future:
To see the entire logs of the container service you can run journalctl -u container@foo.service.
And to see logs of services inside the container you can run sudo journalctl -M foo -u bar
Where foo is the container name and bar is the service name.

1 Like

You shouldn’t have an underscore in your container name

Wow! Thanks for this! Never would have figured that out. This fixed it.

To see the entire logs of the container service you can run journalctl -u container@foo.service

Oh, I was looking at that part already (it’s where I got the invalid machine name error line). It really contains almost nothing at all, which confused me. But I think it makes sense though, if systemd isn’t even able to try to start this container.

Thanks again!

Thanks a bunch to both of you, I just had this same exact problem and solved it with a quick google for “debug nixos container”!

I can’t believe things like this are still a problem on modern linux. :confused:

1 Like

That is indeed very useful information. However in my case I cannot access the logs of the failing container with this method, maybe because it was never really alive in the first place.

I guess the container syslog has to push its log to the host to make that work, which never happens if the container dies on startup.

Any other suggestions how I can peek into the container and see what caused the problem?