I set up a few declarative containers on a NixOS 20.09 server I manage, and I find that over time, they lose the ability to resolve host names.
If I check their /etc/resolv.conf
shortly after a restart, I find a nameserver x.x.x.x
entry, and ping google.com
works.
e.g.
(host) # machinectl shell root@certs
(certs) # cat /etc/resolv.conf
# Generated by resolvconf
domain example.local
nameserver 192.168.70.1
options edns0
options edns0
If I check back a month or two later, and review the /etc/resolv.conf
, then I find the nameserver
entry missing, and ping google.com
fails to resolve the host name.
e.g.
(host) # machinectl shell root@certs
(certs) # cat /etc/resolv.conf
# Generated by resolvconf
options edns0
options edns0
I can echo 'nameserver 192.168.70.1' >> /etc/resolv.conf
and restore DNS resolution inside the container, but I’d like to identify the cause that is breaking my resolv.conf
.
One declarative container I use is:
let
options = { /* Redacted */ };
in
containers.certs = {
autoStart = true;
bindMounts = {
"${options.path}" = {
hostPath = "/srv/certs";
isReadOnly = false;
};
};
config = { pkgs, ... }: {
networking.firewall.enable = false;
environment.systemPackages = [ pkgs.lego ];
systemd = {
services.certs = {
description = "Renew certificates for hosts on the network.";
after = [ "network-online.target" ];
requires = [ "network-online.target" ];
environment = envVars;
path = [ pkgs.lego ];
serviceConfig = {
Type = "oneshot";
User = "caddy";
Group = "caddy";
};
script = concatStringsSep "\n" (map (cert: renewCert cert.domains) options.certs);
preStart = ''
if [[ ! -d "${options.path}/certificates" ]]; then
mkdir -p "${options.path}/certificates"
chmod 750 "${options.path}/certificates"
fi
${concatStringsSep "\n" (map (cert: installCert cert.domains) options.certs)}
'';
};
timers.certs = {
description = "Try to renew certificates daily.";
after = [ "network-online.target" ];
requires = [ "network-online.target" ];
wantedBy = [ "timers.target" ];
timerConfig = {
OnCalendar = "daily";
Persistent = "true";
};
};
};
users.users.caddy = {
uid = hostCfg.ids.uids.caddy;
group = "caddy";
createHome = false;
};
users.groups.caddy.gid = hostCfg.ids.gids.caddy;
};
};
...
This basically just checks the certificates once a day, and when they are within 30 days of expiring, then it connects to LetsEncrypt to renew the certificates. So for 60 days, this container doesn’t really touch the network.
So far I’ve tried:
- Review
journalctl -eb0
andjournalctl -eb-1
Only failures logged are related to my certificate service unable to resolve host names. - Review
systemctl status
andsystemctl failed
System is degraded.
Only certs.service failed. - Identify resolve service(s).
I have resolvconf.service. Status active (exited).
I have nscd.service. Status active (running) (thawing). I’m unsure why it’s thawing… - In a separate container also afflicted with bad-resolv-syndrome, I manually restarted resolvconf.service and nscd.service, and neither updated the
/etc/resolve.conf
with anameserver
entry.
Does anyone have any recommendations where I should look further to identify the culprit?