Okay so I’ve been working on installing Kubernetes, I’m very newbie at it so this might be something obvious to anyone experienced with it.
My configuration is really really simple so I don’t think there’s much room for errors:
Controller 0:
services.kubernetes = {
roles = ["master" ];
masterAddress = controller-0.hostname;
apiserverAddress = "https://${controller-0.hostname}:${toString controller-0.port}";
easyCerts = true;
apiserver = {
securePort = controller-0.port;
advertiseAddress = controller-0.ip;
};
addons.dns.enable = true;
};
Worker 0 and 1:
services.kubernetes = let
api = "https://${controller-0.hostname}:${toString controller-0.port}";
in
{
roles = [ "node" ];
masterAddress = controller-0.hostname;
easyCerts = true;
kubelet.kubeconfig.server = api;
apiserverAddress = api;
addons.dns.enable = true;
};
Some details about it:
- Firewall is disabled
- All variables are the same across the three hosts
- They all have each other’s hostnames in
/etc/hosts
The problem is, pods are constantly getting restarted with a Pod sandbox changed, it will be killed and re-created.
, some of my findings about it:
- I’ve seen mentioned around that this is a problem of insufficient resources allocated to a pod. However, all of the hosts are running on VMs with 4GB memory and 2vCPU.
- I made this same thread yesterday and promptly closed it, I managed to get
coredns
to work (that is, to stop being restarted constantly) just by increasing from 1 vCPU to 2. So therefore, it looks like it’s a resource problem - However, the moment I deploy any other pod… it starts failing with the same error, only
coredns
works. My other pods have no resource limits, and are pretty lightweight anyway - I’ve never seen this error with
kubeadm
or “The Hard Way”, so I think this might be something specific to the NixOS module? - My configuration is so simple that I can’t imagine what might be wrong with it… I followed [this guide](Pod sandbox changed, it will be killed and re-created.) but made it simpler (except for adding another node). Given that this is the bare minimum you need to run a cluster on NixOS, someone else must have done it before successfully, right?
Any ideas? Anyone else faced the same problem? Any tips debugging it? Literally any help is appreciated because I’m out of ideas!
UPDATE
Seems like this error stopped popping up just by rebooting the nodes, which is pretty surprising for nix, but unsurprising for kubernetes. I’m leaving this thread open because:
- Maybe someone comes with an explanation for this
- Maybe someone has the same problem, comes around and realizes the need for rebooting for Kubernetes
- It’d be shameful to close two threads in a row about the same topic with such an easy solution