Kubernetes pods constantly restarted with `Pod sandbox changed` 2

haf · April 21, 2022, 9:45am

Okay so I’ve been working on installing Kubernetes, I’m very newbie at it so this might be something obvious to anyone experienced with it.

My configuration is really really simple so I don’t think there’s much room for errors:

Controller 0:

  services.kubernetes = {
    roles = ["master" ];
    masterAddress = controller-0.hostname;
    apiserverAddress = "https://${controller-0.hostname}:${toString controller-0.port}";
    easyCerts = true;
    apiserver = {
      securePort = controller-0.port;
      advertiseAddress = controller-0.ip;
    };

    addons.dns.enable = true;
  };

Worker 0 and 1:

  services.kubernetes = let
    api = "https://${controller-0.hostname}:${toString controller-0.port}";
  in
  {
    roles = [ "node" ];
    masterAddress = controller-0.hostname;
    easyCerts = true;

    kubelet.kubeconfig.server = api;
    apiserverAddress = api;

    addons.dns.enable = true;
  };

Some details about it:

Firewall is disabled
All variables are the same across the three hosts
They all have each other’s hostnames in /etc/hosts

The problem is, pods are constantly getting restarted with a Pod sandbox changed, it will be killed and re-created., some of my findings about it:

I’ve seen mentioned around that this is a problem of insufficient resources allocated to a pod. However, all of the hosts are running on VMs with 4GB memory and 2vCPU.
I made this same thread yesterday and promptly closed it, I managed to get coredns to work (that is, to stop being restarted constantly) just by increasing from 1 vCPU to 2. So therefore, it looks like it’s a resource problem
However, the moment I deploy any other pod… it starts failing with the same error, only coredns works. My other pods have no resource limits, and are pretty lightweight anyway
I’ve never seen this error with kubeadm or “The Hard Way”, so I think this might be something specific to the NixOS module?
My configuration is so simple that I can’t imagine what might be wrong with it… I followed [this guide](Pod sandbox changed, it will be killed and re-created.) but made it simpler (except for adding another node). Given that this is the bare minimum you need to run a cluster on NixOS, someone else must have done it before successfully, right?

Any ideas? Anyone else faced the same problem? Any tips debugging it? Literally any help is appreciated because I’m out of ideas!

UPDATE
Seems like this error stopped popping up just by rebooting the nodes, which is pretty surprising for nix, but unsurprising for kubernetes. I’m leaving this thread open because:

Maybe someone comes with an explanation for this
Maybe someone has the same problem, comes around and realizes the need for rebooting for Kubernetes
It’d be shameful to close two threads in a row about the same topic with such an easy solution

azazel75 · April 21, 2022, 11:00am

I never used nodes with so strict resources, but you can get the “pressure” (i.e. a status true if the node has insufficient resources) with the command kubectl describe nodes.

aimcod · February 21, 2023, 8:44am

I’ve started seeing the same behavior on Rocky Linux 9.1 nodes. I know this is a NixOS dedicated site, but still, your description matched perfectly what I am seeing.

the pods affected are always calico-nodes and kube-proxy on all the new nodes I am provisioning.

My nodes have 4cpu/16GB RAM and I definitely have no resourcing issues.

Rebooting the nodes seems to have resolved the constant restarts.

I checked the pod logs, kubelet logs as well as the containerd logs and there’s nothing in there that can explain what the issue is.

Anyway, a quick reboot post-provision won’t do no harm, so I guess I’ll add that in my k8s playbook.