22 Feb 2020

Kubernetes workers are not immune to master failures

There is some consensus floating around that Kubernetes worker applications can continue to function when the master is down:

Stack Overflow: What happens when the Kubernetes master fails (August 2016)

It’s my understanding that the master runs the API, and … when it is offline … life for applications will continue as normal unless nodes are rebooted, or there is a dramatic failure of some sort during this time, because TCP/ UDP services, load balancers, DNS, the dashboard, etc. Should all continue to function.

Stack Overflow: Is kubeadm production ready now? (October 2018)

Also, keep in mind if your master(s) go down your workloads will keep running, you just won’t be able to make changes or schedule new pods until the master(s) comes back up.

This may lead you to conclude that the master node is not critical to application availability. For time, cost or complexity reasons you may think it’s ok to run a single master, but let’s see what happens when it goes down…

Breaking DNS

Kubernetes runs its own DNS server to provide service discovery. If you set up your cluster with kubeadm, there is a deployment of CoreDNS with 2 replicas set up. On a single-master setup this means that the master will run one CoreDNS pod and a worker will run the other (except when it doesn’t):

$ kubectl --namespace=kube-system get pods -l k8s-app=kube-dns -owide
NAME                      READY   STATUS    IP          NODE    
coredns-fb8b8dccf-pnvnn   1/1     Running   10.1.0.1   master 
coredns-fb8b8dccf-xs9d9   1/1     Running   10.1.0.2   worker1

If worker1 goes down, the scheduler on the master will kick in and schedule a new DNS pod on another worker node. All good. But if your master goes down, the scheduler goes with it, and you’ll be left with one DNS pod: only one node failure away from total failure.

But it’s far worse than that. If the master goes down, 50% of the DNS requests will fail. To see why, we have to look at how DNS works in the pods:

$ kubectl run -i --tty busybox --image=busybox --restart=Never -- cat /etc/resolv.conf
nameserver 10.2.3.4
...

Our pod’s DNS points at an IP that matches neither of the CoreDNS pod IPs. In fact, it points to the virtual IP of a Service, named kube-dns for legacy reasons:

$ kubectl --namespace=kube-system get service kube-dns
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)               
kube-dns   ClusterIP   10.2.3.4   <none>        53/UDP,53/TCP,9153/TCP

$ kubectl --namespace=kube-system get endpoints kube-dns -o custom-columns=IP:subsets[*].addresses[*].ip
IP
10.1.0.1,10.1.0.2

A Service of type ClusterIP provides load balancing for pod-to-pod communication. The key to its operation is kube-proxy. The kube-proxy DaemonSet lives on every node, It listens to changes to Services and Endpoints from the kube-apiserver and updates iptables rules so that connections to the service IP are transparently routed to pods using NAT.

kube-proxy considers kube-apiserver to be the source of truth, and will not act on its own. So when the master goes down, kube-proxy will keep routing traffic to the CoreDNS pod on the master, causing half of the requests to fail and probably your applications along with it.

You could work around this problem by forcing CoreDNS to never run on the master node using taints and tolerations, but that would still only protect you from 2 out of N (nodes) failures. No single point of failure, but hardly resilient for any N > 3.

It’s not just DNS

If your applications use Services for communication, then without the master you are still one node away from failure. A data center outage can take your multi-AZ application down. If you’re deploying a very small cluster you may be tempted to remove the taints from a node so that it can schedule worker pods. Similar to the DNS problem, you are again vulnerable to the master failing.

You need a multi-master setup, except when…

The only time Kubernetes can be set up so that it’s safe to run a single master is when every pod required for your application runs on the same set of nodes. This is a very specific requirement that is probably only worth the effort to very small users who want to save costs, self-hosted HA applications that bundle Kubernetes or those in the initial stages of migrating from a docker-compose setup. In any case, here’s how to do it:

Install Kubernetes 1.17+ with the ServiceTopology feature enabled. Here is an example of how to do this with kubeadm.
Patch all of your Service objects so that they prefer routing to a local pod when one is available. See this example from the Kubernetes docs.
If using ingress-nginx, configure it to use the Service IP instead of proxying directly to the pods. See this example from the ingress-nginx docs.
If you’re using kubeadm instead of kops, patch CoreDNS so that it:
- Doesn’t crash when the master node crashes. See patch here.
- Runs on every node. kops does this using cluster-proportional-autoscaler.
- No two pods run on the same node. kops does this using pod anti-affinity

The folk wisdom was once true…

The belief that workers could tolerate master failures was true until March 2016, when Kubernetes 1.2 was released. iptables became the default kube-proxy mode, replacing the user-space proxy mode. One key behavioral difference is this:

If kube-proxy is running in iptables mode and the first Pod that’s selected does not respond, the connection fails. This is different from userspace mode: in that scenario, kube-proxy would detect that the connection to the first Pod had failed and would automatically retry with a different backend Pod.

(from https://kubernetes.io/docs/concepts/services-networking/service/)

Barry Coughlan

Kubernetes workers are not immune to master failures

Breaking DNS

It’s not just DNS

You need a multi-master setup, except when…

The folk wisdom was once true…