Safely Draining a Kubernetes Node

Safely Draining a Kubernetes Node::Be An Expert

Hello Friends, Hope you are doing great. In this post, I will go through the process of Safely Draining a Kubernetes Node in detail.

Kubernetes is designed in such a way that it is to be a fault tolerance of worker node failures. If a node goes missing due to a hardware problem, issue at the cloud provider end, or if K8s is unable to receive heartbeat metrics/message, from a node for some reason, the K8s control plane (master nodes) is smart enough to handle such issues.

But, please note that it does not mean that it will be able to resolve every problem. If you would like to read about the components of Kubernetes, please click here.

A common misconception is that:

>> If there are enough resources are available, K8s will reschedule all the pods from the faulty/lost nodes to another node, so nothing to worry about.

>> Every pod will be rescheduled, the auto scaler will add a new node if required.

Before draining a K8s node, we will first understand the meaning of draining in K8s.

What is Draining?

While performing maintenance, we may need to remove a K8s node from the service. In other terms, it is a mechanism that allows users to gracefully move all containers from one node to another.

When we are performing maintenance on our Kubernetes cluster, we may sometimes need to be able to remove the Kubernetes nodes from service. We have multiple nodes in our cluster and all of our applications should be able to continue to run even if we remove a node from the service.

We will be able to temporarily remove the node, perform our maintenance. Then add it back to the cluster without any interruptions. So to do this, we need to drain the node.

It causes containers running on the node to be gracefully terminated and potentially even moved to another node. This is to prevent any interruption of service during that maintenance period.

We can drain a node very easily simply using the kubectl drain command followed by the node name.

Note: If availability is important for any applications that run or could run on the node(s) that you are draining, configure a PodDisruptionBudgets first and then continue following the below guide.

Anatomy of a drain procedure in Kubernetes: the grace period

  1. When we execute kubectl drain for a node, Kubernetes first checks if there are any pods with local data on disk. If there are any local data, we need to add the below option with the drain command to override.

--delete-local-data=true

2. Also, a node contains pods that are managed by a DaemonSet. So if we have any DaemonSet pods running on a node and we just do kubectl drain followed by the node name, we going to get an error message telling us that Kubernetes cannot drain the node because of those DaemonSets.

We can ignore DaemonSet if we just pass in that ignore DaemonSets flag (as mentioned below), kubectl drain will go ahead and run it.

--ignore-daemonsets=true

3. At last, it is also checked if the node has any pods that are not being managed by any K8s itself (like DaemonSet, ReplicaSet, ReplicaController, StatefulSet, Job, etc) or by any other operator.

If this is the case, then the drain operation will be refused unless we use the below option with the drain command.

--force=true

If all the above checks have been overcome, then the k8s proceeds as follows:

All the pods are being notified about this event and are put into a terminating state. During this time, they (the control plane) have a chance to take some action. Although the pods receive the termination notices, they can keep running until the operator has removed all finalizers. This gives the operator a chance to sort out the things (like, it will move the data away from the pod in case of StatefulSets).

However, there is a limit to this tolerance by k8s, and that is called a grace period. If the grace period exceded, but the pod has not actually terminated, then it is killed forcefully. If this happens, the operator has no chance but to remove the pod, drop its persistent volume claim and persistent volumes (data). This situation should not arise, or else it will lead to a failure incident in pods of StatefulSets and must be handled by fail-over management.

Draining a node:

To drain a node, we will first see the nodes in the K8s cluster by executing the below command:

kubectl get nodes

cloud_user@k8s-control:~$ kubectl get nodes
NAME          STATUS   ROLES                  AGE     VERSION
k8s-control   Ready    control-plane,master   2m34s   v1.21.0
k8s-worker1   Ready    <none>                 22s     v1.21.0
k8s-worker2   Ready    <none>                 11s     v1.21.0
cloud_user@k8s-control:~$

Also we are going to create some objects to test this draining process in detail.

So first, we will create a pod. Use the below content to create a pod in the cluster.

cloud_user@k8s-control:~$ vim pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: nginx
    image: nginx:1.14.2
    ports:
    - containerPort: 80
  restartPolicy: OnFailure
cloud_user@k8s-control:~$ kubectl apply -f pod.yaml
pod/my-pod created
cloud_user@k8s-control:~$ kubectl get pods -o wide
NAME     READY   STATUS    RESTARTS   AGE   IP               NODE          NOMINATED NODE   READINESS GATES
my-pod   1/1     Running   0          15s   192.168.194.65   k8s-worker1   <none>           <none>
cloud_user@k8s-control:~$

Now, we will create a deployment with 2 replicas. Execute the below command to get the deployment file structure.

kubectl create deployment my-deployment --image=nginx --dry-run=client -o yaml

Note: kubectl create deployment my-deployment --image=nginx --dry-run -o yaml
W0614 02:00:02.591942 21506 helpers.go:557] --dry-run is deprecated and can be replaced with --dry-run=client.

cloud_user@k8s-control:~$ kubectl create deployment my-deployment --image=nginx --dry-run=client -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: my-deployment
  name: my-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-deployment
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: my-deployment
    spec:
      containers:
      - image: nginx
        name: nginx
        resources: {}
status: {}
cloud_user@k8s-control:~$

Copy the above deployment structure and paste in a deployment file with 2 replicas.

cloud_user@k8s-control:~$ vim deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: my-deployment
  name: my-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: my-deployment
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: my-deployment
    spec:
      containers:
      - image: nginx
        name: nginx
        resources: {}
status: {}
cloud_user@k8s-control:~$ kubectl apply -f deployment.yaml
deployment.apps/my-deployment created

Now run kubectl get pods with -o wide option to see on which nodes our objects are created.

cloud_user@k8s-control:~$ kubectl get pods -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP               NODE          NOMINATED NODE   READINESS GATES
my-deployment-57d86476b6-gdkfm   1/1     Running   0          53s     192.168.126.1    k8s-worker2   <none>           <none>
my-deployment-57d86476b6-krm8g   1/1     Running   0          52s     192.168.126.2    k8s-worker2   <none>           <none>
my-pod                           1/1     Running   0          2m39s   192.168.194.65   k8s-worker1   <none>           <none>
cloud_user@k8s-control:~$

I see my deployment objects are created on same worker node. To test the scenario of point number 2, and 3 (explained above), we have to schedule one replica of the my-deployment to another node (k8s-worker1). So we will delete one pod of my-deployment and eventually, the pod should go to another node.

cloud_user@k8s-control:~$ kubectl delete pod my-deployment-57d86476b6-krm8g
pod "my-deployment-57d86476b6-krm8g" deleted
cloud_user@k8s-control:~$

I have deleted one pod, now the new pod should be created on another pod. Let’s check this.

cloud_user@k8s-control:~$ kubectl get pods -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP               NODE          NOMINATED NODE   READINESS GATES
my-deployment-57d86476b6-855nv   1/1     Running   0          18s     192.168.194.66   k8s-worker1   <none>           <none>
my-deployment-57d86476b6-gdkfm   1/1     Running   0          2m2s    192.168.126.1    k8s-worker2   <none>           <none>
my-pod                           1/1     Running   0          3m48s   192.168.194.65   k8s-worker1   <none>           <none>
cloud_user@k8s-control:~$

I can see that my-deployment has created on different node. Now we will try to drain the node “k8s-worker1“.

cloud_user@k8s-control:~$ kubectl drain k8s-worker1
node/k8s-worker1 cordoned
error: unable to drain node "k8s-worker1", aborting command...

There are pending nodes to be drained:
 k8s-worker1
cannot delete Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): default/my-pod
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-bhmls, kube-system/kube-proxy-pcpdh
cloud_user@k8s-control:~$

You can see that, we have encountered an error message. If you closing look at the error message, it says, the node has a pos which is not managed by “ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet” (as explained in point number 3).

Also, it cannot delete DaemonSet managed pods (as explained in point number 2). You can also see pods that are DaemonSet managed in the error itself.

So to ignore/override, we have to use options provided in the drain error message.

cloud_user@k8s-control:~$ kubectl drain k8s-worker1 --force=true --ignore-daemonsets=true
node/k8s-worker1 already cordoned
WARNING: deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: default/my-pod; ignoring DaemonSet-managed Pods: kube-system/calico-node-bhmls, kube-system/kube-proxy-pcpdh
evicting pod default/my-pod
evicting pod default/my-deployment-57d86476b6-855nv
pod/my-pod evicted
pod/my-deployment-57d86476b6-855nv evicted
node/k8s-worker1 evicted
cloud_user@k8s-control:~$

Note: You can either use --ignore-daemonsets or --ignore-daemonsets=true. Both have the same meaning, as by default adding the option --ignore-daemonsets has a true parameter in it. The same applies to --force the option.

What happens when drain command executed successfully?

  1. The node is cordoned, which means that no new pods can be placed on this node. In Kubernetes terms, it is taint node.kubernetes.io/unschedulable:NoSchedule placed on the node that most of the pods tolerate.
  2. Pods, except the ones that belong to DaemonSets and Pods that are not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet, are evicted and hopefully scheduled on another node.

Now check the nodes by executing the below command.

cloud_user@k8s-control:~$ kubectl get nodes
NAME          STATUS                     ROLES                  AGE     VERSION
k8s-control   Ready                      control-plane,master   9m58s   v1.21.0
k8s-worker1   Ready,SchedulingDisabled   <none>                 7m46s   v1.21.0
k8s-worker2   Ready                      <none>                 7m35s   v1.21.0
cloud_user@k8s-control:~$

You can now see that my node “k8s-worker1" has SchedulingDisabled, which means no pods will get scheduled on this node until we cordon the node or attach back to Ready for scheduling state.

cloud_user@k8s-control:~$ kubectl get pods -o wide
NAME                             READY   STATUS    RESTARTS   AGE    IP              NODE          NOMINATED NODE   READINESS GATES
my-deployment-57d86476b6-24h55   1/1     Running   0          78s    192.168.126.3   k8s-worker2   <none>           <none>
my-deployment-57d86476b6-gdkfm   1/1     Running   0          5m4s   192.168.126.1   k8s-worker2   <none>           <none>
cloud_user@k8s-control:~$

Since Scheduling is disabled on the node “k8s-worker1”, we can now do our maintenance task and once it is done, we need to attach back to normal condition.

To attach the node, execute the below command:

cloud_user@k8s-control:~$ kubectl uncordon k8s-worker1
node/k8s-worker1 uncordoned
cloud_user@k8s-control:~$
cloud_user@k8s-control:~$ kubectl get nodes
NAME          STATUS   ROLES                  AGE     VERSION
k8s-control   Ready    control-plane,master   11m     v1.21.0
k8s-worker1   Ready    <none>                 9m14s   v1.21.0
k8s-worker2   Ready    <none>                 9m3s    v1.21.0
cloud_user@k8s-control:~$

We will again delete one pod of my-deployment and see if the new pod has been scheduled on the node “k8s-worker1”.

cloud_user@k8s-control:~$ kubectl delete pod my-deployment-57d86476b6-gdkfm
pod "my-deployment-57d86476b6-gdkfm" deleted
cloud_user@k8s-control:~$
cloud_user@k8s-control:~$ kubectl get pods -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP               NODE          NOMINATED NODE   READINESS GATES
my-deployment-57d86476b6-24h55   1/1     Running   0          46m   192.168.126.3    k8s-worker2   <none>           <none>
my-deployment-57d86476b6-gks2w   1/1     Running   0          44s   192.168.194.67   k8s-worker1   <none>           <none>
cloud_user@k8s-control:~$

Now, we see that the new pod has been scheduled on the node “k8s-worker1”. By this, I am concluding the process of “Safely Draining a Kubernetes Node“. You can also configure “PodDisruptionBudget” to ensure that your workloads remain available during maintenance.

Summary: In this post, we have learned how can we Safely Draining a Kubernetes Node and do our maintenance tasks and attach it back to the cluster for scheduling. We have also learned the different stages involved in the draining process.

Feel free to let me know in the below comment section, if you still have any questions. I will be happy to answer it.

Have a great day and stay safe.!! ๐Ÿ™‚

Reference: Kubernetes Official Page , ArangoDB Doc

Powered by Facebook Comments

2 Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.