Hello Friends, Hope you are doing great. In this post, I will go through the process of Safely Draining a Kubernetes Node in detail.
Kubernetes is designed in such a way that it is to be a fault tolerance of worker node failures. If a node goes missing due to a hardware problem, issue at the cloud provider end, or if K8s is unable to receive heartbeat metrics/message, from a node for some reason, the K8s control plane (master nodes) is smart enough to handle such issues.
But, please note that it does not mean that it will be able to resolve every problem. If you would like to read about the components of Kubernetes, please click here.
A common misconception is that:
>> If there are enough resources are available, K8s will reschedule all the pods from the faulty/lost nodes to another node, so nothing to worry about.
>> Every pod will be rescheduled, the auto scaler will add a new node if required.
Before draining a K8s node, we will first understand the meaning of draining in K8s.
What is Draining?
While performing maintenance, we may need to remove a K8s node from the service. In other terms, it is a mechanism that allows users to gracefully move all containers from one node to another.
When we are performing maintenance on our Kubernetes cluster, we may sometimes need to be able to remove the Kubernetes nodes from service. We have multiple nodes in our cluster and all of our applications should be able to continue to run even if we remove a node from the service.
We will be able to temporarily remove the node, perform our maintenance. Then add it back to the cluster without any interruptions. So to do this, we need to drain the node.
It causes containers running on the node to be gracefully terminated and potentially even moved to another node. This is to prevent any interruption of service during that maintenance period.
We can drain a node very easily simply using the kubectl drain
command followed by the node name.
Note: If availability is important for any applications that run or could run on the node(s) that you are draining, configure a PodDisruptionBudgets first and then continue following the below guide.
Anatomy of a drain procedure in Kubernetes: the grace period
- When we execute
kubectl drain
for a node, Kubernetes first checks if there are any pods with local data on disk. If there are any local data, we need to add the below option with the drain command to override.
--delete-local-data=true
2. Also, a node contains pods that are managed by a DaemonSet
. So if we have any DaemonSet
pods running on a node and we just do kubectl drain
followed by the node name, we going to get an error message telling us that Kubernetes cannot drain the node because of those DaemonSets.
We can ignore DaemonSet
if we just pass in that ignore DaemonSets
flag (as mentioned below), kubectl drain
will go ahead and run it.
--ignore-daemonsets=true
3. At last, it is also checked if the node has any pods that are not being managed by any K8s itself (like DaemonSet
, ReplicaSet
, ReplicaController
, StatefulSet
, Job
, etc) or by any other operator.
If this is the case, then the drain operation will be refused unless we use the below option with the drain command.
--force=true
If all the above checks have been overcome, then the k8s proceeds as follows:
All the pods are being notified about this event and are put into a terminating state. During this time, they (the control plane) have a chance to take some action. Although the pods receive the termination notices, they can keep running until the operator has removed all finalizers. This gives the operator a chance to sort out the things (like, it will move the data away from the pod in case of StatefulSets
).
However, there is a limit to this tolerance by k8s, and that is called a grace period. If the grace period exceded, but the pod has not actually terminated, then it is killed forcefully. If this happens, the operator has no chance but to remove the pod, drop its persistent volume claim and persistent volumes (data). This situation should not arise, or else it will lead to a failure incident in pods of StatefulSets
and must be handled by fail-over management.
Draining a node:
To drain a node, we will first see the nodes in the K8s cluster by executing the below command:
kubectl get nodes
cloud_user@k8s-control:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION k8s-control Ready control-plane,master 2m34s v1.21.0 k8s-worker1 Ready <none> 22s v1.21.0 k8s-worker2 Ready <none> 11s v1.21.0 cloud_user@k8s-control:~$
Also we are going to create some objects to test this draining process in detail.
So first, we will create a pod. Use the below content to create a pod in the cluster.
cloud_user@k8s-control:~$ vim pod.yaml
apiVersion: v1 kind: Pod metadata: name: my-pod spec: containers: - name: nginx image: nginx:1.14.2 ports: - containerPort: 80 restartPolicy: OnFailure
cloud_user@k8s-control:~$ kubectl apply -f pod.yaml pod/my-pod created
cloud_user@k8s-control:~$ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-pod 1/1 Running 0 15s 192.168.194.65 k8s-worker1 <none> <none> cloud_user@k8s-control:~$
Now, we will create a deployment with 2 replicas. Execute the below command to get the deployment file structure.
kubectl create deployment my-deployment --image=nginx --dry-run=client -o yaml
Note: kubectl create deployment my-deployment --image=nginx --dry-run -o yaml
W0614 02:00:02.591942 21506 helpers.go:557] --dry-run is deprecated and can be replaced with --dry-run=client.
cloud_user@k8s-control:~$ kubectl create deployment my-deployment --image=nginx --dry-run=client -o yaml apiVersion: apps/v1 kind: Deployment metadata: creationTimestamp: null labels: app: my-deployment name: my-deployment spec: replicas: 1 selector: matchLabels: app: my-deployment strategy: {} template: metadata: creationTimestamp: null labels: app: my-deployment spec: containers: - image: nginx name: nginx resources: {} status: {} cloud_user@k8s-control:~$
Copy the above deployment structure and paste in a deployment file with 2 replicas.
cloud_user@k8s-control:~$ vim deployment.yaml
apiVersion: apps/v1 kind: Deployment metadata: creationTimestamp: null labels: app: my-deployment name: my-deployment spec: replicas: 2 selector: matchLabels: app: my-deployment strategy: {} template: metadata: creationTimestamp: null labels: app: my-deployment spec: containers: - image: nginx name: nginx resources: {} status: {}
cloud_user@k8s-control:~$ kubectl apply -f deployment.yaml deployment.apps/my-deployment created
Now run kubectl get pods with -o wide option to see on which nodes our objects are created.
cloud_user@k8s-control:~$ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-deployment-57d86476b6-gdkfm 1/1 Running 0 53s 192.168.126.1 k8s-worker2 <none> <none> my-deployment-57d86476b6-krm8g 1/1 Running 0 52s 192.168.126.2 k8s-worker2 <none> <none> my-pod 1/1 Running 0 2m39s 192.168.194.65 k8s-worker1 <none> <none> cloud_user@k8s-control:~$
I see my deployment objects are created on same worker node. To test the scenario of point number 2, and 3 (explained above), we have to schedule one replica of the my-deployment to another node (k8s-worker1). So we will delete one pod of my-deployment and eventually, the pod should go to another node.
cloud_user@k8s-control:~$ kubectl delete pod my-deployment-57d86476b6-krm8g pod "my-deployment-57d86476b6-krm8g" deleted cloud_user@k8s-control:~$
I have deleted one pod, now the new pod should be created on another pod. Let’s check this.
cloud_user@k8s-control:~$ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-deployment-57d86476b6-855nv 1/1 Running 0 18s 192.168.194.66 k8s-worker1 <none> <none> my-deployment-57d86476b6-gdkfm 1/1 Running 0 2m2s 192.168.126.1 k8s-worker2 <none> <none> my-pod 1/1 Running 0 3m48s 192.168.194.65 k8s-worker1 <none> <none> cloud_user@k8s-control:~$
I can see that my-deployment has created on different node. Now we will try to drain the node “k8s-worker1
“.
cloud_user@k8s-control:~$ kubectl drain k8s-worker1 node/k8s-worker1 cordoned error: unable to drain node "k8s-worker1", aborting command... There are pending nodes to be drained: k8s-worker1 cannot delete Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): default/my-pod cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-bhmls, kube-system/kube-proxy-pcpdh cloud_user@k8s-control:~$
You can see that, we have encountered an error message. If you closing look at the error message, it says, the node has a pos which is not managed by “ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet
” (as explained in point number 3).
Also, it cannot delete DaemonSet managed pods (as explained in point number 2). You can also see pods that are DaemonSet
managed in the error itself.
So to ignore/override, we have to use options provided in the drain error message.
cloud_user@k8s-control:~$ kubectl drain k8s-worker1 --force=true --ignore-daemonsets=true node/k8s-worker1 already cordoned WARNING: deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: default/my-pod; ignoring DaemonSet-managed Pods: kube-system/calico-node-bhmls, kube-system/kube-proxy-pcpdh evicting pod default/my-pod evicting pod default/my-deployment-57d86476b6-855nv pod/my-pod evicted pod/my-deployment-57d86476b6-855nv evicted node/k8s-worker1 evicted cloud_user@k8s-control:~$
Note: You can either use --ignore-daemonsets
or --ignore-daemonsets=true
. Both have the same meaning, as by default adding the option --ignore-daemonsets
has a true parameter in it. The same applies to --force
the option.
What happens when drain command executed successfully?
- The node is cordoned, which means that no new pods can be placed on this node. In Kubernetes terms, it is
taint
node.kubernetes.io/unschedulable:NoSchedule
placed on the node that most of the pods tolerate. - Pods, except the ones that belong to
DaemonSets
andPods
that are not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet, are evicted and hopefully scheduled on another node.
Now check the nodes by executing the below command.
cloud_user@k8s-control:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION k8s-control Ready control-plane,master 9m58s v1.21.0 k8s-worker1 Ready,SchedulingDisabled <none> 7m46s v1.21.0 k8s-worker2 Ready <none> 7m35s v1.21.0 cloud_user@k8s-control:~$
You can now see that my node “k8s-worker1"
has SchedulingDisabled,
which means no pods will get scheduled on this node until we cordon
the node or attach back to Ready for scheduling state.
cloud_user@k8s-control:~$ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-deployment-57d86476b6-24h55 1/1 Running 0 78s 192.168.126.3 k8s-worker2 <none> <none> my-deployment-57d86476b6-gdkfm 1/1 Running 0 5m4s 192.168.126.1 k8s-worker2 <none> <none> cloud_user@k8s-control:~$
Since Scheduling is disabled on the node “k8s-worker1”, we can now do our maintenance task and once it is done, we need to attach back to normal condition.
To attach the node, execute the below command:
cloud_user@k8s-control:~$ kubectl uncordon k8s-worker1 node/k8s-worker1 uncordoned cloud_user@k8s-control:~$
cloud_user@k8s-control:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION k8s-control Ready control-plane,master 11m v1.21.0 k8s-worker1 Ready <none> 9m14s v1.21.0 k8s-worker2 Ready <none> 9m3s v1.21.0 cloud_user@k8s-control:~$
We will again delete one pod of my-deployment and see if the new pod has been scheduled on the node “k8s-worker1”.
cloud_user@k8s-control:~$ kubectl delete pod my-deployment-57d86476b6-gdkfm pod "my-deployment-57d86476b6-gdkfm" deleted cloud_user@k8s-control:~$
cloud_user@k8s-control:~$ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-deployment-57d86476b6-24h55 1/1 Running 0 46m 192.168.126.3 k8s-worker2 <none> <none> my-deployment-57d86476b6-gks2w 1/1 Running 0 44s 192.168.194.67 k8s-worker1 <none> <none> cloud_user@k8s-control:~$
Now, we see that the new pod has been scheduled on the node “k8s-worker1”. By this, I am concluding the process of “Safely Draining a Kubernetes Node“. You can also configure “PodDisruptionBudget” to ensure that your workloads remain available during maintenance.
Summary: In this post, we have learned how can we Safely Draining a Kubernetes Node and do our maintenance tasks and attach it back to the cluster for scheduling. We have also learned the different stages involved in the draining process.
Feel free to let me know in the below comment section, if you still have any questions. I will be happy to answer it.
Have a great day and stay safe.!! ๐
Reference: Kubernetes Official Page , ArangoDB Doc
My name is Shashank Shekhar. I am a DevOps Engineer, currently working in one of the best companies in India. I am having around 5 years of experience in Linux Server Administration and DevOps tools.
I love to work in Linux environment & love learning new things.
Powered by Facebook Comments
Great work Shashank .. worth reading !
Thanks, I am glad you liked it. ๐