Automated Troubleshooting of Kubernetes (K8s) Pods Issues

Photo by Carles Rabada on Unsplash

In this blog post, I will show you how we automate the troubleshooting process of K8s Pod restart issues. By writing and deploying a K8s custom controller, we automatically collect and send K8s Pod Restart Reasons, Logs, and Events to a Slack channel, which made the process much easier than before.

Overview of the Data Collected

First, let’s take a look at two Slack screenshots of the example messages.

Brief Alert Message

A brief alert message with Show more

Detailed Alert Message

As shown below, by clicking Show more, we can see the Reason, Pod Status, Pod Events, Node Status and Events, and Pod Logs Before Restart.

A detailed alert message with Show less

In the Pod Status section, it displays the Restart Count, State, Last State, Reason and container Limits and Requests settings.

According to the above message, the Pod restarted due to OOMKilled. Additionally, we can view the Pod logs before the restart and see that the Memory Limit is set to 1Gi.

It’s very clear, efficient, and time-saving, isn’t it?

Why We Build This?

At Airwallex, we have thousands of Pods running on more than a hundred K8s clusters. In K8s, Pods are considered to be relatively ephemeral (rather than durable) resources. Pod restart events occur pretty often in clusters of this size.

Manually troubleshooting K8s Pod issues each time is a time-consuming and inefficient job. Engineers used to repeat it again and again.

“Why can’t we automate the whole process?” One of our DevOps engineers asked in an alert review meeting. “Why not?” Another 2 DevOps engineers showed their interest immediately and began to research.

The next two sections show the detailed steps and how to automate them.

Troubleshoot Pod Issues

In the past, when a Pod restarted, we had to run the following commands to analyze the contexts manually.

1. Check Pod RESTARTS, READY and STATUS

$ kubectl get pod demoservice-56d5f9f7ff-slr7d
NAME READY STATUS RESTARTS AGE
demoservice-56d5f9f7ff-slr7d 1/2 Running 2 164h13m57s

2. Check Pod Restart Reason, Last State, and resource settings. The resource Limits and Requests should be paid special attention to when troubleshooting OOMKilled issues.

$ kubectl describe pod demoservice-56d5f9f7ff-slr7d
...
Ready: false
Restart Count: 2
State:
Running
Started:
Wed, 10 Aug 2022 02:34:48 +0000
Last State:
Terminated
Reason: OOMKilled
Exit Code:
137
Started:
Mon, 08 Aug 2022 07:28:33 +0000
Finished:
Wed, 10 Aug 2022 02:34:46 +0000
Limits:
cpu:
1
memory:
1Gi
Requests:
cpu:
20m
memory:
500Mi
...

3. Check Pod Events via kubectl get events | grep <podName>

4. Check Pod logs before restart via kubectl logs --previous <podName>

5. Check Node status via kubectl get node <nodeName>

6. Check Node Events via kubectl get events | grep <nodeName>

Automate It

Here are 2 methods to automate the troubleshooting steps mentioned above.

Method #1: Writing a Collector with Bash Script and Kubectl

This is a very simple method. We can make the kubectl get pod -Acommand run periodically and compare the RESTARTS count. If the RESTARTS value rises, it indicates that the Pod restarted. The above kubectl commands are then run in sequence to collect the related information.

Finally, the collected information will be posted to a Slack channel using Slack Incoming Webhooks.

Method #2: Writing a K8s Custom Controller Using client-go Library

The first method is straightforward but inefficient. We can write a Kubernetes custom controller using the client-go library to watch Pods changes and collects Pod Restart Reasons, Logs, and Events when a Pod restarts.

A K8s controller is a control loop that watches the state of the cluster through the API server. The client-go library contains several useful utilities for accessing the API and creating custom controllers.

To build a K8s custom controller, please refer to Writing Controllers and client-go examples.

Summary

It is efficient and time-saving to automate the troubleshooting process of K8s Pod issues. At Airwallex, we wrote a Kubernetes custom controller to collect and send K8s Pod Restart Reasons, Logs, and Events to Slack automatically. Now, troubleshooting Pod issues is very easy and has saved our engineers hundreds of hours.

GitHub Repo: https://github.com/airwallex/k8s-pod-restart-info-collector

References

Many thanks to Jetson Pan, Michael Liu, and Michelle Narayan for reviewing this blog.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Able Lv

Able Lv

Cloud Infrastructure Engineer @Airwallex: Kubernetes, DevOps, Terraform, Istio, Go, and Cloud-Native stuff