This is the second part of https://jrfilocao.medium.com/kubernetes-probes-do-not-make-this-mistake-0f5302f2ff8b.
In the past article, we learned about best practices for readiness and liveness probes. Additionally, we reviewed a case study where the following misconfigured liveness probe led to cascading failures in production:
Let's now reproduce the error on your local machine. Afterwards, we simulate and test a solution.
Reproducing the error
We need to simulate a scenario where the liveness and readiness probes encounter failures. To achieve this, we create a server with a health check endpoint. When called by the probes, this endpoint will intentionally respond with an error, replicating the failure conditions.
Step by Step
1) Clone the repository with files for simulating a server and configuring Kubernetes:
git clone https://github.com/jrfilocao/medium-kubernetes.git
2) Install docker
3) Install kubectl:
- linux: https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/
- macos: https://kubernetes.io/docs/tasks/tools/install-kubectl-macos/
4) Install minikube and start it with docker as a driver:
minikube start --driver=docker
5) Check everything is all right on minikube:
minikube status
6) Point your shell to minikube’s docker-daemon, to fetch the local image of the simulation server
eval $(minikube -p minikube docker-env)
7) Ensure you are in the same directory as the cloned repository. Build the image for the simulation server, which is based on the code in https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/:
docker build -t go-server .
The endpoint /healthz alternates its response every 60 seconds between HTTP 200 and HTTP 500 status codes. Probing this endpoint will simulate a heavy load on the server, resulting in its unavailability for 60 seconds after being available for 60 seconds.
8) Apply the Kubernetes configuration with the erroneous liveness:
kubectl apply -f readiness-liveness-failing.yaml
9) Check how the pod is going:
kubectl get pods -A
10) You should see many restarts:
11) To track what is happening under /healthz:
kubectl port-forward [name of your pod] 8080
/healthz shows the current HTTP status, /started shows the current second counter. Once the counter reaches 60, it will reset back to zero.
watch -n 2 curl -v 127.0.0.1:8080/healthz
watch -n 2 curl -v 127.0.0.1:8080/started
Solving the issue
1) Delete the current Kubernetes configuration:
kubectl delete -f readiness-liveness-failing.yaml
2) Apply the configuration which fixes the issue:
kubectl apply -f readiness-liveness-solution.yaml
This configuration only inverts the periodSeconds and failureThreshold between readiness and liveness. After being unavailable for 30 seconds, the pod is marked as not ready and no longer receives traffic. Only after 30*10 seconds (5 minutes) would the liveness probe fail and restart the container.
This configuration also sets the replicaset to two so that the remaining pod can serve the request if one pod is unavailable due to a failing readiness problem.
3) You should not see any restarts. One of the pods becomes temporarily not ready:
Then, after /healthz starts responding HTTP 200, the pod turns ready again on the next successful readiness probe.