Kubernetes probes: do not make this mistake

jrfilocao
4 min readDec 30, 2023

--

What is the idea behind the probes?

In general, liveness and readiness probes are essential mechanisms that help ensure fault tolerance and high availability of applications deployed in a Kubernetes cluster.

Readiness probes

They determine if a container is ready to receive incoming traffic. If a readiness probe fails, Kubernetes automatically removes the container from the load balancer, preventing traffic from being directed to a container that is not ready to handle it.

Example of readiness probe in action
The pod with one container is removed from Service load balancers after failure. From https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-setting-up-health-checks-with-readiness-and-liveness-probes

Two scenarios can lead to a failing readiness probe:

  • If a container is still in the process of starting up.
  • Even after a container has started, it may encounter challenges such as heavy loads or other issues that prevent it from responding within the defined timeout threshold.

In the latter scenario, readiness probes can be leveraged to implement a protective mechanism akin to a circuit breaker. For instance, if the probe fails X times consecutively (as determined by the failureThreshold), the container is promptly removed from the load balancer. Subsequently, after Y successful probes (as defined by successThreshold), the container’s status is updated to indicate readiness, allowing it to resume receiving traffic.

Liveness probes

They determine if a container is running as expected. If a liveness probe fails, Kubernetes automatically restarts the container to attempt to recover it.

The pod with one container is restarted after failure
The pod with one container is restarted after failure. From https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-setting-up-health-checks-with-readiness-and-liveness-probes

Liveness probes can be essential if they truly indicate unrecoverable application failure, such as a deadlock. But they can also be dangerous if implemented incorrectly.

Kubernetes docs show a recommended approach for implementing Liveness probes:

A common pattern for liveness probes is to use the same low-cost HTTP endpoint as for readiness probes, but with a higher failureThreshold. This ensures that the pod is observed as not-ready for some period of time before it is hard killed.

Case study

Recently, I came across a probe configuration similar to the one below. Can you identify the problem?

Example of a Kubernetes configuration with an issue
Example of a Kubernetes configuration with an issue

No? About once a month, the application became unstable, and users were unable to use it due to cascading failures.

Let’s take a closer look and break the configuration down:

As shown before, the failureThreshold of a liveness probe should be set higher than that of the readiness probe.

In the given configuration, it will take a pod a maximum of 30*10 seconds (periodSeconds*failureThreshold => 300 seconds = 5 minutes) to become ready again. However, the liveness probe will detect failures earlier and restart the pod only after 10*3 seconds (periodSeconds*failureThreshold => 30 seconds).

Clearly, the developer was unaware of the risks of wrong liveness configurations. During times of high system load, a slow container was not being removed from the service load balancer via readiness probes. Instead, it was being restarted due to the erroneous liveness configuration.

Adding to the issue, the replica set was set up with only one pod, making things worse. As a result, the failing service caused errors in other services, ultimately crashing the entire application.

Good Practices for Readiness and Liveness Probes

Let’s recap the definitions of some probe attributes:

  • periodSeconds: defines the interval between probe checks.
  • failureThreshold: specifies the number of consecutive failures before Kubernetes considers the probe failed.
  • successThreshold: specifies the number of consecutive successful probes required after failure to consider the probe successful.

With that in mind, here is a summary of good practices:

  1. Use the same low-cost HTTP endpoint for readiness and liveness.
  2. Set different thresholds for liveness and readiness:
    Readiness should have a lower total value of periodSeconds*failureThreshold than liveness to allow for recovery of unhealthy containers before they are automatically restarted.
  3. Replica set with at least two pods.

Sources

--

--