[K8s / devops] Reviewing the liveness probe endpoint #4383

hackintoshrao · 2019-12-09T11:40:36Z

What version of Dgraph are you using?

v1.1

The kubelet uses liveness probes to know when to restart a Container.
For example, liveness probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a Container in such a state can help to make the application more available despite bugs.

This is basically used to automate all the scenarios where an irresponsive database needs to restart.
But a wrong implementation would restart the pods when it's not necessary to do so.

It would be important to review it once and write a report on two things:

Does the current liveness probe endpoint guarantee the detection of irresponsive database state?
Review for any false positives.

fristonio · 2019-12-09T12:18:36Z

Currently, we have the readiness probe and liveness probes in our helm configuration for dgraph alpha. We are using the same endpoint for both liveness and readiness check, which is a probable bug in the sense that when a node tries to apply the raft snapshot, it sets the health status to unhealthy(here), and if probed at this time, Kubernetes will restart the container which should not be the case as the database is operational.

Liveness Probe

Liveness probe is for cases when we want to know that the container is not dead, which essentially means that it might not be able to serve our business logic at this time but is operational.

For liveness probes, most of the implementations assume that if our HTTP server can give out any response back the application is live(might not be ready to serve the traffic but is live). In cockroachdb’s case this endpoint simply returns details about the node.

Readiness probe

The readiness probe is to check if the application can serve our business logic, which in Dgraph’s case is to check if we can process the database transactions. For alphas, the readiness probe defined in the helm chart does something similar looking at a globally updatable status.

CockroachDB also operates in a similar manner and behaves differently for liveness and readiness probe.

To further improve the readiness probe would mean we will have to integrate health checks deep into our source code and probably brainstorm on what we will consider a not ready state for Dgraph.

hackintoshrao added kind/enhancement Something could be better. area/kubernetes Related to running Dgraph on K8s area/devops labels Dec 9, 2019

hackintoshrao assigned fristonio Dec 9, 2019

hackintoshrao added kind/bug Something is broken. and removed kind/enhancement Something could be better. labels Dec 9, 2019

danielmai added area/operations Related to operational aspects of the DB, including signals, flags, env vars, etc. and removed area/devops labels Dec 9, 2019

fristonio mentioned this issue Dec 17, 2019

contrib: update k8s liveness and readiness probes #4405

Merged

fristonio closed this as completed Jan 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[K8s / devops] Reviewing the liveness probe endpoint #4383

[K8s / devops] Reviewing the liveness probe endpoint #4383

hackintoshrao commented Dec 9, 2019

fristonio commented Dec 9, 2019 •

edited

Loading

[K8s / devops] Reviewing the liveness probe endpoint #4383

[K8s / devops] Reviewing the liveness probe endpoint #4383

Comments

hackintoshrao commented Dec 9, 2019

What version of Dgraph are you using?

fristonio commented Dec 9, 2019 • edited Loading

Liveness Probe

Readiness probe

fristonio commented Dec 9, 2019 •

edited

Loading