liveness probe for etcd may cause database crash #2759

wjentner · 2022-09-21T19:53:16Z

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version):

kubeadm version: &version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.1", GitCommit:"3ddd0f45aa91e2f30c70734b175631bec5b5825a", GitTreeState:"clean", BuildDate:"2022-05-24T12:24:38Z", GoVersion:"go1.18.2", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Kubernetes version (use kubectl version):

Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.14", GitCommit:"bccf857df03c5a99a35e34020b3b63055f0c12ec", GitTreeState:"clean", BuildDate:"2022-09-14T22:36:04Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:
bare metal
OS (e.g. from /etc/os-release):

PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Kernel (e.g. uname -a):

Linux k8s-test 5.10.0-15-amd64 #1 SMP Debian 5.10.120-1 (2022-06-09) x86_64 GNU/Linux

Container runtime (CRI) (e.g. containerd, cri-o):
containerd 1.6.8
Container networking plugin (CNI) (e.g. Calico, Cilium):
calico
Others:

What happened?

This was first reported in the etcd repository: etcd-io/etcd#14497

Kubeadm creates a manifest for etcd that uses the /health endpoint of etcd in the liveness probe.
When the etcd database exceeds a certain size, the alarm NO SPACE is triggered, causing etcd to go into maintenance mode and allowing only read and delete actions until the size is reduced and etcdctl alarm disarm is sent.
When the alarm is triggered, this causes the /health check to no longer return a 200 response, meaning that the etcd member goes into a CrashLoopBackoff.
Because this happens almost simultaneously on all members, the continuous crashloops eventually cause a fatal error where etcd is no longer able to start up by itself.

The etcd maintainers mention that the /health endpoint should not be used for liveness probes.

In our case, this caused all etcd members to run into this error:

etcd1: {"level":"warn","ts":"2022-09-16T17:53:40.235Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":893213589,"snapshot-file-path":"/var/lib/etcd/member/snap/00000000353d5b95.snap.db","error":"snap: snapshot file doesn't exist"}

etcd2: {"level":"warn","ts":"2022-09-16T17:47:06.327Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":893216556,"snapshot-file-path":"/var/lib/etcd/member/snap/00000000353d672c.snap.db","error":"snap: snapshot file doesn't exist"}

etcd5: {"level":"warn","ts":"2022-09-16T17:46:30.424Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":893201552,"snapshot-file-path":"/var/lib/etcd/member/snap/00000000353d2c90.snap.db","error":"snap: snapshot file doesn't exist"}

As you can see, all members are on different indices and are not able to recover from any snapshot, which was likely caused by the continuous restarts.

What you expected to happen?

etcd should not crashloop causing the database to go into an unrecoverable state.

How to reproduce it (as minimally and precisely as possible)?

An easy way to trigger this behavior is to follow the etcd docs: https://etcd.io/docs/v3.5/op-guide/maintenance/#space-quota

Add the flag to etcd --quota-backend-bytes=16777216 (16MB)
Fill the DB: $ while [ 1 ]; do dd if=/dev/urandom bs=1024 count=1024 | ETCDCTL_API=3 etcdctl put key || break; done
Watch the status of etcd: etcdctl endpoint --cluster status -w table or also etcdctl alarm list (ALARM NO SPACE) should occur.
Check /health endpoint, which is no longer 200
Observe that etcd pod goes into a CrashLoopBackoff state.

Note: The error may only occur with multiple etcd nodes.

Anything else we need to know?

Without the --quota-backend-bytes flag, the alarm is raised at around a DB size of 2GB. We have a moderately small cluster running for almost three years and reached this size recently.

The text was updated successfully, but these errors were encountered:

neolit123 · 2022-09-21T20:03:45Z

@ahrtr @serathius
is this something we have to handle?

latest probe in kubeadm is here:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/phases/etcd/local.go#L206

ahrtr · 2022-09-21T22:16:01Z

Sorry for that. It's a known issue, which was fixed in 3.5.5.

FYI. etcd-io/etcd#14419

wjentner · 2022-09-22T00:46:39Z

What is the recommended solution?
kubeadm 1.24.6 installs etcd 3.5.3.
It also does not contain the update liveness probe from the main branch.

Should I force the etcd version 3.5.5?

ahrtr · 2022-09-22T00:55:28Z

Kubernetes bumped etcd 3.5.5 on the master branch in kubernetes/kubernetes#112489. I think etcd 3.5.5 should be bumped into previous stable releases as well. cc @dims @neolit123

pacoxu · 2022-09-22T03:26:35Z

etcd-io/etcd#14382 (comment)
The issue seems to be critical.

#!/bin/bash
./bin/etcd --snapshot-count=5 &
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
kill -9 %1
./bin/etcd --snapshot-count=5

neolit123 · 2022-09-22T05:44:32Z

I think etcd 3.5.5 should be bumped into previous stable releases as well.

from what i've seen these backports to bump etcd in older k8s releases do not get merged. but if someone wants to try, go ahead.
your solution is to tell kubeadm what etcd image version to use. remember to manually handle upgrades

ahrtr · 2022-09-22T06:52:52Z

@wjentner FYI. https://github.com/ahrtr/etcd-issues/blob/d134cb8d07425bf3bf530e6bb509c6e6bc6e7c67/etcd/etcd-db-editor/main.go#L16-L28

Please let me know whether it works or not.

BenTheElder · 2022-09-23T06:11:45Z

from what i've seen these backports to bump etcd in older k8s releases do not get merged. but if someone wants to try, go ahead.

Though, in this case it should only be a patch bump and may be fine? We have been on 3.5.X since at least k8s 1.23

neolit123 · 2022-09-23T08:20:09Z

etcd patch backports did not get attention and approval either. +1 from me if someone wants to try.

wjentner · 2022-09-23T14:34:41Z

@wjentner FYI. https://github.com/ahrtr/etcd-issues/blob/d134cb8d07425bf3bf530e6bb509c6e6bc6e7c67/etcd/etcd-db-editor/main.go#L16-L28

Please let me know whether it works or not.

I have manually recovered our DB using the snap/db file.

As a current workaround I have set the --quota-backend-bytes to 8Gb using the extraArgs in kubeadm such that the alarm is not being raised.
Additionally, I have forced etcd version 3.5.5.
There seems to be no way to overwrite the livenessProbe other than manually editing the manifest directly.

I'm quite surprised that this problem has not yet occurred to more people since the default DB size for the alarm to be raised is around 2GB.

neolit123 · 2022-09-23T14:40:23Z

There seems to be no way to overwrite the livenessProbe other than manually editing the manifest directly.

check the kubeadm --patches functionality

ahrtr · 2022-09-23T21:53:24Z

Thanks for the feedback.

I have manually recovered our DB using the snap/db file.

I am curious how did you do it.

wjentner · 2022-09-23T22:04:50Z

@ahrtr I basically followed the disaster recovery docs: https://etcd.io/docs/v3.5/op-guide/recovery/

stopped etcd on all nodes
On etcd1: etcdutl snapshot restore /var/lib/etcd/members/snap/db --name etcd1 --initial-cluster etcd1=http://host1:2380,etcd2=http://host2:2380,etcd3=http://host3:2380 --initial-cluster-token new-cluster --initial-advertise-peer-urls http://host1:2380 --skip-hash-check
Started etcd1 matching the above flags plus --quota-backend-bytes=8589934592, --initial-cluster-state=existing, --initial-cluster-token new-cluster
Made sure etcd1 start is successful (besides complaining that other etcd members are not answering)
Create a new snapshot with etcdctl snapshot save
Copy that snapshot to etcd2 and etcd3
Use command as in (2) with adjusted names and without the --skip-hash-check flag
Start etcd2 and etcd3 with the same flags as in (3).

The cluster synced successfully, and all members were healthy afterward.

ahrtr · 2022-09-23T22:20:10Z

Thanks @wjentner for the feedback, which makes sense.

I just checked the source code, the key point why your steps work is that etcdutl updates the consistent index using the commitId. FYI. v3_snapshot.go#L272

neolit123 added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. area/etcd labels Sep 21, 2022

neolit123 added this to the v1.26 milestone Sep 21, 2022

neolit123 closed this as completed Sep 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

liveness probe for etcd may cause database crash #2759

liveness probe for etcd may cause database crash #2759

wjentner commented Sep 21, 2022 •

edited

Loading

neolit123 commented Sep 21, 2022

ahrtr commented Sep 21, 2022

wjentner commented Sep 22, 2022

ahrtr commented Sep 22, 2022

pacoxu commented Sep 22, 2022 •

edited

Loading

neolit123 commented Sep 22, 2022 •

edited

Loading

ahrtr commented Sep 22, 2022

BenTheElder commented Sep 23, 2022

neolit123 commented Sep 23, 2022

wjentner commented Sep 23, 2022

neolit123 commented Sep 23, 2022

ahrtr commented Sep 23, 2022

wjentner commented Sep 23, 2022

ahrtr commented Sep 23, 2022

liveness probe for etcd may cause database crash #2759

liveness probe for etcd may cause database crash #2759

Comments

wjentner commented Sep 21, 2022 • edited Loading

Is this a BUG REPORT or FEATURE REQUEST?

Versions

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

neolit123 commented Sep 21, 2022

ahrtr commented Sep 21, 2022

wjentner commented Sep 22, 2022

ahrtr commented Sep 22, 2022

pacoxu commented Sep 22, 2022 • edited Loading

neolit123 commented Sep 22, 2022 • edited Loading

ahrtr commented Sep 22, 2022

BenTheElder commented Sep 23, 2022

neolit123 commented Sep 23, 2022

wjentner commented Sep 23, 2022

neolit123 commented Sep 23, 2022

ahrtr commented Sep 23, 2022

wjentner commented Sep 23, 2022

ahrtr commented Sep 23, 2022

wjentner commented Sep 21, 2022 •

edited

Loading

pacoxu commented Sep 22, 2022 •

edited

Loading

neolit123 commented Sep 22, 2022 •

edited

Loading