ETCD fails to start after performing alarm list operation and then power off/on #14382

tobgu · 2022-08-25T14:21:19Z

What happened?

With a cluster of three nodes setup using https://github.com/bitnami/charts/blob/master/bitnami/etcd/README.md it was noticed that the ETCD nodes failed to start after a forced reboot of the underlying worker nodes. A graceful shutdown will not result in this issue.

The logs indicated a mismatch in the raft log index between the v2 *.snap files and the v3 db file where the index of the snap files was higher than that of the v3 db file causing ETCD to look for a snap.db file that did not exist (see logs).

The index of the snap file was derived from the file name (eg. 0000000000000017-0000000000124f8c.snap) while the consistent_index of the v3 db was extracted using bbolt, bbolt get db meta consistent_index | hexdump => 0xb4903.

So far the issue looked very much like what is described in #11949. The "fix" described in that issue to get the cluster up and running again also worked, to remove/move the *.snap files.

Worth mentioning: This cluster had not had any writes to it for serveral weeks ahead of the reboot. The data in it is mostly read. Doing a proper write to the cluster will set the consistent_index of the v3 DB to an up-to-date value of the raft index.

After some investigation into why this index difference the between the snapshots and the v3 store occurred it was found that the health check executed regularly by Kubernetes was the reason for the version drift.

The health and readiness check regularly executes etcdctl endpoint health to determine if the cluster is healthy or not. In ETCD 3.4 this was a simple GET on the health key but since #12150 it also includes checking the alarm list to verify that it is empty. For some reason listing the alarms also triggers a write/apply (see attached logs). And for some reason this apply is only applied to the V2 store, not the V3 store. This cause the index in the V2 store to drift from the V3 store until a proper write is performed. I have not dug into the reason for why the write is performed and why it is missing from the V3 store.

The behaviour is only present in this form in 3.5 since the health check in 3.4 does not include listing the alarms.

The problem is easy to reproduce locally. See description.

What did you expect to happen?

I would always expect ETCD to be able to start properly regardless of how the shutdown was done.

How can we reproduce it (as minimally and precisely as possible)?

Locally:

# Use very low snapshot count to more quickly trigger the issue
$  ./etcd --snapshot-count=5 --log-level=debug

# In a different terminal, repeat six times to trigger creation of a snapshot file (you'll see a `triggering snapshot` log message in the server log when this happens)
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health

# Kill etcd server process
$ kill -9 <etcd PID>

# Try to start ETCD again, this will fail with the attached log message
$ ./etcd --snapshot-count=5 --log-level=debug

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.4
Git SHA: 08407ff76
Go Version: go1.16.15
Go OS/Arch: darwin/amd64

$ etcdctl version
etcdctl version: 3.5.4
API version: 3.5

Etcd configuration (command line flags or environment variables)

See steps to reproduce

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

See steps to reproduce

Relevant log output

# Log printed upon startup failure
{"level":"info","ts":"2022-07-23T18:21:35.360Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/bitnami/etcd/data/member/snap/db","took":"174.548µs"}
{"level":"info","ts":"2022-07-23T18:21:36.259Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":1200012,"snapshot-size":"25 kB"}
{"level":"warn","ts":"2022-07-23T18:21:36.260Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":1200012,"snapshot-file-path":"/bitnami/etcd/data/member/snap/0000000000124f8c.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2022-07-23T18:21:36.261Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot
goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000506300, 0xc000532f40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000010fa0, 0x1234726, 0x2a, 0xc000532f40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/[email protected]/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0xc0000460ca, 0xd, 0x0, 0x0, 0x0, 0x0, 0xc0005150e0, 0x2, 0x2, 0xc00053c480, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc000104000, 0xc000104600, 0x0, 0x0)
/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc000104000, 0x12089be, 0x6, 0xc0004b7201, 0x2)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a1f0, 0x1, 0x1)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a1f0, 0x1, 0x1)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40 +0x13f


# Server debug logs when running `etcdctl endpoint health`
{"level":"debug","ts":"2022-08-25T15:21:32.425+0200","caller":"v3rpc/interceptor.go:182","msg":"request stats","start time":"2022-08-25T15:21:32.425+0200","time spent":"114.504µs","remote":"127.0.0.1:51728","response type":"/etcdserverpb.KV/Range","request count":0,"request size":8,"response count":0,"response size":28,"request content":"key:\"health\" "}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"etcdserver/server.go:2122","msg":"Applying entries","num-entries":1}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"etcdserver/server.go:2125","msg":"Applying entry","index":70,"term":5,"type":"EntryNormal"}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"etcdserver/server.go:2184","msg":"apply entry normal","consistent-index":60,"entry-index":70,"should-applyV3":true}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"etcdserver/server.go:2211","msg":"applyEntryNormal","raftReq":"header:<ID:7587864799024601626 > alarm:<> "}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"v3rpc/interceptor.go:182","msg":"request stats","start time":"2022-08-25T15:21:32.426+0200","time spent":"241.967µs","remote":"127.0.0.1:51728","response type":"/etcdserverpb.Maintenance/Alarm","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"info","ts":"2022-08-25T15:21:32.428+0200","caller":"zapgrpc/zapgrpc.go:174","msg":"[transport] transport: loopyWriter.run returning. connection error: desc = \"transport is closing\""}

The text was updated successfully, but these errors were encountered:

serathius · 2022-08-25T14:29:03Z

Easily reproducible with script:

#!/bin/bash
./bin/etcd --snapshot-count=5 &
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
./bin/etcdctl endpoint health
kill -9 %1
./bin/etcd --snapshot-count=5

serathius · 2022-08-25T14:41:36Z

Really great analysis and repro! @tobgu would you be willing to send PR with a fix?

tobgu · 2022-08-25T14:46:20Z

I would have to spend time trying to understand the V2/V3 sync logic for that and why it is failing in this case. If nobody else knows of a quick solution/fix to this I can do that but I was hoping that someone well versed in this area would pick it up. ;-)

serathius · 2022-08-25T16:45:29Z

Heh, only if I could say that I understand it better. From what I investigated problem is due to consistent index (CI) not being saved when applying Alarms entry. I managed to confirm that this behavior was introduced between v3.5.2 and v3.5.3. Meaning that this is a bug for in fix for data inconsistency issue :(.

As the result snapshot index is greater than CI, which during etcd bootstrap is interpreted that etcd crushed during process of receiving snapshot from leader. Etcd panics as it assumes that there should be a v3 snapshot received from leader.

serathius · 2022-08-25T16:56:24Z

Ok, looks like v3.5.3 assumes that performing applyV3 without an error means that consistency index was commited. However that's not true in case of Alarms, as it doesn't touch db, meaning it doesn't open any transactions, which would lead to executing backend hooks.

Backend hooks bad, again.

serathius · 2022-08-25T17:12:18Z

Some more details, issue is triggered when etcd crashes after snapshot that was followed by only Alarms entries.
It causes a transitory bad state of db, would prevent etcd to recover after crash. It's transitory as db will be fixed after any WAL entry that is not Alarms.

serathius · 2022-08-25T17:16:22Z

Impact: Low as the issue should be extremely rare.

It requires:

Snapshot triggered on Alarm Request entry
Hard crash of etcd before any non-alarm entries are added

Still it brings up the issue of how incomprehensible etcd apply code is. I think this will be third times we try to fix just this part of the code in v3.5.X

tobgu · 2022-08-25T18:44:14Z

I'm not sure I agree with the conclusion that this is an extremely rare case.

Given how the liveness and readiness probes are setup in the Bitnami Helm chart referred to above all that is needed for this to happen is to let time pass without any writes to the DB. Once enough time has passed for a snapshot to have been written (a couple of days) and the machine restarts, ETCD will not come up again. The restart could happen because of power outages, OS updates, fat fingers, what not...

My current workaround is to not use the default probes in the chart but rather hit the /health HTTP endpoint instead (which doesn't seem to suffer from the same problem).

serathius · 2022-08-26T07:37:01Z

Given how the liveness and readiness probes are setup in the Bitnami Helm chart referred to above all that is needed for this to happen is to let time pass without any writes to the DB. Once enough time has passed for a snapshot to have been written (a couple of days) and the machine restarts, ETCD will not come up again. The restart could happen because of power outages, OS updates, fat fingers, what not...

Can you link to how Bitnami does healthcheck?

serathius · 2022-08-26T09:36:14Z

cc @ahrtr @ptabor for their opinion

tobgu · 2022-08-26T11:23:15Z

Given how the liveness and readiness probes are setup in the Bitnami Helm chart referred to above all that is needed for this to happen is to let time pass without any writes to the DB. Once enough time has passed for a snapshot to have been written (a couple of days) and the machine restarts, ETCD will not come up again. The restart could happen because of power outages, OS updates, fat fingers, what not...

Can you link to how Bitnami does healthcheck?

Sure!

This is how the default liveness- and readiness probes and are setup (the ones I've now replaced with http probes against /health): https://github.com/bitnami/charts/blob/master/bitnami/etcd/templates/statefulset.yaml#L265-L289

And this is the shell script that is called by the above probes: https://github.com/bitnami/bitnami-docker-etcd/blob/master/3.5/debian-11/rootfs/opt/bitnami/scripts/etcd/healthcheck.sh

ahrtr · 2022-08-29T07:40:00Z

Thanks @tobgu for raising this issue ( a real issue)!

It turns out to be a regression introduced in 3.5.4 in #13854 (#13908). The alarm list is the only exception that doesn't move consistent_index forward. The reproduction steps are as simple as,

etcd --snapshot-count=5 &
for i in {1..6}; do etcdctl  alarm list; done 
kill -9 <etcd_pid>
etcd

Short term solution

Solution 1

Lock batch_tx in (*AlarmStore) Get, so that it calls the txPostLockInsideApplyHook, and accordingly move consistent_index forward.

Just need to add code something like below into (*AlarmStore) Get. It's the simplest change, but it looks ugly, because it doesn't make sense for alarmList to acquire the batchTx lock at all.

	tx := s.be.BatchTx()
	tx.LockInsideApply()
	defer tx.Unlock()

Solution 2

Change server.go#L1853-L1855 to something like below. I don't know why I did not do this previously. Will think about this more and get back if I recall something new.

			newIndex := s.consistIndex.ConsistentIndex()
			if newIndex < e.Index {
				s.consistIndex.SetConsistentIndex(e.Index, e.Term)
			}

Solution 3

Change the alarmList to use linearizableReadLoop, so that it doesn't go through the raft & applying workflow at all. Accordingly it will not advance the applyIndex at all, and the snapshot Index will not be advanced.

Long term solution

Get rid of the OnPreCommitUnsafe added in 3.5.0 and the txPostLockInsideApplyHook/LockInsideApply/LockOutsideApply added in main & 3.5.3 & 3.5.4.

I will take care of the long-term solution.

For short-term, solution 2 above looks the best for now. Anyone feel free to deliver a PR.

ahrtr · 2022-08-31T08:18:29Z

Is anyone working on this issue?

tobgu · 2022-09-01T20:01:10Z

I'm not working on this. If no one else picks it up I might be able to find some time for it in a couple of weeks. Right now my priorities do not allow it.
I'm happy with the current workaround (a change in the health check used) as a short term solution in terms of stability but would of course want this to be fixed as a proper long term solution in order to not have to worry about triggering the behaviour experienced in some other way.

Issues: 1. etcd-io#14402 fixed in 3.4 only; 2. etcd-io#14382 fixed in both 3.5 and main. Signed-off-by: Benjamin Wang <[email protected]>

ahrtr · 2022-09-22T07:14:46Z

In case anyone is interested, this is the workaround solution https://github.com/ahrtr/etcd-issues/blob/d134cb8d07425bf3bf530e6bb509c6e6bc6e7c67/etcd/etcd-db-editor/main.go#L16-L28

wcmjunior · 2022-10-10T18:22:47Z

In case anyone is interested, this is the workaround solution https://github.com/ahrtr/etcd-issues/blob/d134cb8d07425bf3bf530e6bb509c6e6bc6e7c67/etcd/etcd-db-editor/main.go#L16-L28

This solution fixed our problem in a production etcd configuration. Just commenting here so more people can follow. Thanks!

kycfeel · 2022-12-04T12:11:00Z

In case anyone is interested, this is the workaround solution https://github.com/ahrtr/etcd-issues/blob/d134cb8d07425bf3bf530e6bb509c6e6bc6e7c67/etcd/etcd-db-editor/main.go#L16-L28

@ahrtr You are a life saver!!!!! This solved my issue like magic. Thank you so much.

tobgu added the type/bug label Aug 25, 2022

ahrtr mentioned this issue Sep 2, 2022

Move consistent_index forward when executing alarmList operation #14419

Merged

ahrtr closed this as completed in #14419 Sep 5, 2022

ahrtr added a commit to ahrtr/etcd that referenced this issue Sep 6, 2022

Update changelog to cover some PRs

92ddc94

Issues: 1. etcd-io#14402 fixed in 3.4 only; 2. etcd-io#14382 fixed in both 3.5 and main. Signed-off-by: Benjamin Wang <[email protected]>

This was referenced Sep 6, 2022

Update changelog to cover some PRs #14430

Merged

[3.5] Move consistent_index forward when executing alarmList operation #14429

Merged

ahrtr changed the title ~~ETCD fails to start after power off/on~~ ETCD fails to start after performing alarm list operation and then power off/on Sep 6, 2022

gpl mentioned this issue Sep 20, 2022

etcd alarms seem to persist beyond the removal of the member they belong to; additionally cannot clear alarms #14379

Closed

pacoxu mentioned this issue Sep 22, 2022

liveness probe for etcd may cause database crash kubernetes/kubeadm#2759

Closed

serathius mentioned this issue Oct 10, 2022

failed to recover v3 backend from snapshot ,failed to find database snapshot file #14569

Closed

ptabor pinned this issue Oct 11, 2022

ptabor unpinned this issue Oct 11, 2022

serathius mentioned this issue Nov 14, 2022

ETCD-3.5.5 : panic: failed to recover v3 backend from snapshot #14749

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETCD fails to start after performing alarm list operation and then power off/on #14382

ETCD fails to start after performing alarm list operation and then power off/on #14382

tobgu commented Aug 25, 2022 •

edited

Loading

serathius commented Aug 25, 2022 •

edited

Loading

serathius commented Aug 25, 2022 •

edited

Loading

tobgu commented Aug 25, 2022

serathius commented Aug 25, 2022

serathius commented Aug 25, 2022 •

edited

Loading

serathius commented Aug 25, 2022

serathius commented Aug 25, 2022 •

edited

Loading

tobgu commented Aug 25, 2022

serathius commented Aug 26, 2022

serathius commented Aug 26, 2022

tobgu commented Aug 26, 2022

ahrtr commented Aug 29, 2022 •

edited

Loading

ahrtr commented Aug 31, 2022

tobgu commented Sep 1, 2022

ahrtr commented Sep 22, 2022

wcmjunior commented Oct 10, 2022

kycfeel commented Dec 4, 2022

ETCD fails to start after performing alarm list operation and then power off/on #14382

ETCD fails to start after performing alarm list operation and then power off/on #14382

Comments

tobgu commented Aug 25, 2022 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

serathius commented Aug 25, 2022 • edited Loading

serathius commented Aug 25, 2022 • edited Loading

tobgu commented Aug 25, 2022

serathius commented Aug 25, 2022

serathius commented Aug 25, 2022 • edited Loading

serathius commented Aug 25, 2022

serathius commented Aug 25, 2022 • edited Loading

tobgu commented Aug 25, 2022

serathius commented Aug 26, 2022

serathius commented Aug 26, 2022

tobgu commented Aug 26, 2022

ahrtr commented Aug 29, 2022 • edited Loading

Short term solution

Solution 1

Solution 2

Solution 3

Long term solution

ahrtr commented Aug 31, 2022

tobgu commented Sep 1, 2022

ahrtr commented Sep 22, 2022

wcmjunior commented Oct 10, 2022

kycfeel commented Dec 4, 2022

tobgu commented Aug 25, 2022 •

edited

Loading

serathius commented Aug 25, 2022 •

edited

Loading

serathius commented Aug 25, 2022 •

edited

Loading

serathius commented Aug 25, 2022 •

edited

Loading

serathius commented Aug 25, 2022 •

edited

Loading

ahrtr commented Aug 29, 2022 •

edited

Loading