-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETCD fails to start after performing alarm list operation and then power off/on #14382
Comments
Easily reproducible with script:
|
Really great analysis and repro! @tobgu would you be willing to send PR with a fix? |
I would have to spend time trying to understand the V2/V3 sync logic for that and why it is failing in this case. If nobody else knows of a quick solution/fix to this I can do that but I was hoping that someone well versed in this area would pick it up. ;-) |
Heh, only if I could say that I understand it better. From what I investigated problem is due to consistent index (CI) not being saved when applying As the result snapshot index is greater than CI, which during etcd bootstrap is interpreted that etcd crushed during process of receiving snapshot from leader. Etcd panics as it assumes that there should be a v3 snapshot received from leader. |
Ok, looks like v3.5.3 assumes that performing applyV3 without an error means that consistency index was commited. However that's not true in case of Backend hooks bad, again. |
Some more details, issue is triggered when etcd crashes after snapshot that was followed by only Alarms entries. |
Impact: Low as the issue should be extremely rare. It requires:
Still it brings up the issue of how incomprehensible etcd apply code is. I think this will be third times we try to fix just this part of the code in v3.5.X |
I'm not sure I agree with the conclusion that this is an extremely rare case. Given how the liveness and readiness probes are setup in the Bitnami Helm chart referred to above all that is needed for this to happen is to let time pass without any writes to the DB. Once enough time has passed for a snapshot to have been written (a couple of days) and the machine restarts, ETCD will not come up again. The restart could happen because of power outages, OS updates, fat fingers, what not... My current workaround is to not use the default probes in the chart but rather hit the /health HTTP endpoint instead (which doesn't seem to suffer from the same problem). |
Can you link to how Bitnami does healthcheck? |
Sure! This is how the default liveness- and readiness probes and are setup (the ones I've now replaced with http probes against And this is the shell script that is called by the above probes: https://github.com/bitnami/bitnami-docker-etcd/blob/master/3.5/debian-11/rootfs/opt/bitnami/scripts/etcd/healthcheck.sh |
Thanks @tobgu for raising this issue ( a real issue)! It turns out to be a regression introduced in 3.5.4 in #13854 (#13908). The
Short term solutionSolution 1Lock batch_tx in (*AlarmStore) Get, so that it calls the Just need to add code something like below into (*AlarmStore) Get. It's the simplest change, but it looks ugly, because it doesn't make sense for
Solution 2Change server.go#L1853-L1855 to something like below. I don't know why I did not do this previously. Will think about this more and get back if I recall something new.
Solution 3Change the alarmList to use linearizableReadLoop, so that it doesn't go through the raft & applying workflow at all. Accordingly it will not advance the applyIndex at all, and the snapshot Index will not be advanced. Long term solutionGet rid of the OnPreCommitUnsafe added in 3.5.0 and the I will take care of the long-term solution. For short-term, solution 2 above looks the best for now. Anyone feel free to deliver a PR. |
Is anyone working on this issue? |
I'm not working on this. If no one else picks it up I might be able to find some time for it in a couple of weeks. Right now my priorities do not allow it. |
Issues: 1. etcd-io#14402 fixed in 3.4 only; 2. etcd-io#14382 fixed in both 3.5 and main. Signed-off-by: Benjamin Wang <[email protected]>
In case anyone is interested, this is the workaround solution https://github.com/ahrtr/etcd-issues/blob/d134cb8d07425bf3bf530e6bb509c6e6bc6e7c67/etcd/etcd-db-editor/main.go#L16-L28 |
This solution fixed our problem in a production |
@ahrtr You are a life saver!!!!! This solved my issue like magic. Thank you so much. |
What happened?
With a cluster of three nodes setup using https://github.com/bitnami/charts/blob/master/bitnami/etcd/README.md it was noticed that the ETCD nodes failed to start after a forced reboot of the underlying worker nodes. A graceful shutdown will not result in this issue.
The logs indicated a mismatch in the raft log index between the v2 *.snap files and the v3 db file where the index of the snap files was higher than that of the v3 db file causing ETCD to look for a snap.db file that did not exist (see logs).
The index of the snap file was derived from the file name (eg.
0000000000000017-0000000000124f8c.snap
) while the consistent_index of the v3 db was extracted using bbolt,bbolt get db meta consistent_index | hexdump
=>0xb4903
.So far the issue looked very much like what is described in #11949. The "fix" described in that issue to get the cluster up and running again also worked, to remove/move the
*.snap
files.Worth mentioning: This cluster had not had any writes to it for serveral weeks ahead of the reboot. The data in it is mostly read. Doing a proper write to the cluster will set the consistent_index of the v3 DB to an up-to-date value of the raft index.
After some investigation into why this index difference the between the snapshots and the v3 store occurred it was found that the health check executed regularly by Kubernetes was the reason for the version drift.
The health and readiness check regularly executes
etcdctl endpoint health
to determine if the cluster is healthy or not. In ETCD 3.4 this was a simple GET on the health key but since #12150 it also includes checking the alarm list to verify that it is empty. For some reason listing the alarms also triggers a write/apply (see attached logs). And for some reason this apply is only applied to the V2 store, not the V3 store. This cause the index in the V2 store to drift from the V3 store until a proper write is performed. I have not dug into the reason for why the write is performed and why it is missing from the V3 store.The behaviour is only present in this form in 3.5 since the health check in 3.4 does not include listing the alarms.
The problem is easy to reproduce locally. See description.
What did you expect to happen?
I would always expect ETCD to be able to start properly regardless of how the shutdown was done.
How can we reproduce it (as minimally and precisely as possible)?
Locally:
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
See steps to reproduce
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
See steps to reproduce
Relevant log output
The text was updated successfully, but these errors were encountered: