-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd alarms seem to persist beyond the removal of the member they belong to; additionally cannot clear alarms #14379
Comments
Another weird thing we've noticed - a lot of actions seem to time out at 7 seconds, but it's unclear exactly how that timeout is set -- {"level":"warn","ts":"2022-08-25T01:23:24.880Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-08-25T01:23:17.880Z","time spent":"7.000622172s","remote":"[::1]:33658","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""} |
@gpl Could you provide detailed steps so that others can try to reproduce it locally? |
Did you see similar error log and panic below (copied from 14382)? I suggest you to provide detailed reproduce step so that we can double confirm.
|
The issue may be already resolved in #14419. Please reopen or raise a new ticket if you still see this issue in 3.5.5 or main branch. |
What happened?
A cluster ran out of space, and was "rebuilt" - one instance was set with --force-new-cluster, and others were joined before that node itself was deleted and rejoined to the 3 member etcd cluster. However, we see the following log line frequently in the log, and as a result kubelet will eventually kill the etcd instance (the error shows up on all nodes.)
{"level":"warn","ts":"2022-08-25T00:57:44.423Z","caller":"etcdhttp/metrics.go:165","msg":"serving /health false due to an alarm","alarm":"memberID:<redacted, but mapped to a member that no longer exists> alarm:NOSPACE "}
Attempting to list or clear the alarms ends up timing out.
ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --endpoints=https://[]:2379 alarm list
{"level":"warn","ts":"2022-08-25T03:08:36.185+0200","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-736942c2-272c-41e8-9268-398c832b8752/[]:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
Other commands work as predicted -- i.e.
member list
returns the expected list of members. (none of which are that of which the alarms are thrown for)Tried to set the timeouts a fair bit higher, but no luck. The cluster seems to be stuck in a state where I can't seem to create any keys even with a long timeout, but endpoint health returns healthy for all members in the list <5ms. We see no issues with disk performance, etcd_disk_wal_fsync_duration_seconds_bucket and etcd_disk_backend_commit_duration_seconds_bucket metrics from etcd both report ~800us on every node.
What did you expect to happen?
Ability to clear the alarms and not see alarms for etcd members that no longer exist.
How can we reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
No response
Etcd version (please run commands below)
3.5.4
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response
The text was updated successfully, but these errors were encountered: