Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd alarms seem to persist beyond the removal of the member they belong to; additionally cannot clear alarms #14379

Closed
gpl opened this issue Aug 25, 2022 · 5 comments
Labels

Comments

@gpl
Copy link

gpl commented Aug 25, 2022

What happened?

A cluster ran out of space, and was "rebuilt" - one instance was set with --force-new-cluster, and others were joined before that node itself was deleted and rejoined to the 3 member etcd cluster. However, we see the following log line frequently in the log, and as a result kubelet will eventually kill the etcd instance (the error shows up on all nodes.)

{"level":"warn","ts":"2022-08-25T00:57:44.423Z","caller":"etcdhttp/metrics.go:165","msg":"serving /health false due to an alarm","alarm":"memberID:<redacted, but mapped to a member that no longer exists> alarm:NOSPACE "}

Attempting to list or clear the alarms ends up timing out.

ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --endpoints=https://[]:2379 alarm list

{"level":"warn","ts":"2022-08-25T03:08:36.185+0200","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-736942c2-272c-41e8-9268-398c832b8752/[]:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded

Other commands work as predicted -- i.e. member list returns the expected list of members. (none of which are that of which the alarms are thrown for)

Tried to set the timeouts a fair bit higher, but no luck. The cluster seems to be stuck in a state where I can't seem to create any keys even with a long timeout, but endpoint health returns healthy for all members in the list <5ms. We see no issues with disk performance, etcd_disk_wal_fsync_duration_seconds_bucket and etcd_disk_backend_commit_duration_seconds_bucket metrics from etcd both report ~800us on every node.

What did you expect to happen?

Ability to clear the alarms and not see alarms for etcd members that no longer exist.

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

No response

Etcd version (please run commands below)

3.5.4

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

@gpl gpl added the type/bug label Aug 25, 2022
@gpl gpl changed the title can't seem to clear alarms etcd alarms seem to persist beyond the removal of the member they belong to; additionally cannot clear alarms Aug 25, 2022
@gpl
Copy link
Author

gpl commented Aug 25, 2022

Another weird thing we've noticed - a lot of actions seem to time out at 7 seconds, but it's unclear exactly how that timeout is set --

{"level":"warn","ts":"2022-08-25T01:23:24.880Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-08-25T01:23:17.880Z","time spent":"7.000622172s","remote":"[::1]:33658","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""}
{"level":"warn","ts":"2022-08-25T01:23:26.110Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-08-25T01:23:19.109Z","time spent":"7.001092567s","remote":"[::1]:33658","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""}
{"level":"warn","ts":"2022-08-25T01:23:26.805Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-08-25T01:23:19.804Z","time spent":"7.000708634s","remote":"[::1]:60342","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""}

@ahrtr
Copy link
Member

ahrtr commented Sep 7, 2022

@gpl Could you provide detailed steps so that others can try to reproduce it locally?

@gpl
Copy link
Author

gpl commented Sep 20, 2022

@gpl Could you provide detailed steps so that others can try to reproduce it locally?

I was able to fix it after poking around in bbolt for a while; although it seems that restoring from snapshots also worked.

This seems to be either the exact issue or closely related: #14382

@ahrtr
Copy link
Member

ahrtr commented Sep 20, 2022

This seems to be either the exact issue or closely related: #14382

Did you see similar error log and panic below (copied from 14382)? I suggest you to provide detailed reproduce step so that we can double confirm.

{"level":"panic","ts":"2022-07-23T18:21:36.261Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot
goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000506300, 0xc000532f40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000010fa0, 0x1234726, 0x2a, 0xc000532f40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/[email protected]/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0xc0000460ca, 0xd, 0x0, 0x0, 0x0, 0x0, 0xc0005150e0, 0x2, 0x2, 0xc00053c480, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc000104000, 0xc000104600, 0x0, 0x0)
/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc000104000, 0x12089be, 0x6, 0xc0004b7201, 0x2)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a1f0, 0x1, 0x1)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a1f0, 0x1, 0x1)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40 +0x13f

@ahrtr
Copy link
Member

ahrtr commented Sep 30, 2022

The issue may be already resolved in #14419. Please reopen or raise a new ticket if you still see this issue in 3.5.5 or main branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants