etcd alarms seem to persist beyond the removal of the member they belong to; additionally cannot clear alarms #14379

gpl · 2022-08-25T01:10:43Z

What happened?

A cluster ran out of space, and was "rebuilt" - one instance was set with --force-new-cluster, and others were joined before that node itself was deleted and rejoined to the 3 member etcd cluster. However, we see the following log line frequently in the log, and as a result kubelet will eventually kill the etcd instance (the error shows up on all nodes.)

{"level":"warn","ts":"2022-08-25T00:57:44.423Z","caller":"etcdhttp/metrics.go:165","msg":"serving /health false due to an alarm","alarm":"memberID:<redacted, but mapped to a member that no longer exists> alarm:NOSPACE "}

Attempting to list or clear the alarms ends up timing out.

ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --endpoints=https://[]:2379 alarm list

{"level":"warn","ts":"2022-08-25T03:08:36.185+0200","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-736942c2-272c-41e8-9268-398c832b8752/[]:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded

Other commands work as predicted -- i.e. member list returns the expected list of members. (none of which are that of which the alarms are thrown for)

Tried to set the timeouts a fair bit higher, but no luck. The cluster seems to be stuck in a state where I can't seem to create any keys even with a long timeout, but endpoint health returns healthy for all members in the list <5ms. We see no issues with disk performance, etcd_disk_wal_fsync_duration_seconds_bucket and etcd_disk_backend_commit_duration_seconds_bucket metrics from etcd both report ~800us on every node.

What did you expect to happen?

Ability to clear the alarms and not see alarms for etcd members that no longer exist.

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

No response

Etcd version (please run commands below)

3.5.4

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

gpl · 2022-08-25T01:25:22Z

Another weird thing we've noticed - a lot of actions seem to time out at 7 seconds, but it's unclear exactly how that timeout is set --

{"level":"warn","ts":"2022-08-25T01:23:24.880Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-08-25T01:23:17.880Z","time spent":"7.000622172s","remote":"[::1]:33658","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""}
{"level":"warn","ts":"2022-08-25T01:23:26.110Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-08-25T01:23:19.109Z","time spent":"7.001092567s","remote":"[::1]:33658","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""}
{"level":"warn","ts":"2022-08-25T01:23:26.805Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-08-25T01:23:19.804Z","time spent":"7.000708634s","remote":"[::1]:60342","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""}

ahrtr · 2022-09-07T08:54:35Z

@gpl Could you provide detailed steps so that others can try to reproduce it locally?

gpl · 2022-09-20T19:51:35Z

@gpl Could you provide detailed steps so that others can try to reproduce it locally?

I was able to fix it after poking around in bbolt for a while; although it seems that restoring from snapshots also worked.

This seems to be either the exact issue or closely related: #14382

ahrtr · 2022-09-20T21:53:48Z

This seems to be either the exact issue or closely related: #14382

Did you see similar error log and panic below (copied from 14382)? I suggest you to provide detailed reproduce step so that we can double confirm.

{"level":"panic","ts":"2022-07-23T18:21:36.261Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot
goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000506300, 0xc000532f40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000010fa0, 0x1234726, 0x2a, 0xc000532f40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/[email protected]/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0xc0000460ca, 0xd, 0x0, 0x0, 0x0, 0x0, 0xc0005150e0, 0x2, 0x2, 0xc00053c480, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc000104000, 0xc000104600, 0x0, 0x0)
/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc000104000, 0x12089be, 0x6, 0xc0004b7201, 0x2)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a1f0, 0x1, 0x1)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a1f0, 0x1, 0x1)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40 +0x13f

ahrtr · 2022-09-30T21:57:11Z

The issue may be already resolved in #14419. Please reopen or raise a new ticket if you still see this issue in 3.5.5 or main branch.

gpl added the type/bug label Aug 25, 2022

gpl changed the title ~~can't seem to clear alarms~~ etcd alarms seem to persist beyond the removal of the member they belong to; additionally cannot clear alarms Aug 25, 2022

ahrtr closed this as completed Sep 30, 2022

rahulbapumore mentioned this issue Jan 24, 2024

not able to clear alarm in etcd 3.5.3 version by using etcdctl alarm disarm command #17318

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd alarms seem to persist beyond the removal of the member they belong to; additionally cannot clear alarms #14379

etcd alarms seem to persist beyond the removal of the member they belong to; additionally cannot clear alarms #14379

gpl commented Aug 25, 2022 •

edited

Loading

paste your configuration here

gpl commented Aug 25, 2022

ahrtr commented Sep 7, 2022

gpl commented Sep 20, 2022

ahrtr commented Sep 20, 2022

ahrtr commented Sep 30, 2022

etcd alarms seem to persist beyond the removal of the member they belong to; additionally cannot clear alarms #14379

etcd alarms seem to persist beyond the removal of the member they belong to; additionally cannot clear alarms #14379

Comments

gpl commented Aug 25, 2022 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

gpl commented Aug 25, 2022

ahrtr commented Sep 7, 2022

gpl commented Sep 20, 2022

ahrtr commented Sep 20, 2022

ahrtr commented Sep 30, 2022

gpl commented Aug 25, 2022 •

edited

Loading