Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCD-3.5.5 : panic: failed to recover v3 backend from snapshot #14749

Closed
IamSatyaonline opened this issue Nov 14, 2022 · 18 comments
Closed

ETCD-3.5.5 : panic: failed to recover v3 backend from snapshot #14749

IamSatyaonline opened this issue Nov 14, 2022 · 18 comments
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release/v3.5 type/bug

Comments

@IamSatyaonline
Copy link

What happened?

We are trying to install a helm chart with three members of ETCD service. ETCD-3.5.5 is being used in our helm chart. Our deployment is getting failed with the error below as "panic: failed to recover v3 backend from snapshot". We are using latest version of ETCD 3.5.5.

panic: failed to recover v3 backend from snapshot
goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0002cc000, 0xc000536640, 0x1, 0x1)
/usr/local/google/home/siarkowicz/.gvm/pkgsets/go1.16.15/global/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc00065a050, 0x124a82a, 0x2a, 0xc000536640, 0x1, 0x1)
/usr/local/google/home/siarkowicz/.gvm/pkgsets/go1.16.15/global/pkg/mod/go.uber.org/[email protected]/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0xc00006a84a, 0x26, 0x0, 0x0, 0x0, 0x0, 0xc0001499e0, 0x1, 0x1, 0xc000149c20, ...)
/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdserver/server.go:516 +0x1676
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc0002c8700, 0xc00033a000, 0x0, 0x0)
/tmp/etcd-release-3.5.5/etcd/release/etcd/server/embed/etcd.go:243 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc0002c8700, 0x121e47b, 0x6, 0xc000533101, 0x2)
/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdmain/etcd.go:228 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003c1f0, 0x1, 0x1)
/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdmain/etcd.go:123 +0x24da
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003c1f0, 0x1, 0x1)
/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdmain/main.go:40 +0x13f
main.main()
/tmp/etcd-release-3.5.5/etcd/release/etcd/server/main.go:32 +0x45

What did you expect to happen?

There should not be any issue in ETCD service at the time of installation and ETCD should come up without any issue.
Seems, This issue has been fixed but we are facing the same issue in ETCD-3.5.5.

How can we reproduce it (as minimally and precisely as possible)?

We don't know the steps to reproduce it as it is not easy to reproduce the issue.

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
# paste output here

$ etcdctl version
# paste output here

We can't share the command's result as our deployment is not coming up.

Etcd configuration (command line flags or environment variables)

ETCD_AUTO_COMPACTION_RETENTION: 100
ETCD_CERT_FILE: /run/sec/certs/server/srvcert.pem
ETCD_PEER_AUTO_TLS: true
ETCD_INITIAL_CLUSTER: etcd-0=https
ETCD_INITIAL_CLUSTER_TOKEN: etcd
ETCD_LISTEN_CLIENT_URLS: https
ETCD_MAX_SNAPSHOTS: 3
ETCD_MAX_WALS: 3
ETCD_ADVERTISE:CLIENT_URLS: https
ETCD_ENABLE_V2 false:
ETCD_KEY_FILE: /run/sec/certs/server/srvprivkey.pem
ETCD_SNAPSHOT_COUNT: 5000
ETCD_AUTO_COMPACTION_MODE: revision
ETCD_CLIENT_CERT_AUTH: true
ETCD_ELECTION_TIMEOUT: 1000
ETCD_HEARTBEAT_INTERVAL: 100
ETCD_INITIAL_CLUSTER_STATE: new
ETCD_NAME: etcd-0
ETCD_QUOTA_BACKEND_BYTES: 268435456
ETCD_TRUSTED_CA_FILE: /data/combinedca/cacertbundle.pem
ETCD_DATA_DIR: /data
ETCD_INITIAL_ADVERTISE_PEER_URLS: https
ETCD_LISTEN_PEER_URLS: https
ETCD_METRICS: basic

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

We can't share the command's result as our deployment is not coming up.

Relevant log output

2022-11-08T23:17:38.073+00:00 | info | The current timezone is UTC
2022-11-08T23:17:38.364+00:00 | info | getopts cs option
2022-11-08T23:17:38.364+00:00 | info | case "${option}" in
2022-11-08T23:17:38.364+00:00 | info | CHECK=true
2022-11-08T23:17:38.364+00:00 | info | getopts cs option
2022-11-08T23:17:38.364+00:00 | info | case "${option}" in
2022-11-08T23:17:38.364+00:00 | info | SUICIDE=true
2022-11-08T23:17:38.364+00:00 | info | getopts cs option
2022-11-08T23:17:38.364+00:00 | info | mkdir -p /data/combinedca/
2022-11-08T23:17:38.364+00:00 | info | [[ 0 == 0 ]]
2022-11-08T23:17:38.364+00:00 | info | [[ ! -f /run/sec/cas/bootstrap/cacertbundle.pem ]]
2022-11-08T23:17:38.364+00:00 | info | [[ ! -f /run/sec/cas/clientca/client-cacertbundle.pem ]]
2022-11-08T23:17:38.364+00:00 | info | [[ -v SUICIDE ]]
22022-11-08T23:17:38.364+00:00 | info | ps -ef
22022-11-08T23:17:38.364+00:00 | info | grep watch_cert
22022-11-08T23:17:38.364+00:00 | info | grep -v grep
2022-11-08T23:17:38.364+00:00 | info | [[ -z '' ]]
2022-11-08T23:17:38.364+00:00 | info | rm -f /data/combinedca/cacertbundle.pem
2022-11-08T23:17:38.364+00:00 | info | nohup /usr/local/bin/scripts/watch_cert.sh /run/sec/cas/clientca /run/sec/cas/pmca /run/sec/cas/siptlsca
2022-11-08T23:17:38.364+00:00 | info | [[ ! -s /data/combinedca/cacertbundle.pem ]]
2022-11-08T23:17:38.364+00:00 | info | awk 1 /run/sec/cas/clientca/client-cacertbundle.pem /run/sec/cas/siptlsca/cacertbundle.pem
2022-11-08T23:17:38.364+00:00 | info | [[ ! -s /data/combinedca/cacertbundle.pem ]]
2022-11-08T23:17:38.364+00:00 | info | [[ ! -s /data/combinedca/cacertbundle.pem ]]
2022-11-08T23:17:38.364+00:00 | info | [[ -v CHECK ]]
2022-11-08T23:17:38.364+00:00 | info | [[ ! -f /data/combinedca/cacertbundle.pem ]]
Is bootstraping ? false
Setup peer certs for etcd-0 
Auto TLS enable , using ETCD auto generated certs for peer communication
2022-11-08T23:17:38.498+00:00 | info | [ /run/sec/cas/clientca ] Caught Certificate renewed event.
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_ADVERTISE_CLIENT_URLS","variable-value":"https://etcd-0.etcd.idunossautoapp01:2379"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_AUTO_COMPACTION_MODE","variable-value":"revision"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_AUTO_COMPACTION_RETENTION","variable-value":"100"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_CERT_FILE","variable-value":"/run/sec/certs/server/srvcert.pem"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_CLIENT_CERT_AUTH","variable-value":"true"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_DATA_DIR","variable-value":"/data"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_ELECTION_TIMEOUT","variable-value":"1000"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_HEARTBEAT_INTERVAL","variable-value":"100"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_INITIAL_ADVERTISE_PEER_URLS","variable-value":"https://etcd-0.etcd-peer.idunossautoapp01.svc.cluster.local:2380"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER","variable-value":"etcd-0=https://etcd-0.etcd-peer.idunossautoapp01.svc.cluster.local:2380"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER_STATE","variable-value":"new"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER_TOKEN","variable-value":"etcd"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_KEY_FILE","variable-value":"/run/sec/certs/server/srvprivkey.pem"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_LISTEN_CLIENT_URLS","variable-value":"https://0.0.0.0:2379"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_LISTEN_PEER_URLS","variable-value":"https://0.0.0.0:2380"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_LOG_LEVEL","variable-value":"info"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_LOGGER","variable-value":"zap"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_MAX_SNAPSHOTS","variable-value":"3"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_MAX_WALS","variable-value":"3"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_METRICS","variable-value":"basic"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_NAME","variable-value":"etcd-0"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_PEER_AUTO_TLS","variable-value":"true"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_QUOTA_BACKEND_BYTES","variable-value":"268435456"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_SNAPSHOT_COUNT","variable-value":"5000"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.505Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_TRUSTED_CA_FILE","variable-value":"/data/combinedca/cacertbundle.pem"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.506Z","caller":"etcdmain/etcd.go:73","message":"Running: ","args":["/usr/local/bin/etcd"]}
{"severity":"warning","timestamp":"2022-11-08T23:17:38.506Z","caller":"etcdmain/etcd.go:446","message":"found invalid file under data directory","filename":"auth_successful","data-dir":"/data"}
{"severity":"warning","timestamp":"2022-11-08T23:17:38.536Z","caller":"etcdmain/etcd.go:446","message":"found invalid file under data directory","filename":"combinedca","data-dir":"/data"}
{"severity":"warning","timestamp":"2022-11-08T23:17:38.536Z","caller":"etcdmain/etcd.go:446","message":"found invalid file under data directory","filename":"etcd.liveness","data-dir":"/data"}
{"severity":"warning","timestamp":"2022-11-08T23:17:38.536Z","caller":"etcdmain/etcd.go:446","message":"found invalid file under data directory","filename":"fixtures","data-dir":"/data"}
{"severity":"warning","timestamp":"2022-11-08T23:17:38.536Z","caller":"etcdmain/etcd.go:446","message":"found invalid file under data directory","filename":"lost+found","data-dir":"/data"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.537Z","caller":"etcdmain/etcd.go:116","message":"server has been already initialized","data-dir":"/data","dir-type":"member"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.537Z","caller":"embed/etcd.go:124","message":"configuring peer listeners","listen-peer-urls":["https://0.0.0.0:2380"]}
{"severity":"warning","timestamp":1667949458.5372853,"caller":"fileutil/fileutil.go:57","message":"check file permission","error":"directory \"/data/fixtures/peer\" exist, but the permission is \"dgrwxrwx---\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.537Z","caller":"embed/etcd.go:482","message":"starting with peer TLS","tls-info":"cert = /data/fixtures/peer/cert.pem, key = /data/fixtures/peer/key.pem, client-cert=/data/fixtures/peer/cert.pem, client-key=/data/fixtures/peer/key.pem, trusted-ca = , client-cert-auth = false, crl-file = ","cipher-suites":[]}
{"severity":"info","timestamp":"2022-11-08T23:17:38.560Z","caller":"embed/etcd.go:132","message":"configuring client listeners","listen-client-urls":["https://0.0.0.0:2379"]}
2022-11-08T23:17:38.588+00:00 | info | Restarting ETCD while reloading certificates
2022-11-08T23:17:38.638+00:00 | info | Killing ETCD process
{"severity":"info","timestamp":"2022-11-08T23:17:38.560Z","caller":"embed/etcd.go:306","message":"starting an etcd server","etcd-version":"3.5.5","git-sha":"19002cfc6","go-version":"go1.16.15","go-os":"linux","go-arch":"amd64","max-cpu-set":8,"max-cpu-available":8,"member-initialized":true,"name":"etcd-0","data-dir":"/data","wal-dir":"","wal-dir-dedicated":"","member-dir":"/data/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":5000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://etcd-0.etcd-peer.idunossautoapp01.svc.cluster.local:2380"],"listen-peer-urls":["https://0.0.0.0:2380"],"advertise-client-urls":["https://etcd-0.etcd.idunossautoapp01:2379"],"listen-client-urls":["https://0.0.0.0:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-backend-bytes":268435456,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"revision","auto-compaction-retention":"100ns","auto-compaction-interval":"100ns","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
{"severity":"warning","timestamp":1667949458.5611992,"caller":"fileutil/fileutil.go:57","message":"check file permission","error":"directory \"/data\" exist, but the permission is \"dgrwxrwxr-x\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
{"severity":"warning","timestamp":1667949458.5615969,"caller":"fileutil/fileutil.go:57","message":"check file permission","error":"directory \"/data/member/snap\" exist, but the permission is \"dgrwxrwx---\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
{"severity":"info","timestamp":"2022-11-08T23:17:38.561Z","caller":"etcdserver/backend.go:81","message":"opened backend db","path":"/data/member/snap/db","took":"323.992µs"}
/usr/local/bin/scripts/entrypoint.sh: line 63: kill: (8) - No such process
2022-11-08T23:17:43.108+00:00 | info | Killing the ETCD process
bash: line 1:    33 Terminated              /usr/local/bin/etcdctl defrag --endpoints=:${CLIENT_PORTS} --insecure-skip-tls-verify
Restarting ETCD
2022-11-08T23:17:43.239+00:00 | info | getopts cs option
2022-11-08T23:17:43.239+00:00 | info | case "${option}" in
2022-11-08T23:17:43.239+00:00 | info | CHECK=true
2022-11-08T23:17:43.239+00:00 | info | getopts cs option
2022-11-08T23:17:43.239+00:00 | info | case "${option}" in
2022-11-08T23:17:43.239+00:00 | info | SUICIDE=true
2022-11-08T23:17:43.239+00:00 | info | getopts cs option
2022-11-08T23:17:43.239+00:00 | info | mkdir -p /data/combinedca/
2022-11-08T23:17:43.239+00:00 | info | [[ 0 == 0 ]]
2022-11-08T23:17:43.239+00:00 | info | [[ ! -f /run/sec/cas/bootstrap/cacertbundle.pem ]]
2022-11-08T23:17:43.239+00:00 | info | [[ ! -f /run/sec/cas/clientca/client-cacertbundle.pem ]]
2022-11-08T23:17:43.239+00:00 | info | [[ -v SUICIDE ]]
22022-11-08T23:17:43.239+00:00 | info | ps -ef
22022-11-08T23:17:43.239+00:00 | info | grep watch_cert
22022-11-08T23:17:43.239+00:00 | info | grep -v grep
2022-11-08T23:17:43.239+00:00 | info | [[ -z 250422      49     1  0 23:17 ?        00:00:00 /bin/bash /usr/local/bin/scripts/watch_cert.sh /run/sec/cas/clientca /run/sec/cas/pmca /run/sec/cas/siptlsca ]]
2022-11-08T23:17:43.239+00:00 | info | rm -f /data/combinedca/cacertbundle.pem
2022-11-08T23:17:43.239+00:00 | info | [[ ! -s /data/combinedca/cacertbundle.pem ]]
2022-11-08T23:17:43.239+00:00 | info | awk 1 /run/sec/cas/clientca/client-cacertbundle.pem /run/sec/cas/siptlsca/cacertbundle.pem
2022-11-08T23:17:43.239+00:00 | info | [[ ! -s /data/combinedca/cacertbundle.pem ]]
2022-11-08T23:17:43.239+00:00 | info | [[ ! -s /data/combinedca/cacertbundle.pem ]]
2022-11-08T23:17:43.239+00:00 | info | [[ -v CHECK ]]
2022-11-08T23:17:43.239+00:00 | info | [[ ! -f /data/combinedca/cacertbundle.pem ]]
Is bootstraping ? false
Setup peer certs for etcd-0 
Auto TLS enable , using ETCD auto generated certs for peer communication
{"severity":"info","timestamp":"2022-11-08T23:17:43.292Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_ADVERTISE_CLIENT_URLS","variable-value":"https://etcd-0.etcd.idunossautoapp01:2379"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.292Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_AUTO_COMPACTION_MODE","variable-value":"revision"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.292Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_AUTO_COMPACTION_RETENTION","variable-value":"100"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.292Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_CERT_FILE","variable-value":"/run/sec/certs/server/srvcert.pem"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.292Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_CLIENT_CERT_AUTH","variable-value":"true"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.292Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_DATA_DIR","variable-value":"/data"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.292Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_ELECTION_TIMEOUT","variable-value":"1000"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.292Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_HEARTBEAT_INTERVAL","variable-value":"100"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.292Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_INITIAL_ADVERTISE_PEER_URLS","variable-value":"https://etcd-0.etcd-peer.idunossautoapp01.svc.cluster.local:2380"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.292Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER","variable-value":"etcd-0=https://etcd-0.etcd-peer.idunossautoapp01.svc.cluster.local:2380"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.292Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER_STATE","variable-value":"new"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.292Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER_TOKEN","variable-value":"etcd"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.292Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_KEY_FILE","variable-value":"/run/sec/certs/server/srvprivkey.pem"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.293Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_LISTEN_CLIENT_URLS","variable-value":"https://0.0.0.0:2379"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.293Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_LISTEN_PEER_URLS","variable-value":"https://0.0.0.0:2380"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.293Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_LOG_LEVEL","variable-value":"info"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.293Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_LOGGER","variable-value":"zap"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.293Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_MAX_SNAPSHOTS","variable-value":"3"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.293Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_MAX_WALS","variable-value":"3"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.293Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_METRICS","variable-value":"basic"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.293Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_NAME","variable-value":"etcd-0"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.293Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_PEER_AUTO_TLS","variable-value":"true"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.293Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_QUOTA_BACKEND_BYTES","variable-value":"268435456"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.293Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_SNAPSHOT_COUNT","variable-value":"5000"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.293Z","caller":"flags/flag.go:113","message":"recognized and used environment variable","variable-name":"ETCD_TRUSTED_CA_FILE","variable-value":"/data/combinedca/cacertbundle.pem"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.293Z","caller":"etcdmain/etcd.go:73","message":"Running: ","args":["/usr/local/bin/etcd"]}
{"severity":"warning","timestamp":"2022-11-08T23:17:43.293Z","caller":"etcdmain/etcd.go:446","message":"found invalid file under data directory","filename":"auth_successful","data-dir":"/data"}
{"severity":"warning","timestamp":"2022-11-08T23:17:43.293Z","caller":"etcdmain/etcd.go:446","message":"found invalid file under data directory","filename":"combinedca","data-dir":"/data"}
{"severity":"warning","timestamp":"2022-11-08T23:17:43.293Z","caller":"etcdmain/etcd.go:446","message":"found invalid file under data directory","filename":"etcd.liveness","data-dir":"/data"}
{"severity":"warning","timestamp":"2022-11-08T23:17:43.293Z","caller":"etcdmain/etcd.go:446","message":"found invalid file under data directory","filename":"fixtures","data-dir":"/data"}
{"severity":"warning","timestamp":"2022-11-08T23:17:43.293Z","caller":"etcdmain/etcd.go:446","message":"found invalid file under data directory","filename":"lost+found","data-dir":"/data"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.293Z","caller":"etcdmain/etcd.go:116","message":"server has been already initialized","data-dir":"/data","dir-type":"member"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.293Z","caller":"embed/etcd.go:124","message":"configuring peer listeners","listen-peer-urls":["https://0.0.0.0:2380"]}
{"severity":"warning","timestamp":1667949463.2936218,"caller":"fileutil/fileutil.go:57","message":"check file permission","error":"directory \"/data/fixtures/peer\" exist, but the permission is \"dgrwxrwx---\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.293Z","caller":"embed/etcd.go:482","message":"starting with peer TLS","tls-info":"cert = /data/fixtures/peer/cert.pem, key = /data/fixtures/peer/key.pem, client-cert=/data/fixtures/peer/cert.pem, client-key=/data/fixtures/peer/key.pem, trusted-ca = , client-cert-auth = false, crl-file = ","cipher-suites":[]}
{"severity":"info","timestamp":"2022-11-08T23:17:43.309Z","caller":"embed/etcd.go:132","message":"configuring client listeners","listen-client-urls":["https://0.0.0.0:2379"]}
{"severity":"info","timestamp":"2022-11-08T23:17:43.309Z","caller":"embed/etcd.go:306","message":"starting an etcd server","etcd-version":"3.5.5","git-sha":"19002cfc6","go-version":"go1.16.15","go-os":"linux","go-arch":"amd64","max-cpu-set":8,"max-cpu-available":8,"member-initialized":true,"name":"etcd-0","data-dir":"/data","wal-dir":"","wal-dir-dedicated":"","member-dir":"/data/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":5000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://etcd-0.etcd-peer.idunossautoapp01.svc.cluster.local:2380"],"listen-peer-urls":["https://0.0.0.0:2380"],"advertise-client-urls":["https://etcd-0.etcd.idunossautoapp01:2379"],"listen-client-urls":["https://0.0.0.0:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-backend-bytes":268435456,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"revision","auto-compaction-retention":"100ns","auto-compaction-interval":"100ns","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
{"severity":"warning","timestamp":1667949463.3107069,"caller":"fileutil/fileutil.go:57","message":"check file permission","error":"directory \"/data\" exist, but the permission is \"dgrwxrwxr-x\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
{"severity":"warning","timestamp":1667949463.3372207,"caller":"fileutil/fileutil.go:57","message":"check file permission","error":"directory \"/data/member/snap\" exist, but the permission is \"dgrwxrwx---\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.337Z","caller":"etcdserver/backend.go:81","message":"opened backend db","path":"/data/member/snap/db","took":"431µs"}
{"severity":"info","timestamp":"2022-11-08T23:17:43.546Z","caller":"etcdserver/server.go:509","message":"recovered v2 store from snapshot","snapshot-index":5001,"snapshot-size":"26 kB"}
{"severity":"warning","timestamp":"2022-11-08T23:17:43.547Z","caller":"snap/db.go:88","message":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":5001,"snapshot-file-path":"/data/member/snap/0000000000001389.snap.db","error":"snap: snapshot file doesn't exist"}
{"severity":"critical","timestamp":"2022-11-08T23:17:43.547Z","caller":"etcdserver/server.go:516","message":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdserver/server.go:516\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/embed/etcd.go:243\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/usr/local/google/home/siarkowicz/.gvm/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot

goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0002cc000, 0xc000536640, 0x1, 0x1)
	/usr/local/google/home/siarkowicz/.gvm/pkgsets/go1.16.15/global/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc00065a050, 0x124a82a, 0x2a, 0xc000536640, 0x1, 0x1)
	/usr/local/google/home/siarkowicz/.gvm/pkgsets/go1.16.15/global/pkg/mod/go.uber.org/[email protected]/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0xc00006a84a, 0x26, 0x0, 0x0, 0x0, 0x0, 0xc0001499e0, 0x1, 0x1, 0xc000149c20, ...)
	/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdserver/server.go:516 +0x1676
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc0002c8700, 0xc00033a000, 0x0, 0x0)
	/tmp/etcd-release-3.5.5/etcd/release/etcd/server/embed/etcd.go:243 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc0002c8700, 0x121e47b, 0x6, 0xc000533101, 0x2)
	/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdmain/etcd.go:228 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003c1f0, 0x1, 0x1)
	/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdmain/etcd.go:123 +0x24da
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003c1f0, 0x1, 0x1)
	/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdmain/main.go:40 +0x13f
main.main()
	/tmp/etcd-release-3.5.5/etcd/release/etcd/server/main.go:32 +0x45
@serathius serathius added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 14, 2022
@serathius
Copy link
Member

serathius commented Nov 14, 2022

This looks like #14569 and #14382.
However logs confirm that issue is still present in affecting v3.5.5 release.

@serathius serathius mentioned this issue Nov 14, 2022
22 tasks
@IamSatyaonline
Copy link
Author

Hi @serathius

Thanks for your prompt reply.
Could you please help if any preventive fix or workaround is possible to fix this issue.

Thanks,
Satya

@serathius
Copy link
Member

cc @ahrtr

@ahrtr
Copy link
Member

ahrtr commented Nov 14, 2022

@IamSatyaonline previously when you ran into #14569, what did you do before upgrading to 3.5.5? Did you resolve the issue before upgrading?

@ahrtr
Copy link
Member

ahrtr commented Nov 14, 2022

Based on the logs and timestamp, looks like the user ran into the same revision inconsistency issue which was resolved in #14733. The de-fragmentation operation was somehow terminated. @IamSatyaonline Please double confirm this and clarify why the de-fragmentation operation was terminated.

FYI. https://github.com/ahrtr/etcd-issues/tree/master/issues/revision_inconsistency

/usr/local/bin/scripts/entrypoint.sh: line 63: kill: (8) - No such process
2022-11-08T23:17:43.108+00:00 | info | Killing the ETCD process
bash: line 1:    33 Terminated              /usr/local/bin/etcdctl defrag --endpoints=:${CLIENT_PORTS} --insecure-skip-tls-verify
Restarting ETCD
{"severity":"warning","timestamp":"2022-11-08T23:17:43.547Z","caller":"snap/db.go:88","message":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":5001,"snapshot-file-path":"/data/member/snap/0000000000001389.snap.db","error":"snap: snapshot file doesn't exist"}
{"severity":"critical","timestamp":"2022-11-08T23:17:43.547Z","caller":"etcdserver/server.go:516","message":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdserver/server.go:516\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/embed/etcd.go:243\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/usr/local/google/home/siarkowicz/.gvm/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot

@ahrtr
Copy link
Member

ahrtr commented Nov 14, 2022

@IamSatyaonline Please also do the following checks:

  1. Go to ${ETCD_DATA_DIR}/membe/snap, and then execute ls -lrt. You can get the index from the name, for example, it's 253 (0xfd) in the last file.
$ ls -lrt
total 33064
-rw-r--r--  1 wachao  staff      9012 Nov 10 20:39 0000000000000002-000000000000009a.snap
-rw-r--r--  1 wachao  staff      9012 Nov 10 20:39 0000000000000002-00000000000000a5.snap
-rw-r--r--  1 wachao  staff      9012 Nov 10 20:39 0000000000000002-00000000000000b0.snap
-rw-r--r--  1 wachao  staff      9012 Nov 10 20:39 0000000000000002-00000000000000bb.snap
-rw-r--r--  1 wachao  staff      9012 Nov 10 20:39 0000000000000002-00000000000000c6.snap
-rw-r--r--  1 wachao  staff      9360 Nov 10 20:39 0000000000000002-00000000000000d1.snap
-rw-r--r--  1 wachao  staff      9360 Nov 10 20:39 0000000000000002-00000000000000dc.snap
-rw-r--r--  1 wachao  staff      9360 Nov 10 20:39 0000000000000002-00000000000000e7.snap
-rw-r--r--  1 wachao  staff      9360 Nov 10 20:40 0000000000000002-00000000000000f2.snap
-rw-r--r--  1 wachao  staff      9360 Nov 10 20:40 0000000000000002-00000000000000fd.snap
-rw-------  1 wachao  staff  16834560 Nov 10 20:40 db
  1. Use etcd-dump-logs to analyze the WAL files. Follow example steps below,
# CD to directory etcd/tools/etcd-dump-logs
$ go build

# Please replace the directory with your real etcd data dir
$ ./etcd-dump-logs ~/tmp/etcd/infra1.etcd
Snapshot:
term=2 index=253 nodes=[8211f1d0f64f3269 91bc3c398fb3c146 fd422379fda50e48] confstate={"voters":[9372538179322589801,10501334649042878790,18249187646912138824],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=8211f1d0f64f3269 clusterID=ef37ad9dc622a7c4 term=2 commitIndex=261 vote=91bc3c398fb3c146
WAL entries: 9
lastIndex=262
term	     index	type	data
   2	       254	norm	header:<ID:3632580128977223734 > put:<key:"k1" value:"v247" > 
   2	       255	norm	header:<ID:3632580128977223735 > put:<key:"k1" value:"v248" > 
   2	       256	norm	header:<ID:3632580128977223736 > put:<key:"k1" value:"v249" > 
   2	       257	norm	header:<ID:3632580128977223737 > put:<key:"k1" value:"v250" > 
   2	       258	norm	header:<ID:3632580128977223738 > put:<key:"k1" value:"v251" > 
   2	       259	norm	header:<ID:3632580128977223739 > put:<key:"k1" value:"v252" > 
   2	       260	norm	header:<ID:3632580128977223740 > put:<key:"k1" value:"v253" > 
   2	       261	norm	header:<ID:3632580128977223741 > put:<key:"k1" value:"v254" > 
   2	       262	norm	header:<ID:3632580128977223742 > put:<key:"k1" value:"v255" > 

@IamSatyaonline
Copy link
Author

@IamSatyaonline previously when you ran into #14569, what did you do before upgrading to 3.5.5? Did you resolve the issue before upgrading?

Hi @ahrtr
This is the new deployment and we just installed the ETCD-3.5.5 and facing the issue.
We have not performed any upgrade so far on this deployment.

Thanks,
Satya

@ahrtr
Copy link
Member

ahrtr commented Nov 15, 2022

@IamSatyaonline
Copy link
Author

Based on the logs and timestamp, looks like the user ran into the same revision inconsistency issue which was resolved in #14733. The de-fragmentation operation was somehow terminated. @IamSatyaonline Please double confirm this and clarify why the de-fragmentation operation was terminated.

FYI. https://github.com/ahrtr/etcd-issues/tree/master/issues/revision_inconsistency

/usr/local/bin/scripts/entrypoint.sh: line 63: kill: (8) - No such process
2022-11-08T23:17:43.108+00:00 | info | Killing the ETCD process
bash: line 1:    33 Terminated              /usr/local/bin/etcdctl defrag --endpoints=:${CLIENT_PORTS} --insecure-skip-tls-verify
Restarting ETCD
{"severity":"warning","timestamp":"2022-11-08T23:17:43.547Z","caller":"snap/db.go:88","message":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":5001,"snapshot-file-path":"/data/member/snap/0000000000001389.snap.db","error":"snap: snapshot file doesn't exist"}
{"severity":"critical","timestamp":"2022-11-08T23:17:43.547Z","caller":"etcdserver/server.go:516","message":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdserver/server.go:516\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/embed/etcd.go:243\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/usr/local/google/home/siarkowicz/.gvm/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot

Based on the logs and timestamp, looks like the user ran into the same revision inconsistency issue which was resolved in #14733. The de-fragmentation operation was somehow terminated. @IamSatyaonline Please double confirm this and clarify why the de-fragmentation operation was terminated.

FYI. https://github.com/ahrtr/etcd-issues/tree/master/issues/revision_inconsistency

/usr/local/bin/scripts/entrypoint.sh: line 63: kill: (8) - No such process
2022-11-08T23:17:43.108+00:00 | info | Killing the ETCD process
bash: line 1:    33 Terminated              /usr/local/bin/etcdctl defrag --endpoints=:${CLIENT_PORTS} --insecure-skip-tls-verify
Restarting ETCD
{"severity":"warning","timestamp":"2022-11-08T23:17:43.547Z","caller":"snap/db.go:88","message":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":5001,"snapshot-file-path":"/data/member/snap/0000000000001389.snap.db","error":"snap: snapshot file doesn't exist"}
{"severity":"critical","timestamp":"2022-11-08T23:17:43.547Z","caller":"etcdserver/server.go:516","message":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdserver/server.go:516\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/embed/etcd.go:243\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/usr/local/google/home/siarkowicz/.gvm/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot

Hi @ahrtr
ETCD running in TLS enabled mode in our cluster. We reloads the TLS certificates once we get the updated certificates.
During relaod of the certificates , We kill the etcd service and restart the ETCD service.
That is the reason , defragmentation is being terminated while killing the ETCD service to relaod the certificates.

It's difficult for us to run the etcd-dump-logs as our deployment is not up due to the panic issue.

Thanks,
Satya

@IamSatyaonline
Copy link
Author

@IamSatyaonline Please also do the following checks:

  1. Go to ${ETCD_DATA_DIR}/membe/snap, and then execute ls -lrt. You can get the index from the name, for example, it's 253 (0xfd) in the last file.
$ ls -lrt
total 33064
-rw-r--r--  1 wachao  staff      9012 Nov 10 20:39 0000000000000002-000000000000009a.snap
-rw-r--r--  1 wachao  staff      9012 Nov 10 20:39 0000000000000002-00000000000000a5.snap
-rw-r--r--  1 wachao  staff      9012 Nov 10 20:39 0000000000000002-00000000000000b0.snap
-rw-r--r--  1 wachao  staff      9012 Nov 10 20:39 0000000000000002-00000000000000bb.snap
-rw-r--r--  1 wachao  staff      9012 Nov 10 20:39 0000000000000002-00000000000000c6.snap
-rw-r--r--  1 wachao  staff      9360 Nov 10 20:39 0000000000000002-00000000000000d1.snap
-rw-r--r--  1 wachao  staff      9360 Nov 10 20:39 0000000000000002-00000000000000dc.snap
-rw-r--r--  1 wachao  staff      9360 Nov 10 20:39 0000000000000002-00000000000000e7.snap
-rw-r--r--  1 wachao  staff      9360 Nov 10 20:40 0000000000000002-00000000000000f2.snap
-rw-r--r--  1 wachao  staff      9360 Nov 10 20:40 0000000000000002-00000000000000fd.snap
-rw-------  1 wachao  staff  16834560 Nov 10 20:40 db
  1. Use etcd-dump-logs to analyze the WAL files. Follow example steps below,
# CD to directory etcd/tools/etcd-dump-logs
$ go build

# Please replace the directory with your real etcd data dir
$ ./etcd-dump-logs ~/tmp/etcd/infra1.etcd
Snapshot:
term=2 index=253 nodes=[8211f1d0f64f3269 91bc3c398fb3c146 fd422379fda50e48] confstate={"voters":[9372538179322589801,10501334649042878790,18249187646912138824],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=8211f1d0f64f3269 clusterID=ef37ad9dc622a7c4 term=2 commitIndex=261 vote=91bc3c398fb3c146
WAL entries: 9
lastIndex=262
term	     index	type	data
   2	       254	norm	header:<ID:3632580128977223734 > put:<key:"k1" value:"v247" > 
   2	       255	norm	header:<ID:3632580128977223735 > put:<key:"k1" value:"v248" > 
   2	       256	norm	header:<ID:3632580128977223736 > put:<key:"k1" value:"v249" > 
   2	       257	norm	header:<ID:3632580128977223737 > put:<key:"k1" value:"v250" > 
   2	       258	norm	header:<ID:3632580128977223738 > put:<key:"k1" value:"v251" > 
   2	       259	norm	header:<ID:3632580128977223739 > put:<key:"k1" value:"v252" > 
   2	       260	norm	header:<ID:3632580128977223740 > put:<key:"k1" value:"v253" > 
   2	       261	norm	header:<ID:3632580128977223741 > put:<key:"k1" value:"v254" > 
   2	       262	norm	header:<ID:3632580128977223742 > put:<key:"k1" value:"v255" > 

Hi @ahrtr

We can't run the etcd-dump-logs in our deployment As our deployment is not up due to the panic issue.

Thanks,
Satya

@ahrtr
Copy link
Member

ahrtr commented Nov 15, 2022

We can't run the etcd-dump-logs in our deployment As our deployment is not up due to the panic issue.

Executing etcd-dump-logs doesn't require the etcd instance running. Please executels -lrtunder the ${data_dir}/member/snap, and also analyze the WAL files using etcd-dump-logs per my comment above

@IamSatyaonline
Copy link
Author

We can't run the etcd-dump-logs in our deployment As our deployment is not up due to the panic issue.

Executing etcd-dump-logs doesn't require the etcd instance running. Please executels -lrtunder the ${data_dir}/member/snap, and also analyze the WAL files using etcd-dump-logs per my comment above

Hi @ahrtr
This deployment is not available now and we are actively trying to reproduce this issue, so that we could follow the steps to analyze the WAL files.
@ahrtr
what is your view on this issue, is this the genuine issue in ETCD-3.5.5 ? could you please suggest a way to move forward to fix this panic issue ?

Thanks,
Satya

@ahrtr
Copy link
Member

ahrtr commented Nov 16, 2022

This deployment is not available now and we are actively trying to reproduce this issue, so that we could follow the steps to analyze the WAL files.

We do not need the deployment. Instead, we just need to data-directory of the etcd. Do you mean the data has already been removed? Or the storage supporting the dynamic volume consumed by the deployment has been removed?

what is your view on this issue, is this the genuine issue in ETCD-3.5.5 ?

Just as I mentioned in #14749 (comment), it's likely the same issue as #14733. But I need to double confirm, but unfortunately, you can't provide the info I requested.

we are actively trying to reproduce this issue

Is it easy to reproduce this issue?

@IamSatyaonline
Copy link
Author

IamSatyaonline commented Nov 22, 2022

We do not need the deployment. Instead, we just need to data-directory of the etcd. Do you mean the data has already been > removed? Or the storage supporting the dynamic volume consumed by the deployment has been removed?

We have cleanup the storage and all the data and reinstalled the new chart. So we don't have the snap and WAL data.

Is it easy to reproduce this issue?

No , it's not easy to reproduce this issue.

@ahrtr
Copy link
Member

ahrtr commented Nov 22, 2022

Thanks @IamSatyaonline for the feedback. I'd close this issue for now. Please feel free to reopen it or create a new one if you reproduce this issue again. Please keep the environment once you reproduce it. thx

@ahrtr ahrtr closed this as completed Nov 22, 2022
@IamSatyaonline
Copy link
Author

Hi @ahrtr

Just as I mentioned in #14749 (comment), it's likely the same issue as #14733. But I need to double confirm, but unfortunately, you can't provide the info I requested.

As you mentioned it's the same issue as 14733 , which was fixed in 3.5.5 but We are able to reproduce this issue in ETCD-3.5.5, unfortunately we don't have WAL and snap data. Do you still think that there is no issue in ETCD-3.5.5 what we have reported ?
I have shared the ETCD logs, are the logs not enough to troubleshoot the issue ?

Thanks,
Satya

@ahrtr
Copy link
Member

ahrtr commented Nov 22, 2022

As you mentioned it's the same issue as 14733 , which was fixed in 3.5.5

The issue is fixed in 3.5.6.

@rpstw
Copy link

rpstw commented Jul 25, 2024

Turnout to be #14382 and fixed by using https://github.com/ahrtr/etcd-issues/blob/b221ffdee411e9dc1715d329f5e67f41366012b3/etcd/etcd-db-editor/main.go#L16-L28.

I must say it is a life-saver otherwise we had to recreate the entire Kubernetes cluster.


Hi @ahrtr

We ran into this issue after deleting a large number of pods.

We use Etcd 3.5.3, which is the default version included in Kubernetes 1.24.

As https://github.com/ahrtr/etcd-issues/tree/master/issues/revision_inconsistency#how-to-workaround-this-issue said a workaround is to rejoin. Unfortunately all of our master node crashed in the same way before we found this issue page.

Are there any other methods that could be used to recover the cluster?

Also since we are having a down cluster, checks in #14749 (comment) could all be executed and I can check anything else needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release/v3.5 type/bug
Development

No branches or pull requests

4 participants