One ETCD member stuck in loop until "restart" in some clusters #15008

Lebvanih · 2022-12-16T09:02:28Z

Lebvanih
Dec 16, 2022

Hello,

We are facing a really weird issue in some of our cluster (2 of them at the moment out of 10+) but we are unable to identify where this could be coming from... The issue is coming up usually after 36-48h after process starts at the moment (so a pain in the ass to replicate/investigate so far). Considering I couldn't find any other trace of similar issue, I assume it's more of a config/setup issue or some really weird edge case

Information about the setup:
ETCD used in a RKE cluster
Version: {"etcdserver":"3.4.16","etcdcluster":"3.4.0"}
ETCD runs in Docker (5:20.10.18~~3-0~~ubuntu-focal).
OS that runs docker: Ubuntu 20.04.5LTS
This cluster is running 4 ETCD nodes, as you can see in the logs (I need to shrink that to 3, it's not supposed to be 4). But we also have the issue in a 3 nodes cluster.

Here is the initial log on one of the node when the issue starts (mix of output of docker logs and logs taken directly from the server in the outputs below):

{"log":"2022-12-16 01:45:16.552453 I | mvcc: store.index: compact 214310696\n","stream":"stderr","time":"2022-12-16T01:45:16.552543649Z"}
{"log":"2022-12-16 01:45:16.640301 I | mvcc: finished scheduled compaction at 214310696 (took 84.289008ms)\n","stream":"stderr","time":"2022-12-16T01:45:16.64037625Z"}
{"log":"2022-12-16 01:48:54.214675 W | rafthttp: lost the TCP streaming connection with peer 588daae93a2c19a (stream MsgApp v2 reader)\n","stream":"stderr","time":"2022-12-16T01:48:54.214753271Z"}
{"log":"2022-12-16 01:48:54.214712 E | rafthttp: failed to read 588daae93a2c19a on stream MsgApp v2 (read tcp 10.252.145.26:49042-\u003e10.252.145.23:2380: i/o timeout)\n","stream":"stderr","time":"2022-12-16T01:48:54.214772241Z"}
{"log":"2022-12-16 01:48:54.214723 I | rafthttp: peer 588daae93a2c19a became inactive (message send to peer failed)\n","stream":"stderr","time":"2022-12-16T01:48:54.214775991Z"}
{"log":"2022-12-16 01:48:54.215370 W | rafthttp: lost the TCP streaming connection with peer 588daae93a2c19a (stream Message reader)\n","stream":"stderr","time":"2022-12-16T01:48:54.215408565Z"}
{"log":"2022-12-16 01:48:55.222284 W | rafthttp: lost the TCP streaming connection with peer f6b8db6bdca3e78f (stream MsgApp v2 reader)\n","stream":"stderr","time":"2022-12-16T01:48:55.22241806Z"}
{"log":"2022-12-16 01:48:55.222321 E | rafthttp: failed to read f6b8db6bdca3e78f on stream MsgApp v2 (read tcp 10.252.145.26:56916-\u003e10.252.145.21:2380: i/o timeout)\n","stream":"stderr","time":"2022-12-16T01:48:55.222441741Z"}
{"log":"2022-12-16 01:48:55.222330 I | rafthttp: peer f6b8db6bdca3e78f became inactive (message send to peer failed)\n","stream":"stderr","time":"2022-12-16T01:48:55.222448661Z"}
{"log":"2022-12-16 01:48:55.223055 W | rafthttp: lost the TCP streaming connection with peer f6b8db6bdca3e78f (stream Message reader)\n","stream":"stderr","time":"2022-12-16T01:48:55.223151068Z"}
{"log":"2022-12-16 01:48:55.592692 W | rafthttp: lost the TCP streaming connection with peer 5f28642951cdc00 (stream MsgApp v2 reader)\n","stream":"stderr","time":"2022-12-16T01:48:55.592806963Z"}
{"log":"2022-12-16 01:48:55.592726 E | rafthttp: failed to read 5f28642951cdc00 on stream MsgApp v2 (read tcp 10.252.145.26:51442-\u003e10.252.145.22:2380: i/o timeout)\n","stream":"stderr","time":"2022-12-16T01:48:55.592825554Z"}
{"log":"2022-12-16 01:48:55.592735 I | rafthttp: peer 5f28642951cdc00 became inactive (message send to peer failed)\n","stream":"stderr","time":"2022-12-16T01:48:55.592832124Z"}
{"log":"2022-12-16 01:48:55.803413 W | rafthttp: lost the TCP streaming connection with peer 5f28642951cdc00 (stream Message reader)\n","stream":"stderr","time":"2022-12-16T01:48:55.803500329Z"}
{"log":"raft2022/12/16 01:48:57 INFO: c06e6b105dac3be9 is starting a new election at term 8673\n","stream":"stderr","time":"2022-12-16T01:48:57.958429575Z"}
{"log":"raft2022/12/16 01:48:57 INFO: c06e6b105dac3be9 became candidate at term 8674\n","stream":"stderr","time":"2022-12-16T01:48:57.958491318Z"}
{"log":"raft2022/12/16 01:48:57 INFO: c06e6b105dac3be9 received MsgVoteResp from c06e6b105dac3be9 at term 8674\n","stream":"stderr","time":"2022-12-16T01:48:57.958519159Z"}
{"log":"raft2022/12/16 01:48:57 INFO: c06e6b105dac3be9 [logterm: 8673, index: 15115092] sent MsgVote request to 588daae93a2c19a at term 8674\n","stream":"stderr","time":"2022-12-16T01:48:57.958528949Z"}
{"log":"raft2022/12/16 01:48:57 INFO: c06e6b105dac3be9 [logterm: 8673, index: 15115092] sent MsgVote request to 5f28642951cdc00 at term 8674\n","stream":"stderr","time":"2022-12-16T01:48:57.958533919Z"}
{"log":"raft2022/12/16 01:48:57 INFO: c06e6b105dac3be9 [logterm: 8673, index: 15115092] sent MsgVote request to f6b8db6bdca3e78f at term 8674\n","stream":"stderr","time":"2022-12-16T01:48:57.958538209Z"}
{"log":"raft2022/12/16 01:48:57 INFO: raft.node: c06e6b105dac3be9 lost leader 5f28642951cdc00 at term 8674\n","stream":"stderr","time":"2022-12-16T01:48:57.95854638Z"}

The other nodes will report this:

2022-12-16 01:45:16.634468 I | mvcc: finished scheduled compaction at 214310696 (took 79.333679ms)
raft2022/12/16 01:48:57 INFO: f6b8db6bdca3e78f [logterm: 8673, index: 15115124, vote: 5f28642951cdc00] ignored MsgVote from c06e6b105dac3be9 [logterm: 8673, index: 15115092] at term 8673: lease is not expired (remaining ticks: 9)
2022-12-16 01:48:59.409160 I | embed: rejected connection from "10.252.145.26:48946" (error "read tcp 10.252.145.21:2380->10.252.145.26:48946: i/o timeout", ServerName "")
2022-12-16 01:49:00.221838 E | rafthttp: failed to heartbeat c06e6b105dac3be9 on stream MsgApp v2 (write tcp 10.252.145.21:2380->10.252.145.26:56916: write: connection reset by peer)
2022-12-16 01:49:00.221860 I | rafthttp: peer c06e6b105dac3be9 became inactive (message send to peer failed)
2022-12-16 01:49:00.222632 W | rafthttp: lost the TCP streaming connection with peer c06e6b105dac3be9 (stream MsgApp v2 writer)
2022-12-16 01:49:00.222989 I | embed: rejected connection from "10.252.145.26:48962" (error "read tcp 10.252.145.21:2380->10.252.145.26:48962: i/o timeout", ServerName "")
2022-12-16 01:49:00.223291 W | rafthttp: lost the TCP streaming connection with peer c06e6b105dac3be9 (stream Message writer)
2022-12-16 01:49:00.223617 I | embed: rejected connection from "10.252.145.26:48976" (error "read tcp 10.252.145.21:2380->10.252.145.26:48976: i/o timeout", ServerName "")
WARNING: 2022/12/16 01:49:03 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
raft2022/12/16 01:49:04 INFO: f6b8db6bdca3e78f is starting a new election at term 8673
raft2022/12/16 01:49:04 INFO: f6b8db6bdca3e78f became candidate at term 8674
raft2022/12/16 01:49:04 INFO: f6b8db6bdca3e78f received MsgVoteResp from f6b8db6bdca3e78f at term 8674
raft2022/12/16 01:49:04 INFO: f6b8db6bdca3e78f [logterm: 8673, index: 15115126] sent MsgVote request to 588daae93a2c19a at term 8674
raft2022/12/16 01:49:04 INFO: f6b8db6bdca3e78f [logterm: 8673, index: 15115126] sent MsgVote request to 5f28642951cdc00 at term 8674
raft2022/12/16 01:49:04 INFO: f6b8db6bdca3e78f [logterm: 8673, index: 15115126] sent MsgVote request to c06e6b105dac3be9 at term 8674
raft2022/12/16 01:49:04 INFO: raft.node: f6b8db6bdca3e78f lost leader 5f28642951cdc00 at term 8674
raft2022/12/16 01:49:04 INFO: f6b8db6bdca3e78f received MsgVoteResp from 5f28642951cdc00 at term 8674
raft2022/12/16 01:49:04 INFO: f6b8db6bdca3e78f has received 2 MsgVoteResp votes and 0 vote rejections
raft2022/12/16 01:49:04 INFO: f6b8db6bdca3e78f received MsgVoteResp from 588daae93a2c19a at term 8674
raft2022/12/16 01:49:04 INFO: f6b8db6bdca3e78f has received 3 MsgVoteResp votes and 0 vote rejections
raft2022/12/16 01:49:04 INFO: f6b8db6bdca3e78f became leader at term 8674
raft2022/12/16 01:49:04 INFO: raft.node: f6b8db6bdca3e78f elected leader f6b8db6bdca3e78f at term 8674

From what I can understand, when this happen, the etcd node that has the issue become completely unresponsive (there is no other issue visible on the server itself, no network error reported or anything).
If we go back on the node with the issue we can see more connection lost/reset/broken pipe a few seconds/minutes later:

{"log":"2022-12-16 01:50:45.604683 I | embed: rejected connection from \"10.252.145.26:43132\" (error \"EOF\", ServerName \"\")\n","stream":"stderr","time":"2022-12-16T01:50:45.604700231Z"}
{"log":"2022-12-16 01:50:45.604970 I | embed: rejected connection from \"10.252.145.26:55730\" (error \"EOF\", ServerName \"\")\n","stream":"stderr","time":"2022-12-16T01:50:45.604992082Z"}
{"log":"2022-12-16 01:50:45.605214 I | embed: rejected connection from \"10.252.145.23:48328\" (error \"EOF\", ServerName \"\")\n","stream":"stderr","time":"2022-12-16T01:50:45.605239201Z"}
{"log":"2022-12-16 01:50:49.851470 W | rafthttp: health check for peer 5f28642951cdc00 could not connect: dial tcp 10.252.145.22:2380: i/o timeout\n","stream":"stderr","time":"2022-12-16T01:50:49.851565851Z"}
{"log":"2022-12-16 01:50:49.851495 W | rafthttp: health check for peer 588daae93a2c19a could not connect: dial tcp 10.252.145.23:2380: i/o timeout\n","stream":"stderr","time":"2022-12-16T01:50:49.851581932Z"}
{"log":"2022-12-16 01:50:49.861642 W | rafthttp: health check for peer f6b8db6bdca3e78f could not connect: dial tcp 10.252.145.21:2380: i/o timeout\n","stream":"stderr","time":"2022-12-16T01:50:49.861698248Z"}
{"log":"2022-12-16 01:50:50.466502 I | embed: rejected connection from \"10.252.145.23:60850\" (error \"read tcp 10.252.145.26:2380-\u003e10.252.145.23:60850: i/o timeout\", ServerName \"\")\n","stream":"stderr","time":"2022-12-16T01:50:50.466616173Z"}
{"log":"2022-12-16 01:50:50.557355 I | embed: rejected connection from \"10.252.145.21:44256\" (error \"read tcp 10.252.145.26:2380-\u003e10.252.145.21:44256: i/o timeout\", ServerName \"\")\n","stream":"stderr","time":"2022-12-16T01:50:50.55743635Z"}
{"log":"2022-12-16 01:50:50.560222 I | embed: rejected connection from \"10.252.145.21:44262\" (error \"read tcp 10.252.145.26:2380-\u003e10.252.145.21:44262: i/o timeout\", ServerName \"\")\n","stream":"stderr","time":"2022-12-16T01:50:50.560278459Z"}
{"log":"2022-12-16 01:50:50.565533 I | embed: rejected connection from \"10.252.145.22:35182\" (error \"read tcp 10.252.145.26:2380-\u003e10.252.145.22:35182: i/o timeout\", ServerName \"\")\n","stream":"stderr","time":"2022-12-16T01:50:50.565633914Z"}

And then it start complaining about clock issues which doesn't exist. I checked in all the ETCD containers and they are all aligned time-wise (And I don't believe a node can randomly have a 1m44 clock drift in a second). Which make me believe that the probe call actually take 1m44s to run and by the time the process checks the time in the response, the time aren't matching:

{"log":"2022-12-16 01:51:04.851948 W | rafthttp: health check for peer 588daae93a2c19a could not connect: dial tcp 10.252.145.23:2380: i/o timeout\n","stream":"stderr","time":"2022-12-16T01:51:04.852052392Z"}
{"log":"2022-12-16 01:51:04.851998 W | rafthttp: health check for peer 5f28642951cdc00 could not connect: dial tcp 10.252.145.22:2380: i/o timeout\n","stream":"stderr","time":"2022-12-16T01:51:04.852078963Z"}
{"log":"2022-12-16 01:51:04.862103 W | rafthttp: health check for peer f6b8db6bdca3e78f could not connect: dial tcp 10.252.145.21:2380: i/o timeout\n","stream":"stderr","time":"2022-12-16T01:51:04.862180948Z"}
{"log":"raft2022/12/16 01:51:07 INFO: c06e6b105dac3be9 is starting a new election at term 8690\n","stream":"stderr","time":"2022-12-16T01:51:07.958476992Z"}
{"log":"raft2022/12/16 01:51:07 INFO: c06e6b105dac3be9 became candidate at term 8691\n","stream":"stderr","time":"2022-12-16T01:51:07.958510763Z"}
{"log":"raft2022/12/16 01:51:07 INFO: c06e6b105dac3be9 received MsgVoteResp from c06e6b105dac3be9 at term 8691\n","stream":"stderr","time":"2022-12-16T01:51:07.958521173Z"}
{"log":"raft2022/12/16 01:51:07 INFO: c06e6b105dac3be9 [logterm: 8673, index: 15115092] sent MsgVote request to 588daae93a2c19a at term 8691\n","stream":"stderr","time":"2022-12-16T01:51:07.958530554Z"}
{"log":"raft2022/12/16 01:51:07 INFO: c06e6b105dac3be9 [logterm: 8673, index: 15115092] sent MsgVote request to 5f28642951cdc00 at term 8691\n","stream":"stderr","time":"2022-12-16T01:51:07.958553085Z"}
{"log":"raft2022/12/16 01:51:07 INFO: c06e6b105dac3be9 [logterm: 8673, index: 15115092] sent MsgVote request to f6b8db6bdca3e78f at term 8691\n","stream":"stderr","time":"2022-12-16T01:51:07.958560075Z"}
{"log":"2022-12-16 01:51:09.852125 W | rafthttp: health check for peer 5f28642951cdc00 could not connect: dial tcp 10.252.145.22:2380: i/o timeout\n","stream":"stderr","time":"2022-12-16T01:51:09.852230449Z"}
{"log":"2022-12-16 01:51:09.852168 W | rafthttp: health check for peer 588daae93a2c19a could not connect: dial tcp 10.252.145.23:2380: i/o timeout\n","stream":"stderr","time":"2022-12-16T01:51:09.852254689Z"}
{"log":"2022-12-16 01:51:09.859485 W | rafthttp: health check for peer f6b8db6bdca3e78f could not connect: dial tcp 10.252.145.21:2380: i/o timeout\n","stream":"stderr","time":"2022-12-16T01:51:09.859573759Z"}
{"log":"2022-12-16 01:51:09.859508 W | rafthttp: the clock difference against peer f6b8db6bdca3e78f is too high [1m44.107763308s \u003e 1s]\n","stream":"stderr","time":"2022-12-16T01:51:09.8595905Z"}
{"log":"2022-12-16 01:51:09.859519 W | rafthttp: health check for peer 5f28642951cdc00 could not connect: dial tcp 10.252.145.22:2380: i/o timeout\n","stream":"stderr","time":"2022-12-16T01:51:09.8595951Z"}
{"log":"2022-12-16 01:51:09.859526 W | rafthttp: the clock difference against peer 5f28642951cdc00 is too high [1m44.110592502s \u003e 1s]\n","stream":"stderr","time":"2022-12-16T01:51:09.85959833Z"}

After that the etcd will just loop trying to have a new election and will keep complaining about the clock issue.

I did some test directly and I could call the /version, /health endpoint of the other nodes without issues, but if I tried on the node with the issue, it would hang. The /metrics call also hangs, but eventually replied.

The last action I did was to attach a strace to the main process, but as soon as I did that, everything went back to normal (logs below), I couldn't keep the initial output of the strace this time as I did not expect it to be what unlock the situation.:

2022-12-16 08:26:10.710687 W | rafthttp: health check for peer 5f28642951cdc00 could not connect: dial tcp 10.252.145.22:2380: i/o timeout
2022-12-16 08:26:10.710710 W | rafthttp: the clock difference against peer 5f28642951cdc00 is too high [1m44.110592502s > 1s]
2022-12-16 08:26:10.710728 W | rafthttp: health check for peer 5f28642951cdc00 could not connect: dial tcp 10.252.145.22:2380: i/o timeout
2022-12-16 08:26:10.710740 W | rafthttp: health check for peer f6b8db6bdca3e78f could not connect: dial tcp 10.252.145.21:2380: i/o timeout
raft2022/12/16 08:26:10 INFO: c06e6b105dac3be9 [term: 11957] ignored a MsgVote message with lower term from f6b8db6bdca3e78f [term: 10418]
raft2022/12/16 08:26:10 INFO: c06e6b105dac3be9 received MsgVoteResp rejection from f6b8db6bdca3e78f at term 11957
raft2022/12/16 08:26:10 INFO: c06e6b105dac3be9 has received 1 MsgVoteResp votes and 1 vote rejections
raft2022/12/16 08:26:10 INFO: c06e6b105dac3be9 received MsgVoteResp rejection from 588daae93a2c19a at term 11957
raft2022/12/16 08:26:10 INFO: c06e6b105dac3be9 has received 1 MsgVoteResp votes and 2 vote rejections
raft2022/12/16 08:26:10 INFO: c06e6b105dac3be9 became follower at term 11957
2022-12-16 08:26:10.798322 I | embed: rejected connection from "10.252.145.22:50592" (error "EOF", ServerName "")
2022-12-16 08:26:10.798787 I | embed: rejected connection from "10.252.145.23:54458" (error "EOF", ServerName "")
2022-12-16 08:26:10.800674 I | embed: rejected connection from "10.252.145.22:47200" (error "EOF", ServerName "")
2022-12-16 08:26:10.801325 I | embed: rejected connection from "10.252.145.21:46966" (error "EOF", ServerName "")
2022-12-16 08:26:10.801757 I | rafthttp: peer f6b8db6bdca3e78f became active
2022-12-16 08:26:10.802020 I | rafthttp: established a TCP streaming connection with peer f6b8db6bdca3e78f (stream MsgApp v2 reader)
2022-12-16 08:26:10.802303 I | rafthttp: peer 588daae93a2c19a became active
2022-12-16 08:26:10.802536 I | rafthttp: established a TCP streaming connection with peer 588daae93a2c19a (stream Message reader)
2022-12-16 08:26:10.802900 I | rafthttp: established a TCP streaming connection with peer 588daae93a2c19a (stream MsgApp v2 reader)
2022-12-16 08:26:10.803688 I | rafthttp: peer 5f28642951cdc00 became active
2022-12-16 08:26:10.803971 I | rafthttp: established a TCP streaming connection with peer f6b8db6bdca3e78f (stream Message reader)
2022-12-16 08:26:10.804581 I | rafthttp: established a TCP streaming connection with peer 5f28642951cdc00 (stream MsgApp v2 reader)
2022-12-16 08:26:10.807723 I | embed: rejected connection from "10.252.145.23:60896" (error "EOF", ServerName "")
2022-12-16 08:26:10.808883 I | embed: rejected connection from "10.252.145.21:36974" (error "EOF", ServerName "")
2022-12-16 08:26:10.809938 I | embed: rejected connection from "10.252.145.22:47254" (error "EOF", ServerName "")
2022-12-16 08:26:10.810777 I | rafthttp: established a TCP streaming connection with peer 5f28642951cdc00 (stream Message reader)
raft2022/12/16 08:26:10 INFO: c06e6b105dac3be9 [term: 11957] ignored a MsgVoteResp message with lower term from 5f28642951cdc00 [term: 11956]
raft2022/12/16 08:26:11 INFO: c06e6b105dac3be9 no leader at term 11957; dropping index reading msg
raft2022/12/16 08:26:11 INFO: c06e6b105dac3be9 [term: 11957] received a MsgVote message with higher term from 588daae93a2c19a [term: 11958]
raft2022/12/16 08:26:11 INFO: c06e6b105dac3be9 became follower at term 11958
raft2022/12/16 08:26:11 INFO: c06e6b105dac3be9 [logterm: 8673, index: 15115092, vote: 0] cast MsgVote for 588daae93a2c19a [logterm: 10417, index: 15386430] at term 11958
raft2022/12/16 08:26:11 INFO: raft.node: c06e6b105dac3be9 elected leader 588daae93a2c19a at term 119582022-12-16 08:26:11.831387 I | rafthttp: receiving database snapshot [index: 15386431, from: 588daae93a2c19a, raft message size: 10 kB]
2022-12-16 08:26:12.462811 I | snap: saved database snapshot to disk [total bytes: 163368960]
2022-12-16 08:26:12.462900 I | rafthttp: successfully received and saved database snapshot [index: 15386431, from: 588daae93a2c19a, raft message size: 10 kB, db size: 163 MB, took: 632.410422ms]
raft2022/12/16 08:26:12 INFO: log [committed=15115092, applied=15115092, unstable.offset=15115093, len(unstable.Entries)=0] starts to restore snapshot [index: 15386431, term: 11958]
raft2022/12/16 08:26:12 INFO: c06e6b105dac3be9 switched to configuration voters=(398809010358370714 428552535081278464 13866138020755553257 17778200785354745743)
raft2022/12/16 08:26:12 INFO: c06e6b105dac3be9 [commit: 15386431, lastindex: 15386431, lastterm: 11958] restored snapshot [index: 15386431, term: 11958]
raft2022/12/16 08:26:12 INFO: c06e6b105dac3be9 [commit: 15386431] restored snapshot [index: 15386431, term: 11958]
2022-12-16 08:26:12.463840 I | etcdserver: applying snapshot at index 15055437...
2022-12-16 08:26:12.464806 I | etcdserver: raft applied incoming snapshot at index 15386431

A restart of the process also works when this happens and the cluster recover almost instantly. (We already tried to restart the physical node hosting it, issue seems to still occur randomly)

If anyone has any idea on what i going on, or how to actually try to investigate and fix this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One ETCD member stuck in loop until "restart" in some clusters #15008

{{title}}

Replies: 0 comments

Select a reply

One ETCD member stuck in loop until "restart" in some clusters #15008

Lebvanih Dec 16, 2022

Replies: 0 comments

Lebvanih
Dec 16, 2022