Server joining Raft cluster might not catch up. #1193

srh · 2017-07-14T02:00:38Z

I ran into this while investigating #1180. One time I ran into an anomalous situation.

If you do the following:

Create servers 1, 2, and 3 (all in Raft groups 0 and 1).
Kill server 3.
Run dgraphloader 1million.rdf.gz on server 1.
Revive server 3.

In other words, do as described here: #1169 (comment)

It's possible that, at this point, server 3 ends up not receiving any predicates from 1 or 2. That is, it gets no Raft log entries, no MsgApp messages. (In #1180 the bug is that the messages arrive slowly.) This situation is still true if you kill and revive server 3, or if you kill all the servers and revive them.

Here is the end of the log for server 3.

--my flag is provided without bindall, Did you forget to specify bindall?
2017/07/13 18:41:00 worker.go:108: Worker listening at address: 127.0.0.1:12347
2017/07/13 18:41:00 main.go:654: gRPC server started.  Listening on port 9082
2017/07/13 18:41:00 main.go:655: HTTP server started.  Listening on port 8082
----------------------------
Connection with "127.0.0.1:12345" healthy.
====== APPLYING MEMBERSHIP UPDATE: {NodeId:1 Addr:127.0.0.1:12345 Leader:true RaftIdx:10 PoolOrNil:0xc4200d07a0}
----------------------------
Group: 0. List: [{NodeId:1 Addr:127.0.0.1:12345 Leader:true RaftIdx:10 PoolOrNil:0xc4200d07a0}]
----------------------------
====== APPLYING MEMBERSHIP UPDATE: {NodeId:2 Addr:127.0.0.1:12346 Leader:false RaftIdx:10 PoolOrNil:0xc4202960c0}
----------------------------
Group: 0. List: [{NodeId:1 Addr:127.0.0.1:12345 Leader:true RaftIdx:10 PoolOrNil:0xc4200d07a0} {NodeId:2 Addr:127.0.0.1:12346 Leader:false RaftIdx:10 PoolOrNil:0xc4202960c0}]
----------------------------
====== APPLYING MEMBERSHIP UPDATE: {NodeId:1 Addr:127.0.0.1:12345 Leader:true RaftIdx:10 PoolOrNil:0xc4200d07a0}
----------------------------
Group: 0. List: [{NodeId:1 Addr:127.0.0.1:12345 Leader:true RaftIdx:10 PoolOrNil:0xc4200d07a0} {NodeId:2 Addr:127.0.0.1:12346 Leader:false RaftIdx:10 PoolOrNil:0xc4202960c0}]
Group: 1. List: [{NodeId:1 Addr:127.0.0.1:12345 Leader:true RaftIdx:10 PoolOrNil:0xc4200d07a0}]
----------------------------
====== APPLYING MEMBERSHIP UPDATE: {NodeId:2 Addr:127.0.0.1:12346 Leader:false RaftIdx:10 PoolOrNil:0xc4202960c0}
----------------------------
Group: 1. List: [{NodeId:1 Addr:127.0.0.1:12345 Leader:true RaftIdx:10 PoolOrNil:0xc4200d07a0} {NodeId:2 Addr:127.0.0.1:12346 Leader:false RaftIdx:10 PoolOrNil:0xc4202960c0}]
Group: 0. List: [{NodeId:1 Addr:127.0.0.1:12345 Leader:true RaftIdx:10 PoolOrNil:0xc4200d07a0} {NodeId:2 Addr:127.0.0.1:12346 Leader:false RaftIdx:10 PoolOrNil:0xc4202960c0}]
Last update is now: 10
Node with GroupID: 0, ID: 3
Connection with "127.0.0.1:12346" healthy.
Node with GroupID: 1, ID: 3
Found hardstate: {Data:[] Metadata:{ConfState:{Nodes:[] XXX_unrecognized:[]} Index:0 Term:0 XXX_unrecognized:[]} XXX_unrecognized:[]}
Group 0 found 0 entries
Restarting node for group: 0
raft2017/07/13 18:41:00 INFO: 3 became follower at term 1
raft2017/07/13 18:41:00 INFO: newRaft 3 [peers: [], term: 1, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
Found hardstate: {Data:[] Metadata:{ConfState:{Nodes:[] XXX_unrecognized:[]} Index:0 Term:0 XXX_unrecognized:[]} XXX_unrecognized:[]}
Group 1 found 0 entries
Restarting node for group: 1
raft2017/07/13 18:41:00 INFO: 3 became follower at term 1
raft2017/07/13 18:41:00 INFO: newRaft 3 [peers: [], term: 1, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2017/07/13 18:42:08 main.go:468: Got clean exit request
2017/07/13 18:42:08 main.go:594: Stopped taking more http(s) requests. Err: mux: listener closed
2017/07/13 18:42:08 main.go:598: All http(s) requests finished.
2017/07/13 18:42:08 main.go:594: Stopped taking more http(s) requests. Err: mux: listener closed
2017/07/13 18:42:08 main.go:598: All http(s) requests finished.
2017/07/13 18:42:08 main.go:581: gRpc server stopped : accept tcp 127.0.0.1:9082: use of closed network connection

I saved the servers' output files and will get back to this later.

The text was updated successfully, but these errors were encountered:

janardhan1993 · 2017-07-14T07:46:43Z

As per the logs it seems that the node 3 was killed before it joined the cluster. On restart it it recognised that this server was restarted. The code in worker/draft.go thinks that if a node is restarted it would have joined the cluster already so doesn't call joinpeers. I think if we call joinpeers even after restart this should fix the issue.
I checked raft code doing proposeconfchange for adding the same node twice shouldn't casue any issue.

This is in their code etcd/raft/raft.go

func (r *raft) addNode(id uint64) {
	if _, ok := r.prs[id]; ok {
		// Ignore any redundant addNode calls (which can happen because the
		// initial bootstrapping entries are applied twice).
		return
	}

@manishrjain @srh ^

manishrjain · 2017-07-27T04:02:12Z

Looks like a small change. @janardhan1993 maybe create the PR, so @srh can check if he still sees this issue.

janardhan1993 · 2017-07-27T07:14:13Z

Hey @srh, Can you please try this on issues/github branch.

srh · 2017-07-27T22:55:16Z

It seems not to work. I'll send you the data dir that I made before, with which node 3 still doesn't catch up at restart.

srh · 2017-07-28T00:54:01Z

The reason it seemed not to work is that go build ./... && go install ./... wasn't running the second command. The reason is, go build ./... returns a non-zero error code because of its "no buildable Go source files in .../client_test" output. So I never actually tested the change on new input.

I also tried recreating the issue from scratch by killing node 3 very quickly. I couldn't do that either. But this was on the binary without a fix applied!

So I'd say, let's make the fix, if we understand how it works.

srh · 2017-07-28T01:21:40Z

It works when restarting from the rebased bug/raft branch (that avoids badger incompatibility problems).

srh added the kind/bug Something is broken. label Jul 14, 2017

janardhan1993 self-assigned this Jul 16, 2017

janardhan1993 added this to the v0.8Maybe milestone Jul 17, 2017

manishrjain modified the milestones: v0.8, v0.8Maybe, v0.8.1 Jul 26, 2017

janardhan1993 closed this as completed in 7eecb1f Jul 28, 2017

manishrjain added the kind/bug Something is broken. label Mar 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server joining Raft cluster might not catch up. #1193

Server joining Raft cluster might not catch up. #1193

srh commented Jul 14, 2017 •

edited by manishrjain

Loading

janardhan1993 commented Jul 14, 2017 •

edited

Loading

manishrjain commented Jul 27, 2017

janardhan1993 commented Jul 27, 2017

srh commented Jul 27, 2017

srh commented Jul 28, 2017 •

edited

Loading

srh commented Jul 28, 2017 •

edited

Loading

Server joining Raft cluster might not catch up. #1193

Server joining Raft cluster might not catch up. #1193

Comments

srh commented Jul 14, 2017 • edited by manishrjain Loading

janardhan1993 commented Jul 14, 2017 • edited Loading

manishrjain commented Jul 27, 2017

janardhan1993 commented Jul 27, 2017

srh commented Jul 27, 2017

srh commented Jul 28, 2017 • edited Loading

srh commented Jul 28, 2017 • edited Loading

srh commented Jul 14, 2017 •

edited by manishrjain

Loading

janardhan1993 commented Jul 14, 2017 •

edited

Loading

srh commented Jul 28, 2017 •

edited

Loading

srh commented Jul 28, 2017 •

edited

Loading