Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server joining Raft cluster might not catch up. #1193

Closed
srh opened this issue Jul 14, 2017 · 6 comments
Closed

Server joining Raft cluster might not catch up. #1193

srh opened this issue Jul 14, 2017 · 6 comments
Assignees
Labels
kind/bug Something is broken.
Milestone

Comments

@srh
Copy link

srh commented Jul 14, 2017

I ran into this while investigating #1180. One time I ran into an anomalous situation.

If you do the following:

  1. Create servers 1, 2, and 3 (all in Raft groups 0 and 1).
  2. Kill server 3.
  3. Run dgraphloader 1million.rdf.gz on server 1.
  4. Revive server 3.

In other words, do as described here: #1169 (comment)

It's possible that, at this point, server 3 ends up not receiving any predicates from 1 or 2. That is, it gets no Raft log entries, no MsgApp messages. (In #1180 the bug is that the messages arrive slowly.) This situation is still true if you kill and revive server 3, or if you kill all the servers and revive them.

Here is the end of the log for server 3.

--my flag is provided without bindall, Did you forget to specify bindall?
2017/07/13 18:41:00 worker.go:108: Worker listening at address: 127.0.0.1:12347
2017/07/13 18:41:00 main.go:654: gRPC server started.  Listening on port 9082
2017/07/13 18:41:00 main.go:655: HTTP server started.  Listening on port 8082
----------------------------
Connection with "127.0.0.1:12345" healthy.
====== APPLYING MEMBERSHIP UPDATE: {NodeId:1 Addr:127.0.0.1:12345 Leader:true RaftIdx:10 PoolOrNil:0xc4200d07a0}
----------------------------
Group: 0. List: [{NodeId:1 Addr:127.0.0.1:12345 Leader:true RaftIdx:10 PoolOrNil:0xc4200d07a0}]
----------------------------
====== APPLYING MEMBERSHIP UPDATE: {NodeId:2 Addr:127.0.0.1:12346 Leader:false RaftIdx:10 PoolOrNil:0xc4202960c0}
----------------------------
Group: 0. List: [{NodeId:1 Addr:127.0.0.1:12345 Leader:true RaftIdx:10 PoolOrNil:0xc4200d07a0} {NodeId:2 Addr:127.0.0.1:12346 Leader:false RaftIdx:10 PoolOrNil:0xc4202960c0}]
----------------------------
====== APPLYING MEMBERSHIP UPDATE: {NodeId:1 Addr:127.0.0.1:12345 Leader:true RaftIdx:10 PoolOrNil:0xc4200d07a0}
----------------------------
Group: 0. List: [{NodeId:1 Addr:127.0.0.1:12345 Leader:true RaftIdx:10 PoolOrNil:0xc4200d07a0} {NodeId:2 Addr:127.0.0.1:12346 Leader:false RaftIdx:10 PoolOrNil:0xc4202960c0}]
Group: 1. List: [{NodeId:1 Addr:127.0.0.1:12345 Leader:true RaftIdx:10 PoolOrNil:0xc4200d07a0}]
----------------------------
====== APPLYING MEMBERSHIP UPDATE: {NodeId:2 Addr:127.0.0.1:12346 Leader:false RaftIdx:10 PoolOrNil:0xc4202960c0}
----------------------------
Group: 1. List: [{NodeId:1 Addr:127.0.0.1:12345 Leader:true RaftIdx:10 PoolOrNil:0xc4200d07a0} {NodeId:2 Addr:127.0.0.1:12346 Leader:false RaftIdx:10 PoolOrNil:0xc4202960c0}]
Group: 0. List: [{NodeId:1 Addr:127.0.0.1:12345 Leader:true RaftIdx:10 PoolOrNil:0xc4200d07a0} {NodeId:2 Addr:127.0.0.1:12346 Leader:false RaftIdx:10 PoolOrNil:0xc4202960c0}]
Last update is now: 10
Node with GroupID: 0, ID: 3
Connection with "127.0.0.1:12346" healthy.
Node with GroupID: 1, ID: 3
Found hardstate: {Data:[] Metadata:{ConfState:{Nodes:[] XXX_unrecognized:[]} Index:0 Term:0 XXX_unrecognized:[]} XXX_unrecognized:[]}
Group 0 found 0 entries
Restarting node for group: 0
raft2017/07/13 18:41:00 INFO: 3 became follower at term 1
raft2017/07/13 18:41:00 INFO: newRaft 3 [peers: [], term: 1, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
Found hardstate: {Data:[] Metadata:{ConfState:{Nodes:[] XXX_unrecognized:[]} Index:0 Term:0 XXX_unrecognized:[]} XXX_unrecognized:[]}
Group 1 found 0 entries
Restarting node for group: 1
raft2017/07/13 18:41:00 INFO: 3 became follower at term 1
raft2017/07/13 18:41:00 INFO: newRaft 3 [peers: [], term: 1, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2017/07/13 18:42:08 main.go:468: Got clean exit request
2017/07/13 18:42:08 main.go:594: Stopped taking more http(s) requests. Err: mux: listener closed
2017/07/13 18:42:08 main.go:598: All http(s) requests finished.
2017/07/13 18:42:08 main.go:594: Stopped taking more http(s) requests. Err: mux: listener closed
2017/07/13 18:42:08 main.go:598: All http(s) requests finished.
2017/07/13 18:42:08 main.go:581: gRpc server stopped : accept tcp 127.0.0.1:9082: use of closed network connection

I saved the servers' output files and will get back to this later.

@srh srh added the kind/bug Something is broken. label Jul 14, 2017
@janardhan1993
Copy link
Contributor

janardhan1993 commented Jul 14, 2017

As per the logs it seems that the node 3 was killed before it joined the cluster. On restart it it recognised that this server was restarted. The code in worker/draft.go thinks that if a node is restarted it would have joined the cluster already so doesn't call joinpeers. I think if we call joinpeers even after restart this should fix the issue.
I checked raft code doing proposeconfchange for adding the same node twice shouldn't casue any issue.

This is in their code etcd/raft/raft.go

func (r *raft) addNode(id uint64) {
	if _, ok := r.prs[id]; ok {
		// Ignore any redundant addNode calls (which can happen because the
		// initial bootstrapping entries are applied twice).
		return
	}

@manishrjain @srh ^

@janardhan1993 janardhan1993 self-assigned this Jul 16, 2017
@janardhan1993 janardhan1993 added this to the v0.8Maybe milestone Jul 17, 2017
@manishrjain manishrjain modified the milestones: v0.8, v0.8Maybe, v0.8.1 Jul 26, 2017
@manishrjain
Copy link
Contributor

Looks like a small change. @janardhan1993 maybe create the PR, so @srh can check if he still sees this issue.

@janardhan1993
Copy link
Contributor

Hey @srh, Can you please try this on issues/github branch.

@srh
Copy link
Author

srh commented Jul 27, 2017

It seems not to work. I'll send you the data dir that I made before, with which node 3 still doesn't catch up at restart.

@srh
Copy link
Author

srh commented Jul 28, 2017

The reason it seemed not to work is that go build ./... && go install ./... wasn't running the second command. The reason is, go build ./... returns a non-zero error code because of its "no buildable Go source files in .../client_test" output. So I never actually tested the change on new input.

I also tried recreating the issue from scratch by killing node 3 very quickly. I couldn't do that either. But this was on the binary without a fix applied!

So I'd say, let's make the fix, if we understand how it works.

@srh
Copy link
Author

srh commented Jul 28, 2017

It works when restarting from the rebased bug/raft branch (that avoids badger incompatibility problems).

@manishrjain manishrjain added the kind/bug Something is broken. label Mar 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something is broken.
Development

No branches or pull requests

3 participants