-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server joining Raft cluster might not catch up. #1193
Comments
As per the logs it seems that the node 3 was killed before it joined the cluster. On restart it it recognised that this server was restarted. The code in worker/draft.go thinks that if a node is restarted it would have joined the cluster already so doesn't call joinpeers. I think if we call joinpeers even after restart this should fix the issue. This is in their code etcd/raft/raft.go
|
Looks like a small change. @janardhan1993 maybe create the PR, so @srh can check if he still sees this issue. |
Hey @srh, Can you please try this on issues/github branch. |
It seems not to work. I'll send you the data dir that I made before, with which node 3 still doesn't catch up at restart. |
The reason it seemed not to work is that I also tried recreating the issue from scratch by killing node 3 very quickly. I couldn't do that either. But this was on the binary without a fix applied! So I'd say, let's make the fix, if we understand how it works. |
It works when restarting from the rebased bug/raft branch (that avoids badger incompatibility problems). |
I ran into this while investigating #1180. One time I ran into an anomalous situation.
If you do the following:
In other words, do as described here: #1169 (comment)
It's possible that, at this point, server 3 ends up not receiving any predicates from 1 or 2. That is, it gets no Raft log entries, no MsgApp messages. (In #1180 the bug is that the messages arrive slowly.) This situation is still true if you kill and revive server 3, or if you kill all the servers and revive them.
Here is the end of the log for server 3.
I saved the servers' output files and will get back to this later.
The text was updated successfully, but these errors were encountered: