-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implemented leader lease when quorum check is on. #5468
Conversation
@@ -565,15 +565,25 @@ func (r *raft) Step(m pb.Message) error { | |||
case m.Term > r.Term: | |||
lead := m.From | |||
if m.Type == pb.MsgVote { | |||
if r.state == StateFollower && r.checkQuorum && r.electionElapsed < r.electionTimeout { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also true for leader? right? also let's put the conditional var checking first
so
if r.checkQuorum && r.state != StateCandidate && r.electionElapsed < r.electionTimeout
b.randomizedElectionTimeout = b.electionTimeout + 1 | ||
} | ||
|
||
for i := 0; i < b.electionTimeout; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just do 2 * b.electionTimeout and remove line 1274-1276?
@xiang90 PTAL Thanks |
f5c0fbb
to
2e0518a
Compare
|
||
nt := newNetwork(a, b, c) | ||
|
||
// Prevent campaigning from b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i do not think we need line 1239->1242.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or at least line 1240 is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this case will break without line 1240
i think we need to let b's elapsedTimeout reach to electionTimeout so that it can vote for a,
and also if randomizedElectionTimeout is exactly electionTimeout(that's gonna happen if the random part of timeout is 0) b is gonna start a vote, it affects the behavior of this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment for this? i think this is tricky actually.
// the reason causes a stuck candidate is because it might be rejected by other peers if | ||
// quorum check is on and those peers are still within leader lease, it ends up having a | ||
// higher term than the quorum | ||
func TestFreeStuckPeerWithCheckQuorum(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestFreeStuckCandidateWithCheckQuorum ensures that a candidate with a higher term can disrupt the leader even if the leader still "officially" holds the lease. The leader is expected to step down and adopt the candidate's term.
/cc @bdarnell Can you or anyone from cockroach team give this a final review? Thanks! |
@xiang90 thank you for all the help, PTAL thanks |
@swingbach There is one corner case we need to fix in this pull request. A new raft node is not promotable until itself is added into the peer list (a node can only be considered part of the cluster once it added itself into the cluster). If the node is not promotable, then it never increases it election (see https://github.com/coreos/etcd/blob/master/raft/raft.go#L427). Then it never votes to anyone. We need to somehow fix this... |
Also let's cover this by a new test case, a non-promotable node can vote for others. |
@@ -565,15 +565,25 @@ func (r *raft) Step(m pb.Message) error { | |||
case m.Term > r.Term: | |||
lead := m.From | |||
if m.Type == pb.MsgVote { | |||
if r.checkQuorum && r.state != StateCandidate && r.electionElapsed < r.electionTimeout { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment like: "Section 4.2.3: 'if a server receives a RequestVote request within the minimum election timeout of hearing from a current leader, it does not update its term or grant its vote.' This is tied to the checkQuorum feature because this is the only case in which leaders track the time elapsed since they last confirmed their leadership. In StateCandidate, r.electionElapsed is small, but by definition we have not recently heard from the current leader if we are in this state"
b.delProgress(2) | ||
|
||
if b.promotable() { | ||
t.Errorf("promotable = %v, want false", b.promotable()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.Fatalf instead of errorf (we should stop testing if the assumption does not hold)
LGTM. Please squash commits. |
31f4ecf
to
337ef64
Compare
already fixed the assertion and squashed all the commits @xiang90 PTAL, Thanks |
@swingbach Merged. Thanks. |
port etcd leader lease. etcd-io/etcd#5468 etcd-io/etcd#5809
port etcd leader lease. etcd-io/etcd#5468 etcd-io/etcd#5809
port etcd leader lease. etcd-io/etcd#5468 etcd-io/etcd#5809
Make the check quorum mechanism behaves more like a leader lease, also solve stuck peer problem
see also pr 5451