Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: Avoid scanning raft log in becomeLeader #9073

Merged
merged 1 commit into from
Jan 9, 2018

Conversation

bdarnell
Copy link
Contributor

@bdarnell bdarnell commented Dec 30, 2017

Scanning the uncommitted portion of the raft log to determine whether
there are any pending config changes can be expensive. In
cockroachdb/cockroach#18601, we've seen that a new leader can spend so
much time scanning its log post-election that it fails to send
its first heartbeats in time to prevent a second election from
starting immediately.

Instead of tracking whether a pending config change exists with a
boolean, this commit tracks the latest log index at which a pending
config change could exist. This is a less expensive solution to
the problem, and the impact of false positives should be minimal since
a newly-elected leader should be able to quickly commit the tail of
its log.

Scanning the uncommitted portion of the raft log to determine whether
there are any pending config changes can be expensive. In
cockroachdb/cockroach#18601, we've seen that a new leader can spend so
much time scanning its log post-election that it fails to send
its first heartbeats in time to prevent a second election from
starting immediately.

Instead of tracking whether a pending config change exists with a
boolean, this commit tracks the latest log index at which a pending
config change *could* exist. This is a less expensive solution to
the problem, and the impact of false positives should be minimal since
a newly-elected leader should be able to quickly commit the tail of
its log.
@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (master@fdec12c). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #9073   +/-   ##
=========================================
  Coverage          ?   76.08%           
=========================================
  Files             ?      359           
  Lines             ?    29944           
  Branches          ?        0           
=========================================
  Hits              ?    22783           
  Misses            ?     5576           
  Partials          ?     1585
Impacted Files Coverage Δ
raft/node.go 89.73% <ø> (ø)
raft/rawnode.go 68.33% <ø> (ø)
raft/raft.go 91.5% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fdec12c...8d8f319. Read the comment docs.

@xiang90
Copy link
Contributor

xiang90 commented Jan 3, 2018

@siddontang

can you take a look at this PR first? I will take a look at it in a couple of days.

@siddontang
Copy link
Contributor

@xiang90 ok

@xiang90
Copy link
Contributor

xiang90 commented Jan 5, 2018

@bdarnell The overall ideal looks good to me. This will not affect etcd a lot since etcd is not reconfig heavy nor having issues with reading log tails (since we keep logs purely in mem). I would like @siddontang from tikv side to have a look before we merge this in.

@@ -682,12 +687,13 @@ func (r *raft) becomeLeader() {
r.logger.Panicf("unexpected error getting uncommitted entries (%v)", err)
}

nconf := numOfPendingConf(ents)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that using pendingConfIndex only reduce calling numOfPendingConf here, can this reduce the performance too much?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not have a significant negative impact on performance. The only thing affected is the ability to propose new config changes, and the impact is small. The worst case scenario is when you have one up-to-date follower and one follower that is behind, then the leader dies and the up-to-date follower becomes the new leader.

Before, the new leader could immediately propose a config change, but that config change wouldn't be applied until the other follower catches up (acknowledging the log entries, but not necessarily applying them)

With this change, the follower must catch up before any config change can be proposed. So this only adds one round trip to membership changes proposed immediately after an election.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it

@siddontang
Copy link
Contributor

LGTM

@xiang90
Copy link
Contributor

xiang90 commented Jan 9, 2018

@bdarnell thanks. merging.

@xiang90 xiang90 merged commit ed1ff9e into etcd-io:master Jan 9, 2018
@bdarnell bdarnell deleted the pending-conf-index branch January 9, 2018 19:59
bdarnell added a commit to cockroachdb/vendored that referenced this pull request Feb 10, 2018
bdarnell added a commit to bdarnell/cockroach that referenced this pull request Feb 10, 2018
bdarnell added a commit to bdarnell/cockroach that referenced this pull request Feb 10, 2018
bdarnell added a commit to cockroachdb/vendored that referenced this pull request Apr 17, 2018
bdarnell added a commit to bdarnell/cockroach that referenced this pull request Apr 18, 2018
Picks up a cherry-picked version of etcd-io/etcd#9073, to fix cockroachdb#18601

Release note (bug fix): Fixes potential cluster unavailability after
raft logs grow too large.
craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request Apr 19, 2018
24889: cherrypick-1.1: build: Update etcd r=bdarnell a=bdarnell

Picks up a cherry-picked version of etcd-io/etcd#9073, to fix #18601

Release note (bug fix): Fixes potential cluster unavailability after
raft logs grow too large.

Co-authored-by: Ben Darnell <[email protected]>
bdarnell added a commit to bdarnell/etcd that referenced this pull request Jun 26, 2018
I meant to do this in etcd-io#9073, but sent the PR before it was finished.
The last log index is known directly; there is no need to fetch any
entries here.
tbg added a commit to cockroachdb/vendored that referenced this pull request Jun 27, 2018
bdarnell added a commit to cockroachdb/etcd that referenced this pull request Jun 27, 2018
I meant to do this in etcd-io#9073, but sent the PR before it was finished.
The last log index is known directly; there is no need to fetch any
entries here.
@nvanbenschoten
Copy link
Contributor

@bdarnell did we mean to use r.pendingConfIndex in place of the following logic as well?

etcd/raft/raft.go

Lines 863 to 870 in 90c5968

ents, err := r.raftLog.slice(r.raftLog.applied+1, r.raftLog.committed+1, noLimit)
if err != nil {
r.logger.Panicf("unexpected error getting unapplied entries (%v)", err)
}
if n := numOfPendingConf(ents); n != 0 && r.raftLog.committed > r.raftLog.applied {
r.logger.Warningf("%x cannot campaign at term %d since there are still %d pending configuration changes to apply", r.id, r.Term, n)
return nil
}

Seems like we're still scanning the uncommitted portion of the raft log before campaigning.

@bdarnell
Copy link
Contributor Author

bdarnell commented Oct 9, 2018

@nvanbenschoten I think I missed that by accident, but I'm not sure we can use pendingConfIndex there.

  1. This is not the uncommitted portion, this is the committed-but-not-applied portion. This is normally much smaller, although it can still be large.
  2. pendingConfIndex is only set on the leader; followers and candidates don't track it.
  3. To avoid scanning the uncommitted log, we set pendingConfIndex to a value that is too high and allows false positives. This is OK when the consequence of a false positive is delaying any future ProposeConfChanges, but if it prevents nodes from campaigning or becoming leader, that's more of a problem. If the group has no leader and no one willing to campaign, no new entries will get committed and we'll be stuck. (That's not actually an issue here because we're talking about unapplied entries instead of uncommitted, and unapplied entries will apply even without a leader).
  4. Blocking before campaigning is much less of a problem than blocking after becoming leader. The latter would cause the new leader to immediately miss heartbeats and lose its leadership, but a follower that blocks for a while before deciding to campaign won't really hurt anything.

I'm not sure why this check is even here. It seems to be addressing an edge case: if there are unapplied config changes that we know are committed, we might as well apply them before starting a campaign so we send MsgVotes to the right nodes. But it's not needed for correctness; in the worst case if there is a quorum of live nodes in the new config but not in the old one the election will fail and need to be retried (which it will, after a timeout). But this is an expensive way to avoid that problem, and I think it would be better to just remove the pending conf check here entirely.

Copy link

@Nidaozi Nidaozi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants