Leadership transfer #306

hanshasselberg · 2019-01-28T17:03:52Z

This PR is implementing the leadership transfer extension described in the thesis chap 3.10.

Background:

Consul is performing some setup after acquiring leadership. It is possible that the setup fails, but there is no good way to step down as a leader. It is possible to use DemoteVoter as show in hashicorp/consul#5247, but this is suboptimal because it relies on Consul's autopilot to promote the old leader to a voter again.
Since there is a perfectly fine way described in the thesis: leadership transfer extension, we decided to implement that instead. Doing it this way also helps other teams, since it is more generic.

The necessary steps to perform are:

Leader picks target to transition to
Leader stops accepting client requests
Leader makes sure to replicate logs to the target
Leader sends TimeoutNow RPC request
Target receives TimeoutNow request, which triggers an election
6a. If the election is successful, a message with the new term will make the old leader step down
6b. if after electiontimeout the leadership transfer did not complete, the old leader resumes operation

Todo:

mention leadership transfer in the readme
add fuzzy test for leadership transfer
make the leader semi respond to stuff until transition is either completed or failed
if after ~electiontimeout the leadership transfer did not complete, the old leader resumes operation
add leadershiptransfer flag to RequestVote as described in chap 4.2.3 Disruptive servers
expire LeaderLeaseTimeout before leadership transfer as described in chap 6.4.1 Using clocks to reduce messaging for read-only queries
are we calling the replication correctly?
add documentation in the code

Resources:

https://github.com/etcd-io/etcd/tree/master/raft

mitchellh · 2019-01-28T17:11:17Z

Exciting! I know this is a WIP so you'd probably get to it later, but as a TODO I'd add clear documentation on all the exported methods at least, if not the internal ones too. 😄

hanshasselberg · 2019-01-29T22:39:59Z

@mitchellh Absolutely, I added the todos!

banks · 2019-02-01T12:34:42Z

@tylertreat FYI. I know you were interested in this!

banks

This looks awesome Hans,

I think we have a few correctness/timing/race issues to look at as noted in the comments but I think the tweaks to fix those aren't huge - just subtle!

The "how long do we wait for replication" one is the most subtle I think.

api.go

raft.go

banks · 2019-02-01T13:09:31Z

raft.go

+	}()
+
+	// Step 4: send TimeoutNow message to target server. Technically the
+	// leadership transfer is done now from the point of view of the leader.


I think this comment is a bit confusing given that the goroutine above is executing in parallel - the leader has done what it can, but it still needs to take responsibility if the transfer times out. Maybe just "transfer is done" is confusing because it's not really done until the election happens and new leader takes over...

raft.go

mkeeler

Paul covered most of it. Just adding my 2¢ in a few places.

api.go

raft.go

hanshasselberg · 2019-02-12T09:10:03Z

I addressed the fact, that the replication needs to be included in the timeout as well. I think it is much better now, because not only timeouts, but also loss of leadership can stop the transfer at any point in time:

raft/raft.go

Lines 540 to 558 in ded6ce1

    
           go r.leadershipTransfer(future.ID, future.Address, stopCh, doneCh) 
        
           select { 
        
           case <-timeout: 
        
           	err := fmt.Errorf("leadership transfer timeout") 
        
           	r.logger.Printf("[DEBUG] raft: %v", err) 
        
           	future.respond(err) 
        
           	stopCh <- err 
        
           	<-doneCh 
        
           case err := <-verify.errCh: 
        
           	r.logger.Printf("[DEBUG] raft: %v", err) 
        
           	stopCh <- err 
        
           	future.respond(err) 
        
           	<-doneCh 
        
           case err := <-doneCh: 
        
           	if err != nil { 
        
           		r.logger.Printf("[DEBUG] raft: %v", err) 
        
           	} 
        
           	future.respond(err) 
        
           }

.

hanshasselberg · 2019-02-12T09:11:35Z

I also defer resetting the flag for leadershiptransfer in progress now:

raft/raft.go

Lines 735 to 736 in ded6ce1

    
           r.leaderState.leadershipTransferInProgress = true 
        
           defer func() { r.leaderState.leadershipTransferInProgress = false }()

.

hanshasselberg · 2019-02-13T14:04:26Z

ok, I am at a point where I addressed the feedback except these two points:

protocol version: there is no point checking for version 3 - because not all v3 raft has leadership transfer. should we create v4 for it?
which actions need to be blocked? The reason why I did it this way is the paper says 1. The prior leader stops accepting new client requests. which I interpreted as every request that comes from the API. Since only the leader can transfer leadership, and the leader initiates every action, I think what I did is enough. I might be wrong.

hanshasselberg · 2019-02-13T22:29:23Z

In a call with @banks we decided, that we don't cut a new version. I will make it clear in the docs that you have to run the latest code in order to use leadership transfer - that v3 is not sufficient.

With regards to blocking actions coming through commitCh, I don't think we should do that. The commitCh receives a message when an entry is commited: accepted by the majority of the followers. This doesn't change the raft state. But it changes the state of the leader. However even when we chose to not process these messages, we are basically missing stuff since we are still the leader. If we happen to transfer to a follower that has this commit, everything is fine. If we transfer to a follower that doesn't, it won't win the election. Not processing commitCh doesn't change that.

hanshasselberg · 2019-02-13T23:19:29Z

If a leader attempts to transfer leadership to a follower that doesn't have that feature, it will fail gracefully on the follower and the leader. In that case the old leader keeps being the leader:

    2019/02/14 00:17:16 [ERR] raft-net: Failed to decode incoming command: unknown rpc type 3

and on the leader as well:

    2019/02/14 00:17:16 [WARN] consul: failed to transfer leadership: failed to make TimeoutNow RPC to dd9a2744-96ec-78fc-f69c-d7376e207946: EOF

If a node is asked to vote for another node that is the target of a leadership transfer, it will just vote for it as usual.

raft.go

mkeeler

Looks good to me now.

banks

Phew! I think this looks good. There are a couple of minor things mentioned inline but not important enough to need another round of reviews!

I should also ask - the failing CI tests, are those same tests failing in master and/or passing locally? Sad as it is, I know the test suite for this is not very robust in CI so we probably can't hold up the PR waiting for that to happen but would be good to just confirm that you verified no failures could be relevant to these changes - even if they are intermittent.

banks · 2019-03-11T20:44:01Z

raft.go

+	replState                    map[ServerID]*followerReplication
+	notify                       map[*verifyFuture]struct{}
+	stepDown                     chan struct{}
+	lease                        <-chan time.Time


Do we need to move this any more? I think you did it because originally you were waiting on it in the transfer until I pointed out that wasn't necessary.

I don't think it matters much so happy to leave it if that's easier but not sure it's necessary to pull it out of the leader loop method any more?

banks · 2019-03-11T20:44:51Z

raft.go

@@ -341,6 +356,8 @@ func (r *Raft) runLeader() {
 	r.leaderState.replState = make(map[ServerID]*followerReplication)
 	r.leaderState.notify = make(map[*verifyFuture]struct{})
 	r.leaderState.stepDown = make(chan struct{}, 1)
+	r.leaderState.lease = time.After(r.conf.LeaderLeaseTimeout)


Hmm is this right? It seems to be e remnant of the original lease stuff that turned out not to be relevant?

raft.go

banks · 2019-03-11T20:48:24Z

raft_test.go

+
+func TestRaft_LeadershipTransferResetsLeaderLease(t *testing.T) {
+	t.Skip("How do I test this?")
+}


Shouldn't even be a test any more with lease stuff removed?

banks · 2019-03-11T20:48:41Z

raft_test.go

+
+func TestRaft_LeadershipTransferToUnresponsiveServer(t *testing.T) {
+	t.Skip("How do I test this?")
+}


Should we just remove these if we have no way to test them?

banks · 2019-03-11T20:54:32Z

raft.go

+		// return early if this server is up to date
+		if state.nextIndex > target {
+			return &server
+		}


Is this necessary?

The loop isn't long and would always find the most up-to-date anyway so I don't think it's a big performance win. In some cases it might even work out to be less-than-optimal for transfer time because there may have been 10 new logs recorded since our getLastIndex above and the first server is all 10 behind while another server in the list might be only 1 behind.

Agreed. Fixed in 8deca55.

banks · 2019-03-11T20:55:51Z

raft.go

+		if server.ID == r.localID {
+			continue
+		}
+		state, ok := r.leaderState.replState[server.ID]


Is this OK? In general leaderState is expected to only be accessed by the single goroutine running the leader loop. Does this always run there? I think it does but worth checking. If so I suggest we update the comments on these new methods that assume they are part of the leader loop to make that explicit?

Great catch, it wasn't called from the leaderloop! Plus I found another place where leaderstate is accessed outside of the leaderloop. Fixed both things.

james-lawrence · 2019-05-04T16:09:09Z

I should also ask - the failing CI tests, are those same tests failing in master and/or passing locally?

@banks, there are a bunch of minor fixes (by freeekanayaka) in PR form already that improve the CI stability of master. might want to consider merging those and rebasing on top.

banks

Super minor Q which may be wrong but could be a race. SOOOO close!

banks · 2019-05-16T09:22:37Z

raft.go

+	return nil
+}
+
+// pickServer returns the follower that is most up to date.


Should we comment this is only to be called in leader loop?

raft.go

banks · 2019-05-16T09:27:45Z

raft.go

+	r.setLeadershipTransferInProgress(true)
+	defer func() { r.setLeadershipTransferInProgress(false) }()
+
+	for repl.nextIndex <= r.getLastIndex() {


This access of the followerReplication.nextIndex made me wonder about thread safety. Presumably this is racey since the replication thread is still running and could be writing to the nextIndex concurrently?

Using atomic now to protect from the race. Great catch!

Co-Authored-By: Paul Banks <[email protected]>

hanshasselberg changed the title ~~[WIP] Transition leadership~~ [WIP] Leadership transfer Jan 30, 2019

hanshasselberg changed the title ~~[WIP] Leadership transfer~~ Leadership transfer Jan 31, 2019

hanshasselberg mentioned this pull request Feb 1, 2019

Leader doesn't step down when establishLeadership fails hashicorp/consul#5047

Closed

banks requested changes Feb 1, 2019

View reviewed changes

mkeeler reviewed Feb 1, 2019

View reviewed changes

api.go Outdated Show resolved Hide resolved

api.go Show resolved Hide resolved

raft.go Show resolved Hide resolved

raft.go Show resolved Hide resolved

hanshasselberg commented Feb 21, 2019

View reviewed changes

raft.go Show resolved Hide resolved

This was referenced Mar 1, 2019

Provide a way to transfer leadership via API or CLI hashicorp/consul#5405

Open

Automatically transfer leadership on shutdown hashicorp/consul#5406

Open

mkeeler approved these changes Mar 11, 2019

View reviewed changes

banks approved these changes Mar 11, 2019

View reviewed changes

schristoff mentioned this pull request May 2, 2019

Updating the leader node and advised heartbeat/election/leaderlease timeouts #250

Closed

hanshasselberg added 10 commits May 15, 2019 16:48

first step

3ec1046

more code

2935d2f

return more errors to the original caller

c2ae146

more tests

3c9bd0e

tests for pickServer

48e5be6

try to keep it together

0ca3242

remove debug stuff.

12e16d5

add deps to integ

8ab3eb7

stuff

5298547

tests

f2d3516

hanshasselberg added 21 commits May 15, 2019 16:51

return follower that is most up to date.

38cf2a3

more tests

e241011

make leadershiptransfer async for real!

688dd5b

do not reset lease.

5e054f3

avoid pointer and allow verifyLeader

c8e1dea

mention the version thingy in the comments

8a6f91b

mention graceful errors

672aab0

fuzzy test leadership transfer

8701e75

more comments

d2f396a

more changes to make test success more likely

6d77567

more changes

e4fa2b8

another way of knowing when no longer leader

ca72a94

log leadership transfer

06a27b2

make test failure optional.

7704957

comment

4d0e509

remove commented code.

073891a

revert lease changes and remove some unused code/comments

d1fdde7

revert makefile changes

30fe721

access leaderstate only in leaderloop

982d4dc

do not return early to get the latest follower.

e60b1b0

move to new logging

50f6abc

hanshasselberg force-pushed the transition_leadership branch from 8deca55 to 50f6abc Compare May 15, 2019 14:58

pass *followerReplication

5b66163

banks requested changes May 16, 2019

View reviewed changes

hanshasselberg and others added 3 commits May 16, 2019 11:29

Update raft.go

5911330

Co-Authored-By: Paul Banks <[email protected]>

docs

a855d10

protect nextIndex from races by using atomic.

0f3edd7

banks approved these changes May 16, 2019

View reviewed changes

hanshasselberg merged commit eba8343 into master May 17, 2019

hanshasselberg deleted the transition_leadership branch May 17, 2019 16:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leadership transfer #306

Leadership transfer #306

hanshasselberg commented Jan 28, 2019 •

edited

Loading

mitchellh commented Jan 28, 2019

hanshasselberg commented Jan 29, 2019

banks commented Feb 1, 2019

banks left a comment

banks Feb 1, 2019

mkeeler left a comment

hanshasselberg commented Feb 12, 2019

hanshasselberg commented Feb 12, 2019

hanshasselberg commented Feb 13, 2019

hanshasselberg commented Feb 13, 2019

hanshasselberg commented Feb 13, 2019 •

edited

Loading

mkeeler left a comment

banks left a comment

banks Mar 11, 2019

banks Mar 11, 2019

banks Mar 11, 2019

hanshasselberg May 14, 2019

banks Mar 11, 2019

hanshasselberg May 14, 2019

banks Mar 11, 2019

hanshasselberg May 15, 2019

banks Mar 11, 2019

hanshasselberg May 15, 2019

hanshasselberg May 15, 2019

james-lawrence commented May 4, 2019

banks left a comment

banks May 16, 2019

hanshasselberg May 16, 2019

banks May 16, 2019

hanshasselberg May 16, 2019

Leadership transfer #306

Leadership transfer #306

Conversation

hanshasselberg commented Jan 28, 2019 • edited Loading

mitchellh commented Jan 28, 2019

hanshasselberg commented Jan 29, 2019

banks commented Feb 1, 2019

banks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkeeler left a comment

Choose a reason for hiding this comment

hanshasselberg commented Feb 12, 2019

hanshasselberg commented Feb 12, 2019

hanshasselberg commented Feb 13, 2019

hanshasselberg commented Feb 13, 2019

hanshasselberg commented Feb 13, 2019 • edited Loading

mkeeler left a comment

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

james-lawrence commented May 4, 2019

banks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanshasselberg commented Jan 28, 2019 •

edited

Loading

hanshasselberg commented Feb 13, 2019 •

edited

Loading