raft: Fix infinite election loop #1310

aaronlehmann · 2016-08-04T04:15:24Z

It looks like the election loops on startup are caused by something
proposing a value before all entries in the log have become committed.
This would cause ApplyStoreActions to try and get a lock on the raft
store, while processInternalRaftRequest is holding the lock (since it's
called by the memory store with the lock held). This situation is a
deadlock. It blocks the loop in Node.Run and prevents further
participation in the cluster.

To try and solve the issue, don't signal that the local node has become
the leader until it has committed all the uncommitted entries from
before. Also, block proposals until it is caught up.

WIP because we still need to test and confirm that this fixes the problem.

LK4D4 · 2016-08-04T04:23:39Z

@aaronlehmann looks like real CI failure.
Do you have reproduction case? I can try with real machines.

aaronlehmann · 2016-08-04T04:30:50Z

Tests passed locally for me. Looks like the failure in CI is because the test needs an adjustment now that the conditions for proposing a value are stricter. I'll work on fixing the test tomorrow.

I think the random failures we saw in certain tests were related to this problem.

To reproduce, set up a cluster of three nodes, then kill dockerd on all three, and restart the old leader and one of the followers at the same time. Repeat the restart of the two nodes until it gets stuck.

aaronlehmann · 2016-08-04T16:41:45Z

I think I fixed the tests. Let's see what happens in CI.

I also added a commit that prevents a follower from sending a proposal, or the leader from accepting one. Only the leader should make proposals, so we avoid ever having conflicting proposals. There is already code to prevent this, but there corner cases where the leader could lose its leadership status right while it's proposing a value.

It looks like the election loops on startup are caused by something proposing a value before all entries in the log have become committed. This would cause ApplyStoreActions to try and get a lock on the raft store, while processInternalRaftRequest is holding the lock (since it's called by the memory store with the lock held). This situation is a deadlock. It blocks the loop in Node.Run and prevents further participation in the cluster. To try and solve the issue, don't signal that the local node has become the leader until it has committed all the uncommitted entries from before. Also, block proposals until it is caught up. Signed-off-by: Aaron Lehmann <[email protected]>

Only the leader should make proposals, so we can guarantee that no two proposals will conflict. This commit prevents a follower from sending a proposal, or the leader from accepting one. We already have code to prevent this, but there are certain corner cases where a leader could become a follower while it's proposing a value. Signed-off-by: Aaron Lehmann <[email protected]>

aaronlehmann · 2016-08-04T17:14:33Z

Vendored this in Docker and it passes the integration tests.

codecov-io · 2016-08-04T17:23:48Z

Current coverage is 55.13% (diff: 52.17%)

Merging #1310 into master will decrease coverage by 0.03%

@@             master      #1310   diff @@
==========================================
  Files            80         80          
  Lines         12541      12558    +17   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           6919       6924     +5   
- Misses         4678       4685     +7   
- Partials        944        949     +5

Powered by Codecov. Last update a210cdd...7e70556

LK4D4 · 2016-08-04T17:33:53Z

ping @abronan

abronan · 2016-08-04T20:37:03Z

Removed the WIP and moved to code review.

LGTM

LK4D4 · 2016-08-04T23:32:38Z

LGTM

PR moby#1310 ("Fix infinite election loop") solved a problem with election loops on startup, by delaying new proposals until the leader has committed all its existing entries. This ensures that the state machine doesn't call ApplyStoreActions to commit a previous entry from the log while an new proposal is in process - since they both acquire a write lock over the memory store, which would deadlock. Unfortunately, there is still a race condition which can lead to a similar deadlock. processInternalRaftRequest makes sure that proposals arent't started after the manager loses its status as the leader by first registering a wait for the raft request, then checking the leadership status. If the leadership status is lost before calling register(), then the leadership check should fail, since it happens afterwards. Conversely, if the leadership status is lost after calling register(), then cancelAll() in Run() will make sure this wait gets cancelled. The problem with this is that the new code in PR moby#1310 calls cancelAll() *before* setting the leadership status. So it's possible that first we cancel all outstanding requests, then a new request is registered and successfully checks that we are still the leader, then we set leader to "false". This request never gets cancelled, so it causes a deadlock. Nothing can be committed to the store until this request goes through, but it can't go through if we're not the leader anymore. To fix this, swap the order of cancelAll so it happens after we change the leadership status variable. This means that no matter how the goroutines are interleaved, a new request will either cancel itself or be cancelled by Run when leadership is lost. I'm aware that this is ugly and I'm open to suggestions for refactoring or abstracting. Also, out of extra caution, call cancelAll in the situation which would lead to a deadlock if there were any outstanding raft requests. Signed-off-by: Aaron Lehmann <[email protected]>

PR #1310 ("Fix infinite election loop") solved a problem with election loops on startup, by delaying new proposals until the leader has committed all its existing entries. This ensures that the state machine doesn't call ApplyStoreActions to commit a previous entry from the log while an new proposal is in process - since they both acquire a write lock over the memory store, which would deadlock. Unfortunately, there is still a race condition which can lead to a similar deadlock. processInternalRaftRequest makes sure that proposals arent't started after the manager loses its status as the leader by first registering a wait for the raft request, then checking the leadership status. If the leadership status is lost before calling register(), then the leadership check should fail, since it happens afterwards. Conversely, if the leadership status is lost after calling register(), then cancelAll() in Run() will make sure this wait gets cancelled. The problem with this is that the new code in PR #1310 calls cancelAll() *before* setting the leadership status. So it's possible that first we cancel all outstanding requests, then a new request is registered and successfully checks that we are still the leader, then we set leader to "false". This request never gets cancelled, so it causes a deadlock. Nothing can be committed to the store until this request goes through, but it can't go through if we're not the leader anymore. To fix this, swap the order of cancelAll so it happens after we change the leadership status variable. This means that no matter how the goroutines are interleaved, a new request will either cancel itself or be cancelled by Run when leadership is lost. I'm aware that this is ugly and I'm open to suggestions for refactoring or abstracting. Also, out of extra caution, call cancelAll in the situation which would lead to a deadlock if there were any outstanding raft requests. Signed-off-by: Aaron Lehmann <[email protected]> (cherry picked from commit 0324db7)

aaronlehmann added priority/P0 area/raft labels Aug 4, 2016

GordonTheTurtle added the status/0-triage label Aug 4, 2016

aaronlehmann mentioned this pull request Aug 4, 2016

[1.12rc4] Occasionally Swarm cluster does not work with 2 of 3 manager nodes online moby/moby#25055

Closed

aaronlehmann added this to the 1.12.1 milestone Aug 4, 2016

aaronlehmann added status/1-design-review and removed status/0-triage labels Aug 4, 2016

aaronlehmann force-pushed the raft-stuck-on-startup branch from e3031a9 to 8a89264 Compare August 4, 2016 16:40

aaronlehmann added 2 commits August 4, 2016 10:14

aaronlehmann force-pushed the raft-stuck-on-startup branch from 8a89264 to 7e70556 Compare August 4, 2016 17:14

abronan changed the title ~~[WIP] raft: Fix infinite election loop~~ raft: Fix infinite election loop Aug 4, 2016

abronan added status/2-code-review and removed status/1-design-review labels Aug 4, 2016

abronan mentioned this pull request Aug 4, 2016

raft: set n.Node and n.isMember to correct value in stop/removeNode #1288

Merged

LK4D4 merged commit af23e13 into moby:master Aug 4, 2016

aaronlehmann deleted the raft-stuck-on-startup branch August 5, 2016 00:12

aaronlehmann added the process/cherry-picked label Aug 5, 2016

aaronlehmann mentioned this pull request Aug 6, 2016

Flaky raft test: TestUpdateNodeDemoteLeader #1337

Closed

aaronlehmann mentioned this pull request Sep 14, 2016

raft: Fix possible deadlocks #1537

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: Fix infinite election loop #1310

raft: Fix infinite election loop #1310

aaronlehmann commented Aug 4, 2016

LK4D4 commented Aug 4, 2016

aaronlehmann commented Aug 4, 2016

aaronlehmann commented Aug 4, 2016

aaronlehmann commented Aug 4, 2016

codecov-io commented Aug 4, 2016 •

edited

Loading

LK4D4 commented Aug 4, 2016

abronan commented Aug 4, 2016

LK4D4 commented Aug 4, 2016

raft: Fix infinite election loop #1310

raft: Fix infinite election loop #1310

Conversation

aaronlehmann commented Aug 4, 2016

LK4D4 commented Aug 4, 2016

aaronlehmann commented Aug 4, 2016

aaronlehmann commented Aug 4, 2016

aaronlehmann commented Aug 4, 2016

codecov-io commented Aug 4, 2016 • edited Loading

Current coverage is 55.13% (diff: 52.17%)

LK4D4 commented Aug 4, 2016

abronan commented Aug 4, 2016

LK4D4 commented Aug 4, 2016

codecov-io commented Aug 4, 2016 •

edited

Loading