Skip to content

Commit

Permalink
Fix raft failure from updated etcd/raft library
Browse files Browse the repository at this point in the history
After 3.3.x, etcd made a small change to the raft library that broke
Swarmkit. It also, as it turns out, broke their raft example.

The core issue is that a snapshot has an embedded ConfState from when
the snapshot is created. This ConfState, as it turns out, is not
supposed to be the one from when the snapshot was made. It should be the
one from when the snapshot is sent, the current ConfState.

When adding a new node to the quorum, the node must be caught up using a
snapshot. Previously, we were sending the snapshot exactly as it was
taken. However, because the snapshot predates the node's membership
in the cluster, the ConfState does not have the new node in it.

The change to the raft library was the raft library began checking the
snapshot ConfState, and rejecting snapshots where the node was missing
from the ConfState. The fix is just, as mentioned above, to overwrite
the ConfState from the snapshot with the current ConfState before
sending.

Signed-off-by: Drew Erny <[email protected]>
  • Loading branch information
dperny committed Mar 7, 2022
1 parent 01ba48a commit b7c49a6
Showing 1 changed file with 6 additions and 0 deletions.
6 changes: 6 additions & 0 deletions manager/state/raft/raft.go
Original file line number Diff line number Diff line change
Expand Up @@ -607,6 +607,12 @@ func (n *Node) Run(ctx context.Context) error {
}

for _, msg := range rd.Messages {
// if the message is a snapshot, before we send it, we should
// overwrite the original ConfState from the snapshot with the
// current one
if msg.Type == raftpb.MsgSnap {
msg.Snapshot.Metadata.ConfState = n.confState
}
// Send raft messages to peers
if err := n.transport.Send(msg); err != nil {
log.G(ctx).WithError(err).Error("failed to send message to member")
Expand Down

0 comments on commit b7c49a6

Please sign in to comment.