Skip to content
This repository has been archived by the owner on Jan 21, 2020. It is now read-only.

Infrakit recovery options when less then half of the managers are available #741

Open
kaufers opened this issue Nov 1, 2017 · 3 comments

Comments

@kaufers
Copy link
Contributor

kaufers commented Nov 1, 2017

We are testing some HA recovery scenarios and we killed 2 of the 3 managers; this resulted in the following:

WARN[10-31|22:13:13] Leadership status is uncertain           module=manager err="Error response from daemon: rpc error: code = 2 desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online." stack=[github.com/docker/infrakit/pkg/manager/manager.go:259] fn=github.com/docker/infrakit/pkg/manager.(*manager).Start.func2
INFO[10-31|22:13:13] Lost leadership                          module=manager stack="[github.com/docker/infrakit/pkg/manager/manager.go:337 github.com/docker/infrakit/pkg/manager/manager.go:224]" fn=github.com/docker/infrakit/pkg/manager.(*manager).onLostLeadership

At this point there is no leader and Infrakit will never attempt to recover the manager nodes because those operations only run on the leader (and without manual intervention a leader will never be elected).

In theory, if a temporary leader could be determined, then:

  • The group controller polling could run on that node and detect that the number of managers is below the desired number
  • Issue Provision requests to the instance provider to create the managers
  • Defer temporary leadership status back to the swarm leader

This seems to align with the self-healing goals of Infrakit.

This flow seems to be straightforward when only 1 of 3 managers are left (since the last manager would assume temporary leadership); however it is not as clean when there are on 2 of 5 left.

A thought on handing the 2/5 scenario would be to rely on the fact that all managers have a unique LogicalID. If the managers had awareness to know which other managers were still around, then each node would be able to identify if they are the lowest (based on string sort of the LogicalIDs) manager that is left then that node could self-elect.

@chungers
Copy link
Contributor

chungers commented Nov 1, 2017

I am glad you're raising this issue. Clearly it's not sufficient to just have Infrakit give up when the backend it depends on loses quorum. For the purpose of discussion let's limit the scope to recovering the managers.... those in groups where logical IDs are specified. The goal here is to have any remaining manager nodes act in recovering the quorum, even if leadership isn't known at the moment.

To document the issue -- this is the current flow:

  • As part of initialization, each of the manager nodes checks for leadership when it comes up.
  • Each of the manager nodes continuously checks for leadership. The actions described are edge-triggered:
  • On losing leader status, the manager calls the Group controller it manages "Free" to pause monitoring.
  • On gaining leader status, the manager loads the user specs from the replicated store and performs a "Commit" to start monitoring.

For N=3, I think this is fairly straight-forward. When 1 of the 3 managers are left, the lone manager, even if not a leader, can take over and restore the quorum. There are some changes though:

  • For each manager that is not a leader, instead of being dormant, each node needs to continuously check the size of the group.
  • If size != 1, then no action
  • If size == 1, then take over even the leader status isn't known.

As you pointed out, this won't work for 2/5 scenario. In this case, the remaining managers (2) know they are not leaders... so

  1. we'd have to pick 1 to act as leader, or

  2. each remaining node will all act.

  3. is tricky because now we need some kind of communication between the remaining managers. A possible network partition will get us into the split brain problem here. So I wonder if we can explore option 2.

In the case of N-k managers are down, each of the k managers remaining can all act, in the worse case, if each is only responsible for starting up exactly 1 other node... This way we will not over-provision. As long as more nodes are coming back online, they are going to try to rejoin the quorum... and when a quorum can be established, a new leader will be elected. The new leader can always be the single actor and restore any missing nodes as necessary.

What do you think? I haven't manually played around with swarm to see how this could work. Obviously this also assumes we are reattaching the volumes /var/lib/docker so we are not doing all kinds of swarm demote/join/leave which alters the original topology.

@kaufers
Copy link
Contributor Author

kaufers commented Nov 1, 2017

@chungers I think that that approach could work when there are only 2/5.

What about the case where there are 4/5? Since this is an even number what is the state of the swarm? Is there a leader at this point? If not, then we need to also handle this case (and we wouldn't want all 4 to provision 1 additional node).

@chungers
Copy link
Contributor

chungers commented Nov 3, 2017

I think for the 4/5 case we are ok as there's still a leader even with 4 active nodes: https://docs.docker.com/engine/swarm/admin_guide/#add-manager-nodes-for-fault-tolerance

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants