-
Notifications
You must be signed in to change notification settings - Fork 76
Description
Context
As of right now, quorum is calculated by evaluating each registered replica within the cluster and asking it who it thinks the current primary is.
We achieve two things by doing this:
- Confirm that the target members are reachable.
- We are building quorum by confirming that the majority of the cluster agree's on who the primary is.
The reason we evaluate every member within the cluster is because the primary member is unable to know how the PRIMARY_REGION
value is configured per Machine. The PRIMARY_REGION
environment variable represents the region holding the primary and dictates which members are eligible for promotion. This value should only be changed when a user needs to issue a regional failover, which requires the end-user to perform an update against every Machine in their cluster with the new PRIMARY_REGION
value. There are a number of ways this process could fail, which presents opportunities for inconsistency.
Example case:
This issue can be expressed in a 6 node cluster with members split evenly across two different regions.
Nodes A,B,C residing in ord with PRIMARY_REGION
set to ord
Nodes D,E,F residing in iad with PRIMARY_REGION
set to iad
If quorum only considered "in-region" nodes, then it would be possible for a split-brain to occur across regions with the presence of a misconfigured PRIMARY_REGION
environment variable.
The problem with considering every registered member
The problem with this approach is that we start to see an increase in connection timeouts when replicas are placed in regions considerably far away from the primary region. Connection timeouts are currently set to 5 seconds for each registered member. This issue is amplified if the replica is under load. When quorum cannot be reached, then the cluster will go read-only until quorum is restored. This can lead to a bad experience for many users who want to push many of their read-replicas into distant regions.