-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NEW] Non-voting primaries, voting empty primaries, voting replicas #1600
Comments
@enjoy-binbin Are the voting empty primaries the same idea as your arbiter nodes? You say @madolson doesn't like the idea of arbiter nodes? @PingXie also invited to discuss. Do you see any obvious catches with this approach? |
Yes, it is the same idea, i would like to open source it if we accept.
madolson mention this thought in here #634 (comment). (maybe i got she wrong) The entire issue is consistent with my idea, i am going to take a moment to talk about it. We actually have three mode:
This is the last one, actually we are doing the about same thing, i was working on it last year and i have a demo branch in the internal fork, but then got stuck with other things and nerver finish it (haven't tested it widely). The main idea is, since we now have instance abiter, we will let arbiter node to ping other node to exchange the infomation. So that the other nodes does not need to ping other node, something like that, i may not describe it well, but the idea is similar to it. |
I agree with the general idea of adding non-voting notes into quorum. I'm concerned that getting these types of algorithms "correct" is really hard, and we should really use formal methods like TLA+ to verify the correctness. My main concern in the thread that was linked was that we should increase the epoch to change quorum (adding arbiters or other nodes), otherwise you can end up in split brains with divergent opinions about the cluster topology.
For example, there is no easy way to do this since quorum can change in the distributed system while polling everyone. We can introduce a mechanism to get everyone to indicate their status at an epoch and retry on failures. I was originally going to propose something like all nodes broadcast out "I support other nodes voting" and then a node can nominate for a new list of "additional voters" that include themselves. If a node in the "other voters" is marked as failed, nodes can vote to replace it. Nodes would get a new config for whether or not they are allowed to nominate themselves into the "other voters" list, or maybe can do it by default. If a conflicting failover happens, (we can still only vote for one failover per epoch) then we backoff and retry later.
This seems like a good idea as well to limit the message passed around in the ping/pong messages about state. |
This has been a general concern I have heard from multiple folks at AWS around getting the Raft implementation right. I'm currently reading a paper around gossip protocol for failure detection https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf which would likely reduce the ping-pong transfer across the clusterbus. Yet to apply the learnings to our implementation. After failure detection, I also think we need a smaller group as quorum compared to our existing quorum of all primaries in a cluster for failover. |
@enjoy-binbin With the arbiter mode approach, I have a few questions.
|
i think it would benefit in every scale, one shard cluster or some cluster cross multi AZ. I didn't consider the failover or vote time here, this is more about the architectural benefits.
The arbiter node does not handle user request, artbier does not own any slots, so arbiter only do the gossip thing. The end user, if they want to get high availability of a single-shard cluster, or high availability at the multi-AZ level, they can use arbiter deployment. They don't need anything extra. The arbiter does not need additional capacity to serve traffic (except for gossip overhead). The cpu, i looked at one cluster (redis5 128 shards, 128 replica, 256 arbiters, so 512 nodes in total), the arbiter CPU avg usage is < 1% (while the primary is 30% since it is a in-use cluster), the cluster-timeout is the default, that is 15s. |
The problem/use-case that the feature addresses
The cluster bus traffic overhead is huge for huge clusters because:
Description of the feature
Decouple right-to-vote from primary-serving-slots, so any node can be a voting or non-voting node.
This can allow non-voting primaries, voting replicas and voting empty nodes. Modifying these details may be easier than to rewrite the cluster protocol to Raft (#384).
By having only a few voting nodes (such as five) in a large cluster with hundreds of nodes, voting becomes faster.
Then we can also limit the ping-pong traffic between non-voting nodes. For example, non-voting nodes don't need to ping other nodes. They can just reply to pings from other nodes. This makes the voting nodes a team of leaders and the non-voters a team of followers.
The difficulty seems to be for all the nodes to agree about which nodes can vote and how to count the votes, and to do the switch in a rolling upgrade.
How to introduce it in a rolling upgrade
Idea: Once all the voting nodes in a cluster support the new rules, then automatically switch to the new voting rules. Add some flag bits to the cluster bus header:
When a node wins a failover election, it checks if all nodes in the current quorum support the new voting rules. If they do, the new node activates the new voting rules and sets the 3rd bit "New voting rules are activated". This is spread to all nodes using broadcast and gossip.
Alternatives you've considered
Raft consensus protocol
Additional information
This is just an idea. We need to consider edge cases...
The text was updated successfully, but these errors were encountered: