Reduce connect timeout to 5s and reduce reconnect timeout from 60s to 30s #165
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This works reduces the time it takes to identify and fence a primary in the event of a network partition.
When a network partition is initiated a couple things need to happen:
connect_timeout
.child_node_disconnect
event.child_node_disconnect
event is then processed and triggers a cluster state evaluation.The split-brain detection window can be calculated using the following formula:
For a typical 3 node cluster we are looking at:
Connect timeout: 5s
Standby reconnect timeout: 30s
Registered standbys: (2 * 5s) = 10s
Total time: 45 seconds.
There are some optimizations we can make here to cut-down on time. E.G. We could get away evaluating only a subset of the registered members and bail once we know quorum can't be met. I'll have to think more about this.