vtorc: add support for dynamic enable/disable of ERS by keyspace/shard#17985
vtorc: add support for dynamic enable/disable of ERS by keyspace/shard#17985timvaillancourt merged 42 commits intovitessio:mainfrom
vtorc: add support for dynamic enable/disable of ERS by keyspace/shard#17985Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
c4ae4cb to
1999722
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #17985 +/- ##
==========================================
- Coverage 67.52% 67.50% -0.02%
==========================================
Files 1607 1607
Lines 263338 263518 +180
==========================================
+ Hits 177812 177892 +80
- Misses 85526 85626 +100 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
8044f70 to
74925c6
Compare
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
|
@shlomi-noach thanks for the review! I believe the code and changelog suggestions are now addressed 👍 |
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
|
@timvaillancourt I see more traffic on this PR, please let us know when it is ready for final review. |
@shlomi-noach 👋 yes, this should be ready for another pass of reviews now. Thanks! |
| utils.WaitForSuccessfulRecoveryCount(t, vtOrcProcess, logic.RecoverDeadPrimaryRecoveryName, keyspace.Name, shard0.Name, 0) | ||
| utils.WaitForSuccessfulERSCount(t, vtOrcProcess, keyspace.Name, shard0.Name, 0) |
There was a problem hiding this comment.
This is tricky. We wait for something to not happen and thus we provide 0 for expected count. But if you look at the code for utils.WaitForSuccessfulRecoveryCount and utils.WaitForSuccessfulERSCount, these functions just loop with timeout until they reach requested value, which is 0, which means they will exit immediately with success code.
Point being, if a failover was to happen, we'd still exit immediately, and this is wrong. I think in this case we need to add an imposed Sleep of a few seconds (you call the number) before calling these functions.
Or, otherwise, find a more formal way of knowing for sure that the recovery was ignored. But Sleep should be fine.
There was a problem hiding this comment.
@shlomi-noach great point, I missed that! 👍. Yes, a sleep would work
Another way to solve this would be to wait for another metric: I'm thinking the "# of recoveries" counter, to increase for the DeadPrimary recovery AND only then should the logic check if the desired value was reached
An alternative idea, which I pondered adding in this PR, was a "# of ERS-recoveries skipped" metric
I think I'll try this 2-dimensional metrics check and see where it goes! If a sleep approach is necessary, that's an option too
There was a problem hiding this comment.
I'm thinking the "# of recoveries"
Wouldn't that have the same behavior? You want that number to not incrase so the only way to check that would be to wait/sleep?
"# of ERS-recoveries skipped" metric
That would be good indeed.
There was a problem hiding this comment.
@shlomi-noach I've updated the e2e test to wait for the RecoverDeadPrimary recovery to be skipped, using a new metric: SkippedRecoveries
Once we know RecoverDeadPrimary was skipped, it should be safe to check for 0 for the stats you originally pointed out. Let me know how it looks! 🙇
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Description
This PR adds support for ERS to be enabled/disabled in VTOrc dynamically by keyspace or keyspace/shard using the topo
Summary:
vtctldclient/server RPCs:SetVtorcEmergencyReparent [--enable|-e] [--disable|-d] <keyspace> <shard>GetKeyspaceandGetShardcan be used to read the written recordEmergencyReparentShardDisabled) to track ERS-disabled shardsdisable_emergency_reparent tinyintfield in thevitess_keyspaceandvitess_shardtablesRelated Issue(s)
#17988
Checklist
Deployment Notes