EmergencyReparentShard: support reachable replica tablets w/mysqld down#18896
EmergencyReparentShard: support reachable replica tablets w/mysqld down#18896timvaillancourt wants to merge 23 commits intovitessio:mainfrom
EmergencyReparentShard: support reachable replica tablets w/mysqld down#18896Conversation
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
| var args []string | ||
| if timeout != "" { | ||
| args = append(args, "--action_timeout", timeout) | ||
| args = append(args, "--action-timeout", timeout) |
There was a problem hiding this comment.
This resolves a warning because of underscore deprecation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18896 +/- ##
=======================================
Coverage 69.63% 69.63%
=======================================
Files 1614 1614
Lines 216395 216400 +5
=======================================
+ Hits 150687 150695 +8
+ Misses 65708 65705 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
EmergencyReparentShard: support reachable replicas w/mysqld downEmergencyReparentShard: support reachable replica tablets w/mysqld down
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
mattlord
left a comment
There was a problem hiding this comment.
LGTM! Just some minor comments. Can you please add another test case to the Test_stopReplicationAndBuildStatusMaps unit test? Unless there's not really a good way to do that?
| // we prioritize completing the reparent (availability) for the common case. If this edge case were | ||
| // to occur, errant GTID(s) will be produced; if this happens often we should return UNAVAILABLE | ||
| // from vttablet using more detailed criteria (check the pidfile + running PID, etc). |
There was a problem hiding this comment.
I wonder if it's not worth improving this case today? The lack of detail may have been in place simply because it did not impact any operations. But now we're building logic around the meaning we infer from the response. I'd say we should do this now, provided you have an idea how to do it.
What do we get in the error? I wonder if it's not an SQLError we can extract that maps to one of these: https://dev.mysql.com/doc/refman/en/gone-away.html
You can see elsewhere in the code base where we look to see if the error contains an SQL error (a MySQL error, and if so, if the code matches 1 or more MySQL error codes). Actually... just below this here 😆
|
So, I'm trying to wrap my head around what this change effectively means. What if the replica that's down is the most advanced replica? You kinda hinted to it above, but does that mean we have a potential for data loss (i.e. changes that were acked by the replica that's now down and have not been replicated anywhere else might actually be lost)? The current behavior is to fail |
@arthurschreiber correct, unfortunately. My understanding is this is the same tradeoff was made in old-Orchestrator, cc @shlomi-noach to confirm (if it's still in cache). What is being seen in the wild is the current behaviour prevents VTOrc (which uses ERS for many things) from remediating shards in partially-unhealthy states. So basically you end up in a situation with an indefinitely broken shard, due to a replica that is often unimportant. One could page a human here (we don't automate/document this) but I don't think that is the state of the art/elegant, so to speak I'm glad you raise the point about manual intervention: erroring for a human to respond is viable on the |
|
I've been thinking about this some more, and I think there's some fundamental issues with Without semi-sync enabled (so, durability policy set to With a durability policy set, the requirements tighten a lot. I think we can combine the two points above in a generalized fashion. When performing One complication here is that we can't rely on the value of the durability policy at the time of failover, as the value could have been changed (e.g. from |
@arthurschreiber this is a good point. While it can't be 100% sure, this PR is relying on
The codes that cause
For context, these error codes to While imperfect, the gaps where this isn't a good signal seem pretty small here. I'm curious in what scenario we would get these error codes above from MySQL, but semi-sync is still running? Without digging I would say (ignoring remote-tablets again) just
One of the first steps reparenting code does is fetch the running durability policy from the topo. From there it assumes that is the correct policy. That's "usually" right, but again imperfect The edge cases I can see this not being true is the keyspace durability policy being changed seconds-before/during a reparent. Are there any other cases? I think that's out of scope of this PR but a very good point to address. If it's just the single scenario I mention, the risk of this occurring only exists for a handful of seconds, assuming VTOrc is successfully fixing things
Yes, this would likely be the most accurate approach, and one I began RFC'ing at Slack for similar reasons - using the tablet record for this state. An overall problem with ERS is it is used on both stateless VTCtlds and kind-of-stateful VTOrcs. VTOrc technically could store the state you refer to in it's backend database, but VTCtld has no such backend database. So we're left with just the topo as something both can use. The topo is probably ok, but it has it's inefficiencies at scale and this would add calls and more reliance on the topo being up - that said, reparents already rely heavily on the topo But again, I see the few seconds where the durability policy may not be 100% accurate as kind of out of scope, or at least this PR doesn't make that better or worse. What do you think? |
Co-authored-by: Arthur Schreiber <schreiber.arthur@googlemail.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
This reverts commit 19996f8.
|
This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:
If no action is taken within 7 days, this PR will be closed. |
|
📝 Documentation updates detected! New suggestion: Document EmergencyReparentShard handling of replicas with mysqld down |
|
📝 Documentation updates detected! New suggestion: Add changelog entry for EmergencyReparentShard skipping replicas with MySQL down |
|
This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:
If no action is taken within 7 days, this PR will be closed. |
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Description
A follow-up to #18565 (the 2nd half of the change), this PR provides
EmergencyReparentShardthe required context to skip candidates replicas that havemysqldcrashed/down althoughvttabletremains upToday ANY candidate tablet of a shard having
mysqldcrashed/down at the time ofEmergencyReparentShardwill break the reparent on:😱
This problem affects both VTCtld and VTOrc ERS operations. Let's make sure
EmergencyReparentShardworks in this kind of emergency!This change raises the question: what if ALL candidates have
vttabletup,mysqlddown? The existing logic to check we found enough valid candidates catches this scenario the same as now - we just don't break the entire ERS operation on single failuresAnother question this raises: what happens to the tablet with MySQL down? This PR doesn't really address/change that. A replica in this state would look unhealthy to VTOrc, but it has no way to fix it. This broken tablet should get probably get replaced (Kube or other automation) - on the Kube operator this is built-in
Related Issue(s)
EmergencyReparentShardfails whenmysqldis down on any tablet in a shard #18528Checklist
Deployment Notes
AI Disclosure