`EmergencyReparentShard`: support reachable replica tablets w/`mysqld` down by timvaillancourt · Pull Request #18896 · vitessio/vitess

timvaillancourt · 2025-11-11T22:49:43Z

Description

A follow-up to #18565 (the 2nd half of the change), this PR provides EmergencyReparentShard the required context to skip candidates replicas that have mysqld crashed/down although vttablet remains up

Today ANY candidate tablet of a shard having mysqld crashed/down at the time of EmergencyReparentShard will break the reparent on:

rpc error: code = Unknown desc = failed to stop replication and build status maps: error when getting replication status for alias zone1-0000000102: Code: UNAVAILABLE
rpc error: code = Unavailable desc = TabletManager.StopReplicationAndGetStatus on zone1-0000000102: before status failed: net.Dial(/Users/tim/github/timvaillancourt/vitess/vtdataroot/vtroot_8301/vt_0000000102/mysql.sock) to local server failed: dial unix /Users/tim/github/timvaillancourt/vitess/vtdataroot/vtroot_8301/vt_0000000102/mysql.sock: connect: no such file or directory (errno 2002) (sqlstate HY000)

😱

This problem affects both VTCtld and VTOrc ERS operations. Let's make sure EmergencyReparentShard works in this kind of emergency!

This change raises the question: what if ALL candidates have vttablet up, mysqld down? The existing logic to check we found enough valid candidates catches this scenario the same as now - we just don't break the entire ERS operation on single failures

Another question this raises: what happens to the tablet with MySQL down? This PR doesn't really address/change that. A replica in this state would look unhealthy to VTOrc, but it has no way to fix it. This broken tablet should get probably get replaced (Kube or other automation) - on the Kube operator this is built-in

Related Issue(s)

Fixes: Bug Report/RFC: EmergencyReparentShard fails when mysqld is down on any tablet in a shard #18528

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not required

Deployment Notes

AI Disclosure

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

vitess-bot · 2025-11-11T22:49:46Z

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

timvaillancourt · 2025-11-11T23:03:40Z

go/test/endtoend/reparent/utils/utils.go

 	var args []string
 	if timeout != "" {
-		args = append(args, "--action_timeout", timeout)
+		args = append(args, "--action-timeout", timeout)


This resolves a warning because of underscore deprecation

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

codecov · 2025-11-12T00:04:35Z

Codecov Report

❌ Patch coverage is 50.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.63%. Comparing base (d22776b) to head (1817266).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
go/vt/vtctl/reparentutil/replication.go	50.00%	4 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #18896   +/-   ##
=======================================
  Coverage   69.63%   69.63%           
=======================================
  Files        1614     1614           
  Lines      216395   216400    +5     
=======================================
+ Hits       150687   150695    +8     
+ Misses      65708    65705    -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

go/test/endtoend/utils/utils.go

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

mattlord

LGTM! Just some minor comments. Can you please add another test case to the Test_stopReplicationAndBuildStatusMaps unit test? Unless there's not really a good way to do that?

go/test/endtoend/tabletmanager/tablet_test.go

mattlord · 2025-11-14T18:01:50Z

go/vt/vtctl/reparentutil/replication.go

+			// we prioritize completing the reparent (availability) for the common case. If this edge case were
+			// to occur, errant GTID(s) will be produced; if this happens often we should return UNAVAILABLE
+			// from vttablet using more detailed criteria (check the pidfile + running PID, etc).


I wonder if it's not worth improving this case today? The lack of detail may have been in place simply because it did not impact any operations. But now we're building logic around the meaning we infer from the response. I'd say we should do this now, provided you have an idea how to do it.
What do we get in the error? I wonder if it's not an SQLError we can extract that maps to one of these: https://dev.mysql.com/doc/refman/en/gone-away.html
You can see elsewhere in the code base where we look to see if the error contains an SQL error (a MySQL error, and if so, if the code matches 1 or more MySQL error codes). Actually... just below this here 😆

go/test/endtoend/reparent/emergencyreparent/ers_test.go

arthurschreiber · 2025-11-17T10:14:45Z

So, I'm trying to wrap my head around what this change effectively means.

What if the replica that's down is the most advanced replica? You kinda hinted to it above, but does that mean we have a potential for data loss (i.e. changes that were acked by the replica that's now down and have not been replicated anywhere else might actually be lost)?

The current behavior is to fail ERS if we couldn't guarantee that all replicas can be stopped successfully (which implies that all replicas need to be healthy to perform an ERS). If we can't guarantee that, manual intervention is required because we don't want to risk losing any data and it's impossible to say what the correct way forward is going to be. But if I understand correctly, this change now means that we'll force an ERS to happen anyway, potentially promoting a replica that didn't have the latest changes. 🤔

timvaillancourt · 2025-11-17T11:45:19Z

The current behavior is to fail ERS if we couldn't guarantee that all replicas can be stopped successfully (which implies that all replicas need to be healthy to perform an ERS). If we can't guarantee that, manual intervention is required because we don't want to risk losing any data and it's impossible to say what the correct way forward is going to be. But if I understand correctly, this change now means that we'll force an ERS to happen anyway, potentially promoting a replica that didn't have the latest changes.

@arthurschreiber correct, unfortunately. My understanding is this is the same tradeoff was made in old-Orchestrator, cc @shlomi-noach to confirm (if it's still in cache).

What is being seen in the wild is the current behaviour prevents VTOrc (which uses ERS for many things) from remediating shards in partially-unhealthy states. So basically you end up in a situation with an indefinitely broken shard, due to a replica that is often unimportant. One could page a human here (we don't automate/document this) but I don't think that is the state of the art/elegant, so to speak

I'm glad you raise the point about manual intervention: erroring for a human to respond is viable on the vtctldclient EmergencyReparentShard command if we want to assume the caller of CLIs is a human that can "respond". I've seen cases where vtctldclient commands (or their RPCs) are used in non-human automation, however, so it's a hard call to make. But I guess what I'm saying is we could make this opt-in in specific areas, perhaps just VTOrc 🤔

arthurschreiber · 2025-11-20T11:02:10Z

I've been thinking about this some more, and I think there's some fundamental issues with EmergencyReparentShard, especially when run with semi-sync.

Without semi-sync enabled (so, durability policy set to none), I think the change you are proposing here makes sense, but we could go even further and allow EmergencyReparentShard to complete, no matter whether the old primary is actually reachable, and no matter the state of the replicas - as long as there's one replica that can be promoted.

With a durability policy set, the requirements tighten a lot. EmergencyReparentShard is not allowed to promote a replica to a primary, unless we can guarantee that all replicas that were ack-ing transactions have stopped replicating from the primary. Otherwise performing a failover runs the risk of introducing a split brain, which is what setting the durability policy to cross_cell is supposed to prevent from happening.

I think we can combine the two points above in a generalized fashion. When performing EmergencyReparentShard, Vitess needs to guarantee that all semi-sync enabled replicas have stopped replicating from the primary we're failing away from. For asynchronous replicas, we don't care at all about their state, and their availability should not have any impact on whether EmergencyReparentShard can be performed (or the success of that operation).

One complication here is that we can't rely on the value of the durability policy at the time of failover, as the value could have been changed (e.g. from none to cross_cell or the other way around), but it's only reflected in the cluster after PlannedReparentShard or EmergencyReparentShard has completed. And we unfortunately can't ask which mode a replica is running in during EmergencyReparentShard because the replicas could be unavailable. I think to model this cleanly, we need to store this information in the tablet record (e.g. before starting replication, we store whether a tablet is running with semi-sync enabled or not). With this information in the topo, EmergencyReparentShard can correctly determine the set of replicas that need to be available to perform an EmergencyReparentShard operation successfully, and which replicas aren't important for the success of the failover.

timvaillancourt · 2025-12-02T14:55:28Z

With a durability policy set, the requirements tighten a lot. EmergencyReparentShard is not allowed to promote a replica to a primary, unless we can guarantee that all replicas that were ack-ing transactions have stopped replicating from the primary.

@arthurschreiber this is a good point. While it can't be 100% sure, this PR is relying on vtrpcpb.Code_UNAVAILABLE for the StopReplicationAndGetStatus RPC to indicate MySQL is not running, which implies (imperfectly) that semi-sync is not running

UNAVAILABLE is returned for MySQL codes that are pretty strong signals MySQL is completely down, but to be fair, some could be a false signal. Ignoring "remote tablets", I think that usually will be accurate but it would be good to understand in what scenarios that is not true

The codes that cause UNAVAILABLE:

CRServerGone = ErrorCode(2006) - MySQL server has gone away 🟡
- Potentially less-reliable, doesn't necessarily mean semi-sync is stopped. But right now we'd assume it is
ERServerShutdown = ErrorCode(1053) - Server shutdown in progress 🟡
- Kind of likely semi-sync is stopped, but not 100%
- Typical cause: MySQL is trying to kill running queries during shutdown. This is a risk if we have massive queries that take time to kill
- Open question: at what stage does MySQL stop replication? Does 1053 mean replication is stopped? ❓
ERServerIsntAvailable = ErrorCode(3168) - Server isn't available 🟢
- This error is returned when the MySQL port us up but the "server API" has not started
  - I find no examples of people reporting this code on Google. It makes sense because this scenario must exist for a very short window of time
  - Open question: does replication start before OR after the "server API"? ❓
CRConnectionError = ErrorCode(2002) - Failed to connect to the local MySQL server 🟢
- This is the code Slack got reliably when mysqld is 100% down, never in-between
CRConnHostError = ErrorCode(2003) - Can't connect to MySQL server on '%s:%u' (%d) 🟡
- This is the same as 2002, but for a hostname/port vs socket file
- Unless a tablet is misconfigured this would mean "remote tablets". Not the typical case. Perhaps should be ignored

For context, these error codes to vtrpcpb.Code mappings have been in queryserver for a long time. For tabletmanager, it matched this mapping in v23+

While imperfect, the gaps where this isn't a good signal seem pretty small here. I'm curious in what scenario we would get these error codes above from MySQL, but semi-sync is still running? Without digging I would say (ignoring remote-tablets again) just CRServerGone/2006 🤔

One complication here is that we can't rely on the value of the durability policy at the time of failover, as the value could have been changed (e.g. from none to cross_cell or the other way around), but it's only reflected in the cluster after PlannedReparentShard or EmergencyReparentShard has completed.

One of the first steps reparenting code does is fetch the running durability policy from the topo. From there it assumes that is the correct policy. That's "usually" right, but again imperfect

The edge cases I can see this not being true is the keyspace durability policy being changed seconds-before/during a reparent. Are there any other cases? I think that's out of scope of this PR but a very good point to address. If it's just the single scenario I mention, the risk of this occurring only exists for a handful of seconds, assuming VTOrc is successfully fixing things

I think to model this cleanly, we need to store this information in the tablet record (e.g. before starting replication, we store whether a tablet is running with semi-sync enabled or not). With this information in the topo, EmergencyReparentShard can correctly determine the set of replicas that need to be available to perform an EmergencyReparentShard operation successfully, and which replicas aren't important for the success of the failover.

Yes, this would likely be the most accurate approach, and one I began RFC'ing at Slack for similar reasons - using the tablet record for this state. An overall problem with ERS is it is used on both stateless VTCtlds and kind-of-stateful VTOrcs. VTOrc technically could store the state you refer to in it's backend database, but VTCtld has no such backend database. So we're left with just the topo as something both can use. The topo is probably ok, but it has it's inefficiencies at scale and this would add calls and more reliance on the topo being up - that said, reparents already rely heavily on the topo

But again, I see the few seconds where the durability policy may not be 100% accurate as kind of out of scope, or at least this PR doesn't make that better or worse. What do you think?

Co-authored-by: Arthur Schreiber <schreiber.arthur@googlemail.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

This reverts commit 19996f8.

github-actions · 2026-01-13T01:35:25Z

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

Push additional commits to the associated branch.
Remove the stale label.
Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

promptless · 2026-01-14T01:41:58Z

📝 Documentation updates detected!

New suggestion: Document EmergencyReparentShard handling of replicas with mysqld down

promptless · 2026-01-14T01:42:28Z

📝 Documentation updates detected!

New suggestion: Add changelog entry for EmergencyReparentShard skipping replicas with MySQL down

github-actions · 2026-02-14T01:37:51Z

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

Push additional commits to the associated branch.
Remove the stale label.
Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

timvaillancourt added 3 commits November 11, 2025 23:03

improve TestGRPCErrorCode_UNAVAILABLE test

df57f95

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

support UNAVAILABLE errors on .ReplicationStatus

29d03c0

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

rename test

2abbc8a

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

timvaillancourt added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: vtctl Component: VTTablet labels Nov 11, 2025

github-actions bot added this to the v24.0.0 milestone Nov 11, 2025

cleanup

97e8683

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

timvaillancourt commented Nov 11, 2025

View reviewed changes

cleanup

ee68e21

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

timvaillancourt added 3 commits November 12, 2025 02:10

fix test, add comment

032fd6e

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

fix comment

c27ec8b

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

improve test

f7f87fe

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

timvaillancourt added the Component: VTOrc Vitess Orchestrator integration label Nov 12, 2025

improve test

5231f76

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

timvaillancourt changed the title ~~EmergencyReparentShard: support reachable replicas w/mysqld down~~ EmergencyReparentShard: support reachable replica tablets w/mysqld down Nov 12, 2025

timvaillancourt added 3 commits November 12, 2025 15:13

code cleaning

14883c7

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

rm dupe error mapping

62a26fe

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

revert

8bf456b

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

timvaillancourt requested a review from arthurschreiber November 12, 2025 14:34

make .RunSQL more resilient to slow mysqld

19996f8

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

timvaillancourt commented Nov 12, 2025

View reviewed changes

go/test/endtoend/utils/utils.go Outdated Show resolved Hide resolved

timvaillancourt marked this pull request as ready for review November 12, 2025 17:14

timvaillancourt requested review from rohit-nayak-ps and shlomi-noach as code owners November 12, 2025 17:14

timvaillancourt requested a review from nickvanw November 12, 2025 17:14

timvaillancourt added 2 commits November 12, 2025 20:17

typo

34d9f24

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

fix comment

9c3350c

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

timvaillancourt mentioned this pull request Nov 13, 2025

Bug Report/RFC: EmergencyReparentShard fails when mysqld is down on any tablet in a shard #18528

Open

mattlord reviewed Nov 14, 2025

View reviewed changes

arthurschreiber reviewed Nov 17, 2025

View reviewed changes

go/test/endtoend/reparent/emergencyreparent/ers_test.go Outdated Show resolved Hide resolved

timvaillancourt and others added 4 commits December 2, 2025 18:49

Update go/test/endtoend/reparent/emergencyreparent/ers_test.go

08835b6

Co-authored-by: Arthur Schreiber <schreiber.arthur@googlemail.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

Merge remote-tracking branch 'origin/main' into ers-handle-mysqld-down

9746e97

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

use t.Context()

7dbc44b

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

Revert "make .RunSQL more resilient to slow mysqld"

034d040

This reverts commit 19996f8.

timvaillancourt marked this pull request as draft December 13, 2025 18:37

github-actions bot added Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. and removed Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. labels Jan 13, 2026

github-actions bot added Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. and removed Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. labels Feb 14, 2026

Merge remote-tracking branch 'origin/main' into ers-handle-mysqld-down

1817266

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

timvaillancourt self-assigned this Feb 26, 2026

Conversation

timvaillancourt commented Nov 11, 2025 • edited by mattlord Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Checklist

Deployment Notes

AI Disclosure

Uh oh!

vitess-bot bot commented Nov 11, 2025

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

Uh oh!

timvaillancourt Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

mattlord left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mattlord Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arthurschreiber commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timvaillancourt commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthurschreiber commented Nov 20, 2025

Uh oh!

timvaillancourt commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 13, 2026

Uh oh!

promptless bot commented Jan 14, 2026

Uh oh!

promptless bot commented Jan 14, 2026

Uh oh!

github-actions bot commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

timvaillancourt commented Nov 11, 2025 •

edited by mattlord

Loading

codecov bot commented Nov 12, 2025 •

edited

Loading

arthurschreiber commented Nov 17, 2025 •

edited

Loading

timvaillancourt commented Nov 17, 2025 •

edited

Loading

timvaillancourt commented Dec 2, 2025 •

edited

Loading