SetReplicationSource: handle mysqld down in PRIMARY self-demotion#19624
SetReplicationSource: handle mysqld down in PRIMARY self-demotion#19624timvaillancourt wants to merge 27 commits intovitessio:mainfrom
SetReplicationSource: handle mysqld down in PRIMARY self-demotion#19624Conversation
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
…o-mysqld Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
…o-mysqld Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Addresses a brief race in MySQL's close_connections() (mysqld.cc) where close_listener() removes the unix socket before end_slave() stops replication threads. Refs: vitessio#19625 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
- IsLocalMySQLDown: return false if MysqlParams() fails or UnixSocket is empty - TestMysqldIsLocalMySQLDown: add t.Cleanup to avoid leaking fakesqldb Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
- Fix IsLocalMySQLDown comment: GetDbaConnection only connects, no SELECT 1 - Check rlimit restore error in e2e fd exhaustion test - Move curPrimary removal to outer defer in TestDownPrimary Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
The root cause was that convertBoolToSemiSyncAction queries MySQL before ChangeTabletType is ever called. When mysqld is down, it fails with errno 2002 and the topo is never updated. Three changes: - SetReplicationSource: skip convertBoolToSemiSyncAction when MySQL is local and down, use SemiSyncActionNone instead - updateTypeAndPublish: skip updateLocked when MySQL is local and down, so publishStateLocked can update topo without connect() draining the context - setReplicationSourceLocked: fix comment (no longer dead code) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
- Remove unused mysqlDaemon interface and mysqld field from stateManager - Pass localMySQLDown bool to setReplicationSourceLocked to avoid probing IsLocalMySQLDown twice in the same RPC Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Code ReviewCorrectness Issues1. Comment references nonexistent recovery mechanism (Medium)
2. TOCTOU gap in
|
- Fix path.Join with leading '/' dropping VTDATAROOT prefix - Fix misleading comment about GetDbaConnection running SELECT 1 - Use IsLocalMySQLDown instead of socket file stat after SIGKILL Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Description
When a
PRIMARYtablet'smysqldgoes down andvttabletreceives aSetReplicationSourceRPC to self-demote, it fails atconvertBoolToSemiSyncAction(which queries MySQL) beforeChangeTabletTypeis ever called. The topo is never updated — the tablet staysPRIMARY.VTOrc's
StaleTopoPrimaryanalysis (added in #19173) can eventually recover this, but this PR prevents the problem at the source.Reproduced locally by killing
mysqld_safe+mysqldon a primary (keepingvttabletrunning) and callingSetReplicationSourceviagrpcurl. Onmain, the tablet staysPRIMARYin topo. With this PR, it transitions toREPLICA.This PR adds
IsMySQLLocal()andIsLocalMySQLDown()to theMysqlDaemoninterface.IsLocalMySQLDown()probes MySQL and uses heuristics to distinguish "MySQL is down" from transient errors (CRConnectionError/errno 2002, too-many-connections, fd exhaustion, socket file validation).Three targeted skips when MySQL is local and down:
SetReplicationSource: skipconvertBoolToSemiSyncAction, useSemiSyncActionNoneupdateTypeAndPublish: skipupdateLocked, letpublishStateLockedupdate topo.retryTransitionreconnects when MySQL comes backsetReplicationSourceLocked: skip replication configuration. VTOrc orvttabletrestart will repair replication laterAlso includes:
STOP REPLICA(3s timeout) beforeMysqld.Shutdown(), to avoid a race in MySQL'sclose_connections()whereclose_listener()removes the unix socket beforeend_slave()stops replication threads (see Bug Report:Mysqld.Shutdown()can return while replication threads are still running #19625)process()retry loop now respects context cancellationRelated Issue(s)
Resolves #19623 and #19625
Checklist
Deployment Notes
No flags or config changes required. The behavior is automatic for tablets with local
mysqld(unix socket DBA connection)AI Disclosure
Claude w/Opus 4.6 wrote tests, PR summaries and issues for me. It also helped address reviews from Copilot and helped reproduce the bug locally.