Skip to content

SetReplicationSource: handle mysqld down in PRIMARY self-demotion#19624

Open
timvaillancourt wants to merge 27 commits intovitessio:mainfrom
timvaillancourt:fix-serveNonPrimary-no-mysqld
Open

SetReplicationSource: handle mysqld down in PRIMARY self-demotion#19624
timvaillancourt wants to merge 27 commits intovitessio:mainfrom
timvaillancourt:fix-serveNonPrimary-no-mysqld

Conversation

@timvaillancourt
Copy link
Contributor

@timvaillancourt timvaillancourt commented Mar 11, 2026

Description

When a PRIMARY tablet's mysqld goes down and vttablet receives a SetReplicationSource RPC to self-demote, it fails at convertBoolToSemiSyncAction (which queries MySQL) before ChangeTabletType is ever called. The topo is never updated — the tablet stays PRIMARY.

VTOrc's StaleTopoPrimary analysis (added in #19173) can eventually recover this, but this PR prevents the problem at the source.

Reproduced locally by killing mysqld_safe + mysqld on a primary (keeping vttablet running) and calling SetReplicationSource via grpcurl. On main, the tablet stays PRIMARY in topo. With this PR, it transitions to REPLICA.

This PR adds IsMySQLLocal() and IsLocalMySQLDown() to the MysqlDaemon interface. IsLocalMySQLDown() probes MySQL and uses heuristics to distinguish "MySQL is down" from transient errors (CRConnectionError/errno 2002, too-many-connections, fd exhaustion, socket file validation).

Three targeted skips when MySQL is local and down:

  1. SetReplicationSource: skip convertBoolToSemiSyncAction, use SemiSyncActionNone
  2. updateTypeAndPublish: skip updateLocked, let publishStateLocked update topo. retryTransition reconnects when MySQL comes back
  3. setReplicationSourceLocked: skip replication configuration. VTOrc or vttablet restart will repair replication later

Also includes:

Related Issue(s)

Resolves #19623 and #19625

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

No flags or config changes required. The behavior is automatic for tablets with local mysqld (unix socket DBA connection)

AI Disclosure

Claude w/Opus 4.6 wrote tests, PR summaries and issues for me. It also helped address reviews from Copilot and helped reproduce the bug locally.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
…o-mysqld

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
…o-mysqld

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@timvaillancourt timvaillancourt self-assigned this Mar 11, 2026
@timvaillancourt timvaillancourt added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VTTablet labels Mar 11, 2026
@github-actions github-actions bot added this to the v24.0.0 milestone Mar 11, 2026
@vitess-bot vitess-bot bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Mar 11, 2026
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Mar 11, 2026

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@timvaillancourt timvaillancourt removed Component: TabletManager NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work labels Mar 11, 2026
Addresses a brief race in MySQL's close_connections() (mysqld.cc)
where close_listener() removes the unix socket before end_slave()
stops replication threads.

Refs: vitessio#19625

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
- IsLocalMySQLDown: return false if MysqlParams() fails or UnixSocket is empty
- TestMysqldIsLocalMySQLDown: add t.Cleanup to avoid leaking fakesqldb

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

- Fix IsLocalMySQLDown comment: GetDbaConnection only connects, no SELECT 1
- Check rlimit restore error in e2e fd exhaustion test
- Move curPrimary removal to outer defer in TestDownPrimary

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

timvaillancourt and others added 4 commits March 11, 2026 23:39
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
The root cause was that convertBoolToSemiSyncAction queries MySQL
before ChangeTabletType is ever called. When mysqld is down, it
fails with errno 2002 and the topo is never updated.

Three changes:
- SetReplicationSource: skip convertBoolToSemiSyncAction when MySQL
  is local and down, use SemiSyncActionNone instead
- updateTypeAndPublish: skip updateLocked when MySQL is local and
  down, so publishStateLocked can update topo without connect()
  draining the context
- setReplicationSourceLocked: fix comment (no longer dead code)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

- Remove unused mysqlDaemon interface and mysqld field from stateManager
- Pass localMySQLDown bool to setReplicationSourceLocked to avoid
  probing IsLocalMySQLDown twice in the same RPC

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@mattlord
Copy link
Member

Code Review

Correctness Issues

1. Comment references nonexistent recovery mechanism (Medium)

tm_state.go:264 — The comment says "retryTransition will handle it when MySQL comes back" but retryTransition doesn't exist anywhere in the codebase. The actual recovery path is VTTablet restarting and calling Open()updateLocked(). This is misleading and could cause someone to rely on a retry mechanism that doesn't exist.

2. TOCTOU gap in SetReplicationSource (Low)

rpc_replication.go:835localMySQLDown is probed once and threaded through. MySQL could come back online between the probe and setReplicationSourceLocked executing, causing us to skip replication configuration unnecessarily. In practice, VTOrc repairs this, and the window is tiny, so impact is low. But it's worth being aware of.

3. isFileDescriptorExhaustedProbe fragility with closed stdin (Low)

mysqld.go:396-403 — The probe does syscall.Dup(0) (duplicates stdin). If a vitess process has stdin closed (e.g., double-fork daemonization), Dup(0) returns EBADF — not EMFILE/ENFILE. The function would report "no fd exhaustion" even if fds are exhausted, because the error isn't the one it's looking for. Using syscall.Dup(1) or syscall.Dup(2) (stdout/stderr) would be more resilient since logging keeps those open, or you could try syscall.Socket(syscall.AF_UNIX, syscall.SOCK_STREAM, 0) as a more robust probe.

4. VTOrc test cleanup ordering (Low risk)

primary_failure_test.go:91-95 vs 117-124 — The deferred PermanentlyRemoveVttablet at line 93 calls KillTablets(curPrimary) at cleanup. But the new code at 117-124 restarts curPrimary and leaves it running. The defer will kill it at test end, which is fine for cleanup — but if a future test in the same function were added after line 124 expecting curPrimary to still be in CellInfos, the defer's removal would cause confusion. This is fine today but worth noting.


Performance Observations

5. Double MysqlParams() resolution in IsLocalMySQLDown (Minor)

mysqld.go:353 + 379GetDbaConnection at line 353 internally resolves MysqlParams(). Then line 379 resolves them again for the socket path check. You could resolve once at the top of the function and pass the socket path to the stat check, avoiding the duplicate work. Not hot-path, so negligible real-world impact.


Design Observations

6. localMySQLDown boolean parameter threading

rpc_replication.go:862 — Adding a boolean that causes the function to short-circuit at line 879 is a mild code smell. The caller (SetReplicationSource) knows MySQL is down but still calls setReplicationSourceLocked to get the ChangeTabletType side-effect. An alternative structure would be to extract the ChangeTabletType call to the caller and only call setReplicationSourceLocked when MySQL is up. That would eliminate the boolean parameter entirely:

if localMySQLDown {
    if tablet.Type == topodatapb.TabletType_PRIMARY {
        return tm.tmState.ChangeTabletType(ctx, topodatapb.TabletType_REPLICA, DBActionNone)
    }
    return nil
}
return tm.setReplicationSourceLocked(...)

This is a style/clarity point, not a bug.


Things Done Well

  • IsLocalMySQLDown layered validation — Connection attempt → error code check → fd exhaustion probe → socket stat is a solid defense-in-depth approach. Conservative default (return false when uncertain) avoids false positives.
  • STOP REPLICA in Shutdown — Best-effort with a tight 3-second timeout before the actual mysqladmin shutdown is the right tradeoff. The context inheritance (context.WithTimeout(ctx, 3*time.Second)) correctly takes the min of parent and 3s.
  • Schema tracker context-aware sleep (tracker.go:173-177) — Clean fix. The old time.Sleep(5*time.Second) ignored cancellation, blocking graceful shutdown. The select with ctx.Done() is the correct Go pattern.
  • E2E test in TestDownPrimary that validates the old primary rejoins as a replica with working replication is the right integration test for this change.
  • Unit test for TestShutdownStopsReplication — Good coverage of the new STOP REPLICA behavior without requiring a real MySQL instance.

Summary

The core design is sound: detect MySQL-down early, skip operations that require MySQL, still update topo, and let VTOrc repair replication later. The main actionable items are:

  1. Fix the misleading retryTransition comment (correctness)
  2. Consider hardening isFileDescriptorExhaustedProbe against closed stdin (defensive)
  3. Consider extracting the ChangeTabletType call to simplify the boolean threading (clarity)

- Fix path.Join with leading '/' dropping VTDATAROOT prefix
- Fix misleading comment about GetDbaConnection running SELECT 1
- Use IsLocalMySQLDown instead of socket file stat after SIGKILL

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: VTTablet Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

3 participants