Skip to content

vtorc: support analysis ordering, improve semi-sync rollout#19427

Merged
timvaillancourt merged 21 commits intovitessio:mainfrom
timvaillancourt:vtorc-analysis-refactor
Feb 24, 2026
Merged

vtorc: support analysis ordering, improve semi-sync rollout#19427
timvaillancourt merged 21 commits intovitessio:mainfrom
timvaillancourt:vtorc-analysis-refactor

Conversation

@timvaillancourt
Copy link
Copy Markdown
Contributor

@timvaillancourt timvaillancourt commented Feb 19, 2026

Description

This PR refactors how VTOrc prioritizes and executes recoveries in order to address #18712, which requires some problems to be executed in a defined order

The major changes this PR brings is:

  1. All tablets are now considered for all problems. Previous to this PR, a tablet matched the first problem found in the large switch statement, limiting options to solve this problem
  2. All problems are defined as a slice-of-struct in a new file: analysis_problem.go
    • Problems can have BeforeAnalysis and AfterAnalysis dependencies, to enforce dependencies
    • Problems can have a numeric priority to adjust their ordering
  3. Problems are now sorted with their priority and dependencies considered
  4. Recoveries are now executed per-shard (vs per-tablet in isolation) randomly. This is required to enforce the ordering of recoveries by-shard
    • Problems that must be ordered are executed first, serially
    • Problems that do not require ordering are executed concurrently
    • The randomness of shards is needed to avoid global topo lock contention when multiple VTOrcs are ran. The previous logic also enforced randomness for this reason, but in a simpler way
  5. A PrimarySemiSyncMustBeSet problem is only solved if >= semi-sync ackers has enabled semi-sync, and VTOrc sees it as enabled in subsequent pollings (via the FullStatus RPC VTOrc calls every X seconds)

End-to-end tests were added to validate the main issue #18712 is addressed. This e2e test checks that semi-sync states on replicas are fixed first, before a PRIMARY. Also various unit tests confirm the validity of new functions that were added

Backport Reason

As explained in #18712, the current recovery logic causes a stall in writes when semi-sync is enabled, from the perspective of a shard

Backporting this would avoid a solvable stall in writes for older Vitess versions. If this doesn't backport nicely, this may be a reason to not-backport

Related Issue(s)

Resolves #18712

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

AI Disclosure

Claude w/Opus used to move switch statement to new map. Also some test-failure debugging

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@github-actions github-actions bot added this to the v24.0.0 milestone Feb 19, 2026
@vitess-bot vitess-bot bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Feb 19, 2026
@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot bot commented Feb 19, 2026

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@timvaillancourt timvaillancourt added Backport to: release-22.0 Needs to be backport to release-22.0 Backport to: release-23.0 Needs to be backport to release-23.0 and removed NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request labels Feb 19, 2026
@timvaillancourt timvaillancourt self-assigned this Feb 19, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Feb 19, 2026

Codecov Report

❌ Patch coverage is 82.24299% with 38 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.46%. Comparing base (70c7a72) to head (25751e6).
⚠️ Report is 18 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/vtorc/logic/topology_recovery.go 0.00% 33 Missing ⚠️
go/vt/vtorc/inst/analysis_problem.go 97.36% 4 Missing ⚠️
go/vt/vtorc/inst/analysis.go 75.00% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (70c7a72) and HEAD (25751e6). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (70c7a72) HEAD (25751e6)
1 0
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #19427       +/-   ##
===========================================
- Coverage   69.67%   46.46%   -23.21%     
===========================================
  Files        1614       24     -1590     
  Lines      216793     3736   -213057     
===========================================
- Hits       151044     1736   -149308     
+ Misses      65749     2000    -63749     
Flag Coverage Δ
partial 46.46% <82.24%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@timvaillancourt timvaillancourt added Component: VTOrc Vitess Orchestrator integration Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature) and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Feb 23, 2026
@timvaillancourt timvaillancourt marked this pull request as ready for review February 23, 2026 17:51
@timvaillancourt
Copy link
Copy Markdown
Contributor Author

timvaillancourt commented Feb 24, 2026

@mattlord / @nickvanw I believe all PR comments/questions have been addressed. Thanks for the review ❤️

For the most part: I've made sure all problems have a priority and deprecated the HasShardWideAction bool that really just meant "top priority". Now there is a priority-0 const for this instead of a bool. I also cleaned up the code comments for the problems 👍

Copy link
Copy Markdown
Collaborator

@mhamza15 mhamza15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just some small comments.

@mattlord
Copy link
Copy Markdown
Member

@mattlord / @nickvanw I believe all PR comments/questions have been addressed. Thanks for the review ❤️

For the most part: I've made sure all problems have a priority and deprecated the HasShardWideAction bool that really just meant "top priority". Now there is a priority-0 const for this instead of a bool. I also cleaned up the code comments for the problems 👍

@timvaillancourt I think this is the only open comment from me: https://github.com/vitessio/vitess/pull/19427/files/2f3b04caca1ba12f4fbcb41baccc5e830d9eaca3#r2843704610 I'm not sure if the comment needs to be changed or that should return true?

Copy link
Copy Markdown
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work on this, @timvaillancourt ❤️

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@timvaillancourt
Copy link
Copy Markdown
Contributor Author

timvaillancourt commented Feb 24, 2026

@mhamza15 this PR seems to be hitting a race in TestDemotePrimaryHang

I debugged this with Claude and here was the summary:

⏺ Here's what I've found. The test has a race condition due to operation ordering.                                                                                                           
                                                                                                                                                                                                                   
  Root cause: The test sets read_only = OFF on the stalePrimary BEFORE updating the topo to make it look like a stale PRIMARY. Here's the problematic sequence:                                                    
                                                                                                                                                                                                                   
  1. SleepTablet on stalePrimary (blocks tablet action queue for 90s)                                                                                                                                              
  2. SET GLOBAL read_only = OFF on stalePrimary                                                                                                                                                                    
  3. Update topo: stalePrimary → PRIMARY with old timestamp                                                                                                                                                        
  4. WaitForDetectedProblems(StaleTopoPrimary, ..., 15s)

  The problem:
  - After step 2, VTOrc's instance poll (every 1s) discovers read_only=OFF on the stalePrimary
  - The topo type is still REPLICA at this point → VTOrc detects ReplicaIsWritable
  - VTOrc's fixReplica recovery tries setReadOnly on the tablet, which hangs because SleepTablet is blocking the action queue
  - CheckAndRecover blocks at wg.Wait() waiting for the hung recovery to complete
  - CheckAndRecover is non-reentrant (atomic guard at vtorc.go:313), so all subsequent calls are skipped
  - The topo update (step 3) happens, but VTOrc never runs a new analysis → StaleTopoPrimary is never detected
  - The 15s timeout in WaitForDetectedProblems expires → test fails

  The fix: Move the SET GLOBAL read_only = OFF to after the topo update (or remove it entirely, since isStaleTopoPrimary doesn't check read_only). Once the topo type is PRIMARY, ReplicaIsWritable can't match
  anymore (it requires topo.IsReplicaType(a.TabletType) which is false for PRIMARY type), and StaleTopoPrimary will be the detected problem instead.

I wanted to hear what you thought about the fix proposed, and if you'd prefer to run with it, or I add it here?

This reverts commit 91d3264.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@timvaillancourt
Copy link
Copy Markdown
Contributor Author

I wanted to hear what you thought about the fix proposed, and if you'd prefer to run with it, or I add it here?

@mhamza15 It sounds like this waitgroup broke this, so I reverted this change: 91d3264. I believe CI will pass after this

@timvaillancourt timvaillancourt enabled auto-merge (squash) February 24, 2026 22:11
@timvaillancourt timvaillancourt merged commit e7888df into vitessio:main Feb 24, 2026
172 of 173 checks passed
@timvaillancourt timvaillancourt deleted the vtorc-analysis-refactor branch February 24, 2026 23:04
timvaillancourt added a commit that referenced this pull request Feb 25, 2026
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
timvaillancourt added a commit that referenced this pull request Feb 25, 2026
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
timvaillancourt added a commit to timvaillancourt/vitess that referenced this pull request Feb 26, 2026
…#19427

- Breaking change: external decompressor no longer read from backup MANIFEST by default
- VTOrc: ordered recovery execution and semi-sync rollout improvements

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
timvaillancourt added a commit that referenced this pull request Feb 26, 2026
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
timvaillancourt added a commit that referenced this pull request Feb 27, 2026
Resolve merge conflicts from cherry-picking the analysis ordering
and semi-sync rollout PR onto release-23.0. Adapt the new
problem-matching system to the release-23.0 codebase by removing
IncapacitatedPrimary (not present on this branch) and keeping
fmt.Sprintf style consistent with existing code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
timvaillancourt added a commit that referenced this pull request Feb 27, 2026
Resolve merge conflicts from cherry-picking the analysis ordering
and semi-sync rollout PR onto release-23.0. Adapt the new
problem-matching system to the release-23.0 codebase by removing
IncapacitatedPrimary (not present on this branch) and keeping
fmt.Sprintf style consistent with existing code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
timvaillancourt added a commit that referenced this pull request Mar 5, 2026
…rollout (#19427) (#19472)

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
Co-authored-by: Mohamed Hamza <mhamza@fastmail.com>
Co-authored-by: Tim Vaillancourt <tim@timvaillancourt.com>
timvaillancourt added a commit that referenced this pull request Mar 9, 2026
…rollout (#19427) (#19471)

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
Co-authored-by: Tim Vaillancourt <tim@timvaillancourt.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backport to: release-22.0 Needs to be backport to release-22.0 Backport to: release-23.0 Needs to be backport to release-23.0 Component: VTOrc Vitess Orchestrator integration Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: vtorc can cause primary lock up when keyspace durability policy is changed

4 participants