Skip to content

Add semi-sync monitor to unblock primaries blocked on semi-sync ACKs#17763

Merged
GuptaManan100 merged 28 commits intovitessio:mainfrom
planetscale:semi-sync-watcher
Feb 24, 2025
Merged

Add semi-sync monitor to unblock primaries blocked on semi-sync ACKs#17763
GuptaManan100 merged 28 commits intovitessio:mainfrom
planetscale:semi-sync-watcher

Conversation

@GuptaManan100
Copy link
Contributor

@GuptaManan100 GuptaManan100 commented Feb 12, 2025

Description

This PR introduces a new component to the vttablet binary to monitor the semi-sync status of primary vttablets. We've observed cases where a brief network disruption can cause the primary to get stuck indefinitely waiting for semi-sync ACKs. In rare scenarios, this can block reparent operations and render the primary unresponsive. More information can be found in the issues #17709 and #17749.

To address this, the new component continuously monitors the semi-sync status. If the primary becomes stuck on semi-sync ACKs, it generates writes to unblock it. If this fails, VTOrc is notified of the issue and initiates an emergency reparent operation.

A metric for the number of outstanding writes from the semi-sync monitor has also been added. Unfortunately, it's not easy to reproduce the problem in an end-to-end test since it requires setting port forward rules (on Mac) or iptable changes (on Linux), both of which require sudo access. So I've added a test but that test doesn't run on CI. It can only be run locally and the password for the root user has to entered when prompted for it.

The semi-sync monitor is used such that the users who aren't running semi-sync, will not see the monitor open at all. It will only run when semi-sync on the primary is turned on.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: Manan Gupta <manan@planetscale.com>
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Feb 12, 2025

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Feb 12, 2025
@github-actions github-actions bot added this to the v22.0.0 milestone Feb 12, 2025
@codecov
Copy link

codecov bot commented Feb 12, 2025

Codecov Report

Attention: Patch coverage is 82.13058% with 52 lines in your changes missing coverage. Please review.

Project coverage is 67.47%. Comparing base (67d081a) to head (4373407).
Report is 15 commits behind head on main.

Files with missing lines Patch % Lines
.../vttablet/tabletmanager/semisyncmonitor/monitor.go 89.65% 24 Missing ⚠️
go/vt/vttablet/tabletserver/tabletenv/config.go 44.00% 14 Missing ⚠️
go/vt/vttablet/tabletmanager/rpc_replication.go 15.38% 11 Missing ⚠️
go/vt/vtorc/inst/instance_dao.go 75.00% 1 Missing ⚠️
go/vt/vtorc/logic/topology_recovery.go 0.00% 1 Missing ⚠️
go/vt/wrangler/testlib/fake_tablet.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #17763      +/-   ##
==========================================
+ Coverage   67.41%   67.47%   +0.06%     
==========================================
  Files        1592     1593       +1     
  Lines      258024   258498     +474     
==========================================
+ Hits       173948   174430     +482     
+ Misses      84076    84068       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…writes

Signed-off-by: Manan Gupta <manan@planetscale.com>
…c is blocked on the primary

Signed-off-by: Manan Gupta <manan@planetscale.com>
…d for too long and a add a few logs

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…ger to only start it when required

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100 GuptaManan100 added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Cluster management and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Feb 17, 2025
@GuptaManan100 GuptaManan100 changed the title Add semi-sync watcher to unblock primaries blocked on semi-sync ACKs Add semi-sync monitor to unblock primaries blocked on semi-sync ACKs Feb 17, 2025
Copy link
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good! Just had some questions/comments. Let me know what you think.

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Copy link
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I only had one minor suggestion. Nice work on this, @GuptaManan100 !

Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100
Copy link
Contributor Author

Done, I've made that change too! 💕

…nds value

Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100 GuptaManan100 merged commit 81ce29c into vitessio:main Feb 24, 2025
106 checks passed
@GuptaManan100 GuptaManan100 deleted the semi-sync-watcher branch February 24, 2025 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Cluster management Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

3 participants