Skip to content

Topo lookup tablet type#18777

Merged
mattlord merged 14 commits intovitessio:mainfrom
sbaker617:topo-lookup-tablet-type
Nov 21, 2025
Merged

Topo lookup tablet type#18777
mattlord merged 14 commits intovitessio:mainfrom
sbaker617:topo-lookup-tablet-type

Conversation

@sbaker617
Copy link
Copy Markdown
Contributor

@sbaker617 sbaker617 commented Oct 20, 2025

Description

This adds an optional --init-tablet-type-lookup startup flag for vttablet to implement functionality described in this issue.

When present, --init-tablet-type-lookup will result in searching the topology for the current tablet being started. If found, will use the previous tablet type instead of the one passed with --init-tablet-type.

Related Issue(s)

Checklist

Deployment Notes

AI Disclosure

@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot bot commented Oct 20, 2025

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Oct 20, 2025
@github-actions github-actions bot added this to the v24.0.0 milestone Oct 20, 2025
@sbaker617 sbaker617 force-pushed the topo-lookup-tablet-type branch from 79a26cd to 4cfab07 Compare October 21, 2025 00:33
Copy link
Copy Markdown
Contributor

@ejortegau ejortegau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments.

@mattlord mattlord added Component: Cluster management Type: Feature and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Nov 4, 2025
@mattlord mattlord requested review from mattlord and timvaillancourt and removed request for rohit-nayak-ps and shlomi-noach November 4, 2025 16:28
@codecov
Copy link
Copy Markdown

codecov bot commented Nov 4, 2025

Codecov Report

❌ Patch coverage is 95.65217% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 69.78%. Comparing base (79af4c1) to head (bc3087d).
⚠️ Report is 18 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/vttablet/tabletmanager/tm_init.go 95.65% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18777      +/-   ##
==========================================
+ Coverage   69.73%   69.78%   +0.04%     
==========================================
  Files        1608     1608              
  Lines      214776   214861      +85     
==========================================
+ Hits       149781   149942     +161     
+ Misses      64995    64919      -76     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sbaker617
Copy link
Copy Markdown
Contributor Author

a note on valid values for --init-tablet-type, commit 8665230 incorrectly added PRIMARY to the list of valid options, the logic has remained unchanged since 31cb5f8 :
current

	tabletType, err := topoproto.ParseTabletType(initTabletType)
	if err != nil {
		return nil, err
	}
	switch tabletType {
	case topodatapb.TabletType_SPARE, topodatapb.TabletType_REPLICA, topodatapb.TabletType_RDONLY:
	default:
		return nil, fmt.Errorf("invalid init-tablet-type %v; can only be REPLICA, RDONLY or SPARE", tabletType)
	}

Copy link
Copy Markdown
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @sbaker617 ! I do think that this logical behavior makes sense and we may end up making this behavior the default at some point in the future. But I also think that we should be very conservative here and start with it being experimental.

Let me know what you think. ❤️

@@ -105,7 +106,8 @@ func registerInitFlags(fs *pflag.FlagSet) {
utils.SetFlagStringVar(fs, &tabletHostname, "tablet-hostname", tabletHostname, "if not empty, this hostname will be assumed instead of trying to resolve it")
utils.SetFlagStringVar(fs, &initKeyspace, "init-keyspace", initKeyspace, "(init parameter) keyspace to use for this tablet")
utils.SetFlagStringVar(fs, &initShard, "init-shard", initShard, "(init parameter) shard to use for this tablet")
utils.SetFlagStringVar(fs, &initTabletType, "init-tablet-type", initTabletType, "(init parameter) tablet type to use for this tablet. Valid values are: PRIMARY, REPLICA, SPARE, and RDONLY. The default is REPLICA.")
utils.SetFlagStringVar(fs, &initTabletType, "init-tablet-type", initTabletType, "(init parameter) the tablet type to use for this tablet. Can be REPLICA, RDONLY, or SPARE. The default is REPLICA.")
fs.BoolVar(&initTabletTypeLookup, "init-tablet-type-lookup", initTabletTypeLookup, "(optional, init parameter) if enabled, look up the tablet type from the existing topology record on restart and use that instead of init-tablet-type. This allows tablets to maintain their changed roles (e.g., RDONLY/DRAINED) across restarts. If disabled or if no topology record exists, init-tablet-type will be used.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think that we need to mark this as Experimental: in the flag description. This has the potential to cause a lot of unexpected edge cases. So this should be opt-in and communicate to the user that is interested in this behavior that there may be unexpected and undesirable side effects. Then that gives users that are still interested the opportunity to try it out and report any issues for follow-up work.
  2. We should be more clear about what we're looking up here. We're looking up the tablet alias, which consists of:
    a. The cell the vttablet is starting in
    b. The --tablet-uid value passed to it
    In your deployment environment (AFAIK) both of those are static, but in many deployments (e.g. k8s with the operator) they are not. Only a running process has live state, so persisting more of that (and such a key part) across restarts for the next process is, I don't think, quite so cut and dry as it may initially seem. And the tablet type of a process is changed for a very wide variety of reasons and in various scenarios by various entities so this is why I think that both of these points are important. For example, "DRAINED" is documented as: A tablet that has been reserved by a Vitess background process (such as rdonly tablets for resharding). Should that state persist to a new process for the same logical tablet? I'm not sure. What if the same logical tablet of e.g. zone1-100 comes back up but with a different volume? RDONLY is certainly a simpler case. For most modern orchestrated deployments using the operator e.g., you define how many RDONLY tablets you want though and the system ensures you have N of them. So it could potentially start trying to add a new one as the old one is still coming back up or something. Anyway, lots of potential unexpected side effects which comes back to point 1. 🙂 Eventually we may well want to make this behavior the default, but I expect to have some follow-up work to address some edge/corner cases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'm good with marking as Experimental, def more familiar with a setup where the aliases are tied to a specific instance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

marked as Experimental and noted use of the tablet alias as the means of lookup. I kept the reference to RDONLY/DRAINED to highlight this would impact a range of types and might have unintended consequences... so it's a warning of sorts: cffa4e1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your deployment environment (AFAIK) both of those are static, but in many deployments (e.g. k8s with the operator) they are not.

Agreed, this worries me a little. I'm assuming dynamic/changing tablet aliases is the most common style of Vitess deployment out there (in raw numbers, etc), so this feature could be a risk to the most common deployment style

We can document somewhere that it's only safe for static tablet aliases but I think that can easily be missed and I can't think of another way to warn/etc those users 🤔. Thoughts appreciated

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can document somewhere that it's only safe for static tablet aliases

I think marking as Experimental sends a strong signal that the flag shouldn't be enabled without proper testing/understanding of its functionality.
The current help message calls out that 'tablet alias' is used for the lookup:

(Experimental, init parameter) if enabled, uses tablet alias to look up the tablet type from the existing topology record on restart and use that instead of init-tablet-type. This allows tablets to maintain their changed roles (e.g., RDONLY/DRAINED) across restarts. If disabled or if no topology record exists, init-tablet-type will be used.

Hopefully that would give enough info for an env operator to know if this flag would be compatible with their deployment and would require testing before choosing to enable.

Copy link
Copy Markdown
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I only had minor comments and nits that you can address as you prefer.

Thanks, @sbaker617 !

Can you also please open a docs PR in the website repo? https://github.com/vitessio/website

We'll need to document this new flag in the 24.0 docs. Let me know if you need any help with that.

I'll then remove the "Needs Website Docs" label.

@sbaker617
Copy link
Copy Markdown
Contributor Author

Can you also please open a docs PR in the website repo? https://github.com/vitessio/website

We'll need to document this new flag in the 24.0 docs. Let me know if you need any help with that.

I'll then remove the "Needs Website Docs" label.

added doc PR here 👍🏻

@@ -173,7 +173,8 @@ Flags:
--init-db-name-override string (init parameter) override the name of the db used by vttablet. Without this flag, the db name defaults to vt_<keyspacename>
--init-keyspace string (init parameter) keyspace to use for this tablet
--init-shard string (init parameter) shard to use for this tablet
--init-tablet-type string (init parameter) tablet type to use for this tablet. Valid values are: PRIMARY, REPLICA, SPARE, and RDONLY. The default is REPLICA.
--init-tablet-type string (init parameter) tablet type to use for this tablet. Valid values are: REPLICA, RDONLY, and SPARE. The default is REPLICA.
--init-tablet-type-lookup (Experimental, init parameter) if enabled, uses tablet alias to look up the tablet type from the existing topology record on restart and use that instead of init-tablet-type. This allows tablets to maintain their changed roles (e.g., RDONLY/DRAINED) across restarts. If disabled or if no topology record exists, init-tablet-type will be used.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbaker617 I feel this comment should be updated so Vitess users understand this feature won't work on Vitess operator deployments, where tablet records are not left around in the topo

I think this current description may lead to those users trying this and believing there is a bug

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timvaillancourt what would you like to see included in the message to best guide Vitess operator users ?

Copy link
Copy Markdown
Member

@mattlord mattlord Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not get into specifics like that. The operator's pruning behavior is configurable. There's a good chance that the replacement has a different alias (e.g. in another AZ) though too. But these are implementation details and examples of where the record would not be found and thus init-tablet-type is used.

@timvaillancourt
Copy link
Copy Markdown
Contributor

@sbaker617 I feel this change needs an explanation in the v24 release notes as well: https://github.com/vitessio/vitess/blob/main/changelog/24.0/24.0.0/summary.md

This change, in my opinion, is kind of awkward for the project because it will only work in environments where topo records are left behind after tablets are removed, which is not the common case, for example in the Vitess operator

So a concern I have is this feature sounds like it will work for everyone, when in reality it only functions in a specific deployment scenario. Another general concern I have is how future changes/improvements to the logic may be hindered by having to support this split behaviour, but that is not a blocker per se

Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
@sbaker617 sbaker617 force-pushed the topo-lookup-tablet-type branch from b2d925e to bc3087d Compare November 21, 2025 09:52
@sbaker617
Copy link
Copy Markdown
Contributor Author

@sbaker617 I feel this change needs an explanation in the v24 release notes as well: https://github.com/vitessio/vitess/blob/main/changelog/24.0/24.0.0/summary.md

@timvaillancourt , added note to changelog here. Docs PR also updated with note.

Copy link
Copy Markdown
Contributor

@timvaillancourt timvaillancourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @sbaker617


When enabled, the tablet uses its alias to look up the tablet type from the existing topology record on restart. This allows tablets to maintain their changed roles (e.g., RDONLY/DRAINED) across restarts without manual reconfiguration. If disabled or if no topology record exists, the standard `--init-tablet-type` value will be used instead.

**Note**: Vitess Operator–managed deployments generally do not keep tablet records in the topo between restarts, so this feature will not take effect in those environments.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how accurate or helpful this is in the end. I'm fine leaving it though.

Copy link
Copy Markdown
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @sbaker617 ! ❤️

@mattlord mattlord removed the NeedsWebsiteDocsUpdate What it says label Nov 21, 2025
@mattlord mattlord merged commit 7906107 into vitessio:main Nov 21, 2025
103 of 105 checks passed
siddharth16396 pushed a commit to siddharth16396/postpone-complete that referenced this pull request Dec 3, 2025
Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
Signed-off-by: siddharth16396 <siddharth16396@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Add ability to restore TabletType on restart based on topology.

4 participants