Skip to content

SidecarDB Init: don't fail on schema init errors#12328

Merged
rohit-nayak-ps merged 5 commits intovitessio:mainfrom
planetscale:rn-sidecar-init-ignore-upgrade-errors
Feb 22, 2023
Merged

SidecarDB Init: don't fail on schema init errors#12328
rohit-nayak-ps merged 5 commits intovitessio:mainfrom
planetscale:rn-sidecar-init-ignore-upgrade-errors

Conversation

@rohit-nayak-ps
Copy link
Member

@rohit-nayak-ps rohit-nayak-ps commented Feb 13, 2023

Description

The initial implementation of the new mechanism for initializing the sidecar database in #11520 causes a failure if there is an error while applying a DDL to create or alter even a single sidecardb table. Such a failure causes the tablet initialization to fail. Hence if there is an issue with one of the queries to create or upgrade a table, say due to an incompatible MySQL version we would end up with tablets not reaching the serving state.

Now, if this happens during a Vitess version upgrade, it would not be catastrophic because the upgrade process would be stopped when the first replica being upgraded fails to launch.

However if, for any reason, the entire cluster rolls over to a new Vitess version in the presence of these failures, we would have a catastrophic incident where we could not serve queries. This can also happen if we upgrade the MySQL version on all tablets for the same Vitess version.

For example, we recently encountered a user cluster where a schema upgrade (using withddl) had caused a cluster to fail when the underlying MySQL was upgraded. This was due to a data inconsistency: one of the sidecar tables had zero dates and a table alter which added an unrelated column failed on newer versions of MySQL.

While such issues might happen very rarely, the possible existence of such critical issues is a strong reason not to fail tablet launches if we encounter errors during the sidecar init process.

This PR ignores errors in the sidecardb module when we execute create or alter queries on sidecardb tables. Quietly failing errors or just logging is also not ideal since it means some modules or some functionality of modules can fail due to an incorrect schema of a table used by that module and debugging them can be time consuming (Note that we had the same issue the withDDL approach).

To surface these issues we have added two stats which are available via the debug/vars endpoint:

  • SidecarDBDDLErrorCount which is the number of errors encountered
  • SidecarDBDDLErrors which is a list of the corresponding golang errors

A future PR will add functionality to VTAdmin to create visual alerts based on these stats.

Related Issue(s)

#10133

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Documentation was added or is not required

@rohit-nayak-ps rohit-nayak-ps added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VReplication Backport to: release-16.0 labels Feb 13, 2023
@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Feb 13, 2023
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Feb 13, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@rohit-nayak-ps rohit-nayak-ps removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Feb 13, 2023
@rohit-nayak-ps rohit-nayak-ps force-pushed the rn-sidecar-init-ignore-upgrade-errors branch from 821f109 to 97fdc86 Compare February 14, 2023 21:15
@rohit-nayak-ps rohit-nayak-ps marked this pull request as ready for review February 15, 2023 06:45
@deepthi deepthi added Component: TabletManager NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work labels Feb 15, 2023
Copy link
Collaborator

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a proper description and link to tracking issue / previous PR.

@rohit-nayak-ps rohit-nayak-ps removed the NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work label Feb 15, 2023
@rohit-nayak-ps
Copy link
Member Author

Needs a proper description and link to tracking issue / previous PR.

Duh, sorry, was multi-tasking and forgot to save the description in another tab.

@frouioui frouioui mentioned this pull request Feb 16, 2023
51 tasks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this key be a constant? I see we are using it in several places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only had some minor nits and suggestions. I will let you make the final call on those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but DDL is an initialism and the letters should be capitalized to reflect that (even though I think all readers know what DDL is).

Comment on lines 107 to 117
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be safe, we should check that it was a valid cast. We don't want to crash the vttablet process for new error handling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note here about additional safety by checking for a valid cast.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note here about capitalizing DDL.

Comment on lines 75 to 90
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's worth avoiding this imprecise string matching. We could instead parse it, walk the AST, have a case where the type is not SHOW, then if the table identifier == the table name in the test we return the error. I can help with this if you want.

@rohit-nayak-ps rohit-nayak-ps force-pushed the rn-sidecar-init-ignore-upgrade-errors branch from 5c27072 to 59e441f Compare February 22, 2023 10:04
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
@rohit-nayak-ps rohit-nayak-ps merged commit a5d0cdc into vitessio:main Feb 22, 2023
@rohit-nayak-ps rohit-nayak-ps deleted the rn-sidecar-init-ignore-upgrade-errors branch February 22, 2023 15:47
rohit-nayak-ps added a commit that referenced this pull request Feb 23, 2023
…) (#12431)

* Don't fail on schema init errors

Signed-off-by: Rohit Nayak <rohit@planetscale.com>

* Fix debug var attribute name

Signed-off-by: Rohit Nayak <rohit@planetscale.com>

* Fix typo in debug var

Signed-off-by: Rohit Nayak <rohit@planetscale.com>

* Address review comment

Signed-off-by: Rohit Nayak <rohit@planetscale.com>

* Address review comments

Signed-off-by: Rohit Nayak <rohit@planetscale.com>

---------

Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Co-authored-by: Rohit Nayak <rohit@planetscale.com>
@hmaurer hmaurer mentioned this pull request Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: TabletManager Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants