SidecarDB Init: don't fail on schema init errors#12328
SidecarDB Init: don't fail on schema init errors#12328rohit-nayak-ps merged 5 commits intovitessio:mainfrom
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
821f109 to
97fdc86
Compare
deepthi
left a comment
There was a problem hiding this comment.
Needs a proper description and link to tracking issue / previous PR.
Duh, sorry, was multi-tasking and forgot to save the description in another tab. |
There was a problem hiding this comment.
Should this key be a constant? I see we are using it in several places.
mattlord
left a comment
There was a problem hiding this comment.
I only had some minor nits and suggestions. I will let you make the final call on those.
go/vt/sidecardb/sidecardb.go
Outdated
There was a problem hiding this comment.
Nit, but DDL is an initialism and the letters should be capitalized to reflect that (even though I think all readers know what DDL is).
go/vt/sidecardb/sidecardb.go
Outdated
There was a problem hiding this comment.
To be safe, we should check that it was a valid cast. We don't want to crash the vttablet process for new error handling.
go/vt/sidecardb/sidecardb.go
Outdated
There was a problem hiding this comment.
Same note here about additional safety by checking for a valid cast.
go/vt/sidecardb/sidecardb.go
Outdated
There was a problem hiding this comment.
Same note here about capitalizing DDL.
go/vt/sidecardb/sidecardb_test.go
Outdated
There was a problem hiding this comment.
IMO it's worth avoiding this imprecise string matching. We could instead parse it, walk the AST, have a case where the type is not SHOW, then if the table identifier == the table name in the test we return the error. I can help with this if you want.
5c27072 to
59e441f
Compare
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
…) (#12431) * Don't fail on schema init errors Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Fix debug var attribute name Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Fix typo in debug var Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Address review comment Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Address review comments Signed-off-by: Rohit Nayak <rohit@planetscale.com> --------- Signed-off-by: Rohit Nayak <rohit@planetscale.com> Co-authored-by: Rohit Nayak <rohit@planetscale.com>
Description
The initial implementation of the new mechanism for initializing the sidecar database in #11520 causes a failure if there is an error while applying a DDL to create or alter even a single sidecardb table. Such a failure causes the tablet initialization to fail. Hence if there is an issue with one of the queries to create or upgrade a table, say due to an incompatible MySQL version we would end up with tablets not reaching the serving state.
Now, if this happens during a Vitess version upgrade, it would not be catastrophic because the upgrade process would be stopped when the first replica being upgraded fails to launch.
However if, for any reason, the entire cluster rolls over to a new Vitess version in the presence of these failures, we would have a catastrophic incident where we could not serve queries. This can also happen if we upgrade the MySQL version on all tablets for the same Vitess version.
For example, we recently encountered a user cluster where a schema upgrade (using
withddl) had caused a cluster to fail when the underlying MySQL was upgraded. This was due to a data inconsistency: one of the sidecar tables had zero dates and a table alter which added an unrelated column failed on newer versions of MySQL.While such issues might happen very rarely, the possible existence of such critical issues is a strong reason not to fail tablet launches if we encounter errors during the sidecar init process.
This PR ignores errors in the
sidecardbmodule when we executecreateoralterqueries on sidecardb tables. Quietly failing errors or just logging is also not ideal since it means some modules or some functionality of modules can fail due to an incorrect schema of a table used by that module and debugging them can be time consuming (Note that we had the same issue thewithDDLapproach).To surface these issues we have added two stats which are available via the
debug/varsendpoint:SidecarDBDDLErrorCountwhich is the number of errors encounteredSidecarDBDDLErrorswhich is a list of the corresponding golang errorsA future PR will add functionality to VTAdmin to create visual alerts based on these stats.
Related Issue(s)
#10133
Checklist