-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docdb] TS crash due to 2 tablets in cache with the same partition_key_start #3669
Comments
Log:
|
Failed yb::client code (which works in TServer - in a CQL/SQL Service, to prepare/process RPC requests from TS to Master).
TS gets info (Partitions for all tablets) from Master - per table_id + per tablet_id partitions info. |
A few other log warnings (may be related to the issue, may be not):
|
Note: used default Cassandra (not YB) driver. Related discussion in Slack: https://yugabyte-db.slack.com/archives/GRY7P7LTG/p1581602620108600 |
Additional logs:
|
The failure can happen when a new table was created. WORKAROUND: Set Master flag |
ROOT CAUSE: It's possible to get race on Tablet-state during the CreateTable. Because the state change CREATING->RUNNING happens in one thread, while the state checking (if it's in CREATING - then it must be replaced after timeout) happens in CatalogManager-Background-tasks thread. So, if it happens in the order:
So, the race is on steps [3] & [4] can cause such crash in MetaCahe. |
Jira Link: DB-1920
TS crashed in MetaCache::ProcessTabletLocations() (meta_cache.cc:722) with the fatal log:
The issue can affect CQL & SQL, because the failed MetaCache is used in the yb::client code to get Tablet Location info in any tablet read/write operation (in AsyncRpc via TabletInvoker via YBClient::LookupTabletById()).
The text was updated successfully, but these errors were encountered: