Skip to content

Improve error messages in tablet gateway when primary tablets are not serving#125

Closed
austenLacy wants to merge 1 commit intov15.0.3-shopify-7from
austenlacy/vitess-project/improve-tablet-gateway-error
Closed

Improve error messages in tablet gateway when primary tablets are not serving#125
austenLacy wants to merge 1 commit intov15.0.3-shopify-7from
austenlacy/vitess-project/improve-tablet-gateway-error

Conversation

@austenLacy
Copy link

@austenLacy austenLacy commented Oct 12, 2023

Description

The error message returned from the tablet gateway is misleading. For example while a primary was not serving and down Vitess was returning current keyspace is being resharded to an application.

I swapped the tablet checks to check if the primary is serving before checking the resharding state.

For example, during a recent incident where a primary mysql was down Vitess was returning current keyspace is being resharded error messages back to an application which is very confusing. If we look at the tablet serving state however we can see that the primary is NOT_SERVING.

ro@cluster(replica)> show vitess_tablets;
+-------------+----------------------------------+-------+------------+-------------+------------------------+---------------+----------------------+
| Cell        | Keyspace                         | Shard | TabletType | State       | Alias                  | Hostname      | PrimaryTermStartTime |
+-------------+----------------------------------+-------+------------+-------------+------------------------+---------------+----------------------+
...
| us-east1    | keyspace1 | -10   | PRIMARY    | NOT_SERVING | us-east1-0000000097    | 247.4.117.26  | 2023-10-11T19:16:00Z |
| us-east1    | keyspace2 | -10   | REPLICA    | NOT_SERVING | us-east1-0000000102    | 247.2.54.58   |                      |
...
+-------------+----------------------------------+-------+------------+-------------+------------------------+---------------+----------------------+
68 rows in set (0.00 sec)

With this change we should at least return an error message of primary is not serving, there could be a reparent operation in progress which is more valuable than a "potential" resharding message.

… serving

Signed-off-by: Austen Lacy <austen.lacy@shopify.com>
Comment on lines +304 to 306
if kev.TargetIsBeingResharded(target) {
err = vterrors.Errorf(vtrpcpb.Code_CLUSTER_EVENT, "current keyspace is potentially being resharded")
continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this return false if it's not actually being resharded? Feel like we should address the bug at that level?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shanth96 I considered changing the TargetIsBeingResharded logic, but I'm also not familiar enough with the logic that determines if it's actually being resharded and if we would have enough information to determine that from the keyspace watch events. We can revisit that though if we think this change isn't as helpful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough

@brendar
Copy link

brendar commented Oct 13, 2023

I have reservations about changing the order of error checks and their messages unless we can get this upstreamed. If we need community support for issues in the future, there's going to be confusion if we are mentioning error messages that don't exist in Vitess upstream.

@austenLacy
Copy link
Author

@brendar

I have reservations about changing the order of error checks and their messages unless we can get this upstreamed.

I'm happy to try upstreaming these changes to see what the maintainers think.

@brendar
Copy link

brendar commented Oct 16, 2023

I'm happy to try upstreaming these changes to see what the maintainers think.

Cool, let's try

@austenLacy
Copy link
Author

I found this PR that's in main that improves the heuristic for determining if a reshard is actually in progress. Should definitely help with this issue: vitessio#13856.

related issue: vitessio#13854

@austenLacy austenLacy closed this Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants