Skip to content

[Backport 12.0] Handle unhealthy tablets more safely#8950

Merged
systay merged 1 commit intovitessio:release-12.0from
planetscale:Backport8943
Oct 7, 2021
Merged

[Backport 12.0] Handle unhealthy tablets more safely#8950
systay merged 1 commit intovitessio:release-12.0from
planetscale:Backport8943

Conversation

@mattlord
Copy link
Copy Markdown
Member

@mattlord mattlord commented Oct 7, 2021

Description

If the tablet is not healthy then the TabletHealth.Stats variable may be nil. Let's check for that and instead show a replication lag value of -1 in that case to reflect the state as the intention is to make users and tooling aware of unhealthy REPLICA/RDONLY tablets rather than hide/omit them in the output.

Let's also add a fairly aggressive timeout for the HTTP calls to get the throttler status so that one misbehaving tablet (call) doesn't cause the entire command to timeout.

Example

mysql> show vitess_tablets;
+-------+----------------+-------+------------+-------------+------------------+---------------+----------------------+
| Cell  | Keyspace       | Shard | TabletType | State       | Alias            | Hostname      | PrimaryTermStartTime |
+-------+----------------+-------+------------+-------------+------------------+---------------+----------------------+
| zone1 | sourcekeyspace | 0     | PRIMARY    | SERVING     | zone1-0000000100 | 192.168.0.134 | 2021-10-06T19:24:48Z |
| zone1 | sourcekeyspace | 0     | REPLICA    | SERVING     | zone1-0000000101 | 192.168.0.134 |                      |
| zone1 | sourcekeyspace | 0     | REPLICA    | SERVING     | zone1-0000000102 | 192.168.0.134 |                      |
| zone1 | sourcekeyspace | 0     | RDONLY     | SERVING     | zone1-0000000103 | 192.168.0.134 |                      |
| zone1 | targetkeyspace | 0     | PRIMARY    | NOT_SERVING | zone1-0000000900 | 192.168.0.134 |                      |
| zone1 | targetkeyspace | 0     | REPLICA    | NOT_SERVING | zone1-0000000902 | 192.168.0.134 |                      |
| zone1 | targetkeyspace | 0     | REPLICA    | NOT_SERVING | zone1-0000000901 | 192.168.0.134 |                      |
| zone1 | targetkeyspace | 0     | RDONLY     | NOT_SERVING | zone1-0000000903 | 192.168.0.134 |                      |
+-------+----------------+-------+------------+-------------+------------------+---------------+----------------------+
8 rows in set (0.00 sec)

mysql> show vitess_replication_status;
ERROR 2013 (HY000): Lost connection to MySQL server during query
mysql> 

Logs

W1006 20:13:17.129131   22543 executor.go:1072] Could not get throttler status from 192.168.0.134:15902: Get "http://192.168.0.134:15902/throttler/check?app=vtgate": dial tcp 192.168.0.134:15902: connect: connection refused
W1006 20:13:17.129328   22543 executor.go:1083] Could not get replication status from 192.168.0.134:15902: Code: UNAVAILABLE
no healthy tablet available for 'keyspace:"targetkeyspace" shard:"0" tablet_type:REPLICA'

target: targetkeyspace.0.replica
E1006 20:13:17.129752   22543 server.go:303] mysql_server caught panic:
runtime error: invalid memory address or nil pointer dereference
runtime/panic.go:221 (0x44bfe6)
runtime/signal_unix.go:735 (0x44bfb6)
vitess.io/vitess/go/vt/vtgate/executor.go:1101 (0xe2be6a)
vitess.io/vitess/go/vt/vtgate/executor.go:758 (0xe27a93)
vitess.io/vitess/go/vt/vtgate/executor.go:270 (0xe23368)
vitess.io/vitess/go/vt/vtgate/executor.go:220 (0xe22f51)
vitess.io/vitess/go/vt/vtgate/executor.go:182 (0xe22b04)
vitess.io/vitess/go/vt/vtgate/vtgate.go:371 (0xe58bbc)
vitess.io/vitess/go/vt/vtgate/plugin_mysql_server.go:222 (0xe3601a)
vitess.io/vitess/go/mysql/conn.go:1255 (0xbc5e95)
vitess.io/vitess/go/mysql/conn.go:1240 (0xbc5b1d)
vitess.io/vitess/go/mysql/conn.go:877 (0xbc2b6e)
vitess.io/vitess/go/mysql/server.go:474 (0xbe2c87)
vitess.io/vitess/go/mysql/server.go:286 (0xbe17de)
runtime/asm_amd64.s:1581 (0x468fc0)

Related Issue(s)

Backport of #8943

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

If the tablet is not healthy then the TabletHealth.Stats
variable is nil. Let's check for that and instead show a replication
lag value of -1 in that case to reflect the state.

Let's also add a fairly aggressive timeout for the HTTP calls
to get the throttler status so that one misbehaving tablet (call)
doesn't cause the entire command to timeout.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord changed the title Handle unhealthy tablets more safely [Backport 12.0] Handle unhealthy tablets more safely Oct 7, 2021
@mattlord mattlord requested a review from deepthi October 7, 2021 15:57
@systay systay merged commit 4a9054a into vitessio:release-12.0 Oct 7, 2021
@systay systay deleted the Backport8943 branch October 7, 2021 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants