Skip to content

Handle SQL thread crash in vt/vttablet/tabletserver/repltracker#7156

Closed
timvaillancourt wants to merge 5 commits intovitessio:masterfrom
timvaillancourt:repltracker-handle-sql-thread-error
Closed

Handle SQL thread crash in vt/vttablet/tabletserver/repltracker#7156
timvaillancourt wants to merge 5 commits intovitessio:masterfrom
timvaillancourt:repltracker-handle-sql-thread-error

Conversation

@timvaillancourt
Copy link
Copy Markdown
Contributor

@timvaillancourt timvaillancourt commented Dec 10, 2020

Backport

NO

Status

DRAFT

Description

This PR causes vt/vttablet/tabletserver/repltracker/poller.go to return an replication sql thread error when a replica's SQL thread has crashed on an unrecoverable error. This is to address delays in the system noticing a replica is unhealthy (and potentially inconsistent) when the SQL thread has crashed

The criteria for the error is:

  1. The SQL thread is not running
  2. The SQL thread error number is > 0
  3. sql_slave_skip_counter=0, ie: SQL thread errors are not skipped

The drawback to this change is Last_SQL_Errno is not reset to zero unless a RESET MASTER or RESET SLAVE is ran following an SQL thread crash

Although somewhat unlikely, a user that manually resolves the SQL error and restarts the SQL thread without running RESET [MASTER|SLAVE] will continue to have a Last_SQL_Errno that is greater-than zero. In this situation this new error could be returned if the SQL thread were to be stopped. I don't have a good solution for this at the moment. Because restoring from a good backup is the norm in this situation (in my experience) I wonder if this is an acceptable limitation or more logic is required 🤔

cc @sougou / @shlomi-noach / @deepthi / @tomkrouper / @drogart for thoughts on the above

Related Issue(s)

List related PRs against other branches:

Todos

  • Tests
  • Documentation

Deployment Notes

Notes regarding deployment of the contained body of work. These should note any
db migrations, etc.

Impacted Areas in Vitess

List general components of the application that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build

Signed-off-by: Tim Vaillancourt <timvaillancourt@github.com>
…com/timvaillancourt/vitess into repltracker-handle-sql-thread-error

Signed-off-by: Tim Vaillancourt <timvaillancourt@github.com>
@timvaillancourt timvaillancourt force-pushed the repltracker-handle-sql-thread-error branch from f95c8b2 to a2b378b Compare December 10, 2020 16:29
…com/timvaillancourt/vitess into repltracker-handle-sql-thread-error

Signed-off-by: Tim Vaillancourt <timvaillancourt@github.com>
@timvaillancourt
Copy link
Copy Markdown
Contributor Author

Replacing with PR #7157 due to DCO CI problems

@timvaillancourt timvaillancourt deleted the repltracker-handle-sql-thread-error branch December 10, 2020 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant