Don't abort restore if master is unreachable by deepthi · Pull Request #5254 · vitessio/vitess

deepthi · 2019-09-30T22:15:11Z

This is a follow up to #5000. With the code introduced in that PR, it is possible for a cluster that is being restored from backups to get into a crash loop situation. This PR fixes that.

Signed-off-by: deepthi deepthi@planetscale.com

Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc · 2019-09-30T22:42:46Z

go/vt/vttablet/tabletmanager/restore.go

-		return vterrors.Wrap(err, "can't get master replication position")
+		// It is possible that though MasterAlias is set, the master tablet is unreachable
+		// Log a warning and let tablet restore in that case
+		// If we had instead considered this fatal, all tablets would crash-loop


Can we change one of the e2e test cases to take the master down before restoring one of the tablets? Would that have caught this?

I was able to reproduce with a unit test, and verified that the fix works.

enisoc · 2019-09-30T22:55:44Z

go/vt/vttablet/tabletmanager/restore.go

+		// If we had instead considered this fatal, all tablets would crash-loop
+		// until a master appears, which would make it impossible to elect a master.
+		log.Warningf("Can't get master replication position after restore: %v", err)
+		return nil


I can't leave a line comment down there, so leaving it here.

The loop on line 248 seems like it will hot-loop indefinitely if replication never starts. Could we add a 1s delay between retries, and check if the context has been cancelled before each iteration?

Good point. Done.

…us, add unit test Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc

LGTM other than some optional comments.

enisoc · 2019-10-01T18:00:33Z

go.mod

 	github.com/golang/mock v1.3.1
 	github.com/golang/protobuf v1.3.2
 	github.com/golang/snappy v0.0.0-20170215233205-553a64147049
+	github.com/google/btree v1.0.0 // indirect


Do these new entries persist after go mod tidy? I don't quite understand what's happening, but I've noticed the go tool adding some things that we don't actually need.

I hadn't intended to commit a new go.mod :(
But this one diff does persist after go mod tidy

enisoc · 2019-10-01T18:04:23Z

go/vt/mysqlctl/builtinbackupengine.go

-				newPos := status.Position
-				if !newPos.Equal(replicationPosition) {
-					break
+				select {


I don't think you need a select to check the context if you're not waiting on anything else at the same time. I usually just do:

if err := ctx.Err(); err != nil { return err }

Signed-off-by: deepthi <deepthi@planetscale.com>

Don't abort restore if master is unreachable

0d1ebf0

Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi requested a review from sougou as a code owner September 30, 2019 22:15

deepthi requested a review from enisoc September 30, 2019 22:15

enisoc reviewed Sep 30, 2019

View reviewed changes

implement delay between retries of attempting to get mysql slave stat…

e6c4a09

…us, add unit test Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc approved these changes Oct 1, 2019

View reviewed changes

cleanup per review comments

1e1ef87

Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi merged commit 169331a into vitessio:master Oct 1, 2019

deepthi deleted the ds-fix-restore-crashloop branch October 1, 2019 21:43

spark4 mentioned this pull request Nov 12, 2019

Serry deploy tinyspeck/vitess#140

Closed

spark4 mentioned this pull request Nov 22, 2019

Slack sync upstream 2019 11 09.r0 tinyspeck/vitess#142

Merged

rafael mentioned this pull request Dec 11, 2019

Slack sync upstream 2019 12 11.r0 tinyspeck/vitess#143

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't abort restore if master is unreachable#5254

Don't abort restore if master is unreachable#5254
deepthi merged 3 commits intovitessio:masterfrom
planetscale:ds-fix-restore-crashloop

deepthi commented Sep 30, 2019

Uh oh!

enisoc Sep 30, 2019

Uh oh!

deepthi Oct 1, 2019

Uh oh!

enisoc Sep 30, 2019

Uh oh!

deepthi Oct 1, 2019

Uh oh!

enisoc left a comment

Uh oh!

enisoc Oct 1, 2019

Uh oh!

deepthi Oct 1, 2019

Uh oh!

enisoc Oct 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

deepthi commented Sep 30, 2019

Uh oh!

enisoc Sep 30, 2019

Choose a reason for hiding this comment

Uh oh!

deepthi Oct 1, 2019

Choose a reason for hiding this comment

Uh oh!

enisoc Sep 30, 2019

Choose a reason for hiding this comment

Uh oh!

deepthi Oct 1, 2019

Choose a reason for hiding this comment

Uh oh!

enisoc left a comment

Choose a reason for hiding this comment

Uh oh!

enisoc Oct 1, 2019

Choose a reason for hiding this comment

Uh oh!

deepthi Oct 1, 2019

Choose a reason for hiding this comment

Uh oh!

enisoc Oct 1, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants