Skip to content

Don't abort restore if master is unreachable#5254

Merged
deepthi merged 3 commits intovitessio:masterfrom
planetscale:ds-fix-restore-crashloop
Oct 1, 2019
Merged

Don't abort restore if master is unreachable#5254
deepthi merged 3 commits intovitessio:masterfrom
planetscale:ds-fix-restore-crashloop

Conversation

@deepthi
Copy link
Collaborator

@deepthi deepthi commented Sep 30, 2019

This is a follow up to #5000. With the code introduced in that PR, it is possible for a cluster that is being restored from backups to get into a crash loop situation. This PR fixes that.

Signed-off-by: deepthi deepthi@planetscale.com

Signed-off-by: deepthi <deepthi@planetscale.com>
@deepthi deepthi requested a review from sougou as a code owner September 30, 2019 22:15
@deepthi deepthi requested a review from enisoc September 30, 2019 22:15
return vterrors.Wrap(err, "can't get master replication position")
// It is possible that though MasterAlias is set, the master tablet is unreachable
// Log a warning and let tablet restore in that case
// If we had instead considered this fatal, all tablets would crash-loop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change one of the e2e test cases to take the master down before restoring one of the tablets? Would that have caught this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to reproduce with a unit test, and verified that the fix works.

// If we had instead considered this fatal, all tablets would crash-loop
// until a master appears, which would make it impossible to elect a master.
log.Warningf("Can't get master replication position after restore: %v", err)
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't leave a line comment down there, so leaving it here.

The loop on line 248 seems like it will hot-loop indefinitely if replication never starts. Could we add a 1s delay between retries, and check if the context has been cancelled before each iteration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done.

…us, add unit test

Signed-off-by: deepthi <deepthi@planetscale.com>
Copy link
Member

@enisoc enisoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than some optional comments.

github.com/golang/mock v1.3.1
github.com/golang/protobuf v1.3.2
github.com/golang/snappy v0.0.0-20170215233205-553a64147049
github.com/google/btree v1.0.0 // indirect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these new entries persist after go mod tidy? I don't quite understand what's happening, but I've noticed the go tool adding some things that we don't actually need.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't intended to commit a new go.mod :(
But this one diff does persist after go mod tidy

newPos := status.Position
if !newPos.Equal(replicationPosition) {
break
select {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need a select to check the context if you're not waiting on anything else at the same time. I usually just do:

if err := ctx.Err(); err != nil {
  return err
}

Signed-off-by: deepthi <deepthi@planetscale.com>
@deepthi deepthi merged commit 169331a into vitessio:master Oct 1, 2019
@deepthi deepthi deleted the ds-fix-restore-crashloop branch October 1, 2019 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants