Don't abort restore if master is unreachable#5254
Conversation
Signed-off-by: deepthi <deepthi@planetscale.com>
| return vterrors.Wrap(err, "can't get master replication position") | ||
| // It is possible that though MasterAlias is set, the master tablet is unreachable | ||
| // Log a warning and let tablet restore in that case | ||
| // If we had instead considered this fatal, all tablets would crash-loop |
There was a problem hiding this comment.
Can we change one of the e2e test cases to take the master down before restoring one of the tablets? Would that have caught this?
There was a problem hiding this comment.
I was able to reproduce with a unit test, and verified that the fix works.
| // If we had instead considered this fatal, all tablets would crash-loop | ||
| // until a master appears, which would make it impossible to elect a master. | ||
| log.Warningf("Can't get master replication position after restore: %v", err) | ||
| return nil |
There was a problem hiding this comment.
I can't leave a line comment down there, so leaving it here.
The loop on line 248 seems like it will hot-loop indefinitely if replication never starts. Could we add a 1s delay between retries, and check if the context has been cancelled before each iteration?
…us, add unit test Signed-off-by: deepthi <deepthi@planetscale.com>
enisoc
left a comment
There was a problem hiding this comment.
LGTM other than some optional comments.
| github.com/golang/mock v1.3.1 | ||
| github.com/golang/protobuf v1.3.2 | ||
| github.com/golang/snappy v0.0.0-20170215233205-553a64147049 | ||
| github.com/google/btree v1.0.0 // indirect |
There was a problem hiding this comment.
Do these new entries persist after go mod tidy? I don't quite understand what's happening, but I've noticed the go tool adding some things that we don't actually need.
There was a problem hiding this comment.
I hadn't intended to commit a new go.mod :(
But this one diff does persist after go mod tidy
| newPos := status.Position | ||
| if !newPos.Equal(replicationPosition) { | ||
| break | ||
| select { |
There was a problem hiding this comment.
I don't think you need a select to check the context if you're not waiting on anything else at the same time. I usually just do:
if err := ctx.Err(); err != nil {
return err
}Signed-off-by: deepthi <deepthi@planetscale.com>
This is a follow up to #5000. With the code introduced in that PR, it is possible for a cluster that is being restored from backups to get into a crash loop situation. This PR fixes that.
Signed-off-by: deepthi deepthi@planetscale.com