[15.0] Fix VTOrc to handle multiple failures#11489
Merged
deepthi merged 3 commits intovitessio:release-15.0from Oct 14, 2022
Merged
[15.0] Fix VTOrc to handle multiple failures#11489deepthi merged 3 commits intovitessio:release-15.0from
deepthi merged 3 commits intovitessio:release-15.0from
Conversation
… and fix it Signed-off-by: Manan Gupta <manan@planetscale.com>
Contributor
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
deepthi
reviewed
Oct 13, 2022
Collaborator
deepthi
left a comment
There was a problem hiding this comment.
vtorc tests are failing :(
Signed-off-by: Manan Gupta <manan@planetscale.com>
GuptaManan100
commented
Oct 14, 2022
Comment on lines
644
to
+647
| if vttablet == tablet { | ||
| // remove this tablet since its mysql has stopped | ||
| cellInfo.ReplicaTablets = append(cellInfo.ReplicaTablets[:i], cellInfo.ReplicaTablets[i+1:]...) | ||
| cellInfo.RdonlyTablets = append(cellInfo.RdonlyTablets[:i], cellInfo.RdonlyTablets[i+1:]...) |
Contributor
Author
There was a problem hiding this comment.
I can't believe this was the issue 😵💫 ...
We were deleting the tablet from the wrong list. After finding the index of the tablet to remove from the rdonly list, we removed a tablet from the replica list 🙄
Contributor
Author
There was a problem hiding this comment.
This rdonly tablet which we don't remove causes issues later as it is not in a usable state.
…ple-failures Signed-off-by: Manan Gupta <manan@planetscale.com>
deepthi
approved these changes
Oct 14, 2022
Contributor
|
I was unable to forwardport this Pull Request to the following branches: |
GuptaManan100
added a commit
to planetscale/vitess
that referenced
this pull request
Oct 17, 2022
* feat: added test for vtorc not being able to handle mutliple failures and fix it Signed-off-by: Manan Gupta <manan@planetscale.com> * test: fix code to delete rdonly tablet from the correct list Signed-off-by: Manan Gupta <manan@planetscale.com> Signed-off-by: Manan Gupta <manan@planetscale.com>
3 tasks
GuptaManan100
added a commit
that referenced
this pull request
Oct 17, 2022
* feat: added test for vtorc not being able to handle mutliple failures and fix it Signed-off-by: Manan Gupta <manan@planetscale.com> * test: fix code to delete rdonly tablet from the correct list Signed-off-by: Manan Gupta <manan@planetscale.com> Signed-off-by: Manan Gupta <manan@planetscale.com> Signed-off-by: Manan Gupta <manan@planetscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes the issue described in #11488. There were actually 2 underlying issues -
TestRecoverWithMultipleFailures. The key difference between the two situations is that, in the test we terminate the tablets too along with the MySQL instances, so the replica's return with the fillStatus output, before the grpc call to the tablets that are down times out. So the fail-fast code doesn't really matter, because the replicas have already finished.When the tablets are up, but MySQL down, then if the failed tablets return earlier than the running ones, then we end up cancelling the context.
This PR removes this context cancellation when we receive more than 1 failures. This allows ERS to work even in the cases where the multiple failures are safe to fix based on the durability policies.
Related Issue(s)
Checklist
Deployment Notes