Skip to content

[15.0] Fix VTOrc to handle multiple failures#11489

Merged
deepthi merged 3 commits intovitessio:release-15.0from
planetscale:vtorc-multiple-failures
Oct 14, 2022
Merged

[15.0] Fix VTOrc to handle multiple failures#11489
deepthi merged 3 commits intovitessio:release-15.0from
planetscale:vtorc-multiple-failures

Conversation

@GuptaManan100
Copy link
Copy Markdown
Contributor

@GuptaManan100 GuptaManan100 commented Oct 13, 2022

Description

This PR fixes the issue described in #11488. There were actually 2 underlying issues -

  1. We were using the incorrect value when we were trying to update the last_checked field in VTOrc. This lead to us not updating that field at all and VTOrc thinking that the last information it had from the rdonly (which was already dead) was the latest information. This led to the rdonly being marked as a valid replicating replica of the primary, even though it had long since died.
  2. We failed fast in ERS when we see more than 1 errors. This led to the context being cancelled on the fillStatus call to the other replica's and then ERS errors out. Surprisingly we already had a test that was testing ERS with multiple failures, TestRecoverWithMultipleFailures. The key difference between the two situations is that, in the test we terminate the tablets too along with the MySQL instances, so the replica's return with the fillStatus output, before the grpc call to the tablets that are down times out. So the fail-fast code doesn't really matter, because the replicas have already finished.
    When the tablets are up, but MySQL down, then if the failed tablets return earlier than the running ones, then we end up cancelling the context.

This PR removes this context cancellation when we receive more than 1 failures. This allows ERS to work even in the cases where the multiple failures are safe to fix based on the durability policies.

Related Issue(s)

Checklist

  • "Backport me!" label has been added if this change should be backported
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

… and fix it

Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100 GuptaManan100 added Type: Bug Component: VTOrc Vitess Orchestrator integration Forwardport to: main This will forward port the PR to the main branch labels Oct 13, 2022
@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot bot commented Oct 13, 2022

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@deepthi deepthi changed the title Fix VTOrc to handle multiple failures [15.0] Fix VTOrc to handle multiple failures Oct 13, 2022
Copy link
Copy Markdown
Collaborator

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vtorc tests are failing :(

Signed-off-by: Manan Gupta <manan@planetscale.com>
Comment on lines 644 to +647
if vttablet == tablet {
// remove this tablet since its mysql has stopped
cellInfo.ReplicaTablets = append(cellInfo.ReplicaTablets[:i], cellInfo.ReplicaTablets[i+1:]...)
cellInfo.RdonlyTablets = append(cellInfo.RdonlyTablets[:i], cellInfo.RdonlyTablets[i+1:]...)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't believe this was the issue 😵‍💫 ...
We were deleting the tablet from the wrong list. After finding the index of the tablet to remove from the rdonly list, we removed a tablet from the replica list 🙄

Copy link
Copy Markdown
Contributor Author

@GuptaManan100 GuptaManan100 Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rdonly tablet which we don't remove causes issues later as it is not in a usable state.

@GuptaManan100 GuptaManan100 requested a review from deepthi October 14, 2022 05:55
…ple-failures

Signed-off-by: Manan Gupta <manan@planetscale.com>
@frouioui frouioui mentioned this pull request Oct 14, 2022
100 tasks
@deepthi deepthi merged commit e7a76cc into vitessio:release-15.0 Oct 14, 2022
@deepthi deepthi deleted the vtorc-multiple-failures branch October 14, 2022 16:17
@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot bot commented Oct 14, 2022

I was unable to forwardport this Pull Request to the following branches: main.

Copy link
Copy Markdown
Contributor

@rsajwani rsajwani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

GuptaManan100 added a commit to planetscale/vitess that referenced this pull request Oct 17, 2022
* feat: added test for vtorc not being able to handle mutliple failures and fix it

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix code to delete rdonly tablet from the correct list

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>
GuptaManan100 added a commit that referenced this pull request Oct 17, 2022
* feat: added test for vtorc not being able to handle mutliple failures and fix it

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: fix code to delete rdonly tablet from the correct list

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: VTOrc Vitess Orchestrator integration Forwardport to: main This will forward port the PR to the main branch Type: Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants