Skip to content

Online DDL, CancelMigration: distinguish user-issued vs. internally-issued cancellation#11011

Merged
deepthi merged 1 commit intovitessio:mainfrom
planetscale:onlineddl-fail-on-auto-cancel
Aug 15, 2022
Merged

Online DDL, CancelMigration: distinguish user-issued vs. internally-issued cancellation#11011
deepthi merged 1 commit intovitessio:mainfrom
planetscale:onlineddl-fail-on-auto-cancel

Conversation

@shlomi-noach
Copy link
Copy Markdown
Contributor

Description

#10900 introduced cancelled_timestamp column, which is populated when a migration is cancelled. Then, based on the value of this column, and assuming a problem in the migration, a migration transitions to either failed (if the column is NULL) or to cancelled (if the column is NOT NULL).

As quick recap, gh-ost, pt-osc and even vreplication itself, to some extent, act as 3rd party tools for Online DDL. To cancel a migration is to fail the migration; #10900 was made so that we can make a distinction between a "legitimately failed" migration to a "user cancelled" migration.

However, we left out a couple use cases, where the Online DDL executor itself may cancel a migration:

  • if the migration is stale
  • if there's an unrecoverable error in the migration flow

In these scenarios, the executor calls CancelMigration, but we expect the terminal state to be failed, not cancelled.

In this PR we make the distinction between a user-generated CANCEL (e.g. user issued a ALTER VITESS_MIGRATION ... CANCEL command), and an internal-generated cancellation. the latter now leads to failed state.

Related Issue(s)

Checklist

  • "Backport me!" label has been added if this change should be backported
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

…ssued cancellation

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot bot commented Aug 15, 2022

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a new flag is being introduced, review whether it is really needed. The flag names should be clear and intuitive (as far as possible), and the flag's help should be descriptive.
  • If a workflow is added or modified, each items in Jobs should be named in order to mark it as required. If the workflow should be required, the GitHub Admin should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should either include a link to an issue that describes the bug OR an actual description of the bug and how to reproduce, along with a description of the fix.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.

Copy link
Copy Markdown
Collaborator

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
What about tests?

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

Mentioned in a private channel, tests are complex. I'm not sure right now. Reason is that the scenarios where a migration gets cancelled by the scheduler are a bit extreme and will be difficult to reproduce synthetically.

@deepthi
Copy link
Copy Markdown
Collaborator

deepthi commented Aug 15, 2022

Mentioned in a private channel, tests are complex. I'm not sure right now. Reason is that the scenarios where a migration gets cancelled by the scheduler are a bit extreme and will be difficult to reproduce synthetically.

You may need to build a unit test framework to simulate vreplication stream errors. We can let this through for now, because that will surely take time to do.

@deepthi deepthi merged commit cea4032 into vitessio:main Aug 15, 2022
@deepthi deepthi deleted the onlineddl-fail-on-auto-cancel branch August 15, 2022 19:05
systay pushed a commit to planetscale/vitess that referenced this pull request Aug 19, 2022
…ssued cancellation (vitessio#11011) (vitessio#962)

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
mattlord added a commit to planetscale/vitess that referenced this pull request Aug 19, 2022
This work makes the following changes:
  - Replaces any time.Sleep calls with a call to wait for the appropriate status(es)
  - Use WaitForStatus:Failed,Cancelled for cancel tests
  - Use CheckForStatus:Failed,Cancelled for cancel tests (instead of just Cancelled)
    - As we see both in the CI but we were requiring cancelled
    - See: vitessio#11011

Signed-off-by: Matt Lord <mattalord@gmail.com>
mattlord added a commit that referenced this pull request Aug 23, 2022
* Address additional causes of OnlineDDL test flakiness

This work makes the following changes:
  - Replaces any time.Sleep calls with a call to wait for the appropriate status(es)
  - Use WaitForStatus:Failed,Cancelled for cancel tests
  - Use CheckForStatus:Failed,Cancelled for cancel tests (instead of just Cancelled)
    - As we see both in the CI but we were requiring cancelled
    - See: #11011

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Always print status from the wait

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Require Cancelled state for for Cancel issued by user

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Remove unnecessary change

* return error if unable to set 'cancelled_timestamp'

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants