Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online DDL: ready_to_complete race fix #12612

Merged
merged 4 commits into from
Mar 14, 2023

Conversation

shlomi-noach
Copy link
Contributor

Description

Fixes #12610

Two main change in this PR:

  1. executeAlterDDLActionMigration() does not create goroutines in order to execute vitess, gh-ost, and pt-osc migrations. The calls to ExecuteWithVReplication, ExecuteWithGhost, and ExecuteWithPTOSC are now inlined, synchronousely.
  • This means these functions operate under migrationMutex acquired by runNextMigration(), which calls isAnyConflictingMigrationRunning() to determine which migration can be executed that does not conflict with running migrations.
  • As such, the three functions do not need to execute isAnyConflictingMigrationRunning() themselves, and that 2nd chcek is removed, trivially solving the race condition reported in Bug Report: race condition in vreplication migration submission #12610.

The reason these migrations were called in a goroutine is historical, and I do not see the need for that anymore.

  1. We introduce a new column: ready_to_complete_timestamp, in schema_migrations table. This column is updated any time the migration is ready_to_complete. It defaults NULL.

It follows that if the column is NOT NULL, then the migration was at some point in the past read_to_complete, even if it isn't right now. This condition, ready_to_complete_timestamp IS NOT NULL, is read into WasReadyToComplete.

In isAnyConflictingMigrationRunning(), we replace the ReadyToComplete check with a WasReadyToComplete check. It is OK to execute a new concurrent vitess migration if all existing migrations were at least once ready-to-complete, even if some of them are now not ready to complete. The main thing is that all existing migrations are done with the copy phase and are tailing the logs.
This change is compatible with the behavior of ready_to_complete of immediate operations (those are implicitly ready_to_complete and thereby their ready_to_complete_timestamp is set to a non-NULL value).

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

…ission condition; introduce ready_to_complete_timestamp and WasReadyToComplete

Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
@shlomi-noach shlomi-noach added Type: Bug Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) labels Mar 12, 2023
@shlomi-noach shlomi-noach requested a review from a team March 12, 2023 12:13
@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Mar 12, 2023
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Mar 12, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

Signed-off-by: Shlomi Noach <[email protected]>
if conflictFound, conflictingMigration := e.isAnyConflictingMigrationRunning(onlineDDL); conflictFound {
return vterrors.Wrapf(ErrExecutorMigrationAlreadyRunning, "conflicting migration: %v over table: %v", conflictingMigration.UUID, conflictingMigration.Table)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this double-verification, since we now run synchronously and under same mutex protection.

@shlomi-noach shlomi-noach removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Mar 12, 2023
@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Mar 12, 2023

vreplication_migrate_vdiff2_convert_tz is failing consistently, and I believe this to be the result of U daylight savings, started today, since the test's error is:

2023-03-12T14:00:25.2727232Z === NAME  TestMoveTablesTZ
2023-03-12T14:00:25.2727387Z     time_zone_test.go:175: 
2023-03-12T14:00:25.2727872Z         	Error Trace:	/home/runner/work/vitess/vitess/go/test/endtoend/vreplication/time_zone_test.go:175
2023-03-12T14:00:25.2728048Z         	Error:      	Not equal: 
2023-03-12T14:00:25.2728301Z         	            	expected: 28800
2023-03-12T14:00:25.2728525Z         	            	actual  : 25200
2023-03-12T14:00:25.2728718Z         	Test:       	TestMoveTablesTZ

the diff is 3600s == 1hr.

I'll try to restart the test tomorrow, 24hr after US daylight saving took effect, and see what happens.

UUID string `json:"uuid,omitempty"`
Strategy DDLStrategy `json:"strategy,omitempty"`
Options string `json:"options,omitempty"`
RequestTime int64 `json:"time_created,omitempty"`
Copy link
Contributor Author

@shlomi-noach shlomi-noach Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found that this field(RequestTime) was unused.

@shlomi-noach shlomi-noach requested a review from a team March 14, 2023 06:09
TabletAlias string `json:"tablet,omitempty"`
Retries int64 `json:"retries,omitempty"`
ReadyToComplete int64 `json:"ready_to_complete,omitempty"`
WasReadyToComplete int64 `json:"was_ready_to_complete,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was going to ask if we can't use atomic.Int64 here (and for others where appropriate), but then found golang/go#54582.

@shlomi-noach shlomi-noach merged commit e1a0fa0 into vitessio:main Mar 14, 2023
@shlomi-noach shlomi-noach deleted the ready-to-complete-race-fix branch March 14, 2023 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug Report: race condition in vreplication migration submission
3 participants