Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: race condition in vreplication migration submission #12610

Closed
shlomi-noach opened this issue Mar 12, 2023 · 0 comments · Fixed by #12612
Closed

Bug Report: race condition in vreplication migration submission #12610

shlomi-noach opened this issue Mar 12, 2023 · 0 comments · Fixed by #12612
Assignees
Labels
Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) Type: Bug

Comments

@shlomi-noach
Copy link
Contributor

Overview of the Issue

There is a race condition in Online DDL vitess migration submission, where the scheduler decides to submit a migration after validating it does not conflict with any existing migration, and then that same migration fails to start because it conflicts with existing migrations.

The TL;DR root cause if the fact we call isAnyConflictingMigrationRunning twice, where conditions may change between calls.

First call is in runNextMigration():

if conflictFound, _ := e.isAnyConflictingMigrationRunning(onlineDDL); conflictFound {
continue // this migration conflicts with a running one
}

Second call is from within ExecuteWithVReplication:

func (e *Executor) ExecuteWithVReplication(ctx context.Context, onlineDDL *schema.OnlineDDL, revertMigration *schema.OnlineDDL) error {
// make sure there's no vreplication workflow running under same name
_ = e.terminateVReplMigration(ctx, onlineDDL.UUID)
if conflictFound, conflictingMigration := e.isAnyConflictingMigrationRunning(onlineDDL); conflictFound {
return vterrors.Wrapf(ErrExecutorMigrationAlreadyRunning, "conflicting migration: %v over table: %v", conflictingMigration.UUID, conflictingMigration.Table)
}

And the condition that may change in between the two calls is whether the migration is ready_to_complete.

Analysis

The deal with concurrent migrations is that we don't really run two ALTER TABLEs fully concurrently; there is concurrency, but we avoid concurrent "copy table" phase. That is, if one migration is now still busy copying the original 100000 rows, we don't start another one.

But, as that migration completes the copy and is now only tailing the binlog events, and is mostly up to date, we can spin up the next migration.

The way we determine that is by looking at ready_to_complete for the running migration. A vitess ALTER TABLE begins with ready_to_complete=0, and once in a while it gets reviewed. If table copy is done, and binary log applier is up to date (only a few seconds behind), then we set ready_to_complete=1 for that migration.

Periodically, we look for the next potential migration we might run, and chcek whether that migration conflicts with anything running? Our 1st migration has ready_to_complete=1 and so it's fine to now start a new vitess ALTER TABLE.

The race happens as we enter ExecuteWithVReplication. We run the check again. We do that because ExecuteWithVReplication runs from within a goroutine.

At this time, it is possible for our st migration to have ready_to_complete=0. This can happen if workload has increased, and vreplication began lagging while applying the binary logs.

Now, our check for isAnyConflictingMigrationRunning fails, and that fails our migration.

I suggest two paths:

  1. Re-evaluate why we run ExecuteWithVReplication (and, likewise, ExecuteWithGhost) in a goroutine. If these would run direclty from runNextMigration(), we wouldn't need the 2nd check anyway, because everything would be protected by the same mutex lock.
  2. Consider adding a first_ready_to_complete_timestamp TIMESTAMP DEFAULT NULL column. This column can be set once and never cleared. It indicates that the migration was, at least once, ready to complete.
    This is required irrespective of option (1) because in the correct logic, we require all running migrations to be ready_to_complete at the same time; something that might be more difficult when there's multiple concurrent migrations running.

Either of the above would solve the race condition.

Reproduction Steps

Binary Version

-

Operating System and Environment details

-

Log Fragments

No response

@shlomi-noach shlomi-noach added Type: Bug Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) labels Mar 12, 2023
@shlomi-noach shlomi-noach self-assigned this Mar 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant