Bug Report: race condition in vreplication migration submission #12610

shlomi-noach · 2023-03-12T08:37:18Z

Overview of the Issue

There is a race condition in Online DDL vitess migration submission, where the scheduler decides to submit a migration after validating it does not conflict with any existing migration, and then that same migration fails to start because it conflicts with existing migrations.

The TL;DR root cause if the fact we call isAnyConflictingMigrationRunning twice, where conditions may change between calls.

First call is in runNextMigration():

vitess/go/vt/vttablet/onlineddl/executor.go

Lines 3217 to 3219 in 365458e

    
           if conflictFound, _ := e.isAnyConflictingMigrationRunning(onlineDDL); conflictFound { 
        
           	continue // this migration conflicts with a running one 
        
           }

Second call is from within ExecuteWithVReplication:

vitess/go/vt/vttablet/onlineddl/executor.go

Lines 1320 to 1326 in 365458e

    
           func (e *Executor) ExecuteWithVReplication(ctx context.Context, onlineDDL *schema.OnlineDDL, revertMigration *schema.OnlineDDL) error { 
        
           	// make sure there's no vreplication workflow running under same name 
        
           	_ = e.terminateVReplMigration(ctx, onlineDDL.UUID) 
        
           	if conflictFound, conflictingMigration := e.isAnyConflictingMigrationRunning(onlineDDL); conflictFound { 
        
           		return vterrors.Wrapf(ErrExecutorMigrationAlreadyRunning, "conflicting migration: %v over table: %v", conflictingMigration.UUID, conflictingMigration.Table) 
        
           	}

And the condition that may change in between the two calls is whether the migration is ready_to_complete.

Analysis

The deal with concurrent migrations is that we don't really run two ALTER TABLEs fully concurrently; there is concurrency, but we avoid concurrent "copy table" phase. That is, if one migration is now still busy copying the original 100000 rows, we don't start another one.

But, as that migration completes the copy and is now only tailing the binlog events, and is mostly up to date, we can spin up the next migration.

The way we determine that is by looking at ready_to_complete for the running migration. A vitess ALTER TABLE begins with ready_to_complete=0, and once in a while it gets reviewed. If table copy is done, and binary log applier is up to date (only a few seconds behind), then we set ready_to_complete=1 for that migration.

Periodically, we look for the next potential migration we might run, and chcek whether that migration conflicts with anything running? Our 1st migration has ready_to_complete=1 and so it's fine to now start a new vitess ALTER TABLE.

The race happens as we enter ExecuteWithVReplication. We run the check again. We do that because ExecuteWithVReplication runs from within a goroutine.

At this time, it is possible for our st migration to have ready_to_complete=0. This can happen if workload has increased, and vreplication began lagging while applying the binary logs.

Now, our check for isAnyConflictingMigrationRunning fails, and that fails our migration.

I suggest two paths:

Re-evaluate why we run ExecuteWithVReplication (and, likewise, ExecuteWithGhost) in a goroutine. If these would run direclty from runNextMigration(), we wouldn't need the 2nd check anyway, because everything would be protected by the same mutex lock.
Consider adding a first_ready_to_complete_timestamp TIMESTAMP DEFAULT NULL column. This column can be set once and never cleared. It indicates that the migration was, at least once, ready to complete.
This is required irrespective of option (1) because in the correct logic, we require all running migrations to be ready_to_complete at the same time; something that might be more difficult when there's multiple concurrent migrations running.

Either of the above would solve the race condition.

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

No response

The text was updated successfully, but these errors were encountered:

shlomi-noach added Type: Bug Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) labels Mar 12, 2023

shlomi-noach self-assigned this Mar 12, 2023

shlomi-noach mentioned this issue Mar 12, 2023

Online DDL: ready_to_complete race fix #12612

Merged

4 tasks

shlomi-noach closed this as completed in #12612 Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: race condition in vreplication migration submission #12610

Bug Report: race condition in vreplication migration submission #12610

shlomi-noach commented Mar 12, 2023

Bug Report: race condition in vreplication migration submission #12610

Bug Report: race condition in vreplication migration submission #12610

Comments

shlomi-noach commented Mar 12, 2023

Overview of the Issue

Analysis

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments