Skip to content

VReplication based traffic migrater#4981

Merged
sougou merged 17 commits intovitessio:masterfrom
planetscale:ss-migrater
Aug 4, 2019
Merged

VReplication based traffic migrater#4981
sougou merged 17 commits intovitessio:masterfrom
planetscale:ss-migrater

Conversation

@sougou
Copy link
Copy Markdown
Contributor

@sougou sougou commented Jul 8, 2019

This is the new VReplication based migrater that will eventually replace all existing MigrateServed (From and Types).

At a high level, the input for the migration is a list of vreplication streams. The migrate command will migrate the traffic for all the requested streams. This means that the higher level MigrateServed functions can be changed to reuse the new implementation instead. The main idea is that the resharding metadata that was previously crucial for traffic migration will become non-critical. We instead rely on the vreplication streams to be the source of truth.

The functionality is exposed as vtctl commands MigrateReads and MigrateWrites. One of the input arguments is a workflow name which can be used to identify a group of vreplication streams across all shards.

The new code has the following properties:

  • Readable: The steps of the migration are clearly readable with well defined code paths about when something can be canceled, and when it cannot.
  • Unified: A unified code path is used for both table and shard migration.
  • Versatile: The new code supports many kinds of migrations. For tables, it can support 1-1, 1-many, many-1, and many-many. For shards, it can support simultaneous splits and merges.
  • Safe: A well defined set of steps are executed to cancel migration if there are failures before the point of no return. All operations after the point of no return are idempotent.
  • Journaled: A _vt.resharding_journal table has been created to keep track of the important parameters and binlog positions of a migration. This can be used to reverse vreplication, and also for redirecting other vreplication streams.
  • Unit tested: There are now local unit tests that exercise all critical code paths, and that simulate all important failure conditions.

@sougou sougou requested review from deepthi and rafael July 8, 2019 05:16
@rafael
Copy link
Copy Markdown
Member

rafael commented Jul 10, 2019

Oh this is very cool. I'll take time to review this as I'm familiarized with the legacy MigrateServed.

Copy link
Copy Markdown
Collaborator

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! clean and well decomposed code 🥇

// If it's a bad table or db, it could be because _vt.vreplication wasn't created.
// In that case we can try creating it again.
// If it's a bad table or db, it could be because the vreplication tables weren't created.
// In that case we can try creating then again.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo then -> them

// This file was copied from testlib. All tests from testlib should be moved
// to the current directory. In order to move tests from there, we have to
// remove the circular dependency it causes (through vtctl dependence).
// The tests in this diectory call wrangler functions directory. So, there's
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The tests in this diectory call wrangler functions directory. So, there's
// The tests in this directory call wrangler functions directly. So, there's

return err
}

// For reads, locking the source keysppace is sufficient.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// For reads, locking the source keysppace is sufficient.
// For reads, locking the source keyspace is sufficient.

}
defer unlock(&err)

if mi.migrationType == binlogdatapb.MigrationType_TABLES {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would you use TABLES vs SHARDS?

return targets, nil
}

// hashStreams produces a reproduceable hash based on the input parameters.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// hashStreams produces a reproduceable hash based on the input parameters.
// hashStreams produces a reproducible hash based on the input parameters.


//-------------------------------------------------------------------------------------------------------------------
// Single cell RDONLY migration.
err := tme.wr.MigrateReads(ctx, tme.targetKeyspace, "test", topodatapb.TabletType_RDONLY, []string{"cell1"}, DirectionForward)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put a checkCellRouting before this to show that there are only 4 entries for cell1 before we MigrateReads, so that it is obvious to anyone reading the test what changes happened because of the call to MigrateReads.

// Other cell REPLICA migration.
// The global routing already contains redirections for rdonly.
// So, adding routes for replica and deploying to cell2 will also cause
// cell2 to migrat rdonly. This is a quirk that can be fixed later if necessary.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// cell2 to migrat rdonly. This is a quirk that can be fixed later if necessary.
// cell2 to migrate rdonly. This is a quirk that can be fixed later if necessary.

// Other cell REPLICA migration.
// The global routing already contains redirections for rdonly.
// So, adding routes for replica and deploying to cell2 will also cause
// cell2 to migrat rdonly. This is a quirk that can be fixed later if necessary.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a TODO for this?

"ks1.t2@replica": {"ks2.t2"},
})
verifyQueries(t, tme.allDBClients)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be nice to add one more backward migration here - of REPLICA

sougou added 16 commits August 3, 2019 19:52
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Since we don't support multiple source or target keyspaces, the
uid parameters can be simplified to only have shards as keys, and
there will be a separate targetKeyspace parameter that applies to
all shards.

Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Specifying shards and uids wasn't user-friendly. Specifying
workflow names should better. However, it will be the user's
responsibilty to keep them unique.

Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Copy link
Copy Markdown
Member

@rafael rafael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the tests in detail. I spent most of my time in migrater.go

This looks really good and way more readable that we have today. I added some comments.

Next steps would be to refactor the low level functions. I need to remember what was the plan we had for that.

}

func commandMigrateReads(ctx context.Context, wr *wrangler.Wrangler, subFlags *flag.FlagSet, args []string) error {
reverse := subFlags.Bool("reverse", false, "Moves the served tablet type backward instead of forward. Use in case of trouble")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe: Replace use in case of trouble. Use in case a rollback of a migration is needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed that entire statement (including places where I copied from). It doesn't add any value.

}

log.Info("Looks like _vt.vreplication table may not exist. Trying to recreate... ")
log.Info("Looks like the vreplcation tables may not exist. Trying to recreate... ")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot if this is this log, but there is one that as is similar to this one that in practice becomes noisy and confusing. I think we should remove and only log if there are errors.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the confusing comment a few weeks ago. It was in tablet manager. This should not spam and looks useful, because it will let us know when the tables got created.

if servedType != topodatapb.TabletType_REPLICA && servedType != topodatapb.TabletType_RDONLY {
return fmt.Errorf("tablet type must be REPLICA or RDONLY: %v", servedType)
}
mi, err := wr.buildMigrater(ctx, targetKeyspace, workflow)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be better to pass servedType and direction to the migrator? That way this whole method could become way more succinct because:

  • The validation above could be push down to the mi.validate.
  • The logic to decide if it's migrateTableReads or migrateShardReads could also be pushed down there. So this whole method could be something like:
mi, err = buildReadMigrater(...)
...
if err := mi.validate(ctx); err != nil {
		return err
}
ctx, unlock, lockErr := wr.ts.LockKeyspace(ctx, mi.sourceKeyspace, "MigrateReads")
if lockErr != nil {
	return lockErr
}
return mi.migrate(ctx)

This implies that you have a migrator for reads and a migrator for writes. But I'm thinking that this is actually better as right now the paths are already different for the validations and we are passing a flag to distinguish this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried changing. It made some things better and some things worse. So, let's leave it the way it is for now until we add more functionality.

if err != nil {
return nil, err
}
if len(p3qr.Rows) < 1 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is still not very clear. Maybe adding a concrete example will help visualize.

if err := mi.wr.saveRoutingRules(ctx, rules); err != nil {
return err
}
return mi.wr.ts.RebuildSrvVSchema(ctx, cells)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this for the tables case? This is not clear to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RebuildSrvVSchema also propagates routing rules to the cells. Routing rules are logically part of the vschema, except that they're not associated to a specific keyspace.

Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Copy link
Copy Markdown
Member

@rafael rafael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sougou sougou merged commit 9682ee0 into vitessio:master Aug 4, 2019
@sougou sougou deleted the ss-migrater branch August 4, 2019 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants