Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --test-on-replica-manual-replication-control flag #174

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions go/base/context.go
Original file line number Diff line number Diff line change
Expand Up @@ -82,13 +82,14 @@ type MigrationContext struct {
ServeSocketFile string
ServeTCPPort int64

Noop bool
TestOnReplica bool
MigrateOnReplica bool
OkToDropTable bool
InitiallyDropOldTable bool
InitiallyDropGhostTable bool
CutOverType CutOver
Noop bool
TestOnReplica bool
MigrateOnReplica bool
ManualReplicationControl bool
OkToDropTable bool
InitiallyDropOldTable bool
InitiallyDropGhostTable bool
CutOverType CutOver

TableEngine string
RowsEstimate int64
Expand Down
8 changes: 8 additions & 0 deletions go/cmd/gh-ost/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ func main() {
flag.BoolVar(&migrationContext.SkipRenamedColumns, "skip-renamed-columns", false, "in case your `ALTER` statement renames columns, gh-ost will note that and offer its interpretation of the rename. By default gh-ost does not proceed to execute. This flag tells gh-ost to skip the renamed columns, i.e. to treat what gh-ost thinks are renamed columns as unrelated columns. NOTE: you may lose column data")

executeFlag := flag.Bool("execute", false, "actually execute the alter & migrate the table. Default is noop: do some tests and exit")
testOnReplicaWithManualReplicationControl := flag.Bool("test-on-replica-manual-replication-control", false, "Same as --test-on-replica, but waits for replication to be stopped, instead of stopping it automatically. (Useful in RDS.)")
flag.BoolVar(&migrationContext.TestOnReplica, "test-on-replica", false, "Have the migration run on a replica, not on the master. At the end of migration replication is stopped, and tables are swapped and immediately swap-revert. Replication remains stopped and you can compare the two tables for building trust")
flag.BoolVar(&migrationContext.MigrateOnReplica, "migrate-on-replica", false, "Have the migration run on a replica, not on the master. This will do the full migration on the replica including cut-over (as opposed to --test-on-replica)")

Expand Down Expand Up @@ -149,6 +150,13 @@ func main() {
if migrationContext.SwitchToRowBinlogFormat && migrationContext.AssumeRBR {
log.Fatalf("--switch-to-rbr and --assume-rbr are mutually exclusive")
}
if *testOnReplicaWithManualReplicationControl {
if migrationContext.TestOnReplica {
log.Fatalf("--test-on-replica-manual-replication-control and --test-on-replica are mutually exclusive")
}
migrationContext.TestOnReplica = true
migrationContext.ManualReplicationControl = true
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shlomi-noach , in my opinion the code reads more complicated to me when there are two flags that have the same core functionality, with this one slight variation.

I think the semantics would be more clear if we still used the --test-on-replica flag and had a separate flag that just customized this one aspect of its behavior. So if one wants to have manual replication control, one would use the args:

--test-on-replica --manual-replication-control

Without the extra flag, --test-on-replica would behave as it always has.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my original intention. The flag is good to be named --test-on-replica-skip-stop-replica or whatever, but it would be an addon to --test-on-replica. Not only are the flags not mutually exclusive, the --test-on-replica-skip-stop-replica is only applicable when --test-on-replica exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I understand now. My apologies for the confusion. I'll update the logic.

switch *cutOver {
case "atomic", "default", "":
migrationContext.CutOverType = base.CutOverAtomic
Expand Down
37 changes: 32 additions & 5 deletions go/logic/applier.go
Original file line number Diff line number Diff line change
Expand Up @@ -566,14 +566,41 @@ func (this *Applier) StartSlaveSQLThread() error {
return nil
}

func (this *Applier) isReplicationStopped() bool {
query := `show slave status`
replicationStopped := false

err := sqlutils.QueryRowsMap(this.db, query, func(rowMap sqlutils.RowMap) error {
replicationStopped = rowMap["Slave_IO_Running"].String == "No" && rowMap["Slave_SQL_Running"].String == "No"
return nil
})

if err != nil {
return false
}
return replicationStopped
}

// StopReplication is used by `--test-on-replica` and stops replication.
func (this *Applier) StopReplication() error {
if err := this.StopSlaveIOThread(); err != nil {
return err
}
if err := this.StopSlaveSQLThread(); err != nil {
return err
if this.migrationContext.ManualReplicationControl {
for {
log.Info("Waiting for replication to stop...")
if this.isReplicationStopped() {
log.Info("Replication stopped.")
break
}
time.Sleep(5 * time.Second)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed the use of plugins. In such case the plugin would be called synchronously and would stop replication.
The way I see it there is no need to wait for replication to stop: I'm assuming your call to amazon to stop replication is also synchronous, i.e. when the plugin returns, replication is stopped (or there is an error).
In either case, waiting is not the desired behavior.

I would suggest the PR could be made much smaller: merely introducing the flag; and if it is set, we skip the StopSlave... operation.

Later on I will add the plugin injection.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that approach, it keeps things clean and simple in the tool. I would probably block this PR until #62 is done, since simply skipping replication without the ability to stop it externally, would break the behavior of --test-on-replica.

I think we should also log a Warning or Info line explaining the above - eg. "You must use a plugin to stop replication." The wording would depend on how #62 turns out.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the RDS calls to start/stop replication, they are indeed synchronous. If there's ever a case of an async operation, the plugin could block while it waits for the desired state.

} else {
if err := this.StopSlaveIOThread(); err != nil {
return err
}
if err := this.StopSlaveSQLThread(); err != nil {
return err
}
}

readBinlogCoordinates, executeBinlogCoordinates, err := mysql.GetReplicationBinlogCoordinates(this.db)
if err != nil {
return err
Expand Down