Skip to content

WIP: ApplySchema to run online schema changes via gh-ost#67

Closed
shlomi-noach wants to merge 102 commits intomasterfrom
pov-gh-ost-tablet-rewrite2
Closed

WIP: ApplySchema to run online schema changes via gh-ost#67
shlomi-noach wants to merge 102 commits intomasterfrom
pov-gh-ost-tablet-rewrite2

Conversation

@shlomi-noach
Copy link
Copy Markdown

WORK IN PROGRESS

This PR begins to explore online schema changes via gh-ost.

NOTE: baseline for this branch has been refactored halfway, this PR is created with conflicts which will later be resolved.

Consider the below:

$ vtctl -topo_implementation etcd2 -topo_global_server_address localhost:2379 -topo_global_root /vitess/global ApplySchema -online_schema_change -sql "alter table zzz modify id bigint not null" commerce

Notice the new -online_schema_change flag.

To date, ApplySchema would ask vtctld to run a full blown schema change. vtctld would:

  • Identify the shards
  • Identify master/primary for each shard
  • Establish MySQL connection on each primary
  • Run ALTER TABLE... statement on each primary
  • return when all are complete.

We wish to look into online schema changes. Online schema changes are non-blocking. They could run for hours, without affecting ongoing production traffic. Online schema changes are available via:

This PR offers an integration with gh-ost. The initial submission has ALOT of assumptions and rough edges:

  • That the gh-ost binary exists in path
  • That the MySQL servers have a gh-ost user account with proper privileges
  • We run gh-ost immediately upon request. This should not be the general case: running gh-ost should be orchestrated (e.g. if there's already a running gh-ost migration, wait for it to complete ; e.g. if gh-ost fails, retry N times?)
  • We run gh-ost asynchronously but return no Job ID to trck progres
  • gh-ost does not report yet success/failure to anyone/anything
  • There's no way to check progress

All the above need to be completed, hence this is a draft PR. So what this PR does have is mostly wiring.

  • vtctl ApplySchema to let vtctld it wants an online schema change via gRPC
  • vtctld to intercept an online schema change
  • vtctld to identify shards, primaries
  • vtctls to request online schema change from vtablets of primaries
  • vttablet to analyze the request and spawn gh-ost

To be continued.

sougou added 30 commits July 19, 2020 22:47
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Introduce a new Seconds type with explicit conversions to and
from time.Duration.

Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
curl -k -L https://github.com/github/gh-ost/releases/download/v1.0.49/gh-ost-binary-linux-20200209110835.tar.gz -o /tmp/gh-ost.tar.gz
(cd /tmp/ && tar xzf gh-ost.tar.gz)
cp /tmp/gh-ost /usr/bin

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity. I can see from security perspective, installing the binary on each tablet is preferred, what are the other tradeoffs to install the binary on each tablet vs remotely on a dedicated server/pod?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right; the above is just the local docker setup, and so does not represent a production environment. The above is also likely temporary. At any case, an important observation is that gh-ost reads data from MySQL and then mostly writes it back. It reads the data both by connecting into the replication stream as well as normal queries. What we've observed at GitHub is that latency between he machine where gh-ost runs, to the MySQL servers, is important. e.g. we would only run gh-ost in same DC as affected servers. But, I think taking it to the next level and running gh-ost on the very same tablet+MySQL master servers can be beneficial.
Another thing about running gh-ost from dedicated servers is the amount of migrations you'd be able to run concurrently. I don't have the numbers, but I guess if you'd want to run 100 concurrent migrations (say you have 100 shards), then I suspect running 100 gh-ost instances on same dedicated server is unlikely to perform well. Actual testing needed but that's my suspicion.
When you run gh-ost right on the master server, that problem implicitly doesn't exist.

Keyspace string `json:"keyspace,omitempty"`
Table string `json:"table,omitempty"`
SQL string `json:"sql,omitempty"`
UUID string `json:"uuid,omitempty"`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the UUID also be used as replica-server-id to support potential concurrent migration?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with current design, each keyspace/shard will be handed this migration. All shards will use same UUID as the "migration job id". Perhaps I should rename the variable to JobID or something.
This will make it easier to investigate issues; if you have the UUID and you know all tables/shards use that same UUID, then all logs are in same place, called by same names, etc. on all tablets.

One key design is that a tablet/master will not run two concurrent migrations. Now, that's subtle. Running two long running migrations is known to be slower than running the two sequentially. However, sometimes one will be running a 3 day long migration, and then also want to ALTER a very small table, which only takes 2 minutes to run. That's a valid use case to running concurrently. Right now I'm not looking into that, and enforce serialization. If vitess is opinionated, and recommends a max 250-300GB of data per shard, then migration time is also capped to a reasonable timeframe (a few hours) and we can afford to serialize everything.

Comment on lines +65 to +73
OnlineDDLStatusRequested OnlineDDLStatus = "requested"
OnlineDDLStatusReviewed OnlineDDLStatus = "reviewed"
OnlineDDLStatusCancelled OnlineDDLStatus = "cancelled"
OnlineDDLStatusQueued OnlineDDLStatus = "queued"
OnlineDDLStatusReady OnlineDDLStatus = "ready"
OnlineDDLStatusRunning OnlineDDLStatus = "running"
OnlineDDLStatusComplete OnlineDDLStatus = "complete"
OnlineDDLStatusFailed OnlineDDLStatus = "failed"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious in the case gh-ost process hangs there in "processing" state, are we going to do proactively health checking and treat it as "failed" or?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gh-ost supports hooks, which we utilize. One of those hooks is on-status, which fires every 1minute. You can use that as a liveness indicator.
Back when I worked on skeefree, I used it as a liveness/health indicator, such that if I haven't seen a report in past 10minutes, I assumed the migration to be dead.

In our current (still evolving) setup, things are somewhat simpler, because I know for fact that gh-ost runs on the master tablet's host, and so can further ask the tablet to communicate with it. That is, we will still use on-status, but then again, the tablet may also check that the gh-ost process is running (by communicating with the OS), or forcibly kill -9 it, etc.

OnlineDDLStatusRequested OnlineDDLStatus = "requested"
OnlineDDLStatusReviewed OnlineDDLStatus = "reviewed"
OnlineDDLStatusCancelled OnlineDDLStatus = "cancelled"
OnlineDDLStatusQueued OnlineDDLStatus = "queued"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see the queueing feature! So it queues in global etcd?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both in global etcd as well as in local _vt.schema_migrations table. To be re-evaluated now that VExec is available. Gonna look into it.

keyspace string
shard string

mu sync.Mutex
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the mutex prevent here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed (guess I didn't push yet) to initMutex; to serialize the Open()/Close() flow; I mostly copied+pasted this logic from other Executor implementations.

// Execute validates and runs a gh-ost process.
// Validation included testing the backend MySQL server and the gh-ost binray itself
// Execution runs first a dry run, then an actual migration
func (e *Executor) Execute(ctx context.Context, target querypb.Target, alias topodatapb.TabletAlias, schema, table, alter string) error {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to check schema consistency (temp gh-ost tables), ongoing migrations (temp gh-ost flag files), and maybe replication health between master and relicas?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • replication health: definitely; gh-ost uses throttling, and we will want to supply gh-ost with the identity of a replica it will check for lag. Also, if we ever deploy/implement freno in Vitess, then gh-ost will use that as a standard throttling mechanism.

  • ongoing migrations: some work already in, possibly not pushed; as mentioned above we will avoid concurrent migrations and only allow one migration ta a time.

  • schema consistency: explain?

fmt.Sprintf(`--port=%d`, mysqlPort),
`--user=gh-ost`,
`--password=gh-ost`,
`--allow-on-master`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we are going directly execute the osc on master. why not connect to a replica?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat elaborating on the above; I began this way because of the simplicity and wanted to have a POC. Latency-wise, it's easier if we do that directly on master. The biggest advantage to running this on a replica is that we'd get implicit throttling by that replica. I will look into our options.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Author

There has been a lot of progress on this PR, and in fact it has been refactored to such extent that little is left from the original commits. The design has changed (e.g. now using async, decoupled, scheduled migration execution, as opposed to sync execution), and many of the original TODOs or limitations have been addressed.

I wish to open a new PR vs. vitessio/vitess rather than continue updating/commenting here, given all the changes. If that's acceptable, I'll proceed to do that, and I'll make sure to point back to this PR for documentation/progress reference.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…plaintext

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Author

closed in favor of vitessio#6547

@shlomi-noach shlomi-noach deleted the pov-gh-ost-tablet-rewrite2 branch August 9, 2020 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants