Online DDL: support migration cut-over backoff and forced cut-over#14546
Online DDL: support migration cut-over backoff and forced cut-over#14546shlomi-noach merged 39 commits intovitessio:mainfrom
Conversation
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…irst invocation Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…actions holding a lock on a table Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…illQueriesOnTable: kill queries on a table, and kill connections with transactions holding locks on table Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…r and Online DDL executor Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…and the effect on 'force_cutover' column' Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…ransaction holding lock Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…t, ForceCutOverSchemaMigrationResponse Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
|
Documentation PR: vitessio/website#1641 |
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…tess into onlineddl-cutover-backoff Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
mattlord
left a comment
There was a problem hiding this comment.
Looks good! My only major concern is that we have no tests for the KILL portion, which is particularly sensitive. Am I missing them? It would be ideally covered by unit tests with a lot of test cases with mock mysql results. But we probably don't have the framework in place for that so it may be difficult. Let me know what you think.
| require.NotNil(t, rs) | ||
| for _, row := range rs.Named().Rows { | ||
| message := row.AsString("message", "") | ||
| if strings.Contains(message, messageSubstring) { |
There was a problem hiding this comment.
Any reason not to make it case insensitive?
There was a problem hiding this comment.
I wonder why we should? If the test knows what to expect, then it should expect the exact string.
|
|
||
| // ForceCutOverSchemaMigration is part of the vtctlservicepb.VtctldServer interface. | ||
| func (s *VtctldServer) ForceCutOverSchemaMigration(ctx context.Context, req *vtctldatapb.ForceCutOverSchemaMigrationRequest) (resp *vtctldatapb.ForceCutOverSchemaMigrationResponse, err error) { | ||
| span, ctx := trace.NewSpan(ctx, "VtctldServer.ForceCutOverSchemaMigrationResponse") |
There was a problem hiding this comment.
This should just be the function name, not the response part.
There was a problem hiding this comment.
I don't follow, can you please explain?
There was a problem hiding this comment.
The trace span name should be the function name:
span, ctx := trace.NewSpan(ctx, "VtctldServer.ForceCutOverSchemaMigration")
There was a problem hiding this comment.
This is the last thing and it's minor. So I will approve and you can change this whenever you like.
go/vt/vttablet/onlineddl/executor.go
Outdated
| return nil | ||
| } | ||
|
|
||
| func (e *Executor) killQueriesOnTable(ctx context.Context, tableName string) error { |
There was a problem hiding this comment.
This isn't really covered by any tests, is it? This feels like the most critical aspect as we're killing things off in the production system.
Things like this are actually easier in unit tests as you can mock the mysql query responses.
There was a problem hiding this comment.
This is covered in https://github.com/vitessio/vitess/pull/14546/files#diff-e1236744a601a269891381b0ed09b7151d86f5a9b001d8c4e954211c2ae04a8d ; there's a test that holds an open transaction, we attempt a completion, the migration does not complete; we then force cut-over, and we see that the migration completes.
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
This is covered in https://github.com/vitessio/vitess/pull/14546/files#diff-e1236744a601a269891381b0ed09b7151d86f5a9b001d8c4e954211c2ae04a8d ; there's a test that holds an open transaction, we attempt a completion, the migration does not complete; we then force cut-over, and we see that the migration completes. |
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Co-authored-by: Matt Lord <mattalord@gmail.com> Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
|
|
||
| // ForceCutOverSchemaMigration is part of the vtctlservicepb.VtctldServer interface. | ||
| func (s *VtctldServer) ForceCutOverSchemaMigration(ctx context.Context, req *vtctldatapb.ForceCutOverSchemaMigrationRequest) (resp *vtctldatapb.ForceCutOverSchemaMigrationResponse, err error) { | ||
| span, ctx := trace.NewSpan(ctx, "VtctldServer.ForceCutOverSchemaMigrationResponse") |
There was a problem hiding this comment.
@mattlord is saying this, essentially
| span, ctx := trace.NewSpan(ctx, "VtctldServer.ForceCutOverSchemaMigrationResponse") | |
| span, ctx := trace.NewSpan(ctx, "VtctldServer.ForceCutOverSchemaMigration") |
| return nil, err | ||
| } | ||
|
|
||
| log.Info("Calling ApplySchema to force cut-over migration") |
There was a problem hiding this comment.
should we include the UUID in this log?
proto/vtctlservice.proto
Outdated
| // CompleteSchemaMigration completes one or all migrations executed with --postpone-completion. | ||
| rpc CompleteSchemaMigration(vtctldata.CompleteSchemaMigrationRequest) returns (vtctldata.CompleteSchemaMigrationResponse) {}; | ||
| // ForceCutOverSchemaMigration marks a schema migration for forced cut-over. | ||
| rpc ForceCutOverSchemaMigration(vtctldata.ForceCutOverSchemaMigrationRequest) returns (vtctldata.ForceCutOverSchemaMigrationResponse) {}; |
There was a problem hiding this comment.
nit: move this down below FindAllShardsInKeyspace (to keep them tidy)?
proto/vtctldata.proto
Outdated
| map<string, uint64> rows_affected_by_shard = 1; | ||
| } | ||
|
|
||
| message ForceCutOverSchemaMigrationRequest { |
There was a problem hiding this comment.
nit: move these down below FindAllShardsInKeyspaceRequest/Response (to keep them tidy)?
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…itessio#14546) Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com>
Description
Closes #14530
This PR implements two functionalities relating to schema migration cut-over, relevant to
ALTER TABLEinvitessstrategy only. Both related to a scenario where the cut-over times out due to either excessive load on the migrated table, or due to some lock being placed on the table. The two functionalities are:1minhas passed; the next one not before additional5minhave passed, the next is10min,30minand from that point cut-overs are only attempted at30minintervals. This is to avoid a scenario where the database, that is already is under heavy load, needs to cope with frequently recurring cut-over attempts, which themselves put additional locks on tables.Backoff
The backoff mechanism is implemented as-is, and is not configurable.
Force cut-over
Forced cut-over can be controlled in these ways.
--force-cut-over-after DDL strategy flag
Example:
It's possible to preconfigure the maximum duration where we allow cut-overs to fail/timeout due to pending queries/transactions.
--force-cut-over-after, if nonzero, applies starting the first cut-over attempt.In the above example,
--force-cut-over-afteris set to1hour. The migration may run for as long as it needs, say 5 hours. Starting the first cut-over attempt, the clock starts ticking 1 hour. The cut-over may be successful, in which case all's well and nothing further happens. Or it may fail, in which case the backoff mechanism kicks in. The next attempt is done within1m, then5m, etc. But if these all keep failing, then1hsince the very first failed attempt, irrespective of backoff, and within a 1min resolution, the scheduler runs a cut-over with query&transaction termination. This is highly likely to succeed. But if it fails, then it continues to attempt forced cut-overs every minute.ALTER VITESS_MIGRATION ... FORCE_CUTOVER ...We introduce a new syntax:
ALTER VITESS_MIGRATION '9748c3b7_7fdb_11eb_ac2c_f875a4d24e90' FORCE_CUTOVER; ALTER VITESS_MIGRATION FORCE_CUTOVER ALL;The former forces cut-over for a specific migration, the latter for all pending migrations (
queued,ready,running). All these command do is set the newschema_migrations.force_cutovercolumn value to1, much likeALTER VITESS_MIGRATION ... COMPLETE ....The scheduler picks up this
force_cutovercolumn value on its next review of running migrations. If it's1, then any backoff state is ignored, and the scheduler attempts a forced cut-over, terminating queries and transactions.vtctldclient OnlineDDL force-cutover
These matching options are added to
vtctldclient OnlineDDLcommand:Notes
Only works on MySQL
8.0(requiresperformance_schema.data_lockstable.5.7does not provide reliable information).To note the obvious, a forced cut-over is only relevant when the migration is actually eligible for cut-over. If the migration is incomplete, then it won't attempt a cut-over. Issuing a
ALTER VITESS_MIGRATION ... FORCE_CUTOVER ...will set the column to1, but it will only become relevant when the migration becomes ready to cut-over. Also, if the migration runs with--postpone-completion, then it will not be eligible to cut-over ; the user will first need to issue aALTER VITESS_MIGRATION ... COMPLETE .... It's fine if the user runs aALTER VITESS_MIGRATION ... FORCE_CUTOVER ...first, but this will only have the effect of settingforce_cutover=1; it will notCOMPLETEthe migration.Added unit and endtoend tests.
Documentation: vitessio/website#1641.
Related Issue(s)
Checklist
Deployment Notes