Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vitess Online DDL atomic cut-over #11460

Conversation

shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented Oct 11, 2022

Description

This is work in progress for an atomic RENAME (not two-step gaping hole) in the cut-over process of a vitess/vreplication Online DDL migration.

Currently, we run a two-step rename, when the original table is renamed away, and then a 2nd rename moves the vreplication table in its place. This is protected under a buffering rule on the primary, but replicas see two distinct renames and there is a point in time on the replica, where the table does not exist. Of course this means queries are failing.
There is also a scenario depicted in #11226, where even queries against the primary may fail.

This PR introduces an algorithm similar to the gh-ost cut-over, described in http://code.openark.org/blog/mysql/solving-the-non-atomic-table-swap-take-iii-making-it-atomic and in github/gh-ost#82

We still retain our query buffering logic, which protects is against existing logic failures in the gh-ost implementation, per github/gh-ost#887.

the new logic to be described in detail once the PR is validated and considered safe. Some noteworthy details in the meantime:

  • We introduce a sentry table, which means VReplication needs to know about it, even though its schema is immaterial. We therefore need to run VReplicationWaitForPos after creating the sentry table, which increases the overall cut-over process time. This does not affect incoming traffic, as the creation of the table comes before the critical section of locking/blocking/renaming. It does, however, lock the executor's mutex, which potentially adds a few more seconds to an already a few seconds worth of process. The impact is internal and of no real concern.
  • We remove the stowaway_table logic. The swap is atomic, and there is no need to recover a "stowaway scenario". We will keep the recovery logic for backwards compatibility for one full release, then remove it. The recovery logic is found here:
    if stowawayTable := row.AsString("stowaway_table", ""); stowawayTable != "" {
    // whoa
    // stowawayTable is an original table stowed away while cutting over a vrepl migration, see call to cutOverVReplMigration() down below in this function.
    // In a normal operation, the table should not exist outside the scope of cutOverVReplMigration
    // If it exists, that means a tablet crashed while running a cut-over, and left the database in a bad state, where the migrated table does not exist.
    // thankfully, we have tracked this situation and just realized what happened. Now, first thing to do is to restore the original table.
    log.Infof("found stowaway table %s journal in migration %s for table %s", stowawayTable, uuid, onlineDDL.Table)
    attemptMade, err := e.renameTableIfApplicable(ctx, stowawayTable, onlineDDL.Table)
    if err != nil {
    // unable to restore table; we bail out, and we will try again next round.
    return countRunnning, cancellable, err
    }
    // success
    if attemptMade {
    log.Infof("stowaway table %s restored back into %s", stowawayTable, onlineDDL.Table)
    } else {
    log.Infof("stowaway table %s did not exist and there was no need to restore it", stowawayTable)
    }
    // OK good, table restored. We can remove the record.
    if err := e.updateMigrationStowawayTable(ctx, uuid, ""); err != nil {
    return countRunnning, cancellable, err
    }
    }
  • We introduce a stage column to _vt.schema_migrations. It serves multiple purposes:
    • we populate the stage column in every step of the cut-over logic, so that we can easily see where a failure took place
    • we audit every such change to stage in the logs, so we have a full history
    • most importantly, we use it to generate a GTID entry post LOCK TABLES and RENAME TABLES: the LOCK TABLES query does not generate a GTID entry. The RENAME TABLES query is blocked due to LOCK TABLES and of course does not generate a GTID entry. We update stage column immediately after issuing both, and thus ensure we have a post-lock/rename position for which we can wait for in VReplicationWaitForPos.

Related Issue(s)

Checklist

  • "Backport me!" label has been added if this change should be backported
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

@shlomi-noach shlomi-noach added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Query Serving labels Oct 11, 2022
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Oct 11, 2022

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@shlomi-noach
Copy link
Contributor Author

A walkthrough to explain the new cut-over logic and why it is implemented the way it is.

First, to explain the current logic, also discussed in this blog post.

Current cut-over logic

Assume we modify table t. The changes actually take place on t_vrepl:

                 t             t_vrepl

When t_vrepl is in near sync with t, we place a quey buffering rule on table t, such that new queries on t are buffered. While buffering takes place, we do a two-step rename:

  1. rename table t to t_placeholder, leaving a puncture in our database: t does not exist
t_placeholder    [ ]            t_vrepl
  1. complete any final changes to t_vrepl and then rename t_vrepl to t, t_placeholder to t_vrepl
                  t             t_vrepl

In essence, we have swapped t and t_vrepl. But in the process, there is a point in time where t does not exist. Queries should not be aware of that thanks to the query buffering. However, there is no buffering on the replicas, and inconspicuous queries running on replicas, such as select * from t limit 1 might return with a table t does not exist eror.

Renaming t away serves two purposes:

  • An implicit block while any pending queries on t complete execution
  • An absolute guarantee that no new queries operate on t afterwards

New cut-over logic in this PR

We want to avoid the puncture, such that replicas always have t available. Essentially, this means we must run something like rename table t to t_placeholder, t_vrepl to t (we in fact run something more elaborate). However, since vreplication online DDL is asynchronous, we can't just run this query. We have to ensure we apply any remaining binary log queries that operate on t onto t_vrepl.

The query buffering should help us: activating the buffering means no queries should operate on t. However, it's not a full guarantee. The buffering only works for queries coming through vtgate, and they do not apply to queries coming from elsewhere. "elsewhere" could mean any local automation users may place, any monitoring/introspection queries, etc.

We therefore employ a variation of gh-ost's cut-over algorithm, which, together with the query buffering, offers stronger guarantees. We first describe a naive implementation, then point out where it can fail.

In the naive implementation

  1. We place a lock on t via lock tables t write
  2. We apply any remaining entries from the binary log onto t_vrepl
  3. We follow by a rename table t to t_placeholder, t_vrepl to t.

Problem is this won't work as expected. (1) and (3) can't both run from the same connection. If they were, (3) would fail because the connection did not lock t_vrepl. But if the connection were to lock t_vrepl in (1) then vreplication would not be able to apply anything to that table. And we can't let vreplication own that connection because the Online DDL executor operates on a different place than vreplication, even if both run on the same server.

So (1) and (3) must run by different connections. But this poses a risk. When we run (3), how do we ensure that (1) is still holding the lock? Maybe (1)'s connections was terminated, releasing the lock? Meaning queries are actually able to run against t, and so running (3) would cause corruption?

This is how gh-ost cut-over logic came to be. It places protection mechanims so that the rename can't take place unless everything is in perfect order.

Non-naive, simplified implementation

  1. Create a sentry table t_sentry
  2. A connection, call it "lock conn", issues lock tables t_sentry write, t write
  3. A 2nd connection, call it "rename conn" issues rename table t to t_sentry, t_vrepl to t. This query blocks due to the lock query. However, if the lock query is accidentally killed, the rename fails, because t_sentry exists.
  4. The rename does not block writes to t_vrepl because it's locked on renaming t. vreplication is able to complete processing the backlog.
  5. We issue a VReplicationWaitForPos: the two tables are now in sync.
  6. The lock conn issues a drop table t_sentry.
  7. lock conn issues a unlock tables
  8. This releases the rename query, which is able to swap the tables
  9. done

Nuances

  • We create a sentry table. This confuses vreplication because it does not know the table. It must be told about the table because we lock it. We therefore make an initial VReplicationWaitForPos right after creating the table and before placing locks.
  • in step (3), we wait until the rename is found to be running in MySQL. The rename takes precedence over any DML that may happen to be queued on the LOCK.
  • in step (5) we issue a VReplicationWaitForPos to ensure the tables are in sync. But, to which GTID do we wait for? The statement LOCK TABLES ... does not generate a GTID entry. The rename ... is blocked and therefore also does not generate a GTID entry. We therefore inject an intentional transaction after the rename is blocked. We update the _vt.schema_migrations stage column. Doesn't matter exactly what we update, as long as we generate a GTID. We know that this GTID is made after t is completely blocked to writes.
  • The rename is actually a full swap: rename table t to t_sentry, t_vrepl to t, t_sentry to t_vrepl

@shlomi-noach
Copy link
Contributor Author

Seeing CI errors in the form

target: ks.1.primary: primary is not serving, there is a reparent operation in progress (errno 1105) (sqlstate HY000) during query: revert vitess_migration '5a4c0f1b_4ee5_11ed_8e91_000d3a36ac92'

I suspect something crashes the primary tablet. Of course this only happens in GitHub CI and not on local machine, so it's a bit of a chase to find out what happened.

@shlomi-noach
Copy link
Contributor Author

Not even sure the CI issue is related to this PR. Still looking.

@shlomi-noach
Copy link
Contributor Author

From an internal discussion it seems the aforementioned CI problem is colossal, and not specifically related to this PR.

@shlomi-noach shlomi-noach changed the title WIP: vitess Online DDL atomic cut-over vitess Online DDL atomic cut-over Oct 20, 2022
@shlomi-noach shlomi-noach marked this pull request as ready for review October 20, 2022 05:38
@shlomi-noach
Copy link
Contributor Author

Ready to review! Please make sure to read the original comment and then also #11460 (comment), if you want to have a good grasp of what's going on. Hint: it's complex!

Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds good to me.
Do we have a test that reproduces the problem that is described in the PR, or is that too difficult (needs to be timed precisely)?
Let's get a review from @mattlord as well.

@deepthi deepthi requested a review from mattlord November 7, 2022 01:40
@shlomi-noach
Copy link
Contributor Author

Do we have a test that reproduces the problem that is described in the PR, or is that too difficult (needs to be timed precisely)?

Yeah, it's too difficult I think; and on top of this, because we do use query buffering on top, I'm not sure I can reproduce it.

Signed-off-by: Shlomi Noach <[email protected]>
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! This approach makes sense to me and seems sound.

I only had some minor comments/questions/suggestions. Thanks!

go/vt/vttablet/onlineddl/executor.go Outdated Show resolved Hide resolved
go/vt/vttablet/onlineddl/executor.go Outdated Show resolved Hide resolved
@@ -739,31 +765,99 @@ func (e *Executor) cutOverVReplMigration(ctx context.Context, s *VReplStream) er
return err
}
isVreplicationTestSuite := onlineDDL.StrategySetting().IsVreplicationTestSuite()
e.updateMigrationStage(ctx, onlineDDL.UUID, "starting cut-over")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can fail for a variety of reasons, e.g. deadlock. We should have some error handling and perhaps retry logic. Or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nice-to-have but not strictly critical for the cut-over. It adds a level of auditing/logging. Not sure we should fail the cut-over over failure of this action.

go/vt/vttablet/onlineddl/executor.go Show resolved Hide resolved
go/vt/vttablet/onlineddl/executor.go Show resolved Hide resolved
go/vt/vttablet/onlineddl/executor.go Show resolved Hide resolved
go/vt/vttablet/onlineddl/executor.go Show resolved Hide resolved
Comment on lines 980 to 981
e.updateMigrationStage(ctx, onlineDDL.UUID, "dropping sentry table")
dropTableQuery := sqlparser.BuildParsedQuery(sqlDropTable, sentryTableName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these be in the following code block? Or we could just remove the code block. I personally think it's less readable in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! done

if err != nil {
return err
}
_, err = e.execQuery(ctx, query)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the error that we're not handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to summarize, we're ignoring it where it doesn't have a true effect on the cut-over, as we consider it more of a logging/debuggability feature

@@ -81,6 +81,8 @@ const (
alterSchemaMigrationsComponentThrottled = "ALTER TABLE _vt.schema_migrations add column component_throttled tinytext NOT NULL"
alterSchemaMigrationsCancelledTimestamp = "ALTER TABLE _vt.schema_migrations add column cancelled_timestamp timestamp NULL DEFAULT NULL"
alterSchemaMigrationsTablePostponeLaunch = "ALTER TABLE _vt.schema_migrations add column postpone_launch tinyint unsigned NOT NULL DEFAULT 0"
alterSchemaMigrationsStage = "ALTER TABLE _vt.schema_migrations add column stage text not null"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text seems a little excessive, no? In case we want to use this in query predicates and index it. Not a problem, just struck me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we're running out of varchar space for 5.7. I may be mistaken but I have this recollection from a few months ago when I tried adding a varchar column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we're probably hitting the [on] page size limit (text can be moved off page to secondary storage).

This is a good example of why we should probably start leveraging JSON columns more, otherwise we'll end up with hundreds of columns which may or may not be used at different times. We're using them now in VDiff2 FWIW.

Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice work on this!

go/vt/vttablet/onlineddl/executor.go Show resolved Hide resolved
@shlomi-noach
Copy link
Contributor Author

force merging: only failing test is vttablet_prscomplex is is irrelevant to this PR (and resolved in an as yet unmerged PR)

@shlomi-noach shlomi-noach merged commit a2b75c7 into vitessio:main Nov 9, 2022
@shlomi-noach shlomi-noach deleted the vrepl-online-ddl-cut-over-atomic-with-sentry-stage branch November 9, 2022 07:52
@earayu
Copy link

earayu commented Feb 3, 2023

Hi, @shlomi-noach

1. We place a lock on t via lock tables t write (lock conn)
2. We apply any remaining entries from the binary log onto t_vrepl
3. We follow by a rename table t to t_placeholder, t_vrepl to t. 

(1) and (3) can't both run from the same connection. If they were, (3) would fail because the connection did not lock t_vrepl
I don't quiet understand this statement:
I think step 3 can be executed by lock conn. Won't step 2 release all locks of t_vrepl after finishing applying all the remaining entries? therefore, lock conn in step 3 will be free to acquire t_vrepl's lock.

@shlomi-noach
Copy link
Contributor Author

I think step 3 can be executed by lock conn.

Unfortunately not. Try this at home:

create table t0 (id int primary key);
create table t1 (id int primary key);
lock tables t0 write;
rename table t0 to told, t1 to t0;

output:

ERROR 1100 (HY000): Table 't1' was not locked with LOCK TABLES

rename will only work if:

  • you have not locked any tables involved in the statement,
  • or, you have locked all tables involved in the statement.

But we really want to lock just the one table.

For this reason we have to use a third connection, which complicates the entire logic.

See my suggestion for an upstream change: mysql/mysql-server#426, https://bugs.mysql.com/bug.php?id=108864

austenLacy pushed a commit to Shopify/vitess that referenced this pull request Jun 23, 2023
* cut-over with sentry table

Signed-off-by: Shlomi Noach <[email protected]>

* wait for rename via channel; write extra transaction post LOCK

Signed-off-by: Shlomi Noach <[email protected]>

* add stage info

Signed-off-by: Shlomi Noach <[email protected]>

* reduced wait-for-pos timeout. Improved stage logic

Signed-off-by: Shlomi Noach <[email protected]>

* cleanup

Signed-off-by: Shlomi Noach <[email protected]>

* more cleanup

Signed-off-by: Shlomi Noach <[email protected]>

* even more cleanup

Signed-off-by: Shlomi Noach <[email protected]>

* context timeout

Signed-off-by: Shlomi Noach <[email protected]>

* preserve stage by disabling deferred stage changes

Signed-off-by: Shlomi Noach <[email protected]>

* killing rename query upon error

Signed-off-by: Shlomi Noach <[email protected]>

* increase test timeout

Signed-off-by: Shlomi Noach <[email protected]>

* fix defer ordering

Signed-off-by: Shlomi Noach <[email protected]>

* log.info

Signed-off-by: Shlomi Noach <[email protected]>

* add and populate cutover_attempts column

Signed-off-by: Shlomi Noach <[email protected]>

* search PROCESSLIST with LIKE

Signed-off-by: Shlomi Noach <[email protected]>

* code comment

Signed-off-by: Shlomi Noach <[email protected]>

* making a variable more local

Signed-off-by: Shlomi Noach <[email protected]>

* literal string

Signed-off-by: Shlomi Noach <[email protected]>

Signed-off-by: Shlomi Noach <[email protected]>
austenLacy added a commit to Shopify/vitess that referenced this pull request Jul 5, 2023
* cut-over with sentry table



* wait for rename via channel; write extra transaction post LOCK



* add stage info



* reduced wait-for-pos timeout. Improved stage logic



* cleanup



* more cleanup



* even more cleanup



* context timeout



* preserve stage by disabling deferred stage changes



* killing rename query upon error



* increase test timeout



* fix defer ordering



* log.info



* add and populate cutover_attempts column



* search PROCESSLIST with LIKE



* code comment



* making a variable more local



* literal string

Signed-off-by: Shlomi Noach <[email protected]>
Co-authored-by: Shlomi Noach <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Query Serving Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants