Skip to content

VDiff2: Auto retry to continue on error#10639

Merged
mattlord merged 59 commits intovitessio:mainfrom
planetscale:vdiff2_retry
Jul 22, 2022
Merged

VDiff2: Auto retry to continue on error#10639
mattlord merged 59 commits intovitessio:mainfrom
planetscale:vdiff2_retry

Conversation

@mattlord
Copy link
Member

@mattlord mattlord commented Jul 6, 2022

Description

We now record and display errors for a VDiff on a per shard basis:

$ vtctlclient VDiff -- --v2 customer.commerce2customer show last

VDiff Summary for customer.p1c2 (c6e8698e-01b3-11ed-b76d-920702940ee0)
State:        error
              Error: (shard -80) vttablet: rpc error: code = Unknown desc = (errno 1213) (sqlstate 40001): Deadlock found when trying to get lock; try restarting transaction
              Error: (shard 80-) vttablet: rpc error: code = Unknown desc = (errno 1213) (sqlstate 40001): Deadlock found when trying to get lock; try restarting transaction
RowsCompared: 105
HasMismatch:  false
StartedAt:    2022-07-12 07:25:17

Use "--format=json" for more detailed output.


$ vtctlclient VDiff -- --v2 --format=json customer.commerce2customer show last
{
	"Workflow": "p1c2",
	"Keyspace": "customer",
	"State": "error",
	"UUID": "c6e8698e-01b3-11ed-b76d-920702940ee0",
	"RowsCompared": 106,
	"HasMismatch": false,
	"Shards": "-80,80-",
	"StartedAt": "2022-07-12 07:25:17",
	"Errors": {
		"-80": "vttablet: rpc error: code = Unknown desc = (errno 1213) (sqlstate 40001): Deadlock found when trying to get lock; try restarting transaction",
		"80-": "vttablet: rpc error: code = Unknown desc = (errno 1213) (sqlstate 40001): Deadlock found when trying to get lock; try restarting transaction"
	}
}

And we automatically retry those VDiffs (every 30 seconds) that ended with an error, when the auto-retry option has been set for it (now defaults to true) and we can see that it's an ephemeral error which warrants an immediate retry. For example (from a manual local test):

$ vtctlclient VDiff -- --v2 --format=json --auto-retry customer.commerce2customer show last
	"Workflow": "commerce2customer",
	"Keyspace": "customer",
	"State": "error",
	"UUID": "da4c5e14-fece-11ec-a217-920702940ee0",
	"RowsCompared": 2,
	"HasMismatch": false,
	"Shards": "0",
	"StartedAt": "2022-07-08 15:01:32",
	"Errors": {
		"0": "vttablet: rpc error: code = Unknown desc = (errno 1213) (sqlstate 40001): Deadlock found when trying to get lock; try restarting transaction"
	}
}

$ vtctlclient VDiff -- --v2 --format=json --auto-retry customer.commerce2customer show last
{
	"Workflow": "commerce2customer",
	"Keyspace": "customer",
	"State": "completed",
	"UUID": "da4c5e14-fece-11ec-a217-920702940ee0",
	"RowsCompared": 4,
	"HasMismatch": false,
	"Shards": "0",
	"StartedAt": "2022-07-08 15:02:31",
	"CompletedAt": "2022-07-08 15:02:32"
}

ℹ️ NOTE: The resume command lets you manually pick up a successfully completed VDiff where it left off (based on the last PK) and no longer resets the VDiff state (rows compared, diff reports, etc) when it starts. The old behavior of resetting that state was not documented and I can see now that it would be unintuitive in a number of ways, along with preventing reasonable progress updates. The new behavior makes resume align nicely with the automatic retry on error behavior added here, which picks the VDiff up where it left off (based on the last PK) — for a run that ended with an error — and retains the state from the run up to the point that it encountered an error.

We also add progress reporting to the VDiff show summary output when the State is started. For example:

$ vtctlclient VDiff -- --v2 --format=json customer.p1c2 show last
{
	"Workflow": "p1c2",
	"Keyspace": "customer",
	"State": "started",
	"UUID": "daf1f03a-03ed-11ed-9ab8-920702940ee0",
	"RowsCompared": 51,
	"HasMismatch": false,
	"Shards": "-80,80-",
	"StartedAt": "2022-07-15 03:26:03",
	"Progress": {
		"Percentage": 48.57,
		"ETA": "2022-07-15 03:26:10"
	}
}

$ vtctlclient VDiff -- --v2 --format=json customer.p1c2 show last
{
	"Workflow": "p1c2",
	"Keyspace": "customer",
	"State": "started",
	"UUID": "daf1f03a-03ed-11ed-9ab8-920702940ee0",
	"RowsCompared": 104,
	"HasMismatch": false,
	"Shards": "-80,80-",
	"StartedAt": "2022-07-15 03:26:03",
	"Progress": {
	"Percentage": 99.05,
		"ETA": "2022-07-15 03:26:11"
	}
}

$ vtctlclient VDiff -- --v2 --format=json customer.p1c2 show last
{
	"Workflow": "p1c2",
	"Keyspace": "customer",
	"State": "started",
	"UUID": "daf1f03a-03ed-11ed-9ab8-920702940ee0",
	"RowsCompared": 105,
	"HasMismatch": false,
	"Shards": "-80,80-",
	"StartedAt": "2022-07-15 03:26:03",
	"Progress": {
		"Percentage": 100,
		"ETA": "2022-07-15 03:26:12"
	}
}

$ vtctlclient VDiff -- --v2 --format=json customer.p1c2 show last
{
	"Workflow": "p1c2",
	"Keyspace": "customer",
	"State": "completed",
	"UUID": "daf1f03a-03ed-11ed-9ab8-920702940ee0",
	"RowsCompared": 105,
	"HasMismatch": false,
	"Shards": "-80,80-",
	"StartedAt": "2022-07-15 03:26:03",
	"CompletedAt": "2022-07-15 03:26:13"
}

Related Issue(s)

Checklist

mattlord added 2 commits July 6, 2022 10:04
An uncompleted vdiff should only be retried, not resumed.

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VReplication release notes labels Jul 6, 2022
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Jul 6, 2022

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a new flag is being introduced, review whether it is really needed. The flag names should be clear and intuitive (as far as possible), and the flag's help should be descriptive.
  • If a workflow is added or modified, each items in Jobs should be named in order to mark it as required. If the workflow should be required, the GitHub Admin should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should either include a link to an issue that describes the bug OR an actual description of the bug and how to reproduce, along with a description of the fix.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.

Signed-off-by: Matt Lord <mattalord@gmail.com>
mattlord added 3 commits July 6, 2022 14:16
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
mattlord added 7 commits July 7, 2022 22:59
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
We now have a goroutine that will do this periodically

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
mattlord added 4 commits July 8, 2022 18:29
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
mattlord added 2 commits July 8, 2022 22:17
Signed-off-by: Matt Lord <mattalord@gmail.com>
And avoid withDDL usage anywhere else.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord marked this pull request as ready for review July 9, 2022 04:24
mattlord added 11 commits July 12, 2022 10:15
Signed-off-by: Matt Lord <mattalord@gmail.com>
And quickly timeout the receive attempted on the controller's
done channel when checking if it's active.

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
We'll never hit that as the first one will receive immediately as
the channel was closed or we'll hit the default.

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Add comment about safe usage of vde.addController()

Signed-off-by: Matt Lord <mattalord@gmail.com>
This is a test of the fix in:
  vitessio#10720

Signed-off-by: Matt Lord <mattalord@gmail.com>
mattlord added 7 commits July 17, 2022 20:11
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord changed the title Vdiff2: Auto retry to continue on error VDiff2: Auto retry to continue on error Jul 21, 2022
Copy link
Member

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
Nice work!

Signed-off-by: Matt Lord <mattalord@gmail.com>
mattlord added a commit to vitessio/website that referenced this pull request Jul 22, 2022
Signed-off-by: Matt Lord <mattalord@gmail.com>
mattlord added 4 commits July 21, 2022 23:30
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
The VReplication engine does the same.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord merged commit 73b4995 into vitessio:main Jul 22, 2022
@mattlord mattlord deleted the vdiff2_retry branch July 22, 2022 15:52
mattlord added a commit to vitessio/website that referenced this pull request Jul 22, 2022
Signed-off-by: Matt Lord <mattalord@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Progress field to Show output when the Status == started to display a % of work done and ETA for completion VDiff2: Automatic retry on error

3 participants