Skip to content

resharding: safer MigrateServedTypes#4248

Merged
sougou merged 2 commits intovitessio:masterfrom
planetscale:resharding
Oct 10, 2018
Merged

resharding: safer MigrateServedTypes#4248
sougou merged 2 commits intovitessio:masterfrom
planetscale:resharding

Conversation

@sougou
Copy link
Copy Markdown
Contributor

@sougou sougou commented Oct 7, 2018

MigrateServedTypes has been made idempotent: if it fails in
the middle, you can safely retry the operation. If the operation
has previously succeeded, retrying it will be a no-op (except
for master migration).

For master migration. A new Frozen field has been added to the
tablet control record. This field signifies the point of no
return. If a migrate fails before reaching this state, then
we undo everything and re-enable the source shards. Once we
go past the 'frozen' state, you can only go forward. If there
are failures after the frozen state, the migrate can be safely
retried until successful. Once successful, a retry will return
an error saying that there's no resharding in progress.

This resulted in some code simplification. There were many sanity checks
that were more problematic than useful, because they would perpetually
block a failed Migrate from being retried; Obviously, a failed Migrate will
have inconsistent state. I've deleted all that code.

The resharding end to end test has been updated to demonstrate
these behaviors.

Signed-off-by: Sugu Sougoumarane ssougou@gmail.com

MigrateServedTypes has been made idempotent: if it fails in
the middle, you can safely retry the operation. If the operation
has previously succeeded, retrying it will be a no-op (except
for master migration).

For master migration. A new Frozen field has been added to the
tablet control record. This field signifies the point of no
return. If a migrate fails before reaching this state, then
we undo everything and re-enable the source shards. Once we
go past the 'frozen' state, you can only go forward. If there
are failures after the frozen state, the migrate can be safely
retried until successful. Once successful, a retry will return
an error saying that there's no resharding in progress.

The resharding end to end test has been updated to demonstrate
these behaviors.

Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
@sougou sougou requested review from demmer and rafael October 7, 2018 01:43
Copy link
Copy Markdown
Member

@rafael rafael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sougou this is a great improvement to MigrateServedTypes. Logic looks correct to me. My only comments are related to validations. I don't understand why we are removing some of them.

Also, I think some tests are missing.

return fmt.Errorf("cannot safely alter DisableQueryService as BlacklistedTables is set")
}
if !tc.DisableQueryService {
// This code is unreachable because we always delete the control record when we enable QueryService.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we remove it in that case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was tempted to. But it's possible some new code could change this without knowing. So, it's good to leave it here as fail-safe.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, why don't we panic? If new code gets added then it should just fail bad.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had some heated arguments in the past about this. The problem with panic is that if tests miss the code path, it will happen in production.

I've now settled on returning an error, and a clarifying comment that the code is unreachable.

The one exception is if the function doesn't return an error. In such cases, I still panic.

Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
@rafael
Copy link
Copy Markdown
Member

rafael commented Oct 10, 2018

Thank you in the clarifications. This makes sense to me. I added one more comment around the dead code, but don't think that's a blocker for merging. This PR LGTM

@sougou sougou merged commit 5b84cb7 into vitessio:master Oct 10, 2018
@sougou sougou deleted the resharding branch October 10, 2018 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants