resharding: safer MigrateServedTypes#4248
Conversation
MigrateServedTypes has been made idempotent: if it fails in the middle, you can safely retry the operation. If the operation has previously succeeded, retrying it will be a no-op (except for master migration). For master migration. A new Frozen field has been added to the tablet control record. This field signifies the point of no return. If a migrate fails before reaching this state, then we undo everything and re-enable the source shards. Once we go past the 'frozen' state, you can only go forward. If there are failures after the frozen state, the migrate can be safely retried until successful. Once successful, a retry will return an error saying that there's no resharding in progress. The resharding end to end test has been updated to demonstrate these behaviors. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
| return fmt.Errorf("cannot safely alter DisableQueryService as BlacklistedTables is set") | ||
| } | ||
| if !tc.DisableQueryService { | ||
| // This code is unreachable because we always delete the control record when we enable QueryService. |
There was a problem hiding this comment.
shouldn't we remove it in that case?
There was a problem hiding this comment.
I was tempted to. But it's possible some new code could change this without knowing. So, it's good to leave it here as fail-safe.
There was a problem hiding this comment.
In that case, why don't we panic? If new code gets added then it should just fail bad.
There was a problem hiding this comment.
We've had some heated arguments in the past about this. The problem with panic is that if tests miss the code path, it will happen in production.
I've now settled on returning an error, and a clarifying comment that the code is unreachable.
The one exception is if the function doesn't return an error. In such cases, I still panic.
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
|
Thank you in the clarifications. This makes sense to me. I added one more comment around the dead code, but don't think that's a blocker for merging. This PR LGTM |
MigrateServedTypes has been made idempotent: if it fails in
the middle, you can safely retry the operation. If the operation
has previously succeeded, retrying it will be a no-op (except
for master migration).
For master migration. A new Frozen field has been added to the
tablet control record. This field signifies the point of no
return. If a migrate fails before reaching this state, then
we undo everything and re-enable the source shards. Once we
go past the 'frozen' state, you can only go forward. If there
are failures after the frozen state, the migrate can be safely
retried until successful. Once successful, a retry will return
an error saying that there's no resharding in progress.
This resulted in some code simplification. There were many sanity checks
that were more problematic than useful, because they would perpetually
block a failed Migrate from being retried; Obviously, a failed Migrate will
have inconsistent state. I've deleted all that code.
The resharding end to end test has been updated to demonstrate
these behaviors.
Signed-off-by: Sugu Sougoumarane ssougou@gmail.com