-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Make Workflow SplitDiff task parallel across Shards #4126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -119,11 +119,17 @@ func FindWorkerTablet(ctx context.Context, wr *wrangler.Wrangler, cleaner *wrang | |
| return nil, err | ||
| } | ||
|
|
||
| // We add the tag before calling ChangeSlaveType, so the destination | ||
| // vttablet reloads the worker URL when it reloads the tablet. | ||
| wr.Logger().Infof("Changing tablet %v to '%v'", topoproto.TabletAliasString(tabletAlias), topodatapb.TabletType_DRAINED) | ||
| shortCtx, cancel := context.WithTimeout(ctx, *remoteActionsTimeout) | ||
| err = wr.ChangeSlaveType(shortCtx, tabletAlias, topodatapb.TabletType_DRAINED) | ||
| cancel() | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm... the tag comment right below this explicitly states that the intent is to put the tag on before changing the slave type, so either we need to rethink this part or at least change the comment now that it's no longer true.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh good catch. I think we can solve this by refreshing tablet state after adding this tag. Let's get @sougou input on this.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I looked at implementation of The refresh that @rafael talks about is a higher level one, and I think it's for reacting to change in serving state. But since this is a lower level function that others can use independent of serving state, we should preserve the original order. |
||
| ourURL := servenv.ListeningURL.String() | ||
| wr.Logger().Infof("Adding tag[worker]=%v to tablet %v", ourURL, topoproto.TabletAliasString(tabletAlias)) | ||
| shortCtx, cancel := context.WithTimeout(ctx, *remoteActionsTimeout) | ||
| shortCtx, cancel = context.WithTimeout(ctx, *remoteActionsTimeout) | ||
| _, err = wr.TopoServer().UpdateTabletFields(shortCtx, tabletAlias, func(tablet *topodatapb.Tablet) error { | ||
| if tablet.Tags == nil { | ||
| tablet.Tags = make(map[string]string) | ||
|
|
@@ -142,16 +148,17 @@ func FindWorkerTablet(ctx context.Context, wr *wrangler.Wrangler, cleaner *wrang | |
| defer wrangler.RecordTabletTagAction(cleaner, tabletAlias, "worker", "") | ||
| defer wrangler.RecordTabletTagAction(cleaner, tabletAlias, "drain_reason", "") | ||
|
|
||
| wr.Logger().Infof("Changing tablet %v to '%v'", topoproto.TabletAliasString(tabletAlias), topodatapb.TabletType_DRAINED) | ||
| // Record a clean-up action to take the tablet back to tabletAlias. | ||
| wrangler.RecordChangeSlaveTypeAction(cleaner, tabletAlias, topodatapb.TabletType_DRAINED, tabletType) | ||
|
|
||
| // We refresh the destination vttablet reloads the worker URL when it reloads the tablet. | ||
| shortCtx, cancel = context.WithTimeout(ctx, *remoteActionsTimeout) | ||
| err = wr.ChangeSlaveType(shortCtx, tabletAlias, topodatapb.TabletType_DRAINED) | ||
| cancel() | ||
| wr.RefreshTabletState(shortCtx, tabletAlias) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| cancel() | ||
|
|
||
| // Record a clean-up action to take the tablet back to rdonly. | ||
| wrangler.RecordChangeSlaveTypeAction(cleaner, tabletAlias, topodatapb.TabletType_DRAINED, topodatapb.TabletType_RDONLY) | ||
| return tabletAlias, nil | ||
| } | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this loop will help.
The context gets passed in, and the downstream calls are themselves looping waiting for context to expire. I see one difference in
waitForHealthyTabletswhere awaitForHealthyTabletsTimeoutgets added, but its default value is same asremoteActionsTimeout.Can you check the error that caused the race? Then we can identify the code path, and ideally fix it at the point where the error origintated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Sugu - I've validated in my local environment that this fixes the problem.
The loop by itself does not help. The fix is the loop + the change where now when you try to call
drainon a tablet that is already drained, you will get an error.Before this change, the race allow two vtworkers to use the same tablet and if I recall correctly one of the actual errors you get is when trying to synchronizeReplication. Both workers are trying to StopBlp and they get errors.