-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Shard started reroute high priority #137306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shard started reroute high priority #137306
Conversation
We executed shard started at urgent priority, but the subsequent reroute were at normal priority, causing subsequent recoveries to possibly come after other less important actions. Now do the reroute at high priority.
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
Hi @henningandersen, I've created a changelog YAML for you. |
ywangd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question
| rerouteService.reroute( | ||
| "reroute after starting shards", | ||
| Priority.NORMAL, | ||
| Priority.HIGH, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to check whether there is unassigned shard before promote it to High priority? I'd be ok to have it as Urgent if there is unassigned shards. But we can take one step at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided not to do so, for a couple reasons:
- Simplicity.
- Seems safe enough - shard started is not that frequent and is batched - and so are the reroutes.
- Relocations off a shutting down node could also run into this and thus delay vacating a node.
- Every shard initialization has some time on the data node where the cluster can attend to other things before shard started comes back (ofc assuming things otherwise work well).
But happy to change this if you find it important.
I prefer to only go to HIGH to avoid bumping the priority too much until we have evidence we need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am suggesting that mostly trying to see whether we can bump it higher, e.g. Urgent, so that it does not get blocked by put-mappings requests. It is something that we observed a few times in production clusters. But if we are sticking to High, the conditional priority is probably not entirely necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I still think conditional urgent is useful. But we can iterate on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HIGH is sufficient to avoid getting blocked by a stream of other HIGH priority requests such as put-mapping ones, because all the HIGH tasks run in submission order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err I mis-remembered put-mapping to be URGENT. Thanks for explaining.
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Henning and I discussed this yesterday and convinced me this is a reasonable change. It LGTM but this does not mean we are setting a precedent for other master-task-priority tweaks.
We cannot go higher than HIGH here without regressing #44433, but I can see an argument that maybe we should have gone with HIGH rather than NORMAL in the first place.
Also the desired balance allocator means that #44433 should be less of a big deal now than it was at the time.
ywangd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Shard started reroute high priority We executed shard started at urgent priority, but the subsequent reroute were at normal priority, causing subsequent recoveries to possibly come after other less important actions. Now do the reroute at high priority.
We executed shard started at urgent priority, but the subsequent reroute were at normal priority, causing subsequent recoveries to possibly come after other less important actions. Now do the reroute at high priority.