-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Inline reroute with process of node join/master election #18938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@ywelsch can you take a look? |
| if (nodesChanged) { | ||
| newState.nodes(nodesBuilder); | ||
| final ClusterState tmpState = newState.build(); | ||
| RoutingAllocation.Result result = routingService.getAllocationService().reroute(tmpState, "node_join"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ZenDiscovery / NodeJoinController is only using AllocationService now, no need forRoutingService. We can directly useAllocationService` as dependency for ZD / NJC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also do the same in LocalDiscovery as we do here.
We can then also remove getAllocationService from RoutingService.
|
Left 2 comments. I really like this change! |
|
@ywelsch I pushed another commit addressing your comments |
| @Override | ||
| public void setRoutingService(RoutingService routingService) { | ||
| this.routingService = routingService; | ||
| public void setAllocationService(AllocationService allocationService) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the @Override back? (here and in all the other subclasses of Discovery)
|
Left one minor, no need for another iteration. LGTM. Thanks @bleskes! |
…8938) There are secondary issues with async shard fetch going out to nodes before they have a cluster state published to them that need to be solved first. For example: - async fetch uses transport node action that resolves nodes based on the cluster state (but it's not yet exposed by ClusterService since we inline the reroute) - after disruption nodes will respond with an allocated shard (they didn't clean up their shards yet) which throws of decisions master side. - nodes deed the index meta data in question but they may not have if they didn't recieve the latest CS
|
I reverted this one due to secondary problems with async shard fetch. I'm working on fixing those before re-committing this. From the revert commit message:
|
* master: (416 commits) docs: removed obsolete information, percolator queries are not longer loaded into jvm heap memory. Upgrade JNA to 4.2.2 and remove optionality [TEST] Increase timeouts for Rest test client (#19042) Update migrate_5_0.asciidoc Add ThreadLeakLingering option to Rest client tests Add a MultiTermAwareComponent marker interface to analysis factories. #19028 Attempt at fixing IndexStatsIT.testFilterCacheStats. Fix docs build. Move templates out of the Search API, into lang-mustache module revert - Inline reroute with process of node join/master election (#18938) Build valid slices in SearchSourceBuilderTests Docs: Convert aggs/misc to CONSOLE Docs: migration notes for _timestamp and _ttl Group client projects under :client [TEST] Add client-test module and make client tests use randomized runner directly Move upgrade test to upgrade from version 2.3.3 Tasks: Add completed to the mapping Fail to start if plugin tries broken onModule Remove duplicated read byte array methods Rename `fields` to `stored_fields` and add `docvalue_fields` ...
…oth on master and non data nodes (#19044) #18938 has changed the timing in which we send out to nodes to fetch their shard stores. Instead of doing this after the cluster state resulting of the node's join was published, #18938 made it be sent concurrently to the publishing processes. This revealed a couple of points where the shard store fetching is dependent of the current state of affairs of the cluster state, both on the master and the data nodes. The problem discovered were already present without #18938 but required a failure/extreme situations to make them happen.This PR tries to remove as much as possible of these dependencies making shard store fetching simpler and make the way to re-introduce #18938 which was reverted. These are the notable changes: 1) Allow TransportNodesAction (of which shard store fetching is derived) callers to supply concrete disco nodes, so it won't need the cluster state to resolve them. This was a problem because the cluster state containing the needed nodes was not yet made available through ClusterService. Note that long term we can expect the rest layer to resolve node ids to concrete nodes, making this mode the only one needed. 2) The data node relied on the cluster state to have the relevant index meta data so it can find data when custom paths are used. We now fall back to read the meta data from disk if needed. 3) The data node was relying on it's own IndexService state to indicate whether the data it has corresponds to an existing allocation. This is of course something it can not know until it got (and processed) the new cluster state from the master. This flag in the response is now removed. This is not a problem because we used that flag to protect against double assigning of a shard to the same node, but we are already protected from it by the allocation deciders. 4) I removed the redundant filterNodeIds method in TransportNodesAction - if people want to filter they can override resolveRequest.
We currently have concurrency issue between the static methods on the Store class and store changes that are done via a valid open store. An example of this is the async shard fetch which can reach out to a node while a local shard copy is shutting down (the fetch does check if we have an open shard and tries to use that first, but if the shard is shutting down, it will not be available from IndexService). Specifically, async shard fetching tries to read metadata from store, concurrently the shard that shuts down commits to lucene, changing the segments_N file. this causes a file not find exception on the shard fetching side. That one in turns makes the master think the shard is unusable. In tests this can cause the shard assignment to be delayed (up to 1m) which fails tests. See https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+java9-periodic/570 for details. This is one of the things #18938 caused to bubble up.
In the past, we had the semantics where the very first cluster state a node processed after joining could not contain shard assignment to it. This was to make sure the node cleans up local / stale shard copies before receiving new ones that might confuse it. Since then a lot of work in this area, most notably the introduction of allocation ids and #17270 . This means we don't have to be careful and just reroute in the same cluster state change where we process the join, keeping things simple and following the same pattern we have in other places.