Inline reroute with process of node join/master election #18938

bleskes · 2016-06-17T07:59:51Z

In the past, we had the semantics where the very first cluster state a node processed after joining could not contain shard assignment to it. This was to make sure the node cleans up local / stale shard copies before receiving new ones that might confuse it. Since then a lot of work in this area, most notably the introduction of allocation ids and #17270 . This means we don't have to be careful and just reroute in the same cluster state change where we process the join, keeping things simple and following the same pattern we have in other places.

bleskes · 2016-06-17T08:01:06Z

@ywelsch can you take a look?

ywelsch · 2016-06-17T08:15:16Z

core/src/main/java/org/elasticsearch/discovery/zen/NodeJoinController.java

            if (nodesChanged) {
                newState.nodes(nodesBuilder);
+                final ClusterState tmpState = newState.build();
+                RoutingAllocation.Result result = routingService.getAllocationService().reroute(tmpState, "node_join");


ZenDiscovery / NodeJoinController is only using AllocationService now, no need forRoutingService. We can directly useAllocationService` as dependency for ZD / NJC.

We should also do the same in LocalDiscovery as we do here.
We can then also remove getAllocationService from RoutingService.

ywelsch · 2016-06-17T08:21:30Z

Left 2 comments. I really like this change!

bleskes · 2016-06-17T11:01:43Z

@ywelsch I pushed another commit addressing your comments

ywelsch · 2016-06-17T11:15:52Z

core/src/main/java/org/elasticsearch/discovery/local/LocalDiscovery.java

-    @Override
-    public void setRoutingService(RoutingService routingService) {
-        this.routingService = routingService;
+    public void setAllocationService(AllocationService allocationService) {


can you add the @Override back? (here and in all the other subclasses of Discovery)

ywelsch · 2016-06-17T11:16:43Z

Left one minor, no need for another iteration. LGTM. Thanks @bleskes!

…8938) There are secondary issues with async shard fetch going out to nodes before they have a cluster state published to them that need to be solved first. For example: - async fetch uses transport node action that resolves nodes based on the cluster state (but it's not yet exposed by ClusterService since we inline the reroute) - after disruption nodes will respond with an allocated shard (they didn't clean up their shards yet) which throws of decisions master side. - nodes deed the index meta data in question but they may not have if they didn't recieve the latest CS

bleskes · 2016-06-23T06:52:43Z

I reverted this one due to secondary problems with async shard fetch. I'm working on fixing those before re-committing this.

From the revert commit message:

There are secondary issues with async shard fetch going out to nodes before they have a cluster state published to them that need to be solved first. For example:

async fetch uses transport node action that resolves nodes based on the cluster state (but it's not yet exposed by ClusterService since we inline the reroute)

after disruption nodes will respond with an allocated shard (they didn't clean up their shards yet) which throws of decisions master side.

nodes deed the index meta data in question but they may not have if they didn't recieve the latest CS

* master: (416 commits) docs: removed obsolete information, percolator queries are not longer loaded into jvm heap memory. Upgrade JNA to 4.2.2 and remove optionality [TEST] Increase timeouts for Rest test client (#19042) Update migrate_5_0.asciidoc Add ThreadLeakLingering option to Rest client tests Add a MultiTermAwareComponent marker interface to analysis factories. #19028 Attempt at fixing IndexStatsIT.testFilterCacheStats. Fix docs build. Move templates out of the Search API, into lang-mustache module revert - Inline reroute with process of node join/master election (#18938) Build valid slices in SearchSourceBuilderTests Docs: Convert aggs/misc to CONSOLE Docs: migration notes for _timestamp and _ttl Group client projects under :client [TEST] Add client-test module and make client tests use randomized runner directly Move upgrade test to upgrade from version 2.3.3 Tasks: Add completed to the mapping Fail to start if plugin tries broken onModule Remove duplicated read byte array methods Rename `fields` to `stored_fields` and add `docvalue_fields` ...

…oth on master and non data nodes (#19044) #18938 has changed the timing in which we send out to nodes to fetch their shard stores. Instead of doing this after the cluster state resulting of the node's join was published, #18938 made it be sent concurrently to the publishing processes. This revealed a couple of points where the shard store fetching is dependent of the current state of affairs of the cluster state, both on the master and the data nodes. The problem discovered were already present without #18938 but required a failure/extreme situations to make them happen.This PR tries to remove as much as possible of these dependencies making shard store fetching simpler and make the way to re-introduce #18938 which was reverted. These are the notable changes: 1) Allow TransportNodesAction (of which shard store fetching is derived) callers to supply concrete disco nodes, so it won't need the cluster state to resolve them. This was a problem because the cluster state containing the needed nodes was not yet made available through ClusterService. Note that long term we can expect the rest layer to resolve node ids to concrete nodes, making this mode the only one needed. 2) The data node relied on the cluster state to have the relevant index meta data so it can find data when custom paths are used. We now fall back to read the meta data from disk if needed. 3) The data node was relying on it's own IndexService state to indicate whether the data it has corresponds to an existing allocation. This is of course something it can not know until it got (and processed) the new cluster state from the master. This flag in the response is now removed. This is not a problem because we used that flag to protect against double assigning of a shard to the same node, but we are already protected from it by the allocation deciders. 4) I removed the redundant filterNodeIds method in TransportNodesAction - if people want to filter they can override resolveRequest.

elastic#18938)

We currently have concurrency issue between the static methods on the Store class and store changes that are done via a valid open store. An example of this is the async shard fetch which can reach out to a node while a local shard copy is shutting down (the fetch does check if we have an open shard and tries to use that first, but if the shard is shutting down, it will not be available from IndexService). Specifically, async shard fetching tries to read metadata from store, concurrently the shard that shuts down commits to lucene, changing the segments_N file. this causes a file not find exception on the shard fetching side. That one in turns makes the master think the shard is unusable. In tests this can cause the shard assignment to be delayed (up to 1m) which fails tests. See https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+java9-periodic/570 for details. This is one of the things #18938 caused to bubble up.

inline reroute

edb9075

bleskes added :Cluster :Distributed Coordination/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure v5.0.0-alpha4 labels Jun 17, 2016

bleskes closed this Jun 17, 2016

bleskes reopened this Jun 17, 2016

ywelsch reviewed Jun 17, 2016
View reviewed changes

feedback

82fc9f2

ywelsch reviewed Jun 17, 2016
View reviewed changes

Override back

20b9cf6

clintongormley added the >enhancement label Jun 17, 2016

bleskes merged commit 46b40f7 into elastic:master Jun 17, 2016

bleskes deleted the node_join_inline_reroute branch June 17, 2016 15:32

bleskes restored the node_join_inline_reroute branch June 23, 2016 06:42

clintongormley removed the v5.0.0-alpha4 label Jun 23, 2016

bleskes mentioned this pull request Jun 23, 2016

Make shard store fetch less dependent on the current cluster state, both on master and non data nodes #19044

Merged

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Jul 4, 2016

re-introduce: Inline reroute with process of node join/master election (

86d2e88

elastic#18938)

bleskes added the v5.0.0-alpha5 label Jul 4, 2016

bleskes mentioned this pull request Jul 13, 2016

Make static Store access shard lock aware #19416

Merged

clintongormley added the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. label Feb 13, 2018

clintongormley removed the :Cluster label Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inline reroute with process of node join/master election #18938

Inline reroute with process of node join/master election #18938

Uh oh!

bleskes commented Jun 17, 2016

Uh oh!

bleskes commented Jun 17, 2016

Uh oh!

ywelsch Jun 17, 2016

Uh oh!

ywelsch Jun 17, 2016

Uh oh!

ywelsch commented Jun 17, 2016

Uh oh!

bleskes commented Jun 17, 2016

Uh oh!

ywelsch Jun 17, 2016

Uh oh!

ywelsch commented Jun 17, 2016

Uh oh!

bleskes commented Jun 23, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Inline reroute with process of node join/master election #18938

Inline reroute with process of node join/master election #18938

Uh oh!

Conversation

bleskes commented Jun 17, 2016

Uh oh!

bleskes commented Jun 17, 2016

Uh oh!

ywelsch Jun 17, 2016

Choose a reason for hiding this comment

Uh oh!

ywelsch Jun 17, 2016

Choose a reason for hiding this comment

Uh oh!

ywelsch commented Jun 17, 2016

Uh oh!

bleskes commented Jun 17, 2016

Uh oh!

ywelsch Jun 17, 2016

Choose a reason for hiding this comment

Uh oh!

ywelsch commented Jun 17, 2016

Uh oh!

bleskes commented Jun 23, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants