-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
ES_VERSION: 7.6.0
JVM version : JDK1.8.0_112
OS version : linux
Description of the problem including expected versus actual behavior:
As it's known, Updating ClusterState on master may cost too much time, which is not good for cluster. During the updating ClusterState, ShardsLimitAllocationDecider deciders iterate through all the shards on a node to find STARTED ones belonging to the index when cluster.routing.allocation.total_shards_per_node > 0, Which will cost too much time.
In out product, There are 39 nodes and 2,000 indices, 50,000 shards, but the time to update cluster state reach at 3.4min, It's intolerable.
To find out why it cost so much time on updating cluste state, I get the thread stack about updateTask, such that:
"[node-1][clusterService#updateTask][T#1]" #21 daemon prio=5 os_prio=0 tid=0x00007fc5c88fa800 nid=0x3369 runnable [0x00007fc58431a000]
java.lang.Thread.State: RUNNABLE
at java.util.Collections$UnmodifiableCollection$1.hasNext(Collections.java:1041)
at org.elasticsearch.cluster.routing.allocation.decider.ShardsLimitAllocationDecider.doDecide(ShardsLimitAllocationDecider.java:112)
at org.elasticsearch.cluster.routing.allocation.decider.ShardsLimitAllocationDecider.canAllocate(ShardsLimitAllocationDecider.java:88)
at org.elasticsearch.cluster.routing.allocation.decider.AllocationDeciders.canAllocate(AllocationDeciders.java:73)
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.decideMove(BalancedShardsAllocator.java:707)
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.moveShards(BalancedShardsAllocator.java:648)
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocate(BalancedShardsAllocator.java:123)
at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:329)
at org.elasticsearch.cluster.routing.allocation.AllocationService.applyStartedShards(AllocationService.java:100)
at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardStartedClusterStateTaskExecutor.execute(ShardStateAction.java:438)
at org.elasticsearch.cluster.service.ClusterService.executeTasks(ClusterService.java:634)
at org.elasticsearch.cluster.service.ClusterService.calculateTaskOutputs(ClusterService.java:612)
I try several times and get the same thread stack, it seems that ShardsLimitAllocationDecider.doDecide will cost too much time, the related code:
if (indexShardLimit <= 0 && clusterShardLimit <= 0) {
return allocation.decision(Decision.YES, NAME, "total shard limits are disabled: [index: %d, cluster: %d] <= 0",
indexShardLimit, clusterShardLimit);
}
int indexShardCount = 0;
int nodeShardCount = 0;
for (ShardRouting nodeShard : node) {
// don't count relocating shards...
if (nodeShard.relocating()) {
continue;
}
nodeShardCount++;
if (nodeShard.index().equals(shardRouting.index())) {
indexShardCount++;
}
}
It will iterate 50000*50000/39 = 64,000,000 times, which will cost too much time.
There is room for optimization to avoid iterating the node:
1.If indexShardLimit=-1 and clusterShardLimit>0, we need't to count indexShardCount and nodeShardCount by iterating, nodeShardCount = node.size() - node.numberOfShardsWithState(ShardRoutingState.RELOCATING), indexShardCount is useless.
2. If we could count the started shards of each index in each node in RoutingNode to avoid the iteration?