Skip to content
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -705,12 +705,15 @@ private boolean balanceByWeights(NodeSorter sorter) {
highIdx = relevantNodes - 1;

if (routingNodes.getRelocatingShardCount() > 0) {
// ES-12955: Check routingNodes.getRelocatingShardCount() > 0 in case the first relocation is a THROTTLE.
// This should rarely happen since in most cases, we don't throttle unless there is an existing relocation.
// But it can happen in production for frozen indices when the cache is still being prepared. It can also
// happen in tests because we have decider like RandomAllocationDecider that can randomly return THROTTLE
// when there is no existing relocation.
shardBalanced = true;
} else {
// A THROTTLE decision can happen when not simulating
assert allocation.isSimulating() == false
: "unexpected THROTTLE decision (simulation="
+ allocation.isSimulating()
+ ") when balancing index ["
+ index
+ "]";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing the assert in a conditional branch makes this seem like certain paths are okay for THROTTLE. Suppose we were in a simulation and somehow got more than 1 relocating shard at once: that would be unsafe, right? Is there any reason we can't safely always have the assert in place? Like,

assert allocation.isSimulating() == false || routingNodes.getRelocatingShardCount() <= 1 : "....";
if (routingNodes.getRelocatingShardCount() > 0) {
    shardBalanced = true;
}
if (completeEarlyOnShardAssignmentChange && shardBalanced) {
....

}
if (completeEarlyOnShardAssignmentChange && shardBalanced) {
return true;
Expand Down Expand Up @@ -835,6 +838,18 @@ public boolean moveShards() {
} else if (moveDecision.isDecisionTaken() && moveDecision.canRemain() == false) {
logger.trace("[{}][{}] can't move", shardRouting.index(), shardRouting.id());
}

// A THROTTLE allocation decision can happen when not simulating
assert moveDecision.getAllocationDecision() != AllocationDecision.THROTTLED || allocation.isSimulating() == false
Copy link
Contributor

@DiannaHohensee DiannaHohensee Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be done right after the decideMove call? There's an early return path above on which we'd miss this assert.

: "unexpected allocation decision ["
+ moveDecision.getAllocationDecision()
+ "] (simulation="
+ allocation.isSimulating()
+ ") with "
+ (shardMoved ? "" : "no ")
+ "prior shard movements when moving shard ["
+ shardRouting
+ "]";
}

// If we get here, attempt to move one of the best not-preferred shards that we identified earlier
Expand Down Expand Up @@ -1268,9 +1283,15 @@ private boolean allocateUnassigned() {
assert allocationDecision.getAllocationStatus() == AllocationStatus.DECIDERS_THROTTLED;
final long shardSize = getExpectedShardSize(shard, ShardRouting.UNAVAILABLE_EXPECTED_SHARD_SIZE, allocation);
minNode.addShard(projectIndex(shard), shard.initialize(minNode.getNodeId(), null, shardSize));
// If we see a throttle decision in simulation, there must be other shards that got assigned before it.
// If we see a THROTTLE decision, it's either:
// 1. Not simulating
// 2. Or, there is shard assigned before this one
assert allocation.isSimulating() == false || shardAssignmentChanged
: "shard " + shard + " was throttled but no other shards were assigned";
: "unexpected THROTTLE decision (simulation="
+ allocation.isSimulating()
+ ") with no prior assignment when allocating unassigned shard ["
+ shard
+ "]";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than putting this in a conditional, can we assert that either isSimulating==false || decision is not throttle up on line 1272? Or better yet, right after the decideAllocateUnassigned call, outside another layer of if-else condition?

} else {
if (logger.isTraceEnabled()) {
logger.trace("No Node found to assign shard [{}]", shard);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -486,6 +486,13 @@ public DesiredBalance compute(
|| info.lastAllocationStatus() == UnassignedInfo.AllocationStatus.DECIDERS_THROTTLED) : "Unexpected stats in: " + info;

if (hasChanges == false && info.lastAllocationStatus() == UnassignedInfo.AllocationStatus.DECIDERS_THROTTLED) {
// Unassigned ignored shards must be based on the provided set of ignoredShards
assert ignoredShards.contains(discardAllocationStatus(shard))
|| ignoredShards.stream().filter(ShardRouting::primary).anyMatch(primary -> primary.shardId().equals(shard.shardId()))
: "ignored shard "
+ shard
+ " unexpectedly has THROTTLE status and no counterpart in the provided ignoredShards set "
+ ignoredShards;
// Simulation could not progress due to missing information in any of the deciders.
// Currently, this could happen if `HasFrozenCacheAllocationDecider` is still fetching the data.
// Progress would be made after the followup reroute call.
Comment on lines 488 to 498
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment about HasFrozenCacheAllocationDecider had me concerned for some time. Adding this assertion to make it clear that throttled status can only come from DesiredBalanceInput#ignoredShards, i.e. they are not produced by BalancedShardAllocator.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,8 @@ public Decision canRebalance(ShardRouting shardRouting, RoutingAllocation alloca
public Decision canRebalance(RoutingAllocation allocation) {
int relocatingShards = allocation.routingNodes().getRelocatingShardCount();
if (allocation.isSimulating() && relocatingShards >= 2) {
// This branch should no longer run after https://github.com/elastic/elasticsearch/pull/134786
assert false : "allocation simulation should have returned earlier and not hit throttling";
// BalancedShardAllocator is prone to perform unnecessary moves when cluster_concurrent_rebalance is set to high values (>2).
// (See https://github.com/elastic/elasticsearch/issues/87279)
// Above allocator is used in DesiredBalanceComputer. Since we do not move actual shard data during calculation
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -206,29 +206,33 @@ public RandomAllocationDecider(Random random) {

@Override
public Decision canRebalance(ShardRouting shardRouting, RoutingAllocation allocation) {
return getRandomDecision();
return getRandomDecision(allocation.isSimulating() == false || allocation.routingNodes().getRelocatingShardCount() > 0);
}

private Decision getRandomDecision() {
private Decision getRandomDecision(boolean canThrottle) {
if (alwaysSayYes) {
return Decision.YES;
}
return switch (random.nextInt(10)) {
case 9, 8, 7, 6, 5 -> Decision.NO;
case 4 -> Decision.THROTTLE;
case 4 -> canThrottle ? Decision.THROTTLE : Decision.YES;
case 3, 2, 1 -> Decision.YES;
default -> Decision.ALWAYS;
};
}

@Override
public Decision canAllocate(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
return getRandomDecision();
return getRandomDecision(
allocation.isSimulating() == false
|| allocation.routingNodes().getIncomingRecoveries(node.nodeId()) > 0
|| allocation.routingNodes().getOutgoingRecoveries(node.nodeId()) > 0
);
}

@Override
public Decision canRemain(IndexMetadata indexMetadata, ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
return getRandomDecision();
return getRandomDecision(false); // throttle does not make sense for canRemain
}

}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,12 @@ public class HasFrozenCacheAllocationDecider extends AllocationDecider {
"value of [" + SHARED_CACHE_SIZE_SETTING.getKey() + "] on this node is not known yet"
);

private static final Decision NO_STILL_FETCHING = Decision.single(
Decision.Type.NO,
NAME,
"Shard movement is not allowed in simulation when value of [" + SHARED_CACHE_SIZE_SETTING.getKey() + "] on this node is not known"
);

private static final Decision HAS_FROZEN_CACHE = Decision.single(
Decision.Type.YES,
NAME,
Expand All @@ -48,6 +54,12 @@ public class HasFrozenCacheAllocationDecider extends AllocationDecider {
"there was an error fetching the searchable snapshot shared cache state from this node"
);

private static final Decision UNKNOWN_NODE = Decision.single(
Decision.Type.NO,
NAME,
"this node is unknown to the searchable snapshot shared cache state"
);

private final FrozenCacheInfoService frozenCacheService;

public HasFrozenCacheAllocationDecider(FrozenCacheInfoService frozenCacheService) {
Expand All @@ -56,25 +68,25 @@ public HasFrozenCacheAllocationDecider(FrozenCacheInfoService frozenCacheService

@Override
public Decision canAllocate(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
return canAllocateToNode(allocation.metadata().indexMetadata(shardRouting.index()), node.node());
return canAllocateToNode(allocation.metadata().indexMetadata(shardRouting.index()), node.node(), allocation);
}

@Override
public Decision canRemain(IndexMetadata indexMetadata, ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
return canAllocateToNode(indexMetadata, node.node());
return canAllocateToNode(indexMetadata, node.node(), allocation);
}

@Override
public Decision canAllocate(IndexMetadata indexMetadata, RoutingNode node, RoutingAllocation allocation) {
return canAllocateToNode(indexMetadata, node.node());
return canAllocateToNode(indexMetadata, node.node(), allocation);
}

@Override
public Decision shouldAutoExpandToNode(IndexMetadata indexMetadata, DiscoveryNode node, RoutingAllocation allocation) {
return canAllocateToNode(indexMetadata, node);
return canAllocateToNode(indexMetadata, node, allocation);
}

private Decision canAllocateToNode(IndexMetadata indexMetadata, DiscoveryNode discoveryNode) {
private Decision canAllocateToNode(IndexMetadata indexMetadata, DiscoveryNode discoveryNode, RoutingAllocation allocation) {
if (indexMetadata.isPartialSearchableSnapshot() == false) {
return Decision.ALWAYS;
}
Expand All @@ -83,7 +95,8 @@ private Decision canAllocateToNode(IndexMetadata indexMetadata, DiscoveryNode di
case HAS_CACHE -> HAS_FROZEN_CACHE;
case NO_CACHE -> NO_FROZEN_CACHE;
case FAILED -> UNKNOWN_FROZEN_CACHE;
default -> STILL_FETCHING;
case FETCHING -> allocation.isSimulating() ? NO_STILL_FETCHING : STILL_FETCHING;
case UNKNOWN -> UNKNOWN_NODE;
Copy link
Member Author

@ywangd ywangd Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HasFrozenAllocationDecider does not differentiate between fetching node state (FETCHING) and no-such-node (UNKNOWN). The later can only happen when a node leaves the cluster which seems more justified to say NO. In any case, they do not really impact SearchableSnapshotAllocator which is what matters for allocating unassigned searchable snapshot shards.
Btw, I can also undo the split if it is considered unrelated.

};
}

Expand Down