Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -704,12 +704,18 @@ private boolean balanceByWeights(NodeSorter sorter) {
highIdx = relevantNodes - 1;

if (routingNodes.getRelocatingShardCount() > 0) {
// ES-12955: Check routingNodes.getRelocatingShardCount() > 0 in case the first relocation is a THROTTLE.
// This should rarely happen since in most cases, we don't throttle unless there is an existing relocation.
// But it can happen in production for frozen indices when the cache is still being prepared. It can also
// happen in tests because we have decider like RandomAllocationDecider that can randomly return THROTTLE
// when there is no existing relocation.
// Check routingNodes.getRelocatingShardCount() > 0 in case the first relocation is a THROTTLE.
shardBalanced = true;
} else {
// A THROTTLE decision can happen when:
// 1. Not simulating
// 2. Or, a partial searchable snapshot index due to HasFrozenCacheAllocationDecider
assert allocation.isSimulating() == false || indexMetadata.isPartialSearchableSnapshot()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could check that the THROTTLE decision is actually from the HasFrozenCacheAllocationDecider. But it's not really feasible. Instead, we can check it's a frozen index. Not really ideal, but seems to be the best we can do.

Copy link
Contributor

@nicktindall nicktindall Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of putting knowledge of specific deciders in here, as discussed, I wonder if HasFrozenAllocationDecider is abusing THROTTLE anyway. My reasoning was:

In the canAllocateUnassigned logic we'll wait for the outcome of a lower-weight THROTTLE when we have higher-weight YES so that would mean this decider can defer the allocation of a partial searchable snapshot while a non-frozen node was initialising doesn't it?
I would think we'd return NO until the setting was loaded?

It seems like returning THROTTLED while we wait to find out if the node can host frozen shards could mean we defer allocating the shard to another node that we KNOW can host frozen shards, while waiting to find out if an initialising node is a frozen node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we defer allocating the shard to another node that we KNOW can host frozen shards, while waiting to find out if an initialising node is a frozen node.

That is true. I also wonder whether it is intentional since otherwise the shards will be assigned to an existing node only to relocate later momentarily. That said, after reading through code around HasFrozenAllocationDecider, I believe it is OK to return NO since it should not change the beahviour. That is because we always call allocateExistingUnassignedShards before calling into the allocator. This method uses SearchableSnapshotAllocator to allocate unassigne shards first. It also checks cache status and reject allocation when cache state is still being fetched. It in turns places the shard in the ignored list and BalancedShardAllocator ignores them as well. In short, the allocation decision for paritially mounted shards are made earlier in a different place so that it should not matter for the decider to return NO.

: "unexpected THROTTLE decision (simulation="
+ allocation.isSimulating()
+ ") when balancing index ["
+ index
+ "]";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing the assert in a conditional branch makes this seem like certain paths are okay for THROTTLE. Suppose we were in a simulation and somehow got more than 1 relocating shard at once: that would be unsafe, right? Is there any reason we can't safely always have the assert in place? Like,

assert allocation.isSimulating() == false || routingNodes.getRelocatingShardCount() <= 1 : "....";
if (routingNodes.getRelocatingShardCount() > 0) {
    shardBalanced = true;
}
if (completeEarlyOnShardAssignmentChange && shardBalanced) {
....

}
if (completeEarlyOnShardAssignmentChange && shardBalanced) {
return true;
Expand Down Expand Up @@ -834,6 +840,20 @@ public boolean moveShards() {
} else if (moveDecision.isDecisionTaken() && moveDecision.canRemain() == false) {
logger.trace("[{}][{}] can't move", shardRouting.index(), shardRouting.id());
}

// A THROTTLE allocation decision can happen when:
// 1. Not simulating
// 2. Or, there are prior shard movements
// 3. Or, a partial searchable snapshot index due to HasFrozenCacheAllocationDecider
assert moveDecision.getAllocationDecision() != AllocationDecision.THROTTLED
|| (allocation.isSimulating() == false || shardMoved || indexMetadata(index).isPartialSearchableSnapshot())
: "unexpected allocation decision ["
+ moveDecision.getAllocationDecision()
+ "] (simulation="
+ allocation.isSimulating()
+ ") with no prior shard movements when moving shard ["
+ shardRouting
+ "]";
}

// If we get here, attempt to move one of the best not-preferred shards that we identified earlier
Expand Down Expand Up @@ -1267,7 +1287,18 @@ private boolean allocateUnassigned() {
assert allocationDecision.getAllocationStatus() == AllocationStatus.DECIDERS_THROTTLED;
final long shardSize = getExpectedShardSize(shard, ShardRouting.UNAVAILABLE_EXPECTED_SHARD_SIZE, allocation);
minNode.addShard(projectIndex(shard), shard.initialize(minNode.getNodeId(), null, shardSize));
// If we see a throttle decision in simulation, there must be other shards that got assigned before it.
// If we see a THROTTLE decision, it's either:
// 1. Not simulating
// 2. Or, there is shard assigned before this one
// 3. Or, a partial searchable snapshot index due to HasFrozenCacheAllocationDecider
assert allocation.isSimulating() == false
|| shardAssignmentChanged
|| indexMetadata(index).isPartialSearchableSnapshot()
: "unexpected THROTTLE decision (simulation="
+ allocation.isSimulating()
+ ") with no prior assignment when allocating unassigned shard ["
+ shard
+ "]";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than putting this in a conditional, can we assert that either isSimulating==false || decision is not throttle up on line 1272? Or better yet, right after the decideAllocateUnassigned call, outside another layer of if-else condition?

assert allocation.isSimulating() == false || shardAssignmentChanged
: "shard " + shard + " was throttled but no other shards were assigned";
} else {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,8 @@ public Decision canRebalance(ShardRouting shardRouting, RoutingAllocation alloca
public Decision canRebalance(RoutingAllocation allocation) {
int relocatingShards = allocation.routingNodes().getRelocatingShardCount();
if (allocation.isSimulating() && relocatingShards >= 2) {
// This branch should no longer run after https://github.com/elastic/elasticsearch/pull/134786
assert false : "allocation simulation should have returned earlier and not hit throttling";
// BalancedShardAllocator is prone to perform unnecessary moves when cluster_concurrent_rebalance is set to high values (>2).
// (See https://github.com/elastic/elasticsearch/issues/87279)
// Above allocator is used in DesiredBalanceComputer. Since we do not move actual shard data during calculation
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,9 @@ public Decision canAllocate(ShardRouting shardRouting, RoutingNode node, Routing
// Allocating a shard to this node will increase the incoming recoveries
int currentInRecoveries = allocation.routingNodes().getIncomingRecoveries(node.nodeId());
if (currentInRecoveries >= concurrentIncomingRecoveries) {
// See also https://github.com/elastic/elasticsearch/pull/134786
assert allocation.isSimulating() == false || concurrentIncomingRecoveries == 0
: "allocation simulation should have returned earlier and not hitting throttling";
return allocation.decision(
THROTTLE,
NAME,
Expand All @@ -169,6 +172,9 @@ public Decision canAllocate(ShardRouting shardRouting, RoutingNode node, Routing
}
int primaryNodeOutRecoveries = allocation.routingNodes().getOutgoingRecoveries(primaryShard.currentNodeId());
if (primaryNodeOutRecoveries >= concurrentOutgoingRecoveries) {
// See also https://github.com/elastic/elasticsearch/pull/134786
assert allocation.isSimulating() == false || concurrentOutgoingRecoveries == 0
: "allocation simulation should have returned earlier and not hitting throttling";
return allocation.decision(
THROTTLE,
NAME,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -206,29 +206,33 @@ public RandomAllocationDecider(Random random) {

@Override
public Decision canRebalance(ShardRouting shardRouting, RoutingAllocation allocation) {
return getRandomDecision();
return getRandomDecision(allocation.isSimulating() == false || allocation.routingNodes().getRelocatingShardCount() > 0);
}

private Decision getRandomDecision() {
private Decision getRandomDecision(boolean canThrottle) {
if (alwaysSayYes) {
return Decision.YES;
}
return switch (random.nextInt(10)) {
case 9, 8, 7, 6, 5 -> Decision.NO;
case 4 -> Decision.THROTTLE;
case 4 -> canThrottle ? Decision.THROTTLE : Decision.YES;
case 3, 2, 1 -> Decision.YES;
default -> Decision.ALWAYS;
};
}

@Override
public Decision canAllocate(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
return getRandomDecision();
return getRandomDecision(
allocation.isSimulating() == false
|| allocation.routingNodes().getIncomingRecoveries(node.nodeId()) > 0
|| allocation.routingNodes().getOutgoingRecoveries(node.nodeId()) > 0
);
}

@Override
public Decision canRemain(IndexMetadata indexMetadata, ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
return getRandomDecision();
return getRandomDecision(false); // throttle does not make sense for canRemain
}

}
Expand Down