Skip to content
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -704,12 +704,14 @@ private boolean balanceByWeights(NodeSorter sorter) {
lowIdx = 0;
highIdx = relevantNodes - 1;

assert allocation.isSimulating() == false || routingNodes.getRelocatingShardCount() > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have asked this in an earlier review round, but shouldn't we expect routingNodes.getRelocatingShardCount() to be exactly 1 when isSimulating is true? IIUC, tryRelocateShard will increment routingNodes.getRelocatingShardCount() when there's a YES answer but leave it unchanged for THROTTLE, per this code

canAllocateOrRebalance == Type.YES
/* only allocate on the cluster if we are not throttled */
? routingNodes.relocateShard(shard, minNode.getNodeId(), shardSize, "rebalance", allocation.changes()).v1()
: shard.relocate(minNode.getNodeId(), shardSize)

Combined with the one shard move and then exit logic added recently, the assert would then be

simulation is false
OR
relocatingShardCount == 1 (and implicitly simulation is true)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can tighten it further see 6cdfdc1

: "unexpected THROTTLE decision (simulation="
+ allocation.isSimulating()
+ ") when balancing index ["
+ index
+ "]";

if (routingNodes.getRelocatingShardCount() > 0) {
// ES-12955: Check routingNodes.getRelocatingShardCount() > 0 in case the first relocation is a THROTTLE.
// This should rarely happen since in most cases, we don't throttle unless there is an existing relocation.
// But it can happen in production for frozen indices when the cache is still being prepared. It can also
// happen in tests because we have decider like RandomAllocationDecider that can randomly return THROTTLE
// when there is no existing relocation.
shardBalanced = true;
}
if (completeEarlyOnShardAssignmentChange && shardBalanced) {
Expand Down Expand Up @@ -821,6 +823,20 @@ public boolean moveShards() {
shardRouting,
bestNonPreferredShardMovementsTracker::shardIsBetterThanCurrent
);
// A THROTTLE allocation decision can happen when not simulating
assert moveDecision.isDecisionTaken() == false
|| moveDecision.getAllocationDecision() != AllocationDecision.THROTTLED
|| allocation.isSimulating() == false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opt nit: might flip isSimulating above != THROTTLED check, seems easier to read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure see 1726061

: "unexpected allocation decision ["
+ moveDecision.getAllocationDecision()
+ "] (simulation="
+ allocation.isSimulating()
+ ") with "
+ (shardMoved ? "" : "no ")
+ "prior shard movements when moving shard ["
+ shardRouting
+ "]";

if (moveDecision.isDecisionTaken() && moveDecision.cannotRemainAndCanMove()) {
// Defer moving of not-preferred until we've moved the NOs
if (moveDecision.getCanRemainDecision().type() == Type.NOT_PREFERRED) {
Expand All @@ -835,6 +851,18 @@ public boolean moveShards() {
} else if (moveDecision.isDecisionTaken() && moveDecision.canRemain() == false) {
logger.trace("[{}][{}] can't move", shardRouting.index(), shardRouting.id());
}

// A THROTTLE allocation decision can happen when not simulating
assert moveDecision.getAllocationDecision() != AllocationDecision.THROTTLED || allocation.isSimulating() == false
Copy link
Contributor

@DiannaHohensee DiannaHohensee Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be done right after the decideMove call? There's an early return path above on which we'd miss this assert.

: "unexpected allocation decision ["
+ moveDecision.getAllocationDecision()
+ "] (simulation="
+ allocation.isSimulating()
+ ") with "
+ (shardMoved ? "" : "no ")
+ "prior shard movements when moving shard ["
+ shardRouting
+ "]";
}

// If we get here, attempt to move one of the best not-preferred shards that we identified earlier
Expand Down Expand Up @@ -1233,6 +1261,21 @@ private boolean allocateUnassigned() {
ShardRouting shard = primary[i];
final ProjectIndex index = projectIndex(shard);
final AllocateUnassignedDecision allocationDecision = decideAllocateUnassigned(index, shard);

assert allocationDecision.isDecisionTaken() : "decision not taken for unassigned shard [" + shard + "]";

// If we see a THROTTLE decision, it's either:
// 1. Not simulating
// 2. Or, there is shard assigned before this one
assert allocationDecision.getAllocationStatus() != AllocationStatus.DECIDERS_THROTTLED
|| allocation.isSimulating() == false
|| shardAssignmentChanged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unassigned replica shards can returning THROTTLE because only the primary shard path is THROTTLE free during simulation. At least until I get ES-12942 complete. I take it that's the problem here?

Since we have more or less ignored the allocateUnassigned code path for not-preferred, feel free to skip this assert: I'm not sure what benefit it adds right now. Or you can add a TODO referencing ES-12942 and I expect I should be able to tighten this up. We'll revisit allocateUnassigned for not-preferred in ES-13279, too.

Copy link
Member Author

@ywangd ywangd Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unassigned replica shards can returning THROTTLE because only the primary shard path is THROTTLE free during simulation. At least until I get ES-12942 complete. I take it that's the problem here?

Yes ThrottlingAllocationDecider can still return throttling decision for unassigne replica shards. But ES-12942 is a separate issue. It's about early return in DesiredBalanceComputer instead of BalancedShardAllocator. It currently short-circuits only for primary shard assignement. Hence, changing ThrottlingAllocationDecider returning YES for replica shards is not really relevant since replica shards can already be assigned today but we don't publish the balance quickly enough after they are assigned.

In summary, let me raise a follow-up to fix ThrottlingAllocationDecider so that it does not return throttling in simulation. It's all part of the current JIRA ES-12955

: "unexpected THROTTLE decision (simulation="
+ allocation.isSimulating()
+ ") with no prior assignment when allocating unassigned shard ["
+ shard
+ "]";

final String assignedNodeId = allocationDecision.getTargetNode() != null
? allocationDecision.getTargetNode().getId()
: null;
Expand Down Expand Up @@ -1269,9 +1312,6 @@ private boolean allocateUnassigned() {
assert allocationDecision.getAllocationStatus() == AllocationStatus.DECIDERS_THROTTLED;
final long shardSize = getExpectedShardSize(shard, ShardRouting.UNAVAILABLE_EXPECTED_SHARD_SIZE, allocation);
minNode.addShard(projectIndex(shard), shard.initialize(minNode.getNodeId(), null, shardSize));
// If we see a throttle decision in simulation, there must be other shards that got assigned before it.
assert allocation.isSimulating() == false || shardAssignmentChanged
: "shard " + shard + " was throttled but no other shards were assigned";
} else {
if (logger.isTraceEnabled()) {
logger.trace("No Node found to assign shard [{}]", shard);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -486,6 +486,13 @@ public DesiredBalance compute(
|| info.lastAllocationStatus() == UnassignedInfo.AllocationStatus.DECIDERS_THROTTLED) : "Unexpected stats in: " + info;

if (hasChanges == false && info.lastAllocationStatus() == UnassignedInfo.AllocationStatus.DECIDERS_THROTTLED) {
// Unassigned ignored shards must be based on the provided set of ignoredShards
assert ignoredShards.contains(discardAllocationStatus(shard))
|| ignoredShards.stream().filter(ShardRouting::primary).anyMatch(primary -> primary.shardId().equals(shard.shardId()))
: "ignored shard "
+ shard
+ " unexpectedly has THROTTLE status and no counterpart in the provided ignoredShards set "
+ ignoredShards;
// Simulation could not progress due to missing information in any of the deciders.
// Currently, this could happen if `HasFrozenCacheAllocationDecider` is still fetching the data.
// Progress would be made after the followup reroute call.
Comment on lines 488 to 498
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment about HasFrozenCacheAllocationDecider had me concerned for some time. Adding this assertion to make it clear that throttled status can only come from DesiredBalanceInput#ignoredShards, i.e. they are not produced by BalancedShardAllocator.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,8 @@ public Decision canRebalance(RoutingAllocation allocation) {
int relocatingFrozenShards = allocation.routingNodes().getRelocatingFrozenShardCount();
int relocatingShards = allocation.routingNodes().getRelocatingShardCount();
if (allocation.isSimulating() && relocatingShards >= 2) {
// This branch should no longer run after https://github.com/elastic/elasticsearch/pull/134786
assert false : "allocation simulation should have returned earlier and not hit throttling";
// BalancedShardAllocator is prone to perform unnecessary moves when cluster_concurrent_rebalance is set to high values (>2).
// (See https://github.com/elastic/elasticsearch/issues/87279)
// Above allocator is used in DesiredBalanceComputer. Since we do not move actual shard data during calculation
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -206,29 +206,33 @@ public RandomAllocationDecider(Random random) {

@Override
public Decision canRebalance(ShardRouting shardRouting, RoutingAllocation allocation) {
return getRandomDecision();
return getRandomDecision(allocation.isSimulating() == false || allocation.routingNodes().getRelocatingShardCount() > 0);
}

private Decision getRandomDecision() {
private Decision getRandomDecision(boolean canThrottle) {
if (alwaysSayYes) {
return Decision.YES;
}
return switch (random.nextInt(10)) {
case 9, 8, 7, 6, 5 -> Decision.NO;
case 4 -> Decision.THROTTLE;
case 4 -> canThrottle ? Decision.THROTTLE : Decision.YES;
case 3, 2, 1 -> Decision.YES;
default -> Decision.ALWAYS;
};
}

@Override
public Decision canAllocate(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
return getRandomDecision();
return getRandomDecision(
allocation.isSimulating() == false
|| allocation.routingNodes().getIncomingRecoveries(node.nodeId()) > 0
|| allocation.routingNodes().getOutgoingRecoveries(node.nodeId()) > 0
);
}

@Override
public Decision canRemain(IndexMetadata indexMetadata, ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
return getRandomDecision();
return getRandomDecision(false); // throttle does not make sense for canRemain
}

}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,12 @@ public class HasFrozenCacheAllocationDecider extends AllocationDecider {
"value of [" + SHARED_CACHE_SIZE_SETTING.getKey() + "] on this node is not known yet"
);

private static final Decision NO_STILL_FETCHING = Decision.single(
Decision.Type.NO,
NAME,
"Shard movement is not allowed in simulation when value of [" + SHARED_CACHE_SIZE_SETTING.getKey() + "] on this node is not known"
);

private static final Decision HAS_FROZEN_CACHE = Decision.single(
Decision.Type.YES,
NAME,
Expand Down Expand Up @@ -62,26 +68,26 @@ public HasFrozenCacheAllocationDecider(FrozenCacheInfoService frozenCacheService

@Override
public Decision canAllocate(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
return canAllocateToNode(allocation.metadata().indexMetadata(shardRouting.index()), node.node());
return canAllocateToNode(allocation.metadata().indexMetadata(shardRouting.index()), node.node(), allocation);
}

@Override
public Decision canRemain(IndexMetadata indexMetadata, ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
return canAllocateToNode(indexMetadata, node.node());
return canAllocateToNode(indexMetadata, node.node(), allocation);
}

@Override
public Decision canAllocate(IndexMetadata indexMetadata, RoutingNode node, RoutingAllocation allocation) {
return canAllocateToNode(indexMetadata, node.node());
return canAllocateToNode(indexMetadata, node.node(), allocation);
}

@Override
public Decision shouldAutoExpandToNode(IndexMetadata indexMetadata, DiscoveryNode node, RoutingAllocation allocation) {
return canAllocateToNode(indexMetadata, node);
return canAllocateToNode(indexMetadata, node, allocation);
}

// Package private for tests
Decision canAllocateToNode(IndexMetadata indexMetadata, DiscoveryNode discoveryNode) {
Decision canAllocateToNode(IndexMetadata indexMetadata, DiscoveryNode discoveryNode, RoutingAllocation allocation) {
if (indexMetadata.isPartialSearchableSnapshot() == false) {
return Decision.ALWAYS;
}
Expand All @@ -90,7 +96,8 @@ Decision canAllocateToNode(IndexMetadata indexMetadata, DiscoveryNode discoveryN
case HAS_CACHE -> HAS_FROZEN_CACHE;
case NO_CACHE -> NO_FROZEN_CACHE;
case FAILED -> UNKNOWN_FROZEN_CACHE;
case FETCHING -> STILL_FETCHING;
// TODO: considering returning NO as well for non-simulation https://elasticco.atlassian.net/browse/ES-13378
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a public repo: should we shorten to ES-13378 since the link won't work for external users?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Pushed 228c35d

case FETCHING -> allocation.isSimulating() ? NO_STILL_FETCHING : STILL_FETCHING;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed over the other PR. I am keeping throttle for non-simulation case for the time being. I added a link to the JIRA ticket for considering its removal.

case UNKNOWN -> NO_UNKNOWN_NODE;
};
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

import org.elasticsearch.cluster.metadata.IndexMetadata;
import org.elasticsearch.cluster.node.DiscoveryNode;
import org.elasticsearch.cluster.routing.allocation.RoutingAllocation;
import org.elasticsearch.cluster.routing.allocation.decider.Decision;
import org.elasticsearch.index.IndexVersion;
import org.elasticsearch.test.ESTestCase;
Expand All @@ -27,6 +28,7 @@ public class HasFrozenCacheAllocationDeciderTests extends ESTestCase {

public void testCanAllocateToNode() {
final var frozenCacheService = mock(FrozenCacheInfoService.class);
final var allocation = mock(RoutingAllocation.class);
final var decider = new HasFrozenCacheAllocationDecider(frozenCacheService);

final var indexMetadata = IndexMetadata.builder(randomIdentifier())
Expand All @@ -35,7 +37,8 @@ public void testCanAllocateToNode() {

for (var nodeState : NodeState.values()) {
when(frozenCacheService.getNodeState(any(DiscoveryNode.class))).thenReturn(nodeState);
assertThat(decider.canAllocateToNode(indexMetadata, mock(DiscoveryNode.class)), equalTo(Decision.ALWAYS));
when(allocation.isSimulating()).thenReturn(randomBoolean());
assertThat(decider.canAllocateToNode(indexMetadata, mock(DiscoveryNode.class), allocation), equalTo(Decision.ALWAYS));
}

final var partialSearchableSnapshotIndexMetadata = IndexMetadata.builder(randomIdentifier())
Expand All @@ -47,12 +50,18 @@ public void testCanAllocateToNode() {

for (var nodeState : NodeState.values()) {
when(frozenCacheService.getNodeState(any(DiscoveryNode.class))).thenReturn(nodeState);
final Decision decision = decider.canAllocateToNode(partialSearchableSnapshotIndexMetadata, mock(DiscoveryNode.class));
final boolean isSimulating = randomBoolean();
when(allocation.isSimulating()).thenReturn(isSimulating);
final Decision decision = decider.canAllocateToNode(
partialSearchableSnapshotIndexMetadata,
mock(DiscoveryNode.class),
allocation
);
final Decision.Type expectedType;
if (nodeState == NodeState.HAS_CACHE) {
expectedType = Decision.Type.YES;
} else if (nodeState == NodeState.FETCHING) {
expectedType = Decision.Type.THROTTLE;
expectedType = isSimulating ? Decision.Type.NO : Decision.Type.THROTTLE;
} else {
expectedType = Decision.Type.NO;
}
Expand Down