Skip to content

Comments

Allow relocation to NOT_PREFERRED node for evacuating shards#140197

Merged
nicktindall merged 8 commits intoelastic:mainfrom
nicktindall:fix_canMove_regression
Jan 6, 2026
Merged

Allow relocation to NOT_PREFERRED node for evacuating shards#140197
nicktindall merged 8 commits intoelastic:mainfrom
nicktindall:fix_canMove_regression

Conversation

@nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Jan 6, 2026

A change in #137228 meant we sometimes return canMove=NOT_PREFERRED from BalancedShardsAllocator#decideMove, but we still would only execute a move when canMove=YES, this meant when we were unable to relocate shards from a shutting down node to a not-preferred allocation.

This PR fixes the logic to execute the move when canMove is YES or NOT_PREFERRED

Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't look at test, but I think we should fix the other call-site?

final var shardRouting = storedShardMovement.shardRouting();
final var index = projectIndex(shardRouting);
final var moveDecision = refreshDecisionIfRequired(index, storedShardMovement, shardMoved);
if (moveDecision.isDecisionTaken() && moveDecision.cannotRemainAndCanMove()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've got to change this, too, I believe?

refreshDecisionIfRequired seems to wind down to the same decideMove method, and even if not-preferred didn't happen now, it could easily happen in future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cb607f1


if (moveDecision.isDecisionTaken() && moveDecision.cannotRemainAndCanMove()) {
if (moveDecision.isDecisionTaken()
&& (moveDecision.cannotRemainAndCanMove() || moveDecision.cannotRemainAndNotPreferredMove())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannotRemainAndCanMove caught my eye a while ago, and I wanted to make it something like cannotRemainAndCanMoveYes. However, now, I wonder if we should just keep cannotRemainAndCanMove and implement that single method as

cannotRemain() && (canMoveDecision == AllocationDecision.YES || canMoveDecision == AllocationDecision.NOT_PREFERRED); ?

Both production callers of cannotRemainAndCanMove now need to check both, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there will need to be some consideration of how to do this because it's used in other places too. I'm not sure what the tidiest approach is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two callers in BalancedShardsAllocator, MoveDecision printing. So it's pretty contained.

My preference would be to combine both under cannotRemainAndCanMove, or rename cannotRemainAndCanMove to include '-Yes-' someplace -- so it's too obvious to make this mistake again. But not a commit blocker to me if you strongly prefer something else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep good call I think it makes more sense that way too. Changed in cb607f1

Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing looks fine with a couple superficial nits.

@nicktindall nicktindall marked this pull request as ready for review January 6, 2026 03:13
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jan 6, 2026
Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, straightforward feedback. Pls fix the commit title before merging :)

@elastic elastic deleted a comment from elasticmachine Jan 6, 2026
@nicktindall nicktindall added >bug :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Jan 6, 2026
@elasticsearchmachine elasticsearchmachine added Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. and removed needs:triage Requires assignment of a team area label labels Jan 6, 2026
@elasticsearchmachine
Copy link
Collaborator

Hi @nicktindall, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@nicktindall nicktindall changed the title Fix canMove regression (hack) Handle moveDecision returning NOT_PREFERRED correctly Jan 6, 2026
@nicktindall nicktindall added auto-backport Automatically create backport pull requests when merged v9.3.1 labels Jan 6, 2026
@nicktindall nicktindall requested a review from ywangd January 6, 2026 06:14
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Nit: Maybe the PR title can more directly say something like "allow relocation to NOT_PREFERRED node for evacuating shards"?

@nicktindall nicktindall changed the title Handle moveDecision returning NOT_PREFERRED correctly Allow relocation to NOT_PREFERRED node for evacuating shards Jan 6, 2026
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@nicktindall
Copy link
Contributor Author

@elasticsearchmachine test this please

@nicktindall nicktindall merged commit 6630582 into elastic:main Jan 6, 2026
35 checks passed
@nicktindall nicktindall deleted the fix_canMove_regression branch January 6, 2026 12:51
nicktindall added a commit to nicktindall/elasticsearch that referenced this pull request Jan 6, 2026
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
9.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.3.1 v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants