Skip to content

Add helper to pick node for reindex relocation#139081

Merged
PeteGillinElastic merged 8 commits intoelastic:mainfrom
PeteGillinElastic:reindex-relocation-node-picker
Dec 18, 2025
Merged

Add helper to pick node for reindex relocation#139081
PeteGillinElastic merged 8 commits intoelastic:mainfrom
PeteGillinElastic:reindex-relocation-node-picker

Conversation

@PeteGillinElastic
Copy link
Member

@PeteGillinElastic PeteGillinElastic commented Dec 4, 2025

This adds a helper class which can be used to pick a node to relocate a reindex task to on shutdown.

The plugin exposes an implementation of the
ReindexRelocationNodePicker interface as a component, so that in a future change it can be injected into the transport action, and from there passed into the Reindexer, and from there used to actually relocate the task on shutdown.

By default, it uses the DefaultReindexRelocationNodePicker. Another plugin may override this by extending the reindex plugin and making the implementation available via SPI.

Closes elastic/elasticsearch-team#2092

@PeteGillinElastic PeteGillinElastic added >non-issue :Distributed/Reindex Issues relating to reindex that are not caused by issues further down labels Dec 4, 2025
This adds a helper class which can be used to pick a node to relocate
a reindex task to on shutdown.

The plugin exposes an implementation of the
`ReindexRelocationNodePicker` interface as a component, so that in a
future change it can be injected into the transport action, and from
there passed into teh `Reindexer`, and from there used to actually
relocate the task on shutdown.

By default, it uses the `DefaultReindexRelocationNodePicker`. Another
plugin may override this by extending the `reindex` plugin and making
the implementation available via SPI.
@PeteGillinElastic PeteGillinElastic force-pushed the reindex-relocation-node-picker branch from 24788c9 to f86685f Compare December 9, 2025 13:03
@PeteGillinElastic PeteGillinElastic marked this pull request as ready for review December 9, 2025 13:08
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Indexing (obsolete) Meta label for Distributed Indexing team. Obsolete. Please do not use. label Dec 9, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

);
return null;
}
List<String> eligibleDedicatedCoordinatingNodes = nodes.getCoordinatingOnlyNodes()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for @Tim-Brooks: Is this the correct way to find what the doc describes as "dedicated coordinating nodes"? The name sounds right, but the implementation just filters out data, master, and ingest nodes — so nodes with roles like ml or voting_only would be included.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO - it is likely the case that this logic is bugged. A dedicated coordinating node should have no roles as outlined by the documentation: https://www.elastic.co/docs/deploy-manage/distributed-architecture/clusters-nodes-shards/node-roles#coordinating-only-node-role.

Looking at this logic I don't think you're introducing that issue. But you are changing this code (nodes.getCoordinatingOnlyNodes ) to be used in production opposed to just tests. So either I think the logic needs to be fixed here or you need to write your own function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's actually 100% true that DiscoveryNodes.getCoordinatingOnlyNodes() is used only in tests now. I think there are ways it is exposed to the user, although the only ways I've found are somewhat obscure. I created #139587 for this. I agree with you that, on the face of it, this is a bug which we should fix. But since that would be a behavioural change, I think it deserves its own issue, so that we can make sure we're aligned on it, and so that it can go in the release notes.

In the meantime, I simply won't use the method in this PR.

logger.debug("Chose dedicated coordinating node ID {} for relocating a reindex task from node {}", newNodeId, currentNodeId);
return newNodeId;
}
List<String> eligibleDataNodes = nodes.getDataNodes().keySet().stream().filter(id -> id.equals(currentNodeId) == false).toList();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for @Tim-Brooks: The doc says "a valid node for index coordinating (data node, no dedicated master nodes)". Is this the correct interpretation of that? (I wasn't quite sure what "no dedicated master nodes" meant, as my understanding was that would mean nodes with only the master role, which therefore could never be a data node.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This looks fine to me (can contain data excluded dedicated master nodes).

Copy link
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the plugin infrastructure here looks good. But I do think we need to use the proper definition of coordinating only node which is nodes without any assigned roles.

logger.debug("Chose dedicated coordinating node ID {} for relocating a reindex task from node {}", newNodeId, currentNodeId);
return newNodeId;
}
List<String> eligibleDataNodes = nodes.getDataNodes().keySet().stream().filter(id -> id.equals(currentNodeId) == false).toList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This looks fine to me (can contain data excluded dedicated master nodes).

);
return null;
}
List<String> eligibleDedicatedCoordinatingNodes = nodes.getCoordinatingOnlyNodes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO - it is likely the case that this logic is bugged. A dedicated coordinating node should have no roles as outlined by the documentation: https://www.elastic.co/docs/deploy-manage/distributed-architecture/clusters-nodes-shards/node-roles#coordinating-only-node-role.

Looking at this logic I don't think you're introducing that issue. But you are changing this code (nodes.getCoordinatingOnlyNodes ) to be used in production opposed to just tests. So either I think the logic needs to be fixed here or you need to write your own function.

Copy link
Member Author

@PeteGillinElastic PeteGillinElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, PTAL.

);
return null;
}
List<String> eligibleDedicatedCoordinatingNodes = nodes.getCoordinatingOnlyNodes()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's actually 100% true that DiscoveryNodes.getCoordinatingOnlyNodes() is used only in tests now. I think there are ways it is exposed to the user, although the only ways I've found are somewhat obscure. I created #139587 for this. I agree with you that, on the face of it, this is a bug which we should fix. But since that would be a behavioural change, I think it deserves its own issue, so that we can make sure we're aligned on it, and so that it can go in the release notes.

In the meantime, I simply won't use the method in this PR.

@PeteGillinElastic
Copy link
Member Author

Thanks, PTAL.

Just in case, repeating this reply which I left but might have been stranded 'cos the code it was attached to went away:

I don't think it's actually 100% true that DiscoveryNodes.getCoordinatingOnlyNodes() is used only in tests now. I think there are ways it is exposed to the user, although the only ways I've found are somewhat obscure. I created #139587 for this. I agree with you that, on the face of it, this is a bug which we should fix. But since that would be a behavioural change, I think it deserves its own issue, so that we can make sure we're aligned on it, and so that it can go in the release notes.

In the meantime, I simply won't use the method in this PR.

Copy link
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

assuming the test failure is not related.

thanks for creating the issue.

@PeteGillinElastic PeteGillinElastic merged commit 76bfeb4 into elastic:main Dec 18, 2025
35 checks passed
@PeteGillinElastic PeteGillinElastic deleted the reindex-relocation-node-picker branch December 18, 2025 12:42
szybia added a commit to szybia/elasticsearch that referenced this pull request Dec 19, 2025
* upstream/main: (253 commits)
  Adds ST_SIMPLIFY geo spatial function (elastic#136309)
  Take control of max clause count verification in Lucene searcher (elastic#139752)
  [ML] Unmute Inference Test (elastic#139765)
  Parameterize the vector operation benchmark tests (elastic#139735)
  Fix node reduction pushdown tests for release tests (elastic#139548)
  Fix flakiness in TSDataGenerationHelper (elastic#139759)
  CPS: Copy existing resolved index expressions when constructing a new `SearchRequest` from an existing one (elastic#139596)
  Add release notes for v9.1.9 release (elastic#139674)
  Add lucene query for wildcards on high cardinality keyword fields. (elastic#139746)
  Suppress Tika entitlement warnings from AWT (elastic#139711)
  Check field storage when synthetic source is enabled, in tests (elastic#139715)
  Refactor VectorSimilarityType to know about its corresponding Function (elastic#139678)
  Merge fixes from patch branch back into main (elastic#139721)
  Define native bulk operations for vector square distance (elastic#139198)
  Use LongUpDownCounter for Linked Project Error Metrics (elastic#139657)
  ESQL: Add javadoc that explains version-aware planning (elastic#139706)
  Add helper to pick node for reindex relocation (elastic#139081)
  Fix auth serialization randomized version test (elastic#139182)
  ES|QL - Add parsing, preanalysis and analysis timing information to profile (elastic#139540)
  Mute org.elasticsearch.persistent.ClusterPersistentTasksCustomMetadataTests testMinVersionSerialization elastic#139741
  ...
PeteGillinElastic added a commit to PeteGillinElastic/elasticsearch that referenced this pull request Jan 7, 2026
This adds an implmentation of the helper which picks a node to
relocate a reindex task to on shutdown to be used in stateless.

This changes the mechanism for choosing the correct implementation
that was added in elastic#139081. The previous mechanism assumed that we'd
want *serverless* to behave differently, and so it defined a default
implementation and allowed for that to be swapped out via another
plugin. But we actually want *stateless* to behave differently, so we
can take the simpler approach of doing everything inside the `reindex`
plugin and just pick the correct implementation based on settings. The
implementation added previously has been renamed, because it is now
explicitly used in stateful rather than just being used by default.
PeteGillinElastic added a commit to PeteGillinElastic/elasticsearch that referenced this pull request Jan 7, 2026
This adds an implmentation of the helper which picks a node to
relocate a reindex task to on shutdown to be used in stateless.

This changes the mechanism for choosing the correct implementation
that was added in elastic#139081. The previous mechanism assumed that we'd
want *serverless* to behave differently, and so it defined a default
implementation and allowed for that to be swapped out via another
plugin. But we actually want *stateless* to behave differently, so we
can take the simpler approach of doing everything inside the `reindex`
plugin and just pick the correct implementation based on settings. The
implementation added previously has been renamed, because it is now
explicitly used in stateful rather than just being used by default.
PeteGillinElastic added a commit to PeteGillinElastic/elasticsearch that referenced this pull request Jan 7, 2026
This adds an implmentation of the helper which picks a node to
relocate a reindex task to on shutdown to be used in stateless.

This changes the mechanism for choosing the correct implementation
that was added in elastic#139081. The previous mechanism assumed that we'd
want *serverless* to behave differently, and so it defined a default
implementation and allowed for that to be swapped out via another
plugin. But we actually want *stateless* to behave differently, so we
can take the simpler approach of doing everything inside the `reindex`
plugin and just pick the correct implementation based on settings. The
implementation added previously has been renamed, because it is now
explicitly used in stateful rather than just being used by default.
PeteGillinElastic added a commit that referenced this pull request Jan 21, 2026
This adds an implmentation of the helper which picks a node to
relocate a reindex task to on shutdown to be used in stateless.

This changes the mechanism for choosing the correct implementation
that was added in #139081. The previous mechanism assumed that we'd
want *serverless* to behave differently, and so it defined a default
implementation and allowed for that to be swapped out via another
plugin. But we actually want *stateless* to behave differently, so we
can take the simpler approach of doing everything inside the `reindex`
plugin and just pick the correct implementation based on settings. The
implementation added previously has been renamed, because it is now
explicitly used in stateful rather than just being used by default.

Co-authored-by: Szymon Bialkowski <szybia@tuta.io>
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
spinscale pushed a commit to spinscale/elasticsearch that referenced this pull request Jan 21, 2026
This adds an implmentation of the helper which picks a node to
relocate a reindex task to on shutdown to be used in stateless.

This changes the mechanism for choosing the correct implementation
that was added in elastic#139081. The previous mechanism assumed that we'd
want *serverless* to behave differently, and so it defined a default
implementation and allowed for that to be swapped out via another
plugin. But we actually want *stateless* to behave differently, so we
can take the simpler approach of doing everything inside the `reindex`
plugin and just pick the correct implementation based on settings. The
implementation added previously has been renamed, because it is now
explicitly used in stateful rather than just being used by default.

Co-authored-by: Szymon Bialkowski <szybia@tuta.io>
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Reindex Issues relating to reindex that are not caused by issues further down >non-issue Team:Distributed Indexing (obsolete) Meta label for Distributed Indexing team. Obsolete. Please do not use. v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants