Skip to content

Determine remote cluster version#142494

Merged
elasticsearchmachine merged 30 commits intoelastic:mainfrom
joshua-adams-1:reindexing-plumb-pittable-hit-source
Feb 27, 2026
Merged

Determine remote cluster version#142494
elasticsearchmachine merged 30 commits intoelastic:mainfrom
joshua-adams-1:reindexing-plumb-pittable-hit-source

Conversation

@joshua-adams-1
Copy link
Copy Markdown
Contributor

Extends the reindexing flow to determine the remote cluster version (if applicable), before slicing.

Relates: https://github.com/elastic/elasticsearch-team/issues/2088

@joshua-adams-1 joshua-adams-1 self-assigned this Feb 13, 2026
@joshua-adams-1 joshua-adams-1 added >non-issue :Distributed/Reindex Issues relating to reindex that are not caused by issues further down labels Feb 13, 2026
@joshua-adams-1
Copy link
Copy Markdown
Contributor Author

This is step 4.1 of https://github.com/elastic/elasticsearch-team/issues/2088.

The current reindexing flow is:

  1. Initialise and execute the reindexing task
  2. Slice the request (if applicable). For each slice, send a reindexing REST request and go back to step 1.
  3. Now we're working with one slice only, we open a scroll for that slice, and iterate through each batch of 1000 items, reindexing them into the destination index. If we're reindexing from remote, then for each slice we determine the remote cluster version, and modify our scroll requests to comply with bwc. As a note, if there are N slices, we're doing N calls to the remote cluster to get the same information (which will be refactored later).
  4. We close each scroll
  5. Once all slices are finished, we finish the task

To support PIT we need to:

  1. Open a PIT
  2. Search using the PIT
  3. Close the PIT.

Since one PIT is shared between all slices, the opening of the PIT must be done before we slice (ie a step 1.5 in the above flow). Before we open a PIT we need to decide whether we should. To decide, we need to check whether we're reindexing from a remote cluster. If we are, the remote version must be high enough to support PIT, or else we'll default to scroll. This change adds the logic to determine the remote cluster version. It does not USE the remote version.

A couple of things to note:

  1. This logic already exists, but I am moving it higher by extracting it into a utils class.
  2. Since I am hiding my new code behind a feature flag, there is no additional overhead to remote reindexing requests in the interim, which would be calculating the remote version twice.
  3. Once we use PIT, it is redundant to then redetermine the remote cluster version on every slice, so I shall be removing that code completely. However, to do this, the remote version must be passed from leader to worker task, and I didn't want to obfuscate these changes.

Next Steps:

  1. Now we have the remote version, we can use it inside BulkByScrollParallelizationHelper.executeSlicedAction to open the PIT before slicing
  2. We can extend LeaderBulkByScrollTaskState to store the remote cluster version, and then remove the redundant check from inside RemoteScrollableHitSource

* while it is under development.
*/
// TODO - DELETE. Only needed for local development
static boolean REINDEX_PIT_SEARCH_ENABLED = new FeatureFlag("reindex_pit_search_enabled").isEnabled();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this as a placeholder until #142035 is merged

* Verifies that lookupRemoteVersion correctly parses historical and
* forward-compatible main action responses.
*/
public void testLookupRemoteVersion() throws Exception {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, testLookupRemoteVersionFailsWithoutContentType and testWrapExceptionToPreserveStatus were modified from RemoteScrollableHitSourceTests. The rest I added to plug testing gaps

Copy link
Copy Markdown
Member

@PeteGillinElastic PeteGillinElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks like a good start. I have one comment about a potential change of approach. It may not work, but it might be cleaner if it does. (I have a few tiny nits as well, sorry...)

I haven't looked at the tests, I figured it would be worth converging on the overall approach first.

public class RemoteReindexingUtils {

public static void lookupRemoteVersion(RejectAwareActionListener<Version> listener, ThreadPool threadPool, RestClient client) {
execute(new Request("GET", ""), MAIN_ACTION_PARSER, listener, threadPool, client);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Micro-nit: I would consider "/" marginally more readable than "". Obviously they're equivalent, but (a) I think that / is the canonical form which the empty path would be normalized into, and (b) I think it's more obvious to the reader what the parameter is if we use "/". (I realize you're just moving this code, but no harm in improving it!)

RestClient client
) {
// Preserve the thread context so headers survive after the call
java.util.function.Supplier<ThreadContext.StoredContext> contextSupplier = threadPool.getThreadContext().newRestorableContext(true);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We can import j.u.f.Supplier here. (We couldn't in the place you're moving it from because that's importing a different Supplier.)

private final ScriptService scriptService;
private final ReindexSslConfig reindexSslConfig;
private final ReindexMetrics reindexMetrics;
Version remoteVersion;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to add this mutable state to Reindexer, if we can avoid it. It makes it a bit harder to reason about, because whenever we're looking at any code we'll have to think carefully to figure out whether this thing will have been initialized yet or not.

Can we instead make lookupRemoteVersion take an ActionListener<Version> instead of ActionListener<Void>, and plumb the thing version through as an additional method parameter to initTask and then BulkByScrollParallelizationHelper.initTaskState, and then through to the LeaderBulkByScrollTaskState or WorkerBulkByScrollTaskState?

I haven't tried that, so I could be wrong, but it feels like it should be possible.

If you end up having to have this mutable state, please can we make it private, and add a getter if needed, so that at least it can only be set from within this class.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, come to think of it: Is there a reason why you trigger the lookupRemoteVersion call in the transport action, as a wrapper around initTask, rather than triggering it inside the implementation of initTask? Again, I haven't tried it — but, if you can do the latter, you're reducing the amount of the internals of Reindexer that you're exposing to the transport action, right?

protected void doStart(RejectAwareActionListener<Response> searchListener) {
lookupRemoteVersion(RejectAwareActionListener.withResponseHandler(searchListener, version -> {
remoteVersion = version;
if (remoteVersion != null) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the feature flag is turned on, the remote version will be non null, and we'll save the cost of the second lookupRemoteVersion call

client,
listener.delegateFailure((l, v) -> executeSlicedAction(task, request, action, l, client, node, workerAction))
listener.delegateFailure(
(l, v) -> executeSlicedAction(task, request, action, l, client, node, null, version -> workerAction.run())
Copy link
Copy Markdown
Contributor Author

@joshua-adams-1 joshua-adams-1 Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called exclusively by u-b-q and d-b-q (which has no concept of remote versions since they're executed on the local node). Therefore, setting the remote version parameter to null has no effect. Subsequent changes to introduce PIT will be behind the REINDEX_PIT_SEARCH_ENABLED feature flag anyways, but in the future, once that flag is lifted, this will fail the 'use pit' check, and default to scroll

reindexSslConfig,
request,
wrapWithMetrics(listener, reindexMetrics, startTime, request.getRemoteInfo() != null),
remoteVersion
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unchanged, except I pass the remoteVersion downstream

sslConfig
);
this.destinationIndexIdMapper = destinationIndexMode(state).idFieldMapperWithoutFieldData();
this.remoteVersion = remoteVersion;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I store it as a state variable so it can be referenced by buildScrollableResultSource below.

remoteInfo,
searchRequest
searchRequest,
remoteVersion
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where we've enabled the PIT feature flag, but we're trying to reindex from remote from a remote node with a version < PIT was introduced, then we need to fallback to using scroll. However, since we've already done the remote version lookup, we may as well pass this value in.

If this is a scrollable workflow, then remoteVersion is null.

* Creates a RemoteScrollablePaginatedHitSource with a pre-resolved initial remote version so that doStart skips the version lookup.
* The mock client serves only the given response paths (one request = one path when using initial version).
*/
private RemoteScrollablePaginatedHitSource sourceWithInitialRemoteVersion(Version initialRemoteVersion, String... paths)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only called by testDoStartSkipsVersionLookupWhenInitialRemoteVersionSet

@joshua-adams-1 joshua-adams-1 marked this pull request as ready for review February 24, 2026 12:18
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Feb 24, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

RemoteInfo remoteInfo = request.getRemoteInfo();
assert reindexSslConfig != null : "Reindex ssl config must be set";
RestClient restClient = buildRestClient(remoteInfo, reindexSslConfig, task.getId(), synchronizedList(new ArrayList<>()));
RejectAwareActionListener<Version> rejectAwareListener = new RejectAwareActionListener<>() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the remote version lookup has retries as part of starting the search. It'd be nice to add retries here too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - will add now

try {
client.performRequestAsync(request, new ResponseListener() {
@Override
public void onSuccess(org.elasticsearch.client.Response response) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully qualified name seems unnecessary, just import the class? It's easier to read. Same for other places.

exponentialBackoff(request.getRetryBackoffInitialTime(), request.getMaxRetries()),
threadPool,
restClient,
task.getWorkerState()::countSearchRetry,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think task.getWorkerState() would throw if the task is not a worker. Leader task would also call lookupRemoteVersionAndExecute, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's fine as long as slicing is not enabled for remote source, but it does seem a bit trappy.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think we could just remove this runnable, countSearchRetry increments the searchRetries counter which is eventually reported in task response. It seems inaccurate to count the retries for fetching remote version towards search retries? Even though the old behaviour does that, it feels more like unintentional.

Copy link
Copy Markdown
Contributor

@samxbr samxbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just have some non-blocking minor comments.

@joshua-adams-1 joshua-adams-1 added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Feb 27, 2026
@elasticsearchmachine elasticsearchmachine merged commit 0da8e46 into elastic:main Feb 27, 2026
35 checks passed
@joshua-adams-1 joshua-adams-1 deleted the reindexing-plumb-pittable-hit-source branch February 27, 2026 11:35
PeteGillinElastic pushed a commit to PeteGillinElastic/elasticsearch that referenced this pull request Feb 27, 2026
Extends the reindexing flow to determine the remote cluster version (if
applicable), before slicing.

Relates: elastic/elasticsearch-team#2088
szybia added a commit to szybia/elasticsearch that referenced this pull request Feb 27, 2026
…cations

* upstream/main: (35 commits)
  Create ARM bulk sqrI8 implementation (elastic#142461)
  Rework get-snapshots predicates (elastic#143161)
  Refactor downsampling fetchers and producers (elastic#140357)
  ESQL: Unmute test and add extra logging to generative test validation (elastic#143168)
  Fix metadata fields being nullified/loaded by unmapped_fields setting (elastic#143155)
  Determine remote cluster version (elastic#142494)
  Populate failure message for aborted clones (elastic#143206)
  Allow kibana_system role to read and manage logs streams (elastic#143053)
  Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:eval.DocsLength} elastic#143224
  Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:eval.DocsByteLength} elastic#143223
  Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:docs.DocsBitLength} elastic#143222
  Fix FloatVectorScorerSupplier bulkScore bug (elastic#143211)
  ESQL: Add data node execution for external sources (elastic#143209)
  [ESQL] Cleanup commands docs (elastic#143058)
  [ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic (elastic#142856)
  Mute org.elasticsearch.index.mapper.IpFieldMapperTests testSyntheticSourceInObject elastic#143212
  Tests: Fix StoreDirectoryMetricsIT (elastic#143084)
  ESQL: Add distribution strategy for external sources (elastic#143194)
  CSV IT spec (elastic#142585)
  Fix VectorScorerOSQBenchmark.score to read corrections properly (elastic#143137)
  ...
tballison pushed a commit to tballison/elasticsearch that referenced this pull request Mar 3, 2026
Extends the reindexing flow to determine the remote cluster version (if
applicable), before slicing.

Relates: elastic/elasticsearch-team#2088
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Reindex Issues relating to reindex that are not caused by issues further down >non-issue Team:Distributed Meta label for distributed team. v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants