Skip to content

[ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic#142856

Merged
valeriy42 merged 12 commits intoelastic:mainfrom
valeriy42:fix/is-90643
Feb 27, 2026
Merged

[ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic#142856
valeriy42 merged 12 commits intoelastic:mainfrom
valeriy42:fix/is-90643

Conversation

@valeriy42
Copy link
Copy Markdown
Contributor

@valeriy42 valeriy42 commented Feb 23, 2026

Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in [lastCheckpoint, nextCheckpoint), making some documents invisible and causing top_hits to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring top_hits picks the latest document. Changes are within LatestChangeCollector with no impact on API or schema, ensuring correct behavior after upgrade.

The new behavior aligns latest with the pattern used by pivot transforms. Pivot employs CompositeBucketsChangeCollector, which runs two phases via TransformIndexer: in IDENTIFY_CHANGES, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like TermsFieldCollector and DateHistogramFieldCollector. In APPLY_RESULTS, it builds the pivot query, narrowing it with filters from these collectors. Latest now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing

Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no top_hits), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover LatestChangeCollector (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data.

Fixes #90643

@valeriy42 valeriy42 added >bug :ml/Transform Transform auto-backport Automatically create backport pull requests when merged v9.4.0 v9.3.2 v8.19.13 v9.2.7 labels Feb 23, 2026
@valeriy42 valeriy42 requested a review from Copilot February 23, 2026 13:44
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @valeriy42, I've created a changelog YAML for you.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug in continuous latest transforms where documents could be incorrectly overwritten when the sort field and sync time field don't increase monotonically together. The fix implements a two-phase change detection mechanism similar to pivot transforms, ensuring the destination always contains the document with the highest sort value for each unique key.

Changes:

  • Implemented two-phase change detection in LatestChangeCollector to correctly handle non-monotonic sort/sync field alignment
  • Updated Latest.buildChangeCollector to pass the unique key list to the change collector
  • Added comprehensive unit tests for the new change collector behavior
  • Added integration test reproducing the non-monotonic scenario
  • Added YAML REST tests validating preview and batch behavior with non-monotonic data

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
LatestChangeCollector.java Implements two-phase change detection with composite aggregation and filtering logic
Latest.java Passes unique key to change collector constructor
LatestChangeCollectorTests.java Comprehensive unit tests for the new change collector implementation
TransformLatestRestIT.java Integration test verifying correct behavior with non-monotonic sort/sync fields
transforms_latest.yml YAML REST tests for preview and batch operations with non-monotonic data
142856.yaml Changelog entry documenting the bug fix

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
}

public void testProcessSearchResponseClearsPreviousPageKeys() throws IOException {
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name 'testProcessSearchResponseClearsPreviousPageKeys' is misleading. The test actually verifies that calling processSearchResponse replaces (not clears) the previous page's keys with new keys from the current page, as evidenced by the assertion checking for only 'page2-id'. Consider renaming to 'testProcessSearchResponseReplacesKeysFromPreviousPage' to better reflect the actual behavior.

Suggested change
public void testProcessSearchResponseClearsPreviousPageKeys() throws IOException {
public void testProcessSearchResponseReplacesKeysFromPreviousPage() throws IOException {

Copilot uses AI. Check for mistakes.
@valeriy42 valeriy42 marked this pull request as ready for review February 25, 2026 05:36
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Feb 25, 2026
@valeriy42 valeriy42 requested a review from prwhelan February 25, 2026 16:35
@valeriy42
Copy link
Copy Markdown
Contributor Author

@prwhelan I addressed your comments. Please take another look.

@valeriy42 valeriy42 requested a review from prwhelan February 26, 2026 09:50
@valeriy42 valeriy42 merged commit a745964 into elastic:main Feb 27, 2026
36 of 37 checks passed
@valeriy42 valeriy42 deleted the fix/is-90643 branch February 27, 2026 08:52
valeriy42 added a commit to valeriy42/elasticsearch that referenced this pull request Feb 27, 2026
…lds are non-monotonic (elastic#142856)

Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in `[lastCheckpoint, nextCheckpoint)`, making some documents invisible and causing `top_hits` to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring `top_hits` picks the latest document. Changes are within `LatestChangeCollector` with no impact on API or schema, ensuring correct behavior after upgrade.

The new behavior aligns `latest` with the pattern used by `pivot` transforms. Pivot employs `CompositeBucketsChangeCollector`, which runs two phases via `TransformIndexer`: in **IDENTIFY_CHANGES**, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like `TermsFieldCollector` and `DateHistogramFieldCollector`. In **APPLY_RESULTS**, it builds the pivot query, narrowing it with filters from these collectors. `Latest` now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing

Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no `top_hits`), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover `LatestChangeCollector` (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data.

Fixes elastic#90643
valeriy42 added a commit to valeriy42/elasticsearch that referenced this pull request Feb 27, 2026
…lds are non-monotonic (elastic#142856)

Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in `[lastCheckpoint, nextCheckpoint)`, making some documents invisible and causing `top_hits` to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring `top_hits` picks the latest document. Changes are within `LatestChangeCollector` with no impact on API or schema, ensuring correct behavior after upgrade.

The new behavior aligns `latest` with the pattern used by `pivot` transforms. Pivot employs `CompositeBucketsChangeCollector`, which runs two phases via `TransformIndexer`: in **IDENTIFY_CHANGES**, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like `TermsFieldCollector` and `DateHistogramFieldCollector`. In **APPLY_RESULTS**, it builds the pivot query, narrowing it with filters from these collectors. `Latest` now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing

Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no `top_hits`), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover `LatestChangeCollector` (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data.

Fixes elastic#90643
valeriy42 added a commit to valeriy42/elasticsearch that referenced this pull request Feb 27, 2026
…lds are non-monotonic (elastic#142856)

Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in `[lastCheckpoint, nextCheckpoint)`, making some documents invisible and causing `top_hits` to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring `top_hits` picks the latest document. Changes are within `LatestChangeCollector` with no impact on API or schema, ensuring correct behavior after upgrade.

The new behavior aligns `latest` with the pattern used by `pivot` transforms. Pivot employs `CompositeBucketsChangeCollector`, which runs two phases via `TransformIndexer`: in **IDENTIFY_CHANGES**, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like `TermsFieldCollector` and `DateHistogramFieldCollector`. In **APPLY_RESULTS**, it builds the pivot query, narrowing it with filters from these collectors. `Latest` now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing

Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no `top_hits`), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover `LatestChangeCollector` (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data.

Fixes elastic#90643
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

💚 Backport successful

Status Branch Result
9.3
8.19
9.2

elasticsearchmachine pushed a commit that referenced this pull request Feb 27, 2026
…lds are non-monotonic (#142856) (#143215)

Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in `[lastCheckpoint, nextCheckpoint)`, making some documents invisible and causing `top_hits` to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring `top_hits` picks the latest document. Changes are within `LatestChangeCollector` with no impact on API or schema, ensuring correct behavior after upgrade.

The new behavior aligns `latest` with the pattern used by `pivot` transforms. Pivot employs `CompositeBucketsChangeCollector`, which runs two phases via `TransformIndexer`: in **IDENTIFY_CHANGES**, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like `TermsFieldCollector` and `DateHistogramFieldCollector`. In **APPLY_RESULTS**, it builds the pivot query, narrowing it with filters from these collectors. `Latest` now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing

Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no `top_hits`), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover `LatestChangeCollector` (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data.

Fixes #90643
elasticsearchmachine pushed a commit that referenced this pull request Feb 27, 2026
…lds are non-monotonic (#142856) (#143213)

Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in `[lastCheckpoint, nextCheckpoint)`, making some documents invisible and causing `top_hits` to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring `top_hits` picks the latest document. Changes are within `LatestChangeCollector` with no impact on API or schema, ensuring correct behavior after upgrade.

The new behavior aligns `latest` with the pattern used by `pivot` transforms. Pivot employs `CompositeBucketsChangeCollector`, which runs two phases via `TransformIndexer`: in **IDENTIFY_CHANGES**, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like `TermsFieldCollector` and `DateHistogramFieldCollector`. In **APPLY_RESULTS**, it builds the pivot query, narrowing it with filters from these collectors. `Latest` now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing

Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no `top_hits`), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover `LatestChangeCollector` (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data.

Fixes #90643
elasticsearchmachine pushed a commit that referenced this pull request Feb 27, 2026
…ync fields are non-monotonic (#142856) (#143214)

* [ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic (#142856)

Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in `[lastCheckpoint, nextCheckpoint)`, making some documents invisible and causing `top_hits` to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring `top_hits` picks the latest document. Changes are within `LatestChangeCollector` with no impact on API or schema, ensuring correct behavior after upgrade.

The new behavior aligns `latest` with the pattern used by `pivot` transforms. Pivot employs `CompositeBucketsChangeCollector`, which runs two phases via `TransformIndexer`: in **IDENTIFY_CHANGES**, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like `TermsFieldCollector` and `DateHistogramFieldCollector`. In **APPLY_RESULTS**, it builds the pivot query, narrowing it with filters from these collectors. `Latest` now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing

Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no `top_hits`), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover `LatestChangeCollector` (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data.

Fixes #90643

* Fix build error
PeteGillinElastic pushed a commit to PeteGillinElastic/elasticsearch that referenced this pull request Feb 27, 2026
…lds are non-monotonic (elastic#142856)

Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in `[lastCheckpoint, nextCheckpoint)`, making some documents invisible and causing `top_hits` to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring `top_hits` picks the latest document. Changes are within `LatestChangeCollector` with no impact on API or schema, ensuring correct behavior after upgrade.

The new behavior aligns `latest` with the pattern used by `pivot` transforms. Pivot employs `CompositeBucketsChangeCollector`, which runs two phases via `TransformIndexer`: in **IDENTIFY_CHANGES**, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like `TermsFieldCollector` and `DateHistogramFieldCollector`. In **APPLY_RESULTS**, it builds the pivot query, narrowing it with filters from these collectors. `Latest` now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing

Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no `top_hits`), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover `LatestChangeCollector` (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data.

Fixes elastic#90643
szybia added a commit to szybia/elasticsearch that referenced this pull request Feb 27, 2026
…cations

* upstream/main: (35 commits)
  Create ARM bulk sqrI8 implementation (elastic#142461)
  Rework get-snapshots predicates (elastic#143161)
  Refactor downsampling fetchers and producers (elastic#140357)
  ESQL: Unmute test and add extra logging to generative test validation (elastic#143168)
  Fix metadata fields being nullified/loaded by unmapped_fields setting (elastic#143155)
  Determine remote cluster version (elastic#142494)
  Populate failure message for aborted clones (elastic#143206)
  Allow kibana_system role to read and manage logs streams (elastic#143053)
  Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:eval.DocsLength} elastic#143224
  Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:eval.DocsByteLength} elastic#143223
  Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:docs.DocsBitLength} elastic#143222
  Fix FloatVectorScorerSupplier bulkScore bug (elastic#143211)
  ESQL: Add data node execution for external sources (elastic#143209)
  [ESQL] Cleanup commands docs (elastic#143058)
  [ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic (elastic#142856)
  Mute org.elasticsearch.index.mapper.IpFieldMapperTests testSyntheticSourceInObject elastic#143212
  Tests: Fix StoreDirectoryMetricsIT (elastic#143084)
  ESQL: Add distribution strategy for external sources (elastic#143194)
  CSV IT spec (elastic#142585)
  Fix VectorScorerOSQBenchmark.score to read corrections properly (elastic#143137)
  ...
neilbhavsar pushed a commit that referenced this pull request Feb 27, 2026
…ync fields are non-monotonic (#142856) (#143214)

* [ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic (#142856)

Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in `[lastCheckpoint, nextCheckpoint)`, making some documents invisible and causing `top_hits` to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring `top_hits` picks the latest document. Changes are within `LatestChangeCollector` with no impact on API or schema, ensuring correct behavior after upgrade.

The new behavior aligns `latest` with the pattern used by `pivot` transforms. Pivot employs `CompositeBucketsChangeCollector`, which runs two phases via `TransformIndexer`: in **IDENTIFY_CHANGES**, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like `TermsFieldCollector` and `DateHistogramFieldCollector`. In **APPLY_RESULTS**, it builds the pivot query, narrowing it with filters from these collectors. `Latest` now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing

Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no `top_hits`), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover `LatestChangeCollector` (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data.

Fixes #90643

* Fix build error
tballison pushed a commit to tballison/elasticsearch that referenced this pull request Mar 3, 2026
…lds are non-monotonic (elastic#142856)

Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in `[lastCheckpoint, nextCheckpoint)`, making some documents invisible and causing `top_hits` to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring `top_hits` picks the latest document. Changes are within `LatestChangeCollector` with no impact on API or schema, ensuring correct behavior after upgrade.

The new behavior aligns `latest` with the pattern used by `pivot` transforms. Pivot employs `CompositeBucketsChangeCollector`, which runs two phases via `TransformIndexer`: in **IDENTIFY_CHANGES**, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like `TermsFieldCollector` and `DateHistogramFieldCollector`. In **APPLY_RESULTS**, it builds the pivot query, narrowing it with filters from these collectors. `Latest` now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing

Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no `top_hits`), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover `LatestChangeCollector` (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data.

Fixes elastic#90643
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :ml/Transform Transform Team:ML Meta label for the ML team v8.19.13 v9.2.7 v9.3.2 v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Transform] Latest transforms may disregard some document updates.

4 participants