[ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic by valeriy42 · Pull Request #142856 · elastic/elasticsearch

valeriy42 · 2026-02-23T13:43:44Z

Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in [lastCheckpoint, nextCheckpoint), making some documents invisible and causing top_hits to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring top_hits picks the latest document. Changes are within LatestChangeCollector with no impact on API or schema, ensuring correct behavior after upgrade.

The new behavior aligns latest with the pattern used by pivot transforms. Pivot employs CompositeBucketsChangeCollector, which runs two phases via TransformIndexer: in IDENTIFY_CHANGES, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like TermsFieldCollector and DateHistogramFieldCollector. In APPLY_RESULTS, it builds the pivot query, narrowing it with filters from these collectors. Latest now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing

Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no top_hits), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover LatestChangeCollector (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data.

Fixes #90643

elasticsearchmachine · 2026-02-23T13:44:09Z

Hi @valeriy42, I've created a changelog YAML for you.

Copilot

Pull request overview

This PR fixes a bug in continuous latest transforms where documents could be incorrectly overwritten when the sort field and sync time field don't increase monotonically together. The fix implements a two-phase change detection mechanism similar to pivot transforms, ensuring the destination always contains the document with the highest sort value for each unique key.

Changes:

Implemented two-phase change detection in LatestChangeCollector to correctly handle non-monotonic sort/sync field alignment
Updated Latest.buildChangeCollector to pass the unique key list to the change collector
Added comprehensive unit tests for the new change collector behavior
Added integration test reproducing the non-monotonic scenario
Added YAML REST tests validating preview and batch behavior with non-monotonic data

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
LatestChangeCollector.java	Implements two-phase change detection with composite aggregation and filtering logic
Latest.java	Passes unique key to change collector constructor
LatestChangeCollectorTests.java	Comprehensive unit tests for the new change collector implementation
TransformLatestRestIT.java	Integration test verifying correct behavior with non-monotonic sort/sync fields
transforms_latest.yml	YAML REST tests for preview and batch operations with non-monotonic data
142856.yaml	Changelog entry documenting the bug fix

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-23T13:45:49Z

...est/java/org/elasticsearch/xpack/transform/transforms/latest/LatestChangeCollectorTests.java

+        }
+    }
+
+    public void testProcessSearchResponseClearsPreviousPageKeys() throws IOException {


The test name 'testProcessSearchResponseClearsPreviousPageKeys' is misleading. The test actually verifies that calling processSearchResponse replaces (not clears) the previous page's keys with new keys from the current page, as evidenced by the assertion checking for only 'page2-id'. Consider renaming to 'testProcessSearchResponseReplacesKeysFromPreviousPage' to better reflect the actual behavior.

Suggested change

public void testProcessSearchResponseClearsPreviousPageKeys() throws IOException {

public void testProcessSearchResponseReplacesKeysFromPreviousPage() throws IOException {

…arch into fix/is-90643

elasticsearchmachine · 2026-02-25T05:37:15Z

Pinging @elastic/ml-core (Team:ML)

x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/transform/transforms_latest.yml

...src/main/java/org/elasticsearch/xpack/transform/transforms/latest/LatestChangeCollector.java

valeriy42 · 2026-02-26T09:49:58Z

@prwhelan I addressed your comments. Please take another look.

…arch into fix/is-90643

…lds are non-monotonic (elastic#142856) Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in `[lastCheckpoint, nextCheckpoint)`, making some documents invisible and causing `top_hits` to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring `top_hits` picks the latest document. Changes are within `LatestChangeCollector` with no impact on API or schema, ensuring correct behavior after upgrade. The new behavior aligns `latest` with the pattern used by `pivot` transforms. Pivot employs `CompositeBucketsChangeCollector`, which runs two phases via `TransformIndexer`: in **IDENTIFY_CHANGES**, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like `TermsFieldCollector` and `DateHistogramFieldCollector`. In **APPLY_RESULTS**, it builds the pivot query, narrowing it with filters from these collectors. `Latest` now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no `top_hits`), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover `LatestChangeCollector` (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data. Fixes elastic#90643

elasticsearchmachine · 2026-02-27T08:54:05Z

💚 Backport successful

Status	Branch	Result
✅	9.3
✅	8.19
✅	9.2

…lds are non-monotonic (#142856) (#143215) Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in `[lastCheckpoint, nextCheckpoint)`, making some documents invisible and causing `top_hits` to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring `top_hits` picks the latest document. Changes are within `LatestChangeCollector` with no impact on API or schema, ensuring correct behavior after upgrade. The new behavior aligns `latest` with the pattern used by `pivot` transforms. Pivot employs `CompositeBucketsChangeCollector`, which runs two phases via `TransformIndexer`: in **IDENTIFY_CHANGES**, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like `TermsFieldCollector` and `DateHistogramFieldCollector`. In **APPLY_RESULTS**, it builds the pivot query, narrowing it with filters from these collectors. `Latest` now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no `top_hits`), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover `LatestChangeCollector` (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data. Fixes #90643

…lds are non-monotonic (#142856) (#143213) Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in `[lastCheckpoint, nextCheckpoint)`, making some documents invisible and causing `top_hits` to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring `top_hits` picks the latest document. Changes are within `LatestChangeCollector` with no impact on API or schema, ensuring correct behavior after upgrade. The new behavior aligns `latest` with the pattern used by `pivot` transforms. Pivot employs `CompositeBucketsChangeCollector`, which runs two phases via `TransformIndexer`: in **IDENTIFY_CHANGES**, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like `TermsFieldCollector` and `DateHistogramFieldCollector`. In **APPLY_RESULTS**, it builds the pivot query, narrowing it with filters from these collectors. `Latest` now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no `top_hits`), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover `LatestChangeCollector` (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data. Fixes #90643

…ync fields are non-monotonic (#142856) (#143214) * [ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic (#142856) Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in `[lastCheckpoint, nextCheckpoint)`, making some documents invisible and causing `top_hits` to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring `top_hits` picks the latest document. Changes are within `LatestChangeCollector` with no impact on API or schema, ensuring correct behavior after upgrade. The new behavior aligns `latest` with the pattern used by `pivot` transforms. Pivot employs `CompositeBucketsChangeCollector`, which runs two phases via `TransformIndexer`: in **IDENTIFY_CHANGES**, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like `TermsFieldCollector` and `DateHistogramFieldCollector`. In **APPLY_RESULTS**, it builds the pivot query, narrowing it with filters from these collectors. `Latest` now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no `top_hits`), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover `LatestChangeCollector` (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data. Fixes #90643 * Fix build error

…lds are non-monotonic (elastic#142856) Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in `[lastCheckpoint, nextCheckpoint)`, making some documents invisible and causing `top_hits` to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring `top_hits` picks the latest document. Changes are within `LatestChangeCollector` with no impact on API or schema, ensuring correct behavior after upgrade. The new behavior aligns `latest` with the pattern used by `pivot` transforms. Pivot employs `CompositeBucketsChangeCollector`, which runs two phases via `TransformIndexer`: in **IDENTIFY_CHANGES**, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like `TermsFieldCollector` and `DateHistogramFieldCollector`. In **APPLY_RESULTS**, it builds the pivot query, narrowing it with filters from these collectors. `Latest` now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no `top_hits`), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover `LatestChangeCollector` (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data. Fixes elastic#90643

…cations * upstream/main: (35 commits) Create ARM bulk sqrI8 implementation (elastic#142461) Rework get-snapshots predicates (elastic#143161) Refactor downsampling fetchers and producers (elastic#140357) ESQL: Unmute test and add extra logging to generative test validation (elastic#143168) Fix metadata fields being nullified/loaded by unmapped_fields setting (elastic#143155) Determine remote cluster version (elastic#142494) Populate failure message for aborted clones (elastic#143206) Allow kibana_system role to read and manage logs streams (elastic#143053) Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:eval.DocsLength} elastic#143224 Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:eval.DocsByteLength} elastic#143223 Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:docs.DocsBitLength} elastic#143222 Fix FloatVectorScorerSupplier bulkScore bug (elastic#143211) ESQL: Add data node execution for external sources (elastic#143209) [ESQL] Cleanup commands docs (elastic#143058) [ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic (elastic#142856) Mute org.elasticsearch.index.mapper.IpFieldMapperTests testSyntheticSourceInObject elastic#143212 Tests: Fix StoreDirectoryMetricsIT (elastic#143084) ESQL: Add distribution strategy for external sources (elastic#143194) CSV IT spec (elastic#142585) Fix VectorScorerOSQBenchmark.score to read corrections properly (elastic#143137) ...

…ync fields are non-monotonic (#142856) (#143214) * [ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic (#142856) Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in `[lastCheckpoint, nextCheckpoint)`, making some documents invisible and causing `top_hits` to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring `top_hits` picks the latest document. Changes are within `LatestChangeCollector` with no impact on API or schema, ensuring correct behavior after upgrade. The new behavior aligns `latest` with the pattern used by `pivot` transforms. Pivot employs `CompositeBucketsChangeCollector`, which runs two phases via `TransformIndexer`: in **IDENTIFY_CHANGES**, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like `TermsFieldCollector` and `DateHistogramFieldCollector`. In **APPLY_RESULTS**, it builds the pivot query, narrowing it with filters from these collectors. `Latest` now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no `top_hits`), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover `LatestChangeCollector` (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data. Fixes #90643 * Fix build error

…lds are non-monotonic (elastic#142856) Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in `[lastCheckpoint, nextCheckpoint)`, making some documents invisible and causing `top_hits` to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring `top_hits` picks the latest document. Changes are within `LatestChangeCollector` with no impact on API or schema, ensuring correct behavior after upgrade. The new behavior aligns `latest` with the pattern used by `pivot` transforms. Pivot employs `CompositeBucketsChangeCollector`, which runs two phases via `TransformIndexer`: in **IDENTIFY_CHANGES**, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like `TermsFieldCollector` and `DateHistogramFieldCollector`. In **APPLY_RESULTS**, it builds the pivot query, narrowing it with filters from these collectors. `Latest` now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no `top_hits`), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover `LatestChangeCollector` (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data. Fixes elastic#90643

Fix Latest Transform Non-Monotonic Sort Issue

f485c3d

valeriy42 added >bug :ml/Transform Transform auto-backport Automatically create backport pull requests when merged v9.4.0 v9.3.2 v8.19.13 v9.2.7 labels Feb 23, 2026

valeriy42 requested a review from Copilot February 23, 2026 13:44

Update docs/changelog/142856.yaml

cac0305

Copilot AI reviewed Feb 23, 2026

View reviewed changes

valeriy42 added 2 commits February 25, 2026 06:33

update comments

30e0611

Merge branch 'fix/is-90643' of https://github.com/valeriy42/elasticse…

642acda

…arch into fix/is-90643

valeriy42 marked this pull request as ready for review February 25, 2026 05:36

elasticsearchmachine added the Team:ML Meta label for the ML team label Feb 25, 2026

valeriy42 added 2 commits February 25, 2026 15:47

fix html charecter

e750df0

Merge branch 'main' into fix/is-90643

3e08e0d

valeriy42 requested a review from prwhelan February 25, 2026 16:35

[CI] Auto commit changes from spotless

bb87bde

prwhelan reviewed Feb 25, 2026

View reviewed changes

review comments

b789e7f

valeriy42 requested a review from prwhelan February 26, 2026 09:50

valeriy42 added 3 commits February 26, 2026 11:06

Merge branch 'main' into fix/is-90643

08bb63b

fix unit test to match the pagination optimization

c7696a5

Merge branch 'fix/is-90643' of https://github.com/valeriy42/elasticse…

5fcb1dc

…arch into fix/is-90643

prwhelan approved these changes Feb 26, 2026

View reviewed changes

remove optimization

bd319c2

valeriy42 merged commit a745964 into elastic:main Feb 27, 2026
36 of 37 checks passed

valeriy42 deleted the fix/is-90643 branch February 27, 2026 08:52

valeriy42 mentioned this pull request Feb 27, 2026

[9.3] [ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic (#142856) #143213

Merged

valeriy42 mentioned this pull request Feb 27, 2026

[8.19] [ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic (#142856) #143214

Merged

valeriy42 mentioned this pull request Feb 27, 2026

[9.2] [ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic (#142856) #143215

Merged

prwhelan mentioned this pull request Feb 27, 2026

[Transform] Clean up internal tests #143246

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic#142856

[ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic#142856
valeriy42 merged 12 commits intoelastic:mainfrom
valeriy42:fix/is-90643

valeriy42 commented Feb 23, 2026 •

edited

Loading

Uh oh!

elasticsearchmachine commented Feb 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 23, 2026

Uh oh!

elasticsearchmachine commented Feb 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

valeriy42 commented Feb 26, 2026

Uh oh!

Uh oh!

elasticsearchmachine commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	public void testProcessSearchResponseClearsPreviousPageKeys() throws IOException {
	public void testProcessSearchResponseReplacesKeysFromPreviousPage() throws IOException {

Conversation

valeriy42 commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Feb 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Feb 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

valeriy42 commented Feb 26, 2026

Uh oh!

Uh oh!

elasticsearchmachine commented Feb 27, 2026

💚 Backport successful

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

valeriy42 commented Feb 23, 2026 •

edited

Loading