Skip to content

[8.19] [ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic (#142856)#143214

Merged
elasticsearchmachine merged 2 commits intoelastic:8.19from
valeriy42:backport/8.19/pr-142856
Feb 27, 2026
Merged

[8.19] [ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic (#142856)#143214
elasticsearchmachine merged 2 commits intoelastic:8.19from
valeriy42:backport/8.19/pr-142856

Conversation

@valeriy42
Copy link
Copy Markdown
Contributor

Backports the following commits to 8.19:

…lds are non-monotonic (elastic#142856)

Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in `[lastCheckpoint, nextCheckpoint)`, making some documents invisible and causing `top_hits` to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring `top_hits` picks the latest document. Changes are within `LatestChangeCollector` with no impact on API or schema, ensuring correct behavior after upgrade.

The new behavior aligns `latest` with the pattern used by `pivot` transforms. Pivot employs `CompositeBucketsChangeCollector`, which runs two phases via `TransformIndexer`: in **IDENTIFY_CHANGES**, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like `TermsFieldCollector` and `DateHistogramFieldCollector`. In **APPLY_RESULTS**, it builds the pivot query, narrowing it with filters from these collectors. `Latest` now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing

Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no `top_hits`), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover `LatestChangeCollector` (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data.

Fixes elastic#90643
@valeriy42 valeriy42 added :ml/Transform Transform >bug auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport Team:ML Meta label for the ML team labels Feb 27, 2026
@elasticsearchmachine elasticsearchmachine merged commit d00780e into elastic:8.19 Feb 27, 2026
28 checks passed
@valeriy42 valeriy42 deleted the backport/8.19/pr-142856 branch February 27, 2026 11:16
neilbhavsar pushed a commit that referenced this pull request Feb 27, 2026
…ync fields are non-monotonic (#142856) (#143214)

* [ML]Fix latest transforms disregarding updates when sort and sync fields are non-monotonic (#142856)

Continuous latest transforms could overwrite newer documents if sort and sync fields didn't increase together. Checkpoint N only queried documents in `[lastCheckpoint, nextCheckpoint)`, making some documents invisible and causing `top_hits` to select the wrong document. The fix introduces two-phase change detection: phase 1 finds updated keys via a composite aggregation; phase 2 runs a full query with a filter for those keys, ensuring `top_hits` picks the latest document. Changes are within `LatestChangeCollector` with no impact on API or schema, ensuring correct behavior after upgrade.

The new behavior aligns `latest` with the pattern used by `pivot` transforms. Pivot employs `CompositeBucketsChangeCollector`, which runs two phases via `TransformIndexer`: in **IDENTIFY_CHANGES**, it performs a composite aggregation over the checkpoint window with sync range, recording changed buckets using collectors like `TermsFieldCollector` and `DateHistogramFieldCollector`. In **APPLY_RESULTS**, it builds the pivot query, narrowing it with filters from these collectors. `Latest` now mirrors this at the unique-key level: phase 1 is a composite over unique key fields, and phase 2 filters by collected key values to run over full history. The key difference is that pivot’s “changed buckets” are the group-by dimensions, while latest’s are the unique key values for recomputing

Performance impact is limited: one extra search per checkpoint in phase 1 (composite aggregation only, no `top_hits`), and phase 2 processes only changed unique keys, not the whole dataset. No Painless scripts, per-document GET/UpdateRequest, or new destination fields. Unit tests cover `LatestChangeCollector` (buildChangesQuery, processSearchResponse, buildFilterQuery, clear, single and multi-field unique key, null buckets); a Java REST test reproduces the non-monotonic scenario (two docs, same key, different sort/sync order) and asserts the destination keeps the doc with higher sort value after checkpoint 2; YAML REST tests assert latest preview and batch behavior with non-monotonic data.

Fixes #90643

* Fix build error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport >bug :ml/Transform Transform Team:ML Meta label for the ML team v8.19.13

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants