Fix offsets not recording duplicate values by jordan-powers · Pull Request #125354 · elastic/elasticsearch

jordan-powers · 2025-03-20T22:07:33Z

While investigating a failing CI test for #125337, I discovered a bug in our current offset logic.

Basically, when calculating the offsets, we just compare the values as-is without any loss of precision. However, when the values are saved into doc values and loaded in the doc values loader, they will have lost precision. This means that values that were not duplicates when calculating the offsets will now be duplicates in the doc values loader. This interferes with the de-duplication logic, causing incorrect values to be returned.

Here's a concrete example.
This value is indexed into a type: half-float field: [0.78151345, 0.6886488, 0.6882413].
The corresponding offsets will be [3, 2, 1].
However, once the value is saved into the doc values and re-loaded, precision is lost and the SortedNumericDocValues become [0.68847656,0.68847656,0.7817383,]. Note that the first two values are now duplicates.
Because of the de-duplication logic, in NumericDocValuesWithOffsetsLoader#write, the values array is then set to [0.68847656, 0.7817383, 0].
Finally, when the offsets are used to reconstruct the source using this values array, the resultant source is [0, 0.7817383, 0.68847656], which does not match the original source.

My solution is to apply the precision loss before calculating the offsets, so that both the offsets calculation and the SortedNumericDocValues de-duplication see the same values as duplicates.

elasticsearchmachine · 2025-03-20T22:08:00Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

martijnvg

Good catch! One small comment - LGTM 👍

martijnvg · 2025-03-21T06:44:27Z

server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java

            boolean stored
        );

+        public abstract long toSortableLong(Number value);


Now we have a good reason to have this method :).
Maybe add documentation on this method to explain why we need to method?

…-duplicates

jordan-powers · 2025-03-21T19:50:12Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x

Questions ?

Please refer to the Backport tool documentation

#124594) | Fix ignores malformed testcase (#125337) | Fix offsets not recording duplicate values (#125354) (#125440) * Natively store synthetic source array offsets for numeric fields (#124594) This patch builds on the work in #122999 and #113757 to natively store array offsets for numeric fields instead of falling back to ignored source when `source_keep_mode: arrays`. (cherry picked from commit 376abfe) # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexVersions.java # server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java * Fix ignores malformed testcase (#125337) Fix and unmute testSynthesizeArrayRandomIgnoresMalformed (cherry picked from commit 2ff03ac) # Conflicts: # muted-tests.yml * Fix offsets not recording duplicate values (#125354) Previously, when calculating the offsets, we just compared the values as-is without any loss of precision. However, when the values were saved into doc values and loaded in the doc values loader, they could have lost precision. This meant that values that were not duplicates when calculating the offsets could now be duplicates in the doc values loader. This interfered with the de-duplication logic, causing incorrect values to be returned. My solution is to apply the precision loss before calculating the offsets, so that both the offsets calculation and the SortedNumericDocValues de-duplication see the same values as duplicates. (cherry picked from commit db73175)

Previously, when calculating the offsets, we just compared the values as-is without any loss of precision. However, when the values were saved into doc values and loaded in the doc values loader, they could have lost precision. This meant that values that were not duplicates when calculating the offsets could now be duplicates in the doc values loader. This interfered with the de-duplication logic, causing incorrect values to be returned. My solution is to apply the precision loss before calculating the offsets, so that both the offsets calculation and the SortedNumericDocValues de-duplication see the same values as duplicates.

jordan-powers added 2 commits March 20, 2025 15:00

Add test demonstrating missing duplicate issue

9c966da

Convert values to sortable longs before recording offsets

29b68c4

jordan-powers added >non-issue :StorageEngine/Mapping The storage related side of mappings v9.1.0 labels Mar 20, 2025

jordan-powers requested a review from martijnvg March 20, 2025 22:07

jordan-powers self-assigned this Mar 20, 2025

elasticsearchmachine added the Team:StorageEngine label Mar 20, 2025

martijnvg approved these changes Mar 21, 2025

View reviewed changes

Document NumberType#toSortableLong

c4a1371

jordan-powers enabled auto-merge (squash) March 21, 2025 16:40

Merge remote-tracking branch 'upstream/main' into fix-offsets-missing…

e99d8d8

…-duplicates

jordan-powers merged commit db73175 into elastic:main Mar 21, 2025
16 of 17 checks passed

jordan-powers mentioned this pull request Mar 21, 2025

[8.x] Natively store synthetic source array offsets for numeric fields (#124594) | Fix ignores malformed testcase (#125337) | Fix offsets not recording duplicate values (#125354) #125440

Merged

jordan-powers deleted the fix-offsets-missing-duplicates branch March 25, 2025 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix offsets not recording duplicate values#125354

Fix offsets not recording duplicate values#125354
jordan-powers merged 4 commits intoelastic:mainfrom
jordan-powers:fix-offsets-missing-duplicates

jordan-powers commented Mar 20, 2025

Uh oh!

elasticsearchmachine commented Mar 20, 2025

Uh oh!

martijnvg left a comment

Uh oh!

martijnvg Mar 21, 2025

Uh oh!

Uh oh!

jordan-powers commented Mar 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

jordan-powers commented Mar 20, 2025

Uh oh!

elasticsearchmachine commented Mar 20, 2025

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

martijnvg Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jordan-powers commented Mar 21, 2025

💚 All backports created successfully

Questions ?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants