Fix offsets not recording duplicate values#125354
Merged
jordan-powers merged 4 commits intoelastic:mainfrom Mar 21, 2025
Merged
Fix offsets not recording duplicate values#125354jordan-powers merged 4 commits intoelastic:mainfrom
jordan-powers merged 4 commits intoelastic:mainfrom
Conversation
Collaborator
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
martijnvg
approved these changes
Mar 21, 2025
Member
martijnvg
left a comment
There was a problem hiding this comment.
Good catch! One small comment - LGTM 👍
| boolean stored | ||
| ); | ||
|
|
||
| public abstract long toSortableLong(Number value); |
Member
There was a problem hiding this comment.
Now we have a good reason to have this method :).
Maybe add documentation on this method to explain why we need to method?
Contributor
Author
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
elasticsearchmachine
pushed a commit
that referenced
this pull request
Mar 21, 2025
#124594) | Fix ignores malformed testcase (#125337) | Fix offsets not recording duplicate values (#125354) (#125440) * Natively store synthetic source array offsets for numeric fields (#124594) This patch builds on the work in #122999 and #113757 to natively store array offsets for numeric fields instead of falling back to ignored source when `source_keep_mode: arrays`. (cherry picked from commit 376abfe) # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexVersions.java # server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java * Fix ignores malformed testcase (#125337) Fix and unmute testSynthesizeArrayRandomIgnoresMalformed (cherry picked from commit 2ff03ac) # Conflicts: # muted-tests.yml * Fix offsets not recording duplicate values (#125354) Previously, when calculating the offsets, we just compared the values as-is without any loss of precision. However, when the values were saved into doc values and loaded in the doc values loader, they could have lost precision. This meant that values that were not duplicates when calculating the offsets could now be duplicates in the doc values loader. This interfered with the de-duplication logic, causing incorrect values to be returned. My solution is to apply the precision loss before calculating the offsets, so that both the offsets calculation and the SortedNumericDocValues de-duplication see the same values as duplicates. (cherry picked from commit db73175)
omricohenn
pushed a commit
to omricohenn/elasticsearch
that referenced
this pull request
Mar 28, 2025
Previously, when calculating the offsets, we just compared the values as-is without any loss of precision. However, when the values were saved into doc values and loaded in the doc values loader, they could have lost precision. This meant that values that were not duplicates when calculating the offsets could now be duplicates in the doc values loader. This interfered with the de-duplication logic, causing incorrect values to be returned. My solution is to apply the precision loss before calculating the offsets, so that both the offsets calculation and the SortedNumericDocValues de-duplication see the same values as duplicates.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
While investigating a failing CI test for #125337, I discovered a bug in our current offset logic.
Basically, when calculating the offsets, we just compare the values as-is without any loss of precision. However, when the values are saved into doc values and loaded in the doc values loader, they will have lost precision. This means that values that were not duplicates when calculating the offsets will now be duplicates in the doc values loader. This interferes with the de-duplication logic, causing incorrect values to be returned.
Here's a concrete example.
This value is indexed into a
type: half-floatfield:[0.78151345, 0.6886488, 0.6882413].The corresponding offsets will be
[3, 2, 1].However, once the value is saved into the doc values and re-loaded, precision is lost and the
SortedNumericDocValuesbecome[0.68847656,0.68847656,0.7817383,]. Note that the first two values are now duplicates.Because of the de-duplication logic, in
NumericDocValuesWithOffsetsLoader#write, thevaluesarray is then set to[0.68847656, 0.7817383, 0].Finally, when the offsets are used to reconstruct the source using this values array, the resultant source is
[0, 0.7817383, 0.68847656], which does not match the original source.My solution is to apply the precision loss before calculating the offsets, so that both the offsets calculation and the
SortedNumericDocValuesde-duplication see the same values as duplicates.