-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fix Term Vectors with artificial docs and keyword fields #53504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Previously, Term Vectors API was returning empty results for artificial documents with keyword fields. Checking only for `string()` on `IndexableField` is not enough, since for `KeywordFieldType` `binaryValue()` must be used instead. Fixes elastic#53494
|
Pinging @elastic/es-search (:Search/Search) |
jimczi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch @matriv, I left a comment regarding the applicability of the fix.
| if (field.name().equals(name)) { | ||
| if (field.stringValue() != null) { | ||
| result.add(field.stringValue()); | ||
| } else if (field.binaryValue() != null) { // KeywordFieldType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot use the binary value without checking the type of the field. It could be a real binary field that doesn't store utf8 bytes. Since this function is only used in the terms vector service, can you move it there and adds a check to extract binary values only if the field is of type keyword ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method is called as you said only by TermVectorService and it's guarded by isValidField:
https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/termvectors/TermVectorsService.java#L304 but I agree, it makes sense to move the method close to this so it's visible.
|
@jimczi I moved it to |
jimczi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for moving the function. I left one comment regarding the field value extraction.
| for (IndexableField field : fields) { | ||
| if (field.stringValue() != null) { | ||
| result.add(field.stringValue()); | ||
| } else if (field.binaryValue() != null) { // KeywordFieldType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this will create duplicates due to doc values field that also use a binary value (SortedSetDocValuesField). You should change this to something like:
if (field.fieldType() instanceof KeywordFieldMapper.KeywordFieldType) {
result.add(field.binaryValue().utf8ToString());
} else if (field.fieldType() instanceof StringFieldType) {
result.add(field.stringValue());
}
jimczi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Previously, Term Vectors API was returning empty results for artificial documents with keyword fields. Checking only for `string()` on `IndexableField` is not enough, since for `KeywordFieldType` `binaryValue()` must be used instead. Fixes elastic#53494 (cherry picked from commit 1fc3fe3)
Previously, Term Vectors API was returning empty results for
artificial documents with keyword fields. Checking only for
string()on
IndexableFieldis not enough, since forKeywordFieldTypebinaryValue()must be used instead.Fixes #53494