-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Update semantic_text field to support indexing numeric and boolean data types #111284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update semantic_text field to support indexing numeric and boolean data types #111284
Conversation
0e0add3 to
168fdf7
Compare
Mikep86
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work 🙌 ! Left some minor comments
.../org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTestUtil.java
Outdated
Show resolved
Hide resolved
.../org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTestUtil.java
Outdated
Show resolved
Hide resolved
| if (model.hasResult(inputText)) { | ||
| ChunkedInferenceServiceResults results = model.getResults(inputText); | ||
| semanticTextField = semanticTextFieldFromChunkedInferenceResults( | ||
| field, | ||
| model, | ||
| List.of(inputText), | ||
| results, | ||
| requestContentType | ||
| ); | ||
| } else { | ||
| semanticTextField = randomSemanticText(field, model, List.of(inputText), requestContentType); | ||
| model.putResult(inputText, toChunkedResult(semanticTextField)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carlosdelest We had to make this change because the inference result cache in model is not field-aware. Now that our input can be of many data types (including Boolean, with only two values), we are nearly guaranteed to hit value collisions across 100+ bulk requests. This caused test failures with the previous logic because different random embeddings would be generated every time we saw the value "true" (for example).
This updated logic checks if the inference result cache already has results for the value, and uses them if it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - we could maybe have used ESTestCase.randomValueOtherThanMany()to a similar effect. That would ensure that the random value is not in the model, and not just trying twice - AFAIU we should loop until we find a value that is not on the results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm misunderstanding your comment, but I don't think ESTestCase.randomValueOtherThanMany() would help here. The issue is that the previous logic always generated a new embedding for the input, regardless of whether the model already had a cached value for that input. This caused test failures. Consider the following case:
- We generate a random embedding for the input
true - We write that embedding to the expected doc map and cache it in
model - In a later bulk request, the input
trueis randomly generated again - We generate a different random embedding for the input
true - We overwrite the cached embedding in
modelwith the new embedding - After all requests are generated, we assert that the embedding in the expected doc map matches that in the
modelcache. This fails because the embedding in themodelcache was overwritten.
This new logic fixes the problem by first checking if model already has a cached embedding for the input. If it does, we use it. If it doesn't, we generate a new random embedding and add it to the model cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is, wouldn't it be simpler not to generate the duplicate input values, and thus avoid managing the results as it happens?
...rence/src/yamlRestTest/resources/rest-api-spec/test/inference/30_semantic_text_inference.yml
Outdated
Show resolved
Hide resolved
...rence/src/yamlRestTest/resources/rest-api-spec/test/inference/30_semantic_text_inference.yml
Outdated
Show resolved
Hide resolved
...rence/src/yamlRestTest/resources/rest-api-spec/test/inference/30_semantic_text_inference.yml
Outdated
Show resolved
Hide resolved
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
|
Pinging @elastic/search-relevance (Team:Search - Relevance) |
|
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
|
Pinging @elastic/ent-search-eng (Team:SearchOrg) |
|
Hi @Samiul-TheSoccerFan, I've created a changelog YAML for you. |
kderusso
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, nice work! You may need to update the changelog to get CI to pass, but otherwise looks good to me!
Mikep86
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for iterating!
carlosdelest
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Nice work @Samiul-TheSoccerFan !
| if (model.hasResult(inputText)) { | ||
| ChunkedInferenceServiceResults results = model.getResults(inputText); | ||
| semanticTextField = semanticTextFieldFromChunkedInferenceResults( | ||
| field, | ||
| model, | ||
| List.of(inputText), | ||
| results, | ||
| requestContentType | ||
| ); | ||
| } else { | ||
| semanticTextField = randomSemanticText(field, model, List.of(inputText), requestContentType); | ||
| model.putResult(inputText, toChunkedResult(semanticTextField)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - we could maybe have used ESTestCase.randomValueOtherThanMany()to a similar effect. That would ensure that the random value is not in the model, and not just trying twice - AFAIU we should loop until we find a value that is not on the results?
|
|
||
| - do: | ||
| headers: | ||
| # Force JSON content type so that we use a parser that interprets the floating-point score as a double |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is content type needed here, as we're using booleans?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a copy-paste from other YAML tests so that we can compare scores using the YAML assertions. Fun fact: If the test uses the SMILE format (which it will randomly do, unless you force JSON like is done here), then scores in search responses will be parsed as float, breaking the ability to check them using YAML assertions (which take double values).
We don't compare scores in this particular test, but it should be harmless to leave this so as to not create a landmine if we add score comparison in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it - thought it was referring to the actual values we were indexing instead of the score, as I saw no scores involved in the test.
I'd say if it's not needed, don't add it - it confused me and probably will confuse others 🤷
| public static Object randomSemanticTextInput() { | ||
| int randomInt = randomIntBetween(0, 5); | ||
| return switch (randomInt) { | ||
| case 0 -> randomAlphaOfLengthBetween(10, 20); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - Maybe we should have less priority for using boolean / numbers via rarely()? Something like
if (rarely) {
return switch(randomIntBetween(0, 4)) {
case 0 -> randomInt();
case 1 -> randomLong();
case 2 -> randomFloat();
case 3 -> randomBoolean();
case 4 -> randomDouble();
}
} else {
return randomAlphaOfLengthBetween(10, 20);
}|
Hi @Samiul-TheSoccerFan, I've updated the changelog YAML for you. |
|
|
||
| - do: | ||
| headers: | ||
| # Force JSON content type so that we use a parser that interprets the floating-point score as a double |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it - thought it was referring to the actual values we were indexing instead of the score, as I saw no scores involved in the test.
I'd say if it's not needed, don't add it - it confused me and probably will confuse others 🤷
| if (model.hasResult(inputText)) { | ||
| ChunkedInferenceServiceResults results = model.getResults(inputText); | ||
| semanticTextField = semanticTextFieldFromChunkedInferenceResults( | ||
| field, | ||
| model, | ||
| List.of(inputText), | ||
| results, | ||
| requestContentType | ||
| ); | ||
| } else { | ||
| semanticTextField = randomSemanticText(field, model, List.of(inputText), requestContentType); | ||
| model.putResult(inputText, toChunkedResult(semanticTextField)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is, wouldn't it be simpler not to generate the duplicate input values, and thus avoid managing the results as it happens?
| // The model is not field aware and that is why we are skipping the embedding generation process for existing values. | ||
| // This prevents a situation where embeddings in the expected docMap do not match those in the model, which could happen if | ||
| // embeddings were overwritten. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this comment makes sense @Mikep86?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for the iterations!
* upstream/main: (105 commits) Removing the use of watcher stats from WatchAcTests (elastic#111435) Mute org.elasticsearch.xpack.restart.FullClusterRestartIT testSingleDoc {cluster=UPGRADED} elastic#111434 Make `EnrichPolicyRunner` more properly async (elastic#111321) Mute org.elasticsearch.xpack.restart.FullClusterRestartIT testSingleDoc {cluster=OLD} elastic#111430 Mute org.elasticsearch.xpack.esql.expression.function.aggregate.ValuesTests testGroupingAggregate {TestCase=<long unicode KEYWORDs>} elastic#111428 Mute org.elasticsearch.xpack.esql.expression.function.aggregate.ValuesTests testGroupingAggregate {TestCase=<long unicode TEXTs>} elastic#111429 Mute org.elasticsearch.xpack.repositories.metering.azure.AzureRepositoriesMeteringIT org.elasticsearch.xpack.repositories.metering.azure.AzureRepositoriesMeteringIT elastic#111307 Update semantic_text field to support indexing numeric and boolean data types (elastic#111284) Mute org.elasticsearch.repositories.blobstore.testkit.AzureSnapshotRepoTestKitIT testRepositoryAnalysis elastic#111280 Ensure vector similarity correctly limits inner_hits returned for nested kNN (elastic#111363) Fix LogsIndexModeFullClusterRestartIT (elastic#111362) Remove 4096 bool query max limit from docs (elastic#111421) Fix score count validation in reranker response (elastic#111212) Integrate data generator in LogsDB mode challenge test (elastic#111303) ESQL: Add COUNT and COUNT_DISTINCT aggregation tests (elastic#111409) [Service Account] Add AutoOps account (elastic#111316) [ML] Fix failing test DetectionRulesTests.testEqualsAndHashcode (elastic#111351) [ML] Create and inject APM Inference Metrics (elastic#111293) [DOCS] Additional reranking docs updates (elastic#111350) Mute org.elasticsearch.repositories.azure.RepositoryAzureClientYamlTestSuiteIT org.elasticsearch.repositories.azure.RepositoryAzureClientYamlTestSuiteIT elastic#111345 ... # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java
Adding support for additional data types in the Semantic_text field. Initially, it only supported String and Collection of String values. This PR enables the options to ingest other data types (Number and Boolean) with the Semantic_text field and retrieve the appropriate documents when queried.
The response will be something like:
