Update semantic_text field to support indexing numeric and boolean data types #111284

Samiul-TheSoccerFan · 2024-07-25T12:50:18Z

Adding support for additional data types in the Semantic_text field. Initially, it only supported String and Collection of String values. This PR enables the options to ingest other data types (Number and Boolean) with the Semantic_text field and retrieve the appropriate documents when queried.

PUT _inference/sparse_embedding/my-elser-model
{
  "service": "elser",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  },
  "task_settings": {}
}

PUT test-sparse
{
    "mappings": {
        "properties": {
            "test_field": {
                "type": "semantic_text",
                "inference_id": "my-elser-model"
            }
        }
    }
}

PUT test-sparse/_doc/doc1
{
    "test_field": "42"
}
PUT test-sparse/_doc/doc2
{
    "test_field": 501.11
}
PUT test-sparse/_doc/doc3
{
    "test_field": 100
}
PUT test-sparse/_doc/doc4
{
    "test_field": true
}
PUT test-sparse/_doc/doc5
{
    "test_field": ["any string", 77, false, "false"]
}

GET test-sparse/_search?_source_excludes=_semantic_text_inference
{
    "query": {
        "semantic": {
            "field": "test_field",
            "query": "42"           
        }
    }
}

The response will be something like:

Mikep86

Great work 🙌 ! Left some minor comments

.../org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTestUtil.java

Mikep86 · 2024-07-26T12:40:46Z

...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java

+            if (model.hasResult(inputText)) {
+                ChunkedInferenceServiceResults results = model.getResults(inputText);
+                semanticTextField = semanticTextFieldFromChunkedInferenceResults(
+                    field,
+                    model,
+                    List.of(inputText),
+                    results,
+                    requestContentType
+                );
+            } else {
+                semanticTextField = randomSemanticText(field, model, List.of(inputText), requestContentType);
+                model.putResult(inputText, toChunkedResult(semanticTextField));
+            }


@carlosdelest We had to make this change because the inference result cache in model is not field-aware. Now that our input can be of many data types (including Boolean, with only two values), we are nearly guaranteed to hit value collisions across 100+ bulk requests. This caused test failures with the previous logic because different random embeddings would be generated every time we saw the value "true" (for example).

This updated logic checks if the inference result cache already has results for the value, and uses them if it does.

I see - we could maybe have used ESTestCase.randomValueOtherThanMany()to a similar effect. That would ensure that the random value is not in the model, and not just trying twice - AFAIU we should loop until we find a value that is not on the results?

Maybe I'm misunderstanding your comment, but I don't think ESTestCase.randomValueOtherThanMany() would help here. The issue is that the previous logic always generated a new embedding for the input, regardless of whether the model already had a cached value for that input. This caused test failures. Consider the following case:

We generate a random embedding for the input true

We write that embedding to the expected doc map and cache it in model

In a later bulk request, the input true is randomly generated again

We generate a different random embedding for the input true

We overwrite the cached embedding in model with the new embedding

After all requests are generated, we assert that the embedding in the expected doc map matches that in the model cache. This fails because the embedding in the model cache was overwritten.

This new logic fixes the problem by first checking if model already has a cached embedding for the input. If it does, we use it. If it doesn't, we generate a new random embedding and add it to the model cache.

My point is, wouldn't it be simpler not to generate the duplicate input values, and thus avoid managing the results as it happens?

...rence/src/yamlRestTest/resources/rest-api-spec/test/inference/30_semantic_text_inference.yml

…st files

elasticsearchmachine · 2024-07-26T15:19:22Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

elasticsearchmachine · 2024-07-26T15:19:22Z

Pinging @elastic/search-relevance (Team:Search - Relevance)

elasticsearchmachine · 2024-07-26T15:19:22Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elasticsearchmachine · 2024-07-26T15:19:22Z

Pinging @elastic/ent-search-eng (Team:SearchOrg)

elasticsearchmachine · 2024-07-26T15:20:25Z

Hi @Samiul-TheSoccerFan, I've created a changelog YAML for you.

kderusso

This looks good, nice work! You may need to update the changelog to get CI to pass, but otherwise looks good to me!

… up to date

Mikep86

Looks good, thanks for iterating!

carlosdelest

LGTM. Nice work @Samiul-TheSoccerFan !

carlosdelest · 2024-07-29T09:35:10Z

...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java

+            if (model.hasResult(inputText)) {
+                ChunkedInferenceServiceResults results = model.getResults(inputText);
+                semanticTextField = semanticTextFieldFromChunkedInferenceResults(
+                    field,
+                    model,
+                    List.of(inputText),
+                    results,
+                    requestContentType
+                );
+            } else {
+                semanticTextField = randomSemanticText(field, model, List.of(inputText), requestContentType);
+                model.putResult(inputText, toChunkedResult(semanticTextField));
+            }


I see - we could maybe have used ESTestCase.randomValueOtherThanMany()to a similar effect. That would ensure that the random value is not in the model, and not just trying twice - AFAIU we should loop until we find a value that is not on the results?

carlosdelest · 2024-07-29T09:38:06Z

...inference/src/yamlRestTest/resources/rest-api-spec/test/inference/40_semantic_text_query.yml

+
+  - do:
+      headers:
+        # Force JSON content type so that we use a parser that interprets the floating-point score as a double


Is content type needed here, as we're using booleans?

This is a copy-paste from other YAML tests so that we can compare scores using the YAML assertions. Fun fact: If the test uses the SMILE format (which it will randomly do, unless you force JSON like is done here), then scores in search responses will be parsed as float, breaking the ability to check them using YAML assertions (which take double values).

We don't compare scores in this particular test, but it should be harmless to leave this so as to not create a landmine if we add score comparison in the future.

Got it - thought it was referring to the actual values we were indexing instead of the score, as I saw no scores involved in the test.

I'd say if it's not needed, don't add it - it confused me and probably will confuse others 🤷

carlosdelest · 2024-07-29T09:42:26Z

...inference/src/test/java/org/elasticsearch/xpack/inference/mapper/SemanticTextFieldTests.java

+    public static Object randomSemanticTextInput() {
+        int randomInt = randomIntBetween(0, 5);
+        return switch (randomInt) {
+            case 0 -> randomAlphaOfLengthBetween(10, 20);


Nit - Maybe we should have less priority for using boolean / numbers via rarely()? Something like

if (rarely) { return switch(randomIntBetween(0, 4)) { case 0 -> randomInt(); case 1 -> randomLong(); case 2 -> randomFloat(); case 3 -> randomBoolean(); case 4 -> randomDouble(); } } else { return randomAlphaOfLengthBetween(10, 20); }

elasticsearchmachine · 2024-07-29T13:12:12Z

Hi @Samiul-TheSoccerFan, I've updated the changelog YAML for you.

carlosdelest · 2024-07-29T15:32:29Z

...inference/src/yamlRestTest/resources/rest-api-spec/test/inference/40_semantic_text_query.yml

+
+  - do:
+      headers:
+        # Force JSON content type so that we use a parser that interprets the floating-point score as a double


Got it - thought it was referring to the actual values we were indexing instead of the score, as I saw no scores involved in the test.

I'd say if it's not needed, don't add it - it confused me and probably will confuse others 🤷

carlosdelest · 2024-07-29T15:40:58Z

...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java

+            if (model.hasResult(inputText)) {
+                ChunkedInferenceServiceResults results = model.getResults(inputText);
+                semanticTextField = semanticTextFieldFromChunkedInferenceResults(
+                    field,
+                    model,
+                    List.of(inputText),
+                    results,
+                    requestContentType
+                );
+            } else {
+                semanticTextField = randomSemanticText(field, model, List.of(inputText), requestContentType);
+                model.putResult(inputText, toChunkedResult(semanticTextField));
+            }


My point is, wouldn't it be simpler not to generate the duplicate input values, and thus avoid managing the results as it happens?

Samiul-TheSoccerFan · 2024-07-29T19:54:05Z

...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java

+            // The model is not field aware and that is why we are skipping the embedding generation process for existing values.
+            // This prevents a situation where embeddings in the expected docMap do not match those in the model, which could happen if
+            // embeddings were overwritten.


Do you think this comment makes sense @Mikep86?

Looks good to me, thanks for the iterations!

* upstream/main: (105 commits) Removing the use of watcher stats from WatchAcTests (elastic#111435) Mute org.elasticsearch.xpack.restart.FullClusterRestartIT testSingleDoc {cluster=UPGRADED} elastic#111434 Make `EnrichPolicyRunner` more properly async (elastic#111321) Mute org.elasticsearch.xpack.restart.FullClusterRestartIT testSingleDoc {cluster=OLD} elastic#111430 Mute org.elasticsearch.xpack.esql.expression.function.aggregate.ValuesTests testGroupingAggregate {TestCase=<long unicode KEYWORDs>} elastic#111428 Mute org.elasticsearch.xpack.esql.expression.function.aggregate.ValuesTests testGroupingAggregate {TestCase=<long unicode TEXTs>} elastic#111429 Mute org.elasticsearch.xpack.repositories.metering.azure.AzureRepositoriesMeteringIT org.elasticsearch.xpack.repositories.metering.azure.AzureRepositoriesMeteringIT elastic#111307 Update semantic_text field to support indexing numeric and boolean data types (elastic#111284) Mute org.elasticsearch.repositories.blobstore.testkit.AzureSnapshotRepoTestKitIT testRepositoryAnalysis elastic#111280 Ensure vector similarity correctly limits inner_hits returned for nested kNN (elastic#111363) Fix LogsIndexModeFullClusterRestartIT (elastic#111362) Remove 4096 bool query max limit from docs (elastic#111421) Fix score count validation in reranker response (elastic#111212) Integrate data generator in LogsDB mode challenge test (elastic#111303) ESQL: Add COUNT and COUNT_DISTINCT aggregation tests (elastic#111409) [Service Account] Add AutoOps account (elastic#111316) [ML] Fix failing test DetectionRulesTests.testEqualsAndHashcode (elastic#111351) [ML] Create and inject APM Inference Metrics (elastic#111293) [DOCS] Additional reranking docs updates (elastic#111350) Mute org.elasticsearch.repositories.azure.RepositoryAzureClientYamlTestSuiteIT org.elasticsearch.repositories.azure.RepositoryAzureClientYamlTestSuiteIT elastic#111345 ... # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

elasticsearchmachine added the v8.16.0 label Jul 25, 2024

Samiul-TheSoccerFan added 5 commits July 25, 2024 11:19

adding support for additional data types

1c9b03f

Adding unit tests for additional data types

69fa91d

updating integration tests to feed random data types

b3b3cf9

Fix code styles by running spotlessApply

b65f4b3

Adding yml tests for additional data type support

168fdf7

Samiul-TheSoccerFan force-pushed the semantic-text-other-field-type branch from 0e0add3 to 168fdf7 Compare July 25, 2024 18:25

fix failed yml tests and added tests for dense and boolean type

08ff9dd

Mikep86 added :Search Foundations/Mapping Index mappings, including merging and defining field types :Search Relevance/Vectors Vector search labels Jul 26, 2024

Mikep86 reviewed Jul 26, 2024

View reviewed changes

Mikep86 requested a review from carlosdelest July 26, 2024 12:49

Mikep86 changed the title ~~Adding additional Data type in Semantic Search~~ Update semantic_text field to support indexing numeric and boolean data types Jul 26, 2024

Mikep86 added >feature :SearchOrg/Relevance Label for the Search (solution/org) Relevance team labels Jul 26, 2024

Samiul-TheSoccerFan added 2 commits July 26, 2024 11:00

Removed util class and moved the random function into own specific te…

3692755

…st files

rewrite the terms to match most up to date terminology

85f25b5

Samiul-TheSoccerFan marked this pull request as ready for review July 26, 2024 15:18

Update docs/changelog/111284.yaml

9b5bf98

kderusso approved these changes Jul 26, 2024

View reviewed changes

Samiul-TheSoccerFan added 4 commits July 26, 2024 11:35

update changelog yml text to fit into one line

a1651a7

limit changelog limit to only 1 area

44fa931

Updating text_expansion with sparse_embedding to keep the terminalogy…

362bd62

… up to date

refactoring randomSemanticTextInput function

1eaded4

Mikep86 approved these changes Jul 26, 2024

View reviewed changes

carlosdelest approved these changes Jul 29, 2024

View reviewed changes

benwtrent added >enhancement and removed >feature labels Jul 29, 2024

Update docs/changelog/111284.yaml

74ee5eb

carlosdelest approved these changes Jul 29, 2024

View reviewed changes

Adding comments and addressing nitpiks

732691b

Samiul-TheSoccerFan commented Jul 29, 2024

View reviewed changes

Samiul-TheSoccerFan merged commit b601e3b into elastic:main Jul 29, 2024

Samiul-TheSoccerFan deleted the semantic-text-other-field-type branch July 29, 2024 20:20

Update semantic_text field to support indexing numeric and boolean data types #111284

Update semantic_text field to support indexing numeric and boolean data types #111284

Uh oh!

Conversation

Samiul-TheSoccerFan commented Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mikep86 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carlosdelest Jul 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Jul 26, 2024

Uh oh!

elasticsearchmachine commented Jul 26, 2024

Uh oh!

elasticsearchmachine commented Jul 26, 2024

Uh oh!

elasticsearchmachine commented Jul 26, 2024

Uh oh!

elasticsearchmachine commented Jul 26, 2024

Uh oh!

kderusso left a comment

Choose a reason for hiding this comment

Uh oh!

Mikep86 left a comment

Choose a reason for hiding this comment

Uh oh!

carlosdelest left a comment

Choose a reason for hiding this comment

Uh oh!

carlosdelest Jul 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Jul 29, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Samiul-TheSoccerFan commented Jul 25, 2024 •

edited

Loading

carlosdelest Jul 29, 2024 •

edited

Loading

carlosdelest Jul 29, 2024 •

edited

Loading