Skip to content

Conversation

@Samiul-TheSoccerFan
Copy link
Contributor

@Samiul-TheSoccerFan Samiul-TheSoccerFan commented Jul 25, 2024

Adding support for additional data types in the Semantic_text field. Initially, it only supported String and Collection of String values. This PR enables the options to ingest other data types (Number and Boolean) with the Semantic_text field and retrieve the appropriate documents when queried.

PUT _inference/sparse_embedding/my-elser-model
{
  "service": "elser",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  },
  "task_settings": {}
}

PUT test-sparse
{
    "mappings": {
        "properties": {
            "test_field": {
                "type": "semantic_text",
                "inference_id": "my-elser-model"
            }
        }
    }
}

PUT test-sparse/_doc/doc1
{
    "test_field": "42"
}
PUT test-sparse/_doc/doc2
{
    "test_field": 501.11
}
PUT test-sparse/_doc/doc3
{
    "test_field": 100
}
PUT test-sparse/_doc/doc4
{
    "test_field": true
}
PUT test-sparse/_doc/doc5
{
    "test_field": ["any string", 77, false, "false"]
}

GET test-sparse/_search?_source_excludes=_semantic_text_inference
{
    "query": {
        "semantic": {
            "field": "test_field",
            "query": "42"           
        }
    }
}

The response will be something like:
Screenshot 2024-07-25 at 3 57 37 PM

@Samiul-TheSoccerFan Samiul-TheSoccerFan force-pushed the semantic-text-other-field-type branch from 0e0add3 to 168fdf7 Compare July 25, 2024 18:25
@Mikep86 Mikep86 added :Search Foundations/Mapping Index mappings, including merging and defining field types :Search Relevance/Vectors Vector search labels Jul 26, 2024
Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work 🙌 ! Left some minor comments

Comment on lines +347 to +359
if (model.hasResult(inputText)) {
ChunkedInferenceServiceResults results = model.getResults(inputText);
semanticTextField = semanticTextFieldFromChunkedInferenceResults(
field,
model,
List.of(inputText),
results,
requestContentType
);
} else {
semanticTextField = randomSemanticText(field, model, List.of(inputText), requestContentType);
model.putResult(inputText, toChunkedResult(semanticTextField));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carlosdelest We had to make this change because the inference result cache in model is not field-aware. Now that our input can be of many data types (including Boolean, with only two values), we are nearly guaranteed to hit value collisions across 100+ bulk requests. This caused test failures with the previous logic because different random embeddings would be generated every time we saw the value "true" (for example).

This updated logic checks if the inference result cache already has results for the value, and uses them if it does.

Copy link
Member

@carlosdelest carlosdelest Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - we could maybe have used ESTestCase.randomValueOtherThanMany()to a similar effect. That would ensure that the random value is not in the model, and not just trying twice - AFAIU we should loop until we find a value that is not on the results?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm misunderstanding your comment, but I don't think ESTestCase.randomValueOtherThanMany() would help here. The issue is that the previous logic always generated a new embedding for the input, regardless of whether the model already had a cached value for that input. This caused test failures. Consider the following case:

  • We generate a random embedding for the input true
  • We write that embedding to the expected doc map and cache it in model
  • In a later bulk request, the input true is randomly generated again
  • We generate a different random embedding for the input true
  • We overwrite the cached embedding in model with the new embedding
  • After all requests are generated, we assert that the embedding in the expected doc map matches that in the model cache. This fails because the embedding in the model cache was overwritten.

This new logic fixes the problem by first checking if model already has a cached embedding for the input. If it does, we use it. If it doesn't, we generate a new random embedding and add it to the model cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is, wouldn't it be simpler not to generate the duplicate input values, and thus avoid managing the results as it happens?

@Mikep86 Mikep86 requested a review from carlosdelest July 26, 2024 12:49
@Mikep86 Mikep86 changed the title Adding additional Data type in Semantic Search Update semantic_text field to support indexing numeric and boolean data types Jul 26, 2024
@Mikep86 Mikep86 added >feature :SearchOrg/Relevance Label for the Search (solution/org) Relevance team labels Jul 26, 2024
@Samiul-TheSoccerFan Samiul-TheSoccerFan marked this pull request as ready for review July 26, 2024 15:18
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-relevance (Team:Search - Relevance)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ent-search-eng (Team:SearchOrg)

@elasticsearchmachine
Copy link
Collaborator

Hi @Samiul-TheSoccerFan, I've created a changelog YAML for you.

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, nice work! You may need to update the changelog to get CI to pass, but otherwise looks good to me!

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for iterating!

Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice work @Samiul-TheSoccerFan !

Comment on lines +347 to +359
if (model.hasResult(inputText)) {
ChunkedInferenceServiceResults results = model.getResults(inputText);
semanticTextField = semanticTextFieldFromChunkedInferenceResults(
field,
model,
List.of(inputText),
results,
requestContentType
);
} else {
semanticTextField = randomSemanticText(field, model, List.of(inputText), requestContentType);
model.putResult(inputText, toChunkedResult(semanticTextField));
}
Copy link
Member

@carlosdelest carlosdelest Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - we could maybe have used ESTestCase.randomValueOtherThanMany()to a similar effect. That would ensure that the random value is not in the model, and not just trying twice - AFAIU we should loop until we find a value that is not on the results?


- do:
headers:
# Force JSON content type so that we use a parser that interprets the floating-point score as a double
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is content type needed here, as we're using booleans?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a copy-paste from other YAML tests so that we can compare scores using the YAML assertions. Fun fact: If the test uses the SMILE format (which it will randomly do, unless you force JSON like is done here), then scores in search responses will be parsed as float, breaking the ability to check them using YAML assertions (which take double values).

We don't compare scores in this particular test, but it should be harmless to leave this so as to not create a landmine if we add score comparison in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - thought it was referring to the actual values we were indexing instead of the score, as I saw no scores involved in the test.

I'd say if it's not needed, don't add it - it confused me and probably will confuse others 🤷

public static Object randomSemanticTextInput() {
int randomInt = randomIntBetween(0, 5);
return switch (randomInt) {
case 0 -> randomAlphaOfLengthBetween(10, 20);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - Maybe we should have less priority for using boolean / numbers via rarely()? Something like

if (rarely) {
    return switch(randomIntBetween(0, 4)) {
            case 0 -> randomInt();
            case 1 -> randomLong();
            case 2 -> randomFloat();
            case 3 -> randomBoolean();
            case 4 -> randomDouble();
   }
} else {
   return randomAlphaOfLengthBetween(10, 20);
}

@elasticsearchmachine
Copy link
Collaborator

Hi @Samiul-TheSoccerFan, I've updated the changelog YAML for you.


- do:
headers:
# Force JSON content type so that we use a parser that interprets the floating-point score as a double
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - thought it was referring to the actual values we were indexing instead of the score, as I saw no scores involved in the test.

I'd say if it's not needed, don't add it - it confused me and probably will confuse others 🤷

Comment on lines +347 to +359
if (model.hasResult(inputText)) {
ChunkedInferenceServiceResults results = model.getResults(inputText);
semanticTextField = semanticTextFieldFromChunkedInferenceResults(
field,
model,
List.of(inputText),
results,
requestContentType
);
} else {
semanticTextField = randomSemanticText(field, model, List.of(inputText), requestContentType);
model.putResult(inputText, toChunkedResult(semanticTextField));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is, wouldn't it be simpler not to generate the duplicate input values, and thus avoid managing the results as it happens?

Comment on lines +347 to +349
// The model is not field aware and that is why we are skipping the embedding generation process for existing values.
// This prevents a situation where embeddings in the expected docMap do not match those in the model, which could happen if
// embeddings were overwritten.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this comment makes sense @Mikep86?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for the iterations!

@Samiul-TheSoccerFan Samiul-TheSoccerFan merged commit b601e3b into elastic:main Jul 29, 2024
@Samiul-TheSoccerFan Samiul-TheSoccerFan deleted the semantic-text-other-field-type branch July 29, 2024 20:20
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Jul 30, 2024
* upstream/main: (105 commits)
  Removing the use of watcher stats from WatchAcTests (elastic#111435)
  Mute org.elasticsearch.xpack.restart.FullClusterRestartIT testSingleDoc {cluster=UPGRADED} elastic#111434
  Make `EnrichPolicyRunner` more properly async (elastic#111321)
  Mute org.elasticsearch.xpack.restart.FullClusterRestartIT testSingleDoc {cluster=OLD} elastic#111430
  Mute org.elasticsearch.xpack.esql.expression.function.aggregate.ValuesTests testGroupingAggregate {TestCase=<long unicode KEYWORDs>} elastic#111428
  Mute org.elasticsearch.xpack.esql.expression.function.aggregate.ValuesTests testGroupingAggregate {TestCase=<long unicode TEXTs>} elastic#111429
  Mute org.elasticsearch.xpack.repositories.metering.azure.AzureRepositoriesMeteringIT org.elasticsearch.xpack.repositories.metering.azure.AzureRepositoriesMeteringIT elastic#111307
  Update semantic_text field to support indexing numeric and boolean data types (elastic#111284)
  Mute org.elasticsearch.repositories.blobstore.testkit.AzureSnapshotRepoTestKitIT testRepositoryAnalysis elastic#111280
  Ensure vector similarity correctly limits inner_hits returned for nested kNN (elastic#111363)
  Fix LogsIndexModeFullClusterRestartIT (elastic#111362)
  Remove 4096 bool query max limit from docs (elastic#111421)
  Fix score count validation in reranker response (elastic#111212)
  Integrate data generator in LogsDB mode challenge test (elastic#111303)
  ESQL: Add COUNT and COUNT_DISTINCT aggregation tests (elastic#111409)
  [Service Account] Add AutoOps account (elastic#111316)
  [ML] Fix failing test DetectionRulesTests.testEqualsAndHashcode (elastic#111351)
  [ML] Create and inject APM Inference Metrics (elastic#111293)
  [DOCS] Additional reranking docs updates (elastic#111350)
  Mute org.elasticsearch.repositories.azure.RepositoryAzureClientYamlTestSuiteIT org.elasticsearch.repositories.azure.RepositoryAzureClientYamlTestSuiteIT elastic#111345
  ...

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types :Search Relevance/Vectors Vector search :SearchOrg/Relevance Label for the Search (solution/org) Relevance team v8.16.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants