Support configurable chunking in semantic_text fields by kderusso · Pull Request #121041 · elastic/elasticsearch

kderusso · 2025-01-28T14:13:21Z

Supports configurable chunking in semantic_text fields that will override the model defaults.

Example script:

PUT my-chunking
{
  "mappings": {
    "properties": {
      "default_chunked_inference_field": {
        "type": "semantic_text"
      },
      "custom_chunked_inference_field": {
        "type": "semantic_text",
        "chunking_settings": {
          "strategy": "word",
          "max_chunk_size": 10,
          "overlap": 5
        }
      }
    }
  }
}

PUT my-chunking/_doc/1
{
  "default_chunked_inference_field": "Elasticsearch is an open source, distributed, RESTful, search engine which is built on top of Lucene internally and enjoys all the features it provides.",
  "custom_chunked_inference_field": "Elasticsearch is an open source, distributed, RESTful, search engine which is built on top of Lucene internally and enjoys all the features it provides."
}


GET my-chunking/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "semantic": {
            "field": "default_chunked_inference_field",
            "query": "what is elasticsearch"
          }
        },
        {
          "semantic": {
            "field": "custom_chunked_inference_field",
            "query": "what is elasticsearch"
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "default_chunked_inference_field": {
        "type": "semantic",
        "number_of_fragments": 5
      },
      "custom_chunked_inference_field": {
        "type": "semantic",
        "number_of_fragments": 5
      }
    }
  }
}

This reverts commit 9f4e2ad.

…_text field

…uest

This reverts commit 9f4e2ad.

Mikep86

Looks good, thank you for the iterations on this!

Mikep86 · 2025-04-03T19:05:04Z

...erence/src/main/java/org/elasticsearch/xpack/inference/chunking/EmbeddingRequestChunker.java

I noticed a completely unrelated bug in SentenceBoundaryChunkingSettings while reviewing this: equals does not consider sentenceOverlap

We can spin up a quick and dirty followup PR for that, not addressing it here.

elasticsearchmachine · 2025-04-03T21:46:42Z

💔 Backport failed

Status	Branch	Result
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 121041

kderusso · 2025-04-09T17:48:59Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x

Questions ?

Please refer to the Backport tool documentation

* test * Revert "test" This reverts commit 9f4e2ad. * Refactor InferenceService to allow passing in chunking settings * Add chunking config to inference field metadata and store in semantic_text field * Fix test compilation errors * Hacking around trying to get ingest to work * Debugging * [CI] Auto commit changes from spotless * POC works and update TODO to fix this * [CI] Auto commit changes from spotless * Refactor chunking settings from model settings to field inference request * A bit of cleanup * Revert a bunch of changes to try to narrow down what broke CI * test * Revert "test" This reverts commit 9f4e2ad. * Fix InferenceFieldMetadataTest * [CI] Auto commit changes from spotless * Add chunking settings back in * Update builder to use new map * Fix compilation errors after merge * Debugging tests * debugging * Cleanup * Add yaml test * Update tests * Add chunking to test inference service * Trying to get tests to work * Shard bulk inference test never specifies chunking settings * Fix test * Always process batches in order * Fix chunking in test inference service and yaml tests * [CI] Auto commit changes from spotless * Refactor - remove convenience method with default chunking settings * Fix ShardBulkInferenceActionFilterTests * Fix ElasticsearchInternalServiceTests * Fix SemanticTextFieldMapperTests * [CI] Auto commit changes from spotless * Fix test data to fit within bounds * Add additional yaml test cases * Playing with xcontent parsing * A little cleanup * Update docs/changelog/121041.yaml * Fix failures introduced by merge * [CI] Auto commit changes from spotless * Address PR feedback * [CI] Auto commit changes from spotless * Fix predicate in updated test * Better handling of null/empty ChunkingSettings * Update parsing settings * Fix errors post merge * PR feedback * [CI] Auto commit changes from spotless * PR feedback and fix Xcontent parsing for SemanticTextField * Remove chunking settings check to use what's passed in from sender service * Fix some tests * Cleanup * Test failure whack-a-mole * Cleanup * Refactor to handle memory optimized bulk shard inference actions - this is ugly but at least it compiles * [CI] Auto commit changes from spotless * Minor cleanup * A bit more cleanup * Spotless * Revert change * Update chunking setting update logic * Go back to serializing maps * Revert change to model settings - source still errors on missing model_id * Fix updating chunking settings * Look up model if null * Fix test * Work around elastic#125723 in semantic text field serialization * Add BWC tests * Add chunking_settings to docs * Refactor/rename * Address minor PR feedback * Add test case for null update * PR feedback - adjust refactor of chunked inputs * Refactored AbstractTestInferenceService to return offsets instead of just Strings * [CI] Auto commit changes from spotless * Fix tests where chunk output was of size 3 * Update mappings per PR feedback * PR Feedback * Fix problems related to merge * PR optimization * Fix test * Delete extra file --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>

…#126545) * Support configurable chunking in semantic_text fields (#121041) * test * Revert "test" This reverts commit 9f4e2ad. * Refactor InferenceService to allow passing in chunking settings * Add chunking config to inference field metadata and store in semantic_text field * Fix test compilation errors * Hacking around trying to get ingest to work * Debugging * [CI] Auto commit changes from spotless * POC works and update TODO to fix this * [CI] Auto commit changes from spotless * Refactor chunking settings from model settings to field inference request * A bit of cleanup * Revert a bunch of changes to try to narrow down what broke CI * test * Revert "test" This reverts commit 9f4e2ad. * Fix InferenceFieldMetadataTest * [CI] Auto commit changes from spotless * Add chunking settings back in * Update builder to use new map * Fix compilation errors after merge * Debugging tests * debugging * Cleanup * Add yaml test * Update tests * Add chunking to test inference service * Trying to get tests to work * Shard bulk inference test never specifies chunking settings * Fix test * Always process batches in order * Fix chunking in test inference service and yaml tests * [CI] Auto commit changes from spotless * Refactor - remove convenience method with default chunking settings * Fix ShardBulkInferenceActionFilterTests * Fix ElasticsearchInternalServiceTests * Fix SemanticTextFieldMapperTests * [CI] Auto commit changes from spotless * Fix test data to fit within bounds * Add additional yaml test cases * Playing with xcontent parsing * A little cleanup * Update docs/changelog/121041.yaml * Fix failures introduced by merge * [CI] Auto commit changes from spotless * Address PR feedback * [CI] Auto commit changes from spotless * Fix predicate in updated test * Better handling of null/empty ChunkingSettings * Update parsing settings * Fix errors post merge * PR feedback * [CI] Auto commit changes from spotless * PR feedback and fix Xcontent parsing for SemanticTextField * Remove chunking settings check to use what's passed in from sender service * Fix some tests * Cleanup * Test failure whack-a-mole * Cleanup * Refactor to handle memory optimized bulk shard inference actions - this is ugly but at least it compiles * [CI] Auto commit changes from spotless * Minor cleanup * A bit more cleanup * Spotless * Revert change * Update chunking setting update logic * Go back to serializing maps * Revert change to model settings - source still errors on missing model_id * Fix updating chunking settings * Look up model if null * Fix test * Work around #125723 in semantic text field serialization * Add BWC tests * Add chunking_settings to docs * Refactor/rename * Address minor PR feedback * Add test case for null update * PR feedback - adjust refactor of chunked inputs * Refactored AbstractTestInferenceService to return offsets instead of just Strings * [CI] Auto commit changes from spotless * Fix tests where chunk output was of size 3 * Update mappings per PR feedback * PR Feedback * Fix problems related to merge * PR optimization * Fix test * Delete extra file --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co> (cherry picked from commit e7d4a28) # Conflicts: # docs/reference/elasticsearch/mapping-reference/semantic-text.md # server/src/main/java/org/elasticsearch/TransportVersions.java # x-pack/plugin/inference/qa/test-service-plugin/src/main/java/org/elasticsearch/xpack/inference/mock/TestSparseInferenceServiceExtension.java # x-pack/plugin/inference/src/main/java/module-info.java # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/InferenceFeatures.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/chunking/EmbeddingRequestChunkerTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/amazonbedrock/AmazonBedrockServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/azureopenai/AzureOpenAiServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/cohere/CohereServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/elastic/ElasticInferenceServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/huggingface/HuggingFaceElserServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/huggingface/HuggingFaceServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/mistral/MistralServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/openai/OpenAiServiceTests.java * Delete docs/reference/elasticsearch/mapping-reference/semantic-text.md * [CI] Auto commit changes from spotless * Fix error after merge conflict resolution * Fix broken test --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>

This commit restores the behaviour introduced in elastic#125837 which was inadvertently undone by changes in elastic#121041, specifically, delaying copying Strings as part of calling Request.chunkText() until the request is being executed. In addition to the above change, refactor doChunkedInfer() and its implementations to take a List<ChunkInferenceInput> rather than EmbeddingsInput, since the EmbeddingsInput passed into doChunkedInfer() was immediately discarded after extracting the ChunkInferenceInput list from it. This change allowed the EmbeddingsInput class to be refactored to not know about ChunkInferenceInput, simplifying it significantly. This commit also simplifies EmbeddingRequestChunker.Request to take only the input String rather than the entire list of all inputs, since only one input is actually needed. This change prevents Requests from retaining a reference to the input list, potentially allowing it to be GC'd faster.

This commit restores the behaviour introduced in #125837 which was inadvertently undone by changes in #121041, specifically, delaying copying Strings as part of calling Request.chunkText() until the request is being executed. In addition to the above change, refactor doChunkedInfer() and its implementations to take a List<ChunkInferenceInput> rather than EmbeddingsInput, since the EmbeddingsInput passed into doChunkedInfer() was immediately discarded after extracting the ChunkInferenceInput list from it. This change allowed the EmbeddingsInput class to be refactored to not know about ChunkInferenceInput, simplifying it significantly. This commit also simplifies EmbeddingRequestChunker.Request to take only the input String rather than the entire list of all inputs, since only one input is actually needed. This change prevents Requests from retaining a reference to the input list, potentially allowing it to be GC'd faster.

This commit restores the behaviour introduced in elastic#125837 which was inadvertently undone by changes in elastic#121041, specifically, delaying copying Strings as part of calling Request.chunkText() until the request is being executed. In addition to the above change, refactor doChunkedInfer() and its implementations to take a List<ChunkInferenceInput> rather than EmbeddingsInput, since the EmbeddingsInput passed into doChunkedInfer() was immediately discarded after extracting the ChunkInferenceInput list from it. This change allowed the EmbeddingsInput class to be refactored to not know about ChunkInferenceInput, simplifying it significantly. This commit also simplifies EmbeddingRequestChunker.Request to take only the input String rather than the entire list of all inputs, since only one input is actually needed. This change prevents Requests from retaining a reference to the input list, potentially allowing it to be GC'd faster. (cherry picked from commit f3447d3) # Conflicts: # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/ai21/Ai21Service.java # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/llama/LlamaService.java # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/llama/action/LlamaActionCreator.java # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/mistral/MistralService.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/SenderServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/alibabacloudsearch/AlibabaCloudSearchCompletionRequestManagerTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/alibabacloudsearch/action/AlibabaCloudSearchActionCreatorTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/azureaistudio/action/AzureAiStudioActionAndCreatorTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/llama/action/LlamaActionCreatorTests.java

This commit restores the behaviour introduced in #125837 which was inadvertently undone by changes in #121041, specifically, delaying copying Strings as part of calling Request.chunkText() until the request is being executed. In addition to the above change, refactor doChunkedInfer() and its implementations to take a List<ChunkInferenceInput> rather than EmbeddingsInput, since the EmbeddingsInput passed into doChunkedInfer() was immediately discarded after extracting the ChunkInferenceInput list from it. This change allowed the EmbeddingsInput class to be refactored to not know about ChunkInferenceInput, simplifying it significantly. This commit also simplifies EmbeddingRequestChunker.Request to take only the input String rather than the entire list of all inputs, since only one input is actually needed. This change prevents Requests from retaining a reference to the input list, potentially allowing it to be GC'd faster. (cherry picked from commit f3447d3)

elasticsearchmachine added v9.0.0 v9.1.0 and removed v9.0.0 labels Jan 28, 2025

kderusso and others added 11 commits February 12, 2025 09:18

test

9f4e2ad

Revert "test"

a561017

This reverts commit 9f4e2ad.

Refactor InferenceService to allow passing in chunking settings

acb14a2

Add chunking config to inference field metadata and store in semantic…

933c74d

…_text field

Fix test compilation errors

6b11f01

Hacking around trying to get ingest to work

7c4ba49

Debugging

edbfde3

[CI] Auto commit changes from spotless

c255745

POC works and update TODO to fix this

889875f

[CI] Auto commit changes from spotless

c87753e

Refactor chunking settings from model settings to field inference req…

c2c9a52

…uest

kderusso force-pushed the kderusso/support-configurable-chunking branch from 035c3cb to c2c9a52 Compare February 12, 2025 19:46

kderusso added 2 commits February 12, 2025 15:36

A bit of cleanup

c9bfa32

Revert a bunch of changes to try to narrow down what broke CI

122aeee

kderusso force-pushed the kderusso/support-configurable-chunking branch from 8d8b273 to 122aeee Compare February 14, 2025 16:42

kderusso and others added 12 commits February 14, 2025 13:28

test

b42f262

Revert "test"

75404b5

This reverts commit 9f4e2ad.

Merge from main

0dcadfa

Fix InferenceFieldMetadataTest

a54804e

[CI] Auto commit changes from spotless

675819c

Add chunking settings back in

ccef5cc

Update builder to use new map

70ac065

Merge from main

4ac2460

Fix compilation errors after merge

20d596c

Debugging tests

178a9db

debugging

7f1e99d

Merge branch 'main' into kderusso/support-configurable-chunking

f74c803

kderusso requested a review from jimczi April 3, 2025 18:24

kderusso added 2 commits April 3, 2025 14:48

Fix test

9fb17a6

Delete extra file

bf8c460

Mikep86 approved these changes Apr 3, 2025

View reviewed changes

Merge main into branch

2b5589d

kderusso mentioned this pull request Apr 3, 2025

Ensure sentence overlap is considered in SentenceBoundaryChunkingSettings equals/hashCode #126250

Merged

Merge from main

1e3fc22

kderusso merged commit e7d4a28 into elastic:main Apr 3, 2025
17 checks passed

elasticsearchmachine added the backport pending label Apr 3, 2025

kderusso mentioned this pull request Apr 9, 2025

[8.x] Support configurable chunking in semantic_text fields (#121041) #126545

Merged

This was referenced Apr 10, 2025

Update chunking_settings docs for semantic_text #126634

Merged

Add support for chunking_settings in semantic_text fields elastic/elasticsearch-specification#4261

Merged

Mikep86 removed the backport pending label Jul 21, 2025

shainaraskas mentioned this pull request Jul 23, 2025

Fix semantic text docs for features coming in 9.1 #131783

Merged

Mikep86 mentioned this pull request Jul 24, 2025

[DOCS] Adds inline applies_to tags to semantic text docs #131814

Merged

DonalEvans mentioned this pull request Sep 23, 2025

[ML] Restore delayed string copying #135242

Merged

pquentin mentioned this pull request Sep 30, 2025

Specification fixes for 9.2 elastic/elasticsearch-specification#5372

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support configurable chunking in semantic_text fields#121041

Support configurable chunking in semantic_text fields#121041
kderusso merged 99 commits intoelastic:mainfrom
kderusso:kderusso/support-configurable-chunking

kderusso commented Jan 28, 2025 •

edited

Loading

Uh oh!

Mikep86 left a comment

Uh oh!

Mikep86 Apr 3, 2025

Uh oh!

kderusso Apr 3, 2025

Uh oh!

kderusso Apr 3, 2025

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 3, 2025

Uh oh!

kderusso commented Apr 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

Conversation

kderusso commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mikep86 left a comment

Choose a reason for hiding this comment

Uh oh!

Mikep86 Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

kderusso Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

kderusso Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 3, 2025

💔 Backport failed

Uh oh!

kderusso commented Apr 9, 2025

💚 All backports created successfully

Questions ?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

kderusso commented Jan 28, 2025 •

edited

Loading