Support configurable chunking in semantic_text fields#121041
Merged
kderusso merged 99 commits intoelastic:mainfrom Apr 3, 2025
Merged
Support configurable chunking in semantic_text fields#121041kderusso merged 99 commits intoelastic:mainfrom
kderusso merged 99 commits intoelastic:mainfrom
Conversation
This reverts commit 9f4e2ad.
035c3cb to
c2c9a52
Compare
8d8b273 to
122aeee
Compare
This reverts commit 9f4e2ad.
Mikep86
approved these changes
Apr 3, 2025
Contributor
Mikep86
left a comment
There was a problem hiding this comment.
Looks good, thank you for the iterations on this!
Contributor
There was a problem hiding this comment.
I noticed a completely unrelated bug in SentenceBoundaryChunkingSettings while reviewing this: equals does not consider sentenceOverlap
Member
Author
There was a problem hiding this comment.
We can spin up a quick and dirty followup PR for that, not addressing it here.
Collaborator
💔 Backport failed
You can use sqren/backport to manually backport by running |
Member
Author
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
andreidan
pushed a commit
to andreidan/elasticsearch
that referenced
this pull request
Apr 9, 2025
* test * Revert "test" This reverts commit 9f4e2ad. * Refactor InferenceService to allow passing in chunking settings * Add chunking config to inference field metadata and store in semantic_text field * Fix test compilation errors * Hacking around trying to get ingest to work * Debugging * [CI] Auto commit changes from spotless * POC works and update TODO to fix this * [CI] Auto commit changes from spotless * Refactor chunking settings from model settings to field inference request * A bit of cleanup * Revert a bunch of changes to try to narrow down what broke CI * test * Revert "test" This reverts commit 9f4e2ad. * Fix InferenceFieldMetadataTest * [CI] Auto commit changes from spotless * Add chunking settings back in * Update builder to use new map * Fix compilation errors after merge * Debugging tests * debugging * Cleanup * Add yaml test * Update tests * Add chunking to test inference service * Trying to get tests to work * Shard bulk inference test never specifies chunking settings * Fix test * Always process batches in order * Fix chunking in test inference service and yaml tests * [CI] Auto commit changes from spotless * Refactor - remove convenience method with default chunking settings * Fix ShardBulkInferenceActionFilterTests * Fix ElasticsearchInternalServiceTests * Fix SemanticTextFieldMapperTests * [CI] Auto commit changes from spotless * Fix test data to fit within bounds * Add additional yaml test cases * Playing with xcontent parsing * A little cleanup * Update docs/changelog/121041.yaml * Fix failures introduced by merge * [CI] Auto commit changes from spotless * Address PR feedback * [CI] Auto commit changes from spotless * Fix predicate in updated test * Better handling of null/empty ChunkingSettings * Update parsing settings * Fix errors post merge * PR feedback * [CI] Auto commit changes from spotless * PR feedback and fix Xcontent parsing for SemanticTextField * Remove chunking settings check to use what's passed in from sender service * Fix some tests * Cleanup * Test failure whack-a-mole * Cleanup * Refactor to handle memory optimized bulk shard inference actions - this is ugly but at least it compiles * [CI] Auto commit changes from spotless * Minor cleanup * A bit more cleanup * Spotless * Revert change * Update chunking setting update logic * Go back to serializing maps * Revert change to model settings - source still errors on missing model_id * Fix updating chunking settings * Look up model if null * Fix test * Work around elastic#125723 in semantic text field serialization * Add BWC tests * Add chunking_settings to docs * Refactor/rename * Address minor PR feedback * Add test case for null update * PR feedback - adjust refactor of chunked inputs * Refactored AbstractTestInferenceService to return offsets instead of just Strings * [CI] Auto commit changes from spotless * Fix tests where chunk output was of size 3 * Update mappings per PR feedback * PR Feedback * Fix problems related to merge * PR optimization * Fix test * Delete extra file --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
kderusso
added a commit
that referenced
this pull request
Apr 10, 2025
…#126545) * Support configurable chunking in semantic_text fields (#121041) * test * Revert "test" This reverts commit 9f4e2ad. * Refactor InferenceService to allow passing in chunking settings * Add chunking config to inference field metadata and store in semantic_text field * Fix test compilation errors * Hacking around trying to get ingest to work * Debugging * [CI] Auto commit changes from spotless * POC works and update TODO to fix this * [CI] Auto commit changes from spotless * Refactor chunking settings from model settings to field inference request * A bit of cleanup * Revert a bunch of changes to try to narrow down what broke CI * test * Revert "test" This reverts commit 9f4e2ad. * Fix InferenceFieldMetadataTest * [CI] Auto commit changes from spotless * Add chunking settings back in * Update builder to use new map * Fix compilation errors after merge * Debugging tests * debugging * Cleanup * Add yaml test * Update tests * Add chunking to test inference service * Trying to get tests to work * Shard bulk inference test never specifies chunking settings * Fix test * Always process batches in order * Fix chunking in test inference service and yaml tests * [CI] Auto commit changes from spotless * Refactor - remove convenience method with default chunking settings * Fix ShardBulkInferenceActionFilterTests * Fix ElasticsearchInternalServiceTests * Fix SemanticTextFieldMapperTests * [CI] Auto commit changes from spotless * Fix test data to fit within bounds * Add additional yaml test cases * Playing with xcontent parsing * A little cleanup * Update docs/changelog/121041.yaml * Fix failures introduced by merge * [CI] Auto commit changes from spotless * Address PR feedback * [CI] Auto commit changes from spotless * Fix predicate in updated test * Better handling of null/empty ChunkingSettings * Update parsing settings * Fix errors post merge * PR feedback * [CI] Auto commit changes from spotless * PR feedback and fix Xcontent parsing for SemanticTextField * Remove chunking settings check to use what's passed in from sender service * Fix some tests * Cleanup * Test failure whack-a-mole * Cleanup * Refactor to handle memory optimized bulk shard inference actions - this is ugly but at least it compiles * [CI] Auto commit changes from spotless * Minor cleanup * A bit more cleanup * Spotless * Revert change * Update chunking setting update logic * Go back to serializing maps * Revert change to model settings - source still errors on missing model_id * Fix updating chunking settings * Look up model if null * Fix test * Work around #125723 in semantic text field serialization * Add BWC tests * Add chunking_settings to docs * Refactor/rename * Address minor PR feedback * Add test case for null update * PR feedback - adjust refactor of chunked inputs * Refactored AbstractTestInferenceService to return offsets instead of just Strings * [CI] Auto commit changes from spotless * Fix tests where chunk output was of size 3 * Update mappings per PR feedback * PR Feedback * Fix problems related to merge * PR optimization * Fix test * Delete extra file --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co> (cherry picked from commit e7d4a28) # Conflicts: # docs/reference/elasticsearch/mapping-reference/semantic-text.md # server/src/main/java/org/elasticsearch/TransportVersions.java # x-pack/plugin/inference/qa/test-service-plugin/src/main/java/org/elasticsearch/xpack/inference/mock/TestSparseInferenceServiceExtension.java # x-pack/plugin/inference/src/main/java/module-info.java # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/InferenceFeatures.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/chunking/EmbeddingRequestChunkerTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/amazonbedrock/AmazonBedrockServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/azureopenai/AzureOpenAiServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/cohere/CohereServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/elastic/ElasticInferenceServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/huggingface/HuggingFaceElserServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/huggingface/HuggingFaceServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/mistral/MistralServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/openai/OpenAiServiceTests.java * Delete docs/reference/elasticsearch/mapping-reference/semantic-text.md * [CI] Auto commit changes from spotless * Fix error after merge conflict resolution * Fix broken test --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
This was referenced Apr 10, 2025
DonalEvans
added a commit
to DonalEvans/elasticsearch
that referenced
this pull request
Sep 22, 2025
This commit restores the behaviour introduced in elastic#125837 which was inadvertently undone by changes in elastic#121041, specifically, delaying copying Strings as part of calling Request.chunkText() until the request is being executed. In addition to the above change, refactor doChunkedInfer() and its implementations to take a List<ChunkInferenceInput> rather than EmbeddingsInput, since the EmbeddingsInput passed into doChunkedInfer() was immediately discarded after extracting the ChunkInferenceInput list from it. This change allowed the EmbeddingsInput class to be refactored to not know about ChunkInferenceInput, simplifying it significantly. This commit also simplifies EmbeddingRequestChunker.Request to take only the input String rather than the entire list of all inputs, since only one input is actually needed. This change prevents Requests from retaining a reference to the input list, potentially allowing it to be GC'd faster.
DonalEvans
added a commit
to DonalEvans/elasticsearch
that referenced
this pull request
Sep 23, 2025
This commit restores the behaviour introduced in elastic#125837 which was inadvertently undone by changes in elastic#121041, specifically, delaying copying Strings as part of calling Request.chunkText() until the request is being executed. In addition to the above change, refactor doChunkedInfer() and its implementations to take a List<ChunkInferenceInput> rather than EmbeddingsInput, since the EmbeddingsInput passed into doChunkedInfer() was immediately discarded after extracting the ChunkInferenceInput list from it. This change allowed the EmbeddingsInput class to be refactored to not know about ChunkInferenceInput, simplifying it significantly. This commit also simplifies EmbeddingRequestChunker.Request to take only the input String rather than the entire list of all inputs, since only one input is actually needed. This change prevents Requests from retaining a reference to the input list, potentially allowing it to be GC'd faster.
DonalEvans
added a commit
to DonalEvans/elasticsearch
that referenced
this pull request
Sep 23, 2025
This commit restores the behaviour introduced in elastic#125837 which was inadvertently undone by changes in elastic#121041, specifically, delaying copying Strings as part of calling Request.chunkText() until the request is being executed. In addition to the above change, refactor doChunkedInfer() and its implementations to take a List<ChunkInferenceInput> rather than EmbeddingsInput, since the EmbeddingsInput passed into doChunkedInfer() was immediately discarded after extracting the ChunkInferenceInput list from it. This change allowed the EmbeddingsInput class to be refactored to not know about ChunkInferenceInput, simplifying it significantly. This commit also simplifies EmbeddingRequestChunker.Request to take only the input String rather than the entire list of all inputs, since only one input is actually needed. This change prevents Requests from retaining a reference to the input list, potentially allowing it to be GC'd faster.
DonalEvans
added a commit
to DonalEvans/elasticsearch
that referenced
this pull request
Sep 23, 2025
This commit restores the behaviour introduced in elastic#125837 which was inadvertently undone by changes in elastic#121041, specifically, delaying copying Strings as part of calling Request.chunkText() until the request is being executed. In addition to the above change, refactor doChunkedInfer() and its implementations to take a List<ChunkInferenceInput> rather than EmbeddingsInput, since the EmbeddingsInput passed into doChunkedInfer() was immediately discarded after extracting the ChunkInferenceInput list from it. This change allowed the EmbeddingsInput class to be refactored to not know about ChunkInferenceInput, simplifying it significantly. This commit also simplifies EmbeddingRequestChunker.Request to take only the input String rather than the entire list of all inputs, since only one input is actually needed. This change prevents Requests from retaining a reference to the input list, potentially allowing it to be GC'd faster.
DonalEvans
added a commit
that referenced
this pull request
Sep 25, 2025
This commit restores the behaviour introduced in #125837 which was inadvertently undone by changes in #121041, specifically, delaying copying Strings as part of calling Request.chunkText() until the request is being executed. In addition to the above change, refactor doChunkedInfer() and its implementations to take a List<ChunkInferenceInput> rather than EmbeddingsInput, since the EmbeddingsInput passed into doChunkedInfer() was immediately discarded after extracting the ChunkInferenceInput list from it. This change allowed the EmbeddingsInput class to be refactored to not know about ChunkInferenceInput, simplifying it significantly. This commit also simplifies EmbeddingRequestChunker.Request to take only the input String rather than the entire list of all inputs, since only one input is actually needed. This change prevents Requests from retaining a reference to the input list, potentially allowing it to be GC'd faster.
DonalEvans
added a commit
to DonalEvans/elasticsearch
that referenced
this pull request
Sep 25, 2025
This commit restores the behaviour introduced in elastic#125837 which was inadvertently undone by changes in elastic#121041, specifically, delaying copying Strings as part of calling Request.chunkText() until the request is being executed. In addition to the above change, refactor doChunkedInfer() and its implementations to take a List<ChunkInferenceInput> rather than EmbeddingsInput, since the EmbeddingsInput passed into doChunkedInfer() was immediately discarded after extracting the ChunkInferenceInput list from it. This change allowed the EmbeddingsInput class to be refactored to not know about ChunkInferenceInput, simplifying it significantly. This commit also simplifies EmbeddingRequestChunker.Request to take only the input String rather than the entire list of all inputs, since only one input is actually needed. This change prevents Requests from retaining a reference to the input list, potentially allowing it to be GC'd faster. (cherry picked from commit f3447d3) # Conflicts: # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/ai21/Ai21Service.java # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/llama/LlamaService.java # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/llama/action/LlamaActionCreator.java # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/mistral/MistralService.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/SenderServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/alibabacloudsearch/AlibabaCloudSearchCompletionRequestManagerTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/alibabacloudsearch/action/AlibabaCloudSearchActionCreatorTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/azureaistudio/action/AzureAiStudioActionAndCreatorTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/llama/action/LlamaActionCreatorTests.java
DonalEvans
added a commit
to DonalEvans/elasticsearch
that referenced
this pull request
Sep 25, 2025
This commit restores the behaviour introduced in elastic#125837 which was inadvertently undone by changes in elastic#121041, specifically, delaying copying Strings as part of calling Request.chunkText() until the request is being executed. In addition to the above change, refactor doChunkedInfer() and its implementations to take a List<ChunkInferenceInput> rather than EmbeddingsInput, since the EmbeddingsInput passed into doChunkedInfer() was immediately discarded after extracting the ChunkInferenceInput list from it. This change allowed the EmbeddingsInput class to be refactored to not know about ChunkInferenceInput, simplifying it significantly. This commit also simplifies EmbeddingRequestChunker.Request to take only the input String rather than the entire list of all inputs, since only one input is actually needed. This change prevents Requests from retaining a reference to the input list, potentially allowing it to be GC'd faster. (cherry picked from commit f3447d3) # Conflicts: # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/ai21/Ai21Service.java # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/llama/LlamaService.java # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/llama/action/LlamaActionCreator.java # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/mistral/MistralService.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/SenderServiceTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/alibabacloudsearch/AlibabaCloudSearchCompletionRequestManagerTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/alibabacloudsearch/action/AlibabaCloudSearchActionCreatorTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/azureaistudio/action/AzureAiStudioActionAndCreatorTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/services/llama/action/LlamaActionCreatorTests.java
DonalEvans
added a commit
that referenced
this pull request
Sep 25, 2025
This commit restores the behaviour introduced in #125837 which was inadvertently undone by changes in #121041, specifically, delaying copying Strings as part of calling Request.chunkText() until the request is being executed. In addition to the above change, refactor doChunkedInfer() and its implementations to take a List<ChunkInferenceInput> rather than EmbeddingsInput, since the EmbeddingsInput passed into doChunkedInfer() was immediately discarded after extracting the ChunkInferenceInput list from it. This change allowed the EmbeddingsInput class to be refactored to not know about ChunkInferenceInput, simplifying it significantly. This commit also simplifies EmbeddingRequestChunker.Request to take only the input String rather than the entire list of all inputs, since only one input is actually needed. This change prevents Requests from retaining a reference to the input list, potentially allowing it to be GC'd faster. (cherry picked from commit f3447d3)
DonalEvans
added a commit
that referenced
this pull request
Sep 25, 2025
This commit restores the behaviour introduced in #125837 which was inadvertently undone by changes in #121041, specifically, delaying copying Strings as part of calling Request.chunkText() until the request is being executed. In addition to the above change, refactor doChunkedInfer() and its implementations to take a List<ChunkInferenceInput> rather than EmbeddingsInput, since the EmbeddingsInput passed into doChunkedInfer() was immediately discarded after extracting the ChunkInferenceInput list from it. This change allowed the EmbeddingsInput class to be refactored to not know about ChunkInferenceInput, simplifying it significantly. This commit also simplifies EmbeddingRequestChunker.Request to take only the input String rather than the entire list of all inputs, since only one input is actually needed. This change prevents Requests from retaining a reference to the input list, potentially allowing it to be GC'd faster. (cherry picked from commit f3447d3)
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Supports configurable chunking in
semantic_textfields that will override the model defaults.Example script: