[ML] Fix timeout ingesting an empty string into a semantic_text field #117840

davidkyle · 2024-12-02T16:50:50Z

The sentence chunker returns 0 chunks if the input is an empty string. After chunking the Elasticsearch inference service would send an empty lists of request to the ml node for processing but the node never replied because there were no requests to process.

The first change is for the sentence chunker to return "" if the input is "", this behaviour is inline with the word based chunker.

The second change is to protect against an empty inference request in the inference action.

The bug only applies to inference endpoints using the Elasticsearch service configured with sentence chunking. Sentence chunking which was introduced in 8.16

elasticsearchmachine · 2024-12-02T16:51:16Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2024-12-02T16:51:16Z

Hi @davidkyle, I've created a changelog YAML for you.

Mikep86 · 2024-12-02T17:03:22Z

...erence/src/main/java/org/elasticsearch/xpack/inference/chunking/SentenceBoundaryChunker.java

+        if (input.isEmpty()) {
+            return List.of("");
+        }


Do we need to trim the input before checking if it's empty? How does the chunker handle input that is only whitespace?

Good catch @Mikep86 thanks.

There's a whole class of bugs for "things that don't chunk". Pure whitespace, a single character or the same character repeated thousands of times don't chunk. I've added test cases for these situations and the solution I've implemented here in this is to return the original input if it did not chunk. This applies to both the Word and Sentence chunker.

This makes me wonder if there should be an upper limit on the chunk size in terms of number of characters. A badly formed input contain latin characters but no whitespace would result in a single large chunk, think a binary file base64 encoded.

Mikep86

Looks good, thanks for handling those edge cases

Mikep86 · 2024-12-04T17:15:12Z

...e/src/test/java/org/elasticsearch/xpack/inference/chunking/EmbeddingRequestChunkerTests.java

+        assertThat(batches, empty());
+    }
+
+    public void testWhitespaceInput_SentenceChunker() {


Do we need a test for whitespace input for the word chunker?

Covered by an existing test

# Conflicts: # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/SentenceBoundaryChunker.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/chunking/SentenceBoundaryChunkerTests.java

elasticsearchmachine · 2024-12-12T10:22:32Z

💔 Backport failed

Status	Branch	Result
❌	8.16	Commit could not be cherrypicked due to conflicts
❌	8.17	Commit could not be cherrypicked due to conflicts
✅	8.x

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 117840

…elastic#117840)

…#117840) (#118540)

…elastic#117840) # Conflicts: # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/chunking/SentenceBoundaryChunkerTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/chunking/WordBoundaryChunkerTests.java

… field (#118746) Backport of #117840

… field (elastic#118746) Backport of elastic#117840 # Conflicts: # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/chunking/EmbeddingRequestChunkerTests.java

…elastic#117840) (elastic#118540)

… field (#118746) (#118767) Backport of #117840 # Conflicts: # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/chunking/EmbeddingRequestChunkerTests.java

… field (elastic#118746) Backport of elastic#117840

davidkyle added 2 commits December 2, 2024 16:41

Sentence chunker should return 1 input for an empty string

3c27cc1

Handle empty case

d217607

davidkyle added >bug :ml Machine learning v6.8.17 auto-backport Automatically create backport pull requests when merged v9.0.0 v8.16.2 v8.18.0 labels Dec 2, 2024

elasticsearchmachine added the Team:ML Meta label for the ML team label Dec 2, 2024

Update docs/changelog/117840.yaml

9513926

Mikep86 added v8.17.0 v8.17.1 and removed v6.8.17 v8.17.0 labels Dec 2, 2024

Mikep86 reviewed Dec 2, 2024

View reviewed changes

davidkyle added 2 commits December 3, 2024 14:58

handle inputs that do not chunk

0ab1ca8

Merge branch 'main' into empty-input

a4edf41

Mikep86 approved these changes Dec 4, 2024

View reviewed changes

davidkyle added 2 commits December 11, 2024 13:20

Merge branch 'main' into empty-input

78ea3e7

# Conflicts: # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/SentenceBoundaryChunker.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/chunking/SentenceBoundaryChunkerTests.java

fix the tests

71c1012

davidkyle enabled auto-merge (squash) December 11, 2024 21:13

Merge branch 'main' into empty-input

2a98904

davidkyle merged commit a8484ad into elastic:main Dec 12, 2024
16 checks passed

davidkyle mentioned this pull request Dec 12, 2024

[8.x] [ML] Fix timeout ingesting an empty string into a semantic_text field (#117840) #118540

Merged

elasticsearchmachine added the backport pending label Dec 12, 2024

davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Dec 12, 2024

[ML] Fix timeout ingesting an empty string into a semantic_text field (…

0bd44f9

…elastic#117840)

elasticsearchmachine pushed a commit that referenced this pull request Dec 12, 2024

[ML] Fix timeout ingesting an empty string into a semantic_text field (…

90305d2

…#117840) (#118540)

davidkyle mentioned this pull request Dec 16, 2024

[8.17][ML] Fix timeout ingesting an empty string into a semantic_text field #118746

Merged

elasticsearchmachine pushed a commit that referenced this pull request Dec 16, 2024

[8.17][ML] Fix timeout ingesting an empty string into a semantic_text…

9fed31c

… field (#118746) Backport of #117840

davidkyle mentioned this pull request Dec 16, 2024

[8.16][ML] Fix timeout ingesting an empty string into a semantic_tet field #118767

Merged

maxhniebergall pushed a commit to maxhniebergall/elasticsearch that referenced this pull request Dec 16, 2024

[ML] Fix timeout ingesting an empty string into a semantic_text field (…

6c7918a

…elastic#117840) (elastic#118540)

maxhniebergall pushed a commit to maxhniebergall/elasticsearch that referenced this pull request Dec 16, 2024

[ML] Fix timeout ingesting an empty string into a semantic_text field (…

1e114a7

…elastic#117840) (elastic#118540)

sarog pushed a commit to portsbuild/elasticsearch that referenced this pull request Jan 22, 2025

[8.17][ML] Fix timeout ingesting an empty string into a semantic_text…

758aa2a

… field (elastic#118746) Backport of elastic#117840

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Fix timeout ingesting an empty string into a semantic_text field #117840

[ML] Fix timeout ingesting an empty string into a semantic_text field #117840

Uh oh!

davidkyle commented Dec 2, 2024

Uh oh!

elasticsearchmachine commented Dec 2, 2024

Uh oh!

elasticsearchmachine commented Dec 2, 2024

Uh oh!

Mikep86 Dec 2, 2024

Uh oh!

davidkyle Dec 3, 2024

Uh oh!

Mikep86 left a comment

Uh oh!

Mikep86 Dec 4, 2024

Uh oh!

davidkyle Dec 11, 2024

Uh oh!

Uh oh!

elasticsearchmachine commented Dec 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[ML] Fix timeout ingesting an empty string into a semantic_text field #117840

[ML] Fix timeout ingesting an empty string into a semantic_text field #117840

Uh oh!

Conversation

davidkyle commented Dec 2, 2024

Uh oh!

elasticsearchmachine commented Dec 2, 2024

Uh oh!

elasticsearchmachine commented Dec 2, 2024

Uh oh!

Mikep86 Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

davidkyle Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

Mikep86 left a comment

Choose a reason for hiding this comment

Uh oh!

Mikep86 Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

davidkyle Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Dec 12, 2024

💔 Backport failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants