-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[ML] Fix timeout ingesting an empty string into a semantic_text field #117840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/ml-core (Team:ML) |
|
Hi @davidkyle, I've created a changelog YAML for you. |
| if (input.isEmpty()) { | ||
| return List.of(""); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to trim the input before checking if it's empty? How does the chunker handle input that is only whitespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch @Mikep86 thanks.
There's a whole class of bugs for "things that don't chunk". Pure whitespace, a single character or the same character repeated thousands of times don't chunk. I've added test cases for these situations and the solution I've implemented here in this is to return the original input if it did not chunk. This applies to both the Word and Sentence chunker.
This makes me wonder if there should be an upper limit on the chunk size in terms of number of characters. A badly formed input contain latin characters but no whitespace would result in a single large chunk, think a binary file base64 encoded.
Mikep86
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for handling those edge cases
| assertThat(batches, empty()); | ||
| } | ||
|
|
||
| public void testWhitespaceInput_SentenceChunker() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a test for whitespace input for the word chunker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Covered by an existing test
# Conflicts: # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/SentenceBoundaryChunker.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/chunking/SentenceBoundaryChunkerTests.java
💔 Backport failed
You can use sqren/backport to manually backport by running |
…elastic#117840) # Conflicts: # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/chunking/SentenceBoundaryChunkerTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/chunking/WordBoundaryChunkerTests.java
… field (elastic#118746) Backport of elastic#117840 # Conflicts: # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/chunking/EmbeddingRequestChunkerTests.java
The sentence chunker returns 0 chunks if the input is an empty string. After chunking the Elasticsearch inference service would send an empty lists of request to the ml node for processing but the node never replied because there were no requests to process.
The first change is for the sentence chunker to return "" if the input is "", this behaviour is inline with the word based chunker.
The second change is to protect against an empty inference request in the inference action.
The bug only applies to inference endpoints using the Elasticsearch service configured with sentence chunking. Sentence chunking which was introduced in 8.16