Skip to content

Conversation

@davidkyle
Copy link
Member

The sentence chunker returns 0 chunks if the input is an empty string. After chunking the Elasticsearch inference service would send an empty lists of request to the ml node for processing but the node never replied because there were no requests to process.

The first change is for the sentence chunker to return "" if the input is "", this behaviour is inline with the word based chunker.

The second change is to protect against an empty inference request in the inference action.

The bug only applies to inference endpoints using the Elasticsearch service configured with sentence chunking. Sentence chunking which was introduced in 8.16

@davidkyle davidkyle added >bug :ml Machine learning v6.8.17 auto-backport Automatically create backport pull requests when merged v9.0.0 v8.16.2 v8.18.0 labels Dec 2, 2024
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Dec 2, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @davidkyle, I've created a changelog YAML for you.

Comment on lines 69 to 71
if (input.isEmpty()) {
return List.of("");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to trim the input before checking if it's empty? How does the chunker handle input that is only whitespace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @Mikep86 thanks.

There's a whole class of bugs for "things that don't chunk". Pure whitespace, a single character or the same character repeated thousands of times don't chunk. I've added test cases for these situations and the solution I've implemented here in this is to return the original input if it did not chunk. This applies to both the Word and Sentence chunker.

This makes me wonder if there should be an upper limit on the chunk size in terms of number of characters. A badly formed input contain latin characters but no whitespace would result in a single large chunk, think a binary file base64 encoded.

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for handling those edge cases

assertThat(batches, empty());
}

public void testWhitespaceInput_SentenceChunker() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a test for whitespace input for the word chunker?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered by an existing test

# Conflicts:
#	x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/SentenceBoundaryChunker.java
#	x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/chunking/SentenceBoundaryChunkerTests.java
@davidkyle davidkyle enabled auto-merge (squash) December 11, 2024 21:13
@davidkyle davidkyle merged commit a8484ad into elastic:main Dec 12, 2024
16 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.16 Commit could not be cherrypicked due to conflicts
8.17 Commit could not be cherrypicked due to conflicts
8.x

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 117840

davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Dec 12, 2024
davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Dec 16, 2024
…elastic#117840)

# Conflicts:
#	x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/chunking/SentenceBoundaryChunkerTests.java
#	x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/chunking/WordBoundaryChunkerTests.java
davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Dec 16, 2024
… field (elastic#118746)

Backport of elastic#117840
# Conflicts:
#	x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/chunking/EmbeddingRequestChunkerTests.java
elasticsearchmachine pushed a commit that referenced this pull request Dec 16, 2024
… field (#118746) (#118767)

Backport of #117840
# Conflicts:
#	x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/chunking/EmbeddingRequestChunkerTests.java
sarog pushed a commit to portsbuild/elasticsearch that referenced this pull request Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged backport pending >bug :ml Machine learning Team:ML Meta label for the ML team v8.16.2 v8.17.1 v8.18.0 v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants