-
Notifications
You must be signed in to change notification settings - Fork 16.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Have textsplitter always be <= chunk_size #528
Comments
cc: @hwchase17 |
may be tricky to ensure... but let me take a stab |
@krrishdholakia thoughts on #530? |
Hmm - why is that better than a simple sliding window approach? |
it maintains some of the context (eg wont split in the middle of a word, etc that being said, maybe if your working with code or long string you need to sometimes do that? |
This does not involve a separator, and will naively chunk input text at the appropriate boundaries in token space. This is helpful if we have strict token length limits that we need to strictly follow the specified chunk size, and we can't use aggressive separators like spaces to guarantee the absence of long strings. CharacterTextSplitter will let these strings through without splitting them, which could cause overflow errors downstream. Splitting at arbitrary token boundaries is not ideal but is hopefully mitigated by having a decent overlap quantity. Also this results in chunks which has exact number of tokens desired, instead of sometimes overcounting if we concatenate shorter strings. Potentially also helps with #528.
This does not involve a separator, and will naively chunk input text at the appropriate boundaries in token space. This is helpful if we have strict token length limits that we need to strictly follow the specified chunk size, and we can't use aggressive separators like spaces to guarantee the absence of long strings. CharacterTextSplitter will let these strings through without splitting them, which could cause overflow errors downstream. Splitting at arbitrary token boundaries is not ideal but is hopefully mitigated by having a decent overlap quantity. Also this results in chunks which has exact number of tokens desired, instead of sometimes overcounting if we concatenate shorter strings. Potentially also helps with langchain-ai#528.
I'd like to chime in and revive the discussion :) I am trying to wrap my head around the text splitters since I observed results that make no sense to me. While doing so I looked at the tests, in particular this one (I added the function parameters separators and keep_separator with their default values for transparency). This test runs successfully, however the expected output is not what I would expect: def test_iterative_text_splitter() -> None:
"""Test iterative text splitter."""
text = """Hi.\n\nI'm Harrison.\n\nHow? Are? You?\nOkay then f f f f.
This is a weird text to write, but gotta test the splittingggg some how.
Bye!\n\n-H."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=10,
chunk_overlap=1,
separators=["\n\n", "\n", " ", ""],
keep_separator=True,
)
output = splitter.split_text(text)
expected_output = [
"Hi.",
"I'm",
"Harrison.",
"How? Are?",
"You?",
"Okay then",
"f f f f.",
"This is a",
"weird",
"text to",
"write,",
"but gotta",
"test the",
"splitting",
"gggg",
"some how.",
"Bye!",
"-H.",
]
assert output == expected_output Chunk size is 10 so I'd expect the 1st split to be Next I don't see a chunk_overlap. Although I reckon we are not talking about the number of chunks (which is a string with a length<=chunk_size), that should overlap between neighboring chunks, but rather number of "characters". So imo it's better to call it "char_overlap" or similar. Still, I don't see a character overlap between the chunks in the test output. With chunk_overlap=0 the first splits are the same btw. Given the fact that we usually aim at handing over full sentences to LLMs, I doubt such an overlap makes any sense anyway. Next, here is an additional test function: def test_iterative_text_splitter_long() -> None:
# a mulitplier, set to 1, 10, 100 …
m = 10
text = f"""{'.'*3*m}|{'.'*2*m}|{'.'*m}"""
chunk_size = 4 * m
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=0,
# separator="|",
separators=["|"],
keep_separator=True,
)
actual = splitter.split_text(text)
expected = [
f"{'.' * 3*m}|",
f"{'.' * 2*m}|{'.' * m}",
]
print(f"{chunk_size=}")
print(f" {text=}")
print(f" {expected=}")
print(f" {actual=}")
assert actual == expected
The string is cut "mid-sentence" without need. The expected output would have created 2 chunks of 31 chars length, split at the separators. Imo the optimal solution. Hence I believe those unglamorous little classes are in fact quite essential to get right since already a vast amount of vector stores are filled and LLMs fed with the chunks they created that are supposed to extract knowledge from them. What are your thoughts, ideas, plans? Thanks for reading this up to this point – and for this wonderful package! |
Hi, @krrishdholakia. I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale. From what I understand, you were requesting a feature to ensure that the textsplitter always splits the text into chunks that are less than or equal to the specified chunk size. This issue was encountered while trying to embed large documents. The maintainer, @hwchase17, suggested using a sliding window approach to maintain context while splitting. Another user, @TomTom101, also provided additional test cases to highlight the issue and suggested renaming the chunk_overlap parameter. Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days. Thank you for your understanding and contribution to the LangChain project. Let us know if you have any further questions or concerns. |
Would it be possible to have the textsplitter ensure the text it's splitting is always <= chunk_size?
Seeing this issue while trying to embed large documents
The text was updated successfully, but these errors were encountered: