[Feature Request] Have textsplitter always be <= chunk_size #528

krrishdholakia · 2023-01-04T06:35:57Z

Would it be possible to have the textsplitter ensure the text it's splitting is always <= chunk_size?

Seeing this issue while trying to embed large documents

krrishdholakia · 2023-01-04T06:36:05Z

cc: @hwchase17

hwchase17 · 2023-01-04T07:09:10Z

may be tricky to ensure... but let me take a stab

hwchase17 · 2023-01-04T07:25:25Z

@krrishdholakia thoughts on #530?

krrishdholakia · 2023-01-04T11:05:48Z

Hmm - why is that better than a simple sliding window approach?

hwchase17 · 2023-01-04T15:07:44Z

it maintains some of the context (eg wont split in the middle of a word, etc

that being said, maybe if your working with code or long string you need to sometimes do that?
i can add something for that as well

This does not involve a separator, and will naively chunk input text at the appropriate boundaries in token space. This is helpful if we have strict token length limits that we need to strictly follow the specified chunk size, and we can't use aggressive separators like spaces to guarantee the absence of long strings. CharacterTextSplitter will let these strings through without splitting them, which could cause overflow errors downstream. Splitting at arbitrary token boundaries is not ideal but is hopefully mitigated by having a decent overlap quantity. Also this results in chunks which has exact number of tokens desired, instead of sometimes overcounting if we concatenate shorter strings. Potentially also helps with #528.

This does not involve a separator, and will naively chunk input text at the appropriate boundaries in token space. This is helpful if we have strict token length limits that we need to strictly follow the specified chunk size, and we can't use aggressive separators like spaces to guarantee the absence of long strings. CharacterTextSplitter will let these strings through without splitting them, which could cause overflow errors downstream. Splitting at arbitrary token boundaries is not ideal but is hopefully mitigated by having a decent overlap quantity. Also this results in chunks which has exact number of tokens desired, instead of sometimes overcounting if we concatenate shorter strings. Potentially also helps with langchain-ai#528.

TomTom101 · 2023-06-09T09:29:53Z

I'd like to chime in and revive the discussion :)

I am trying to wrap my head around the text splitters since I observed results that make no sense to me. While doing so I looked at the tests, in particular this one (I added the function parameters separators and keep_separator with their default values for transparency).

This test runs successfully, however the expected output is not what I would expect:

def test_iterative_text_splitter() -> None:
    """Test iterative text splitter."""
    text = """Hi.\n\nI'm Harrison.\n\nHow? Are? You?\nOkay then f f f f.
This is a weird text to write, but gotta test the splittingggg some how.

Bye!\n\n-H."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=10,
        chunk_overlap=1,
        separators=["\n\n", "\n", " ", ""],
        keep_separator=True,
    )
    output = splitter.split_text(text)
    expected_output = [
        "Hi.",
        "I'm",
        "Harrison.",
        "How? Are?",
        "You?",
        "Okay then",
        "f f f f.",
        "This is a",
        "weird",
        "text to",
        "write,",
        "but gotta",
        "test the",
        "splitting",
        "gggg",
        "some how.",
        "Bye!",
        "-H.",
    ]
    assert output == expected_output

Chunk size is 10 so I'd expect the 1st split to be Hi.\n\nI'm. This is length==10, counting in the separators that are supposed to be preserved. I don't understand why the string is split already after 3 chars (Hi.) although the next one would fit in snuggly in the 10 char allowance. Also the separators are not included despite keep_separator is True.

Next I don't see a chunk_overlap. Although I reckon we are not talking about the number of chunks (which is a string with a length<=chunk_size), that should overlap between neighboring chunks, but rather number of "characters". So imo it's better to call it "char_overlap" or similar. Still, I don't see a character overlap between the chunks in the test output.

With chunk_overlap=0 the first splits are the same btw.

Given the fact that we usually aim at handing over full sentences to LLMs, I doubt such an overlap makes any sense anyway.

Next, here is an additional test function:

def test_iterative_text_splitter_long() -> None:
    # a mulitplier, set to 1, 10, 100 …
    m = 10
    text = f"""{'.'*3*m}|{'.'*2*m}|{'.'*m}"""
    chunk_size = 4 * m
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=0,
        # separator="|",
        separators=["|"],
        keep_separator=True,
    )
    actual = splitter.split_text(text)
    expected = [
        f"{'.' * 3*m}|",
        f"{'.' * 2*m}|{'.' * m}",
    ]
    print(f"{chunk_size=}")
    print(f"       {text=}")
    print(f"  {expected=}")
    print(f"    {actual=}")

    assert actual == expected

m is a multiplier, set it to 1, 10, 100 … should not make any difference since the text and chunksize are expanded in the same way.
Run it with m=1 all is good, run with 10, the output is this:

chunk_size=40
       text='..............................|....................|..........'
  expected=['..............................|', '....................|..........']
    actual=['..............................|.........', '...........|..........']

The string is cut "mid-sentence" without need. The expected output would have created 2 chunks of 31 chars length, split at the separators. Imo the optimal solution.

Hence I believe those unglamorous little classes are in fact quite essential to get right since already a vast amount of vector stores are filled and LLMs fed with the chunks they created that are supposed to extract knowledge from them.

What are your thoughts, ideas, plans?

Thanks for reading this up to this point – and for this wonderful package!
t

dosubot · 2023-09-20T16:02:38Z

Hi, @krrishdholakia. I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you were requesting a feature to ensure that the textsplitter always splits the text into chunks that are less than or equal to the specified chunk size. This issue was encountered while trying to embed large documents. The maintainer, @hwchase17, suggested using a sliding window approach to maintain context while splitting. Another user, @TomTom101, also provided additional test cases to highlight the issue and suggested renaming the chunk_overlap parameter.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project. Let us know if you have any further questions or concerns.

kahkeng mentioned this issue Jan 31, 2023

Add alternative token-based text splitter #816

Merged

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 20, 2023

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 27, 2023

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Have textsplitter always be <= chunk_size #528

[Feature Request] Have textsplitter always be <= chunk_size #528

krrishdholakia commented Jan 4, 2023

krrishdholakia commented Jan 4, 2023

hwchase17 commented Jan 4, 2023

hwchase17 commented Jan 4, 2023

krrishdholakia commented Jan 4, 2023

hwchase17 commented Jan 4, 2023

TomTom101 commented Jun 9, 2023

dosubot bot commented Sep 20, 2023

[Feature Request] Have textsplitter always be <= chunk_size #528

[Feature Request] Have textsplitter always be <= chunk_size #528

Comments

krrishdholakia commented Jan 4, 2023

krrishdholakia commented Jan 4, 2023

hwchase17 commented Jan 4, 2023

hwchase17 commented Jan 4, 2023

krrishdholakia commented Jan 4, 2023

hwchase17 commented Jan 4, 2023

TomTom101 commented Jun 9, 2023

dosubot bot commented Sep 20, 2023