Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Have textsplitter always be <= chunk_size #528

Closed
krrishdholakia opened this issue Jan 4, 2023 · 7 comments
Closed

[Feature Request] Have textsplitter always be <= chunk_size #528

krrishdholakia opened this issue Jan 4, 2023 · 7 comments

Comments

@krrishdholakia
Copy link
Contributor

Would it be possible to have the textsplitter ensure the text it's splitting is always <= chunk_size?

Seeing this issue while trying to embed large documents

@krrishdholakia
Copy link
Contributor Author

cc: @hwchase17

@hwchase17
Copy link
Contributor

may be tricky to ensure... but let me take a stab

@hwchase17
Copy link
Contributor

@krrishdholakia thoughts on #530?

@krrishdholakia
Copy link
Contributor Author

Hmm - why is that better than a simple sliding window approach?

@hwchase17
Copy link
Contributor

it maintains some of the context (eg wont split in the middle of a word, etc

that being said, maybe if your working with code or long string you need to sometimes do that?
i can add something for that as well

hwchase17 pushed a commit that referenced this issue Feb 3, 2023
This does not involve a separator, and will naively chunk input text at
the appropriate boundaries in token space.

This is helpful if we have strict token length limits that we need to
strictly follow the specified chunk size, and we can't use aggressive
separators like spaces to guarantee the absence of long strings.

CharacterTextSplitter will let these strings through without splitting
them, which could cause overflow errors downstream.

Splitting at arbitrary token boundaries is not ideal but is hopefully
mitigated by having a decent overlap quantity. Also this results in
chunks which has exact number of tokens desired, instead of sometimes
overcounting if we concatenate shorter strings.

Potentially also helps with #528.
zachschillaci27 pushed a commit to zachschillaci27/langchain that referenced this issue Mar 8, 2023
This does not involve a separator, and will naively chunk input text at
the appropriate boundaries in token space.

This is helpful if we have strict token length limits that we need to
strictly follow the specified chunk size, and we can't use aggressive
separators like spaces to guarantee the absence of long strings.

CharacterTextSplitter will let these strings through without splitting
them, which could cause overflow errors downstream.

Splitting at arbitrary token boundaries is not ideal but is hopefully
mitigated by having a decent overlap quantity. Also this results in
chunks which has exact number of tokens desired, instead of sometimes
overcounting if we concatenate shorter strings.

Potentially also helps with langchain-ai#528.
@TomTom101
Copy link
Contributor

I'd like to chime in and revive the discussion :)

I am trying to wrap my head around the text splitters since I observed results that make no sense to me. While doing so I looked at the tests, in particular this one (I added the function parameters separators and keep_separator with their default values for transparency).

This test runs successfully, however the expected output is not what I would expect:

def test_iterative_text_splitter() -> None:
    """Test iterative text splitter."""
    text = """Hi.\n\nI'm Harrison.\n\nHow? Are? You?\nOkay then f f f f.
This is a weird text to write, but gotta test the splittingggg some how.

Bye!\n\n-H."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=10,
        chunk_overlap=1,
        separators=["\n\n", "\n", " ", ""],
        keep_separator=True,
    )
    output = splitter.split_text(text)
    expected_output = [
        "Hi.",
        "I'm",
        "Harrison.",
        "How? Are?",
        "You?",
        "Okay then",
        "f f f f.",
        "This is a",
        "weird",
        "text to",
        "write,",
        "but gotta",
        "test the",
        "splitting",
        "gggg",
        "some how.",
        "Bye!",
        "-H.",
    ]
    assert output == expected_output

Chunk size is 10 so I'd expect the 1st split to be Hi.\n\nI'm. This is length==10, counting in the separators that are supposed to be preserved. I don't understand why the string is split already after 3 chars (Hi.) although the next one would fit in snuggly in the 10 char allowance. Also the separators are not included despite keep_separator is True.

Next I don't see a chunk_overlap. Although I reckon we are not talking about the number of chunks (which is a string with a length<=chunk_size), that should overlap between neighboring chunks, but rather number of "characters". So imo it's better to call it "char_overlap" or similar. Still, I don't see a character overlap between the chunks in the test output.

With chunk_overlap=0 the first splits are the same btw.

Given the fact that we usually aim at handing over full sentences to LLMs, I doubt such an overlap makes any sense anyway.

Next, here is an additional test function:

def test_iterative_text_splitter_long() -> None:
    # a mulitplier, set to 1, 10, 100 …
    m = 10
    text = f"""{'.'*3*m}|{'.'*2*m}|{'.'*m}"""
    chunk_size = 4 * m
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=0,
        # separator="|",
        separators=["|"],
        keep_separator=True,
    )
    actual = splitter.split_text(text)
    expected = [
        f"{'.' * 3*m}|",
        f"{'.' * 2*m}|{'.' * m}",
    ]
    print(f"{chunk_size=}")
    print(f"       {text=}")
    print(f"  {expected=}")
    print(f"    {actual=}")

    assert actual == expected

m is a multiplier, set it to 1, 10, 100 … should not make any difference since the text and chunksize are expanded in the same way.
Run it with m=1 all is good, run with 10, the output is this:

chunk_size=40
       text='..............................|....................|..........'
  expected=['..............................|', '....................|..........']
    actual=['..............................|.........', '...........|..........']

The string is cut "mid-sentence" without need. The expected output would have created 2 chunks of 31 chars length, split at the separators. Imo the optimal solution.

Hence I believe those unglamorous little classes are in fact quite essential to get right since already a vast amount of vector stores are filled and LLMs fed with the chunks they created that are supposed to extract knowledge from them.

What are your thoughts, ideas, plans?

Thanks for reading this up to this point – and for this wonderful package!
t

@dosubot
Copy link

dosubot bot commented Sep 20, 2023

Hi, @krrishdholakia. I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you were requesting a feature to ensure that the textsplitter always splits the text into chunks that are less than or equal to the specified chunk size. This issue was encountered while trying to embed large documents. The maintainer, @hwchase17, suggested using a sliding window approach to maintain context while splitting. Another user, @TomTom101, also provided additional test cases to highlight the issue and suggested renaming the chunk_overlap parameter.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project. Let us know if you have any further questions or concerns.

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 20, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 27, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants