Skip to content

chunk_overlap not working when using RecursiveCharacterTextSplitter #30200

@sallahbaksh

Description

@sallahbaksh

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

def __init__(self, separators: List[str], chunk_size: int = 4000, chunk_overlap: int = 200):
        headers_to_split_on = [("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ("####", "Header 4")]
        self.separators = separators
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        # Use LangChain's MarkdownHeaderTextSplitter to split on headers (levels 1-4)
        self.header_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)
        self.text_splitter = RecursiveCharacterTextSplitter(
            separators=self.separators,
            chunk_size=self.chunk_size,
            chunk_overlap=self.chunk_overlap,
            length_function=len
        )

Error Message and Stack Trace (if applicable)

No response

Description

I'm trying to split text from an md file. I first, use the MarkdownHeaderTextSplitter to split on headers and then I use the RecursiveCharacterTextSplitter. I want to have a chunk overlap of 200, but when I specify the overlap and use split_text, there is no overlap actually occurring.

System Info

System Information

OS: Windows
OS Version: 10.0.26100
Python Version: 3.12.8 | packaged by conda-forge | (main, Dec 5 2024, 14:06:27) [MSC v.1942 64 bit (AMD64)]

Package Information

langchain_core: 0.3.23
langchain: 0.3.10
langsmith: 0.1.142
langchain_openai: 0.2.8
langchain_text_splitters: 0.3.2

Optional packages not installed

langserve

Other Dependencies

aiohttp: 3.10.11
async-timeout: Installed. No version info available.
httpx: 0.28.1
jsonpatch: 1.33
numpy: 1.26.4
openai: 1.59.5
orjson: 3.10.11
packaging: 24.2
pydantic: 2.10.6
PyYAML: 6.0.2
requests: 2.32.3
requests-toolbelt: 1.0.0
SQLAlchemy: 2.0.36
tenacity: 9.0.0
tiktoken: 0.8.0
typing-extensions: 4.12.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugRelated to a bug, vulnerability, unexpected error with an existing featuretext-splittersRelated to the package `text-splitters`

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions