Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLHeaderTextSplitter won't run (maxHead) #13149

Closed
1 of 14 tasks
AndyMik90 opened this issue Nov 9, 2023 · 10 comments · May be fixed by #27678
Closed
1 of 14 tasks

HTMLHeaderTextSplitter won't run (maxHead) #13149

AndyMik90 opened this issue Nov 9, 2023 · 10 comments · May be fixed by #27678
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)

Comments

@AndyMik90
Copy link

System Info

Notebook with latest langchain

Who can help?

No response

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

Trying in notebook the HTMLHeaderTextSplitter

from langchain.text_splitter import RecursiveCharacterTextSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text_from_file("/content/X.html")

chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

Split
splits = text_splitter.split_documents(html_header_splits)
splits[80:85]`
---------------------------------------------------------------------------
XSLTApplyError                            Traceback (most recent call last)
[<ipython-input-54-bd3edea942d1>](https://localhost:8080/#) in <cell line: 12>()
     10 html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
     11 
---> 12 html_header_splits = html_splitter.split_text_from_file("/content/X.html")
     13 
     14 chunk_size = 500

[/usr/local/lib/python3.10/dist-packages/langchain/text_splitter.py](https://localhost:8080/#) in split_text_from_file(self, file)
    586         xslt_tree = etree.parse(xslt_path)
    587         transform = etree.XSLT(xslt_tree)
--> 588         result = transform(tree)
    589         result_dom = etree.fromstring(str(result))
    590 

src/lxml/xslt.pxi in lxml.etree.XSLT.__call__()

XSLTApplyError: maxHead``` 
Is the HTML just to large to be handled by the textsplitter?

### Expected behavior

Load the html..
Copy link

dosubot bot commented Nov 9, 2023

🤖

Hello @AndyMik90,

I'm an assistive bot here to help you with your issue while we wait for a human maintainer. I'm able to help address bugs, answer questions, and even guide you on how to become a contributor to the project. Rest assured, I'm here to help!

I'm currently reviewing the problem you've encountered with HTMLHeaderTextSplitter in the LangChain repository. I'll be back shortly with a more comprehensive solution!

Stay tuned!

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Nov 9, 2023
Copy link

dosubot bot commented Nov 9, 2023

🤖

The error you're encountering, XSLTApplyError: maxHead, is likely due to the size of the HTML file you're trying to process. The HTMLHeaderTextSplitter class uses the lxml package to parse and transform the HTML file. The lxml package, in turn, uses libxml2 and libxslt libraries, which have certain limitations.

The maxHead error is typically thrown when the size of the HTML file exceeds the maximum size that can be processed by the XSLT processor. This limit is not explicitly defined in the LangChain codebase or the lxml package, but it's a limitation of the underlying libraries.

One possible solution could be to split your HTML file into smaller chunks before processing it with the HTMLHeaderTextSplitter. This could be done manually or by using a tool or script to automate the process.

Another possible solution could be to increase the maximum size limit of the XSLT processor. However, this would involve modifying the source code of the libxslt library and recompiling it, which may not be feasible or desirable.

Please note that these are just possible solutions and may not work in all cases. If you continue to encounter issues, please provide more details about the size and structure of your HTML file, and any other relevant information that could help in troubleshooting the issue.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Copy link

dosubot bot commented Feb 8, 2024

Hi, @AndyMik90,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, you raised an issue regarding the HTMLHeaderTextSplitter not running due to a maxHead error when attempting to split a large HTML file. Dosubot provided a detailed response explaining the error and suggested potential solutions for resolving the issue, such as splitting the HTML file into smaller chunks or modifying the underlying libraries. They also requested more details for further troubleshooting.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, kindly let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 8, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 15, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 15, 2024
@wcohnshb
Copy link

I'm having the same problem and I'm using the the latest version of the LangChain repository.

@sprklinginfo
Copy link

Me too? have you found any solutions?

@KivenChen
Copy link

KivenChen commented May 15, 2024

This is indeed a problem from the lxml package, which seems to use libxlts to parse html content. However, libxlts has a restriction of about 2^17 characters (which is uint16_t).

It is high time we switch to another dependency.

@brett-matson
Copy link

I've just hit this problem too. Anyone know of any workarounds?

@davonte7
Copy link

+1 Looking for a workaround

@raphaelgurtner
Copy link

My workaround is to subclass HTMLHeaderTextSplitter and overwrite split_text_from_file. In there I call xsltproc via subprocess.run just for the transformation using the same html_chunks_with_headers.xslt.
xsltproc has a param --maxdepth that you'll have to play with/increase. Make sure xsltproc is installed on the system you use this workaround with.

@AhmedTammaa
Copy link

I have overwritten the class. I have used bs4 instead of lxml which is so far working for me

    def split_text_from_file(self, file: Any) -> List[Document]:
        """Split HTML file using BeautifulSoup.

        Args:
            file: HTML file path or file-like object.

        Returns:
            List of Document objects with page_content and metadata.
        """
        from bs4 import BeautifulSoup
        from langchain.docstore.document import Document
        import bs4

        # Read the HTML content from the file or file-like object
        if isinstance(file, str):
            with open(file, 'r', encoding='utf-8') as f:
                html_content = f.read()
        else:
            # Assuming file is a file-like object
            html_content = file.read()

        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(html_content, 'html.parser')

        # Extract the header tags and their corresponding metadata keys
        headers_to_split_on = [tag[0] for tag in self.headers_to_split_on]
        header_mapping = dict(self.headers_to_split_on)

        documents = []

        # Find the body of the document
        body = soup.body if soup.body else soup

        # Find all header tags in the order they appear
        all_headers = body.find_all(headers_to_split_on)

        # If there's content before the first header, collect it
        first_header = all_headers[0] if all_headers else None
        if first_header:
            pre_header_content = ''
            for elem in first_header.find_all_previous():
                if isinstance(elem, bs4.Tag):
                    text = elem.get_text(separator=' ', strip=True)
                    if text:
                        pre_header_content = text + ' ' + pre_header_content
            if pre_header_content.strip():
                documents.append(Document(
                    page_content=pre_header_content.strip(),
                    metadata={}  # No metadata since there's no header
                ))
        else:
            # If no headers are found, return the whole content
            full_text = body.get_text(separator=' ', strip=True)
            if full_text.strip():
                documents.append(Document(
                    page_content=full_text.strip(),
                    metadata={}
                ))
            return documents

        # Process each header and its associated content
        for header in all_headers:
            current_metadata = {}
            header_name = header.name
            header_text = header.get_text(separator=' ', strip=True)
            current_metadata[header_mapping[header_name]] = header_text

            # Collect all sibling elements until the next header of the same or higher level
            content_elements = []
            for sibling in header.find_next_siblings():
                if sibling.name in headers_to_split_on:
                    # Stop at the next header
                    break
                if isinstance(sibling, bs4.Tag):
                    content_elements.append(sibling)

            # Get the text content of the collected elements
            current_content = ''
            for elem in content_elements:
                text = elem.get_text(separator=' ', strip=True)
                if text:
                    current_content += text + ' '

            # Create a Document if there is content
            if current_content.strip():
                documents.append(Document(
                    page_content=current_content.strip(),
                    metadata=current_metadata.copy()
                ))
            else:
                # If there's no content, but we have metadata, still create a Document
                documents.append(Document(
                    page_content='',
                    metadata=current_metadata.copy()
                ))

        return documents
    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)
Projects
None yet
8 participants