HTMLHeaderTextSplitter won't run (maxHead) #13149

AndyMik90 · 2023-11-09T20:48:09Z

dosubot · 2023-11-09T20:48:44Z

🤖

I'm an assistive bot here to help you with your issue while we wait for a human maintainer. I'm able to help address bugs, answer questions, and even guide you on how to become a contributor to the project. Rest assured, I'm here to help!

I'm currently reviewing the problem you've encountered with HTMLHeaderTextSplitter in the LangChain repository. I'll be back shortly with a more comprehensive solution!

Stay tuned!

dosubot · 2023-11-09T20:54:34Z

🤖

The error you're encountering, XSLTApplyError: maxHead, is likely due to the size of the HTML file you're trying to process. The HTMLHeaderTextSplitter class uses the lxml package to parse and transform the HTML file. The lxml package, in turn, uses libxml2 and libxslt libraries, which have certain limitations.

The maxHead error is typically thrown when the size of the HTML file exceeds the maximum size that can be processed by the XSLT processor. This limit is not explicitly defined in the LangChain codebase or the lxml package, but it's a limitation of the underlying libraries.

One possible solution could be to split your HTML file into smaller chunks before processing it with the HTMLHeaderTextSplitter. This could be done manually or by using a tool or script to automate the process.

Another possible solution could be to increase the maximum size limit of the XSLT processor. However, this would involve modifying the source code of the libxslt library and recompiling it, which may not be feasible or desirable.

Please note that these are just possible solutions and may not work in all cases. If you continue to encounter issues, please provide more details about the size and structure of your HTML file, and any other relevant information that could help in troubleshooting the issue.

Sources

libs/langchain/langchain/text_splitter.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot · 2024-02-08T16:01:29Z

Hi, @AndyMik90,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, you raised an issue regarding the HTMLHeaderTextSplitter not running due to a maxHead error when attempting to split a large HTML file. Dosubot provided a detailed response explaining the error and suggested potential solutions for resolving the issue, such as splitting the HTML file into smaller chunks or modifying the underlying libraries. They also requested more details for further troubleshooting.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, kindly let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

wcohnshb · 2024-02-24T14:23:14Z

I'm having the same problem and I'm using the the latest version of the LangChain repository.

sprklinginfo · 2024-05-14T20:09:45Z

Me too? have you found any solutions?

KivenChen · 2024-05-15T10:59:25Z

This is indeed a problem from the lxml package, which seems to use libxlts to parse html content. However, libxlts has a restriction of about 2^17 characters (which is uint16_t).

It is high time we switch to another dependency.

brett-matson · 2024-05-16T06:00:06Z

I've just hit this problem too. Anyone know of any workarounds?

davonte7 · 2024-06-18T19:23:45Z

+1 Looking for a workaround

raphaelgurtner · 2024-08-26T07:23:04Z

My workaround is to subclass HTMLHeaderTextSplitter and overwrite split_text_from_file. In there I call xsltproc via subprocess.run just for the transformation using the same html_chunks_with_headers.xslt.
xsltproc has a param --maxdepth that you'll have to play with/increase. Make sure xsltproc is installed on the system you use this workaround with.

AhmedTammaa · 2024-10-28T10:48:40Z

I have overwritten the class. I have used bs4 instead of lxml which is so far working for me

    def split_text_from_file(self, file: Any) -> List[Document]:
        """Split HTML file using BeautifulSoup.

        Args:
            file: HTML file path or file-like object.

        Returns:
            List of Document objects with page_content and metadata.
        """
        from bs4 import BeautifulSoup
        from langchain.docstore.document import Document
        import bs4

        # Read the HTML content from the file or file-like object
        if isinstance(file, str):
            with open(file, 'r', encoding='utf-8') as f:
                html_content = f.read()
        else:
            # Assuming file is a file-like object
            html_content = file.read()

        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(html_content, 'html.parser')

        # Extract the header tags and their corresponding metadata keys
        headers_to_split_on = [tag[0] for tag in self.headers_to_split_on]
        header_mapping = dict(self.headers_to_split_on)

        documents = []

        # Find the body of the document
        body = soup.body if soup.body else soup

        # Find all header tags in the order they appear
        all_headers = body.find_all(headers_to_split_on)

        # If there's content before the first header, collect it
        first_header = all_headers[0] if all_headers else None
        if first_header:
            pre_header_content = ''
            for elem in first_header.find_all_previous():
                if isinstance(elem, bs4.Tag):
                    text = elem.get_text(separator=' ', strip=True)
                    if text:
                        pre_header_content = text + ' ' + pre_header_content
            if pre_header_content.strip():
                documents.append(Document(
                    page_content=pre_header_content.strip(),
                    metadata={}  # No metadata since there's no header
                ))
        else:
            # If no headers are found, return the whole content
            full_text = body.get_text(separator=' ', strip=True)
            if full_text.strip():
                documents.append(Document(
                    page_content=full_text.strip(),
                    metadata={}
                ))
            return documents

        # Process each header and its associated content
        for header in all_headers:
            current_metadata = {}
            header_name = header.name
            header_text = header.get_text(separator=' ', strip=True)
            current_metadata[header_mapping[header_name]] = header_text

            # Collect all sibling elements until the next header of the same or higher level
            content_elements = []
            for sibling in header.find_next_siblings():
                if sibling.name in headers_to_split_on:
                    # Stop at the next header
                    break
                if isinstance(sibling, bs4.Tag):
                    content_elements.append(sibling)

            # Get the text content of the collected elements
            current_content = ''
            for elem in content_elements:
                text = elem.get_text(separator=' ', strip=True)
                if text:
                    current_content += text + ' '

            # Create a Document if there is content
            if current_content.strip():
                documents.append(Document(
                    page_content=current_content.strip(),
                    metadata=current_metadata.copy()
                ))
            else:
                # If there's no content, but we have metadata, still create a Document
                documents.append(Document(
                    page_content='',
                    metadata=current_metadata.copy()
                ))

        return documents

…MLHeaderTextSplitter for Improved Large HTML File Processing (#27678) This pull request updates the `HTMLHeaderTextSplitter` by replacing the `split_text_from_file` method's implementation. The original method used `lxml` and XSLT for processing HTML files, which caused `lxml.etree.xsltapplyerror maxhead` when handling large HTML documents due to limitations in the XSLT processor. Fixes #13149 By switching to BeautifulSoup (`bs4`), we achieve: - **Improved Performance and Reliability:** BeautifulSoup efficiently processes large HTML files without the errors associated with `lxml` and XSLT. - **Simplified Dependencies:** Removes the dependency on `lxml` and external XSLT files, relying instead on the widely used `beautifulsoup4` library. - **Maintained Functionality:** The new method replicates the original behavior, ensuring compatibility with existing code and preserving the extraction of content and metadata. **Issue:** This change addresses issues related to processing large HTML files with the existing `HTMLHeaderTextSplitter` implementation. It resolves problems where users encounter lxml.etree.xsltapplyerror maxhead due to large HTML documents. **Dependencies:** - **BeautifulSoup (`beautifulsoup4`):** The `beautifulsoup4` library is now used for parsing HTML content. - Installation: `pip install beautifulsoup4` **Code Changes:** Updated the `split_text_from_file` method in `HTMLHeaderTextSplitter` as follows: ```python def split_text_from_file(self, file: Any) -> List[Document]: """Split HTML file using BeautifulSoup. Args: file: HTML file path or file-like object. Returns: List of Document objects with page_content and metadata. """ from bs4 import BeautifulSoup from langchain.docstore.document import Document import bs4 # Read the HTML content from the file or file-like object if isinstance(file, str): with open(file, 'r', encoding='utf-8') as f: html_content = f.read() else: # Assuming file is a file-like object html_content = file.read() # Parse the HTML content using BeautifulSoup soup = BeautifulSoup(html_content, 'html.parser') # Extract the header tags and their corresponding metadata keys headers_to_split_on = [tag[0] for tag in self.headers_to_split_on] header_mapping = dict(self.headers_to_split_on) documents = [] # Find the body of the document body = soup.body if soup.body else soup # Find all header tags in the order they appear all_headers = body.find_all(headers_to_split_on) # If there's content before the first header, collect it first_header = all_headers[0] if all_headers else None if first_header: pre_header_content = '' for elem in first_header.find_all_previous(): if isinstance(elem, bs4.Tag): text = elem.get_text(separator=' ', strip=True) if text: pre_header_content = text + ' ' + pre_header_content if pre_header_content.strip(): documents.append(Document( page_content=pre_header_content.strip(), metadata={} # No metadata since there's no header )) else: # If no headers are found, return the whole content full_text = body.get_text(separator=' ', strip=True) if full_text.strip(): documents.append(Document( page_content=full_text.strip(), metadata={} )) return documents # Process each header and its associated content for header in all_headers: current_metadata = {} header_name = header.name header_text = header.get_text(separator=' ', strip=True) current_metadata[header_mapping[header_name]] = header_text # Collect all sibling elements until the next header of the same or higher level content_elements = [] for sibling in header.find_next_siblings(): if sibling.name in headers_to_split_on: # Stop at the next header break if isinstance(sibling, bs4.Tag): content_elements.append(sibling) # Get the text content of the collected elements current_content = '' for elem in content_elements: text = elem.get_text(separator=' ', strip=True) if text: current_content += text + ' ' # Create a Document if there is content if current_content.strip(): documents.append(Document( page_content=current_content.strip(), metadata=current_metadata.copy() )) else: # If there's no content, but we have metadata, still create a Document documents.append(Document( page_content='', metadata=current_metadata.copy() )) return documents ``` --------- Co-authored-by: Eugene Yurtsev <[email protected]>

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Nov 9, 2023

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 8, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 15, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 15, 2024

AhmedTammaa mentioned this issue Oct 29, 2024

langchain: Replace lxml and XSLT with BeautifulSoup in HTMLHeaderTextSplitter for Improved Large HTML File Processing #27678

Merged

eyurtsev closed this as completed in #27678 Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTMLHeaderTextSplitter won't run (maxHead) #13149

HTMLHeaderTextSplitter won't run (maxHead) #13149

AndyMik90 commented Nov 9, 2023

dosubot bot commented Nov 9, 2023

dosubot bot commented Nov 9, 2023

dosubot bot commented Feb 8, 2024

wcohnshb commented Feb 24, 2024

sprklinginfo commented May 14, 2024

KivenChen commented May 15, 2024 •

edited

Loading

brett-matson commented May 16, 2024

davonte7 commented Jun 18, 2024

raphaelgurtner commented Aug 26, 2024

AhmedTammaa commented Oct 28, 2024

HTMLHeaderTextSplitter won't run (maxHead) #13149

HTMLHeaderTextSplitter won't run (maxHead) #13149

Comments

AndyMik90 commented Nov 9, 2023

System Info

Who can help?

Information

Related Components

Reproduction

dosubot bot commented Nov 9, 2023

dosubot bot commented Nov 9, 2023

Sources

dosubot bot commented Feb 8, 2024

wcohnshb commented Feb 24, 2024

sprklinginfo commented May 14, 2024

KivenChen commented May 15, 2024 • edited Loading

brett-matson commented May 16, 2024

davonte7 commented Jun 18, 2024

raphaelgurtner commented Aug 26, 2024

AhmedTammaa commented Oct 28, 2024

KivenChen commented May 15, 2024 •

edited

Loading