Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

langchain: Replace lxml and XSLT with BeautifulSoup in HTMLHeaderTextSplitter for Improved Large HTML File Processing #27678

Open
wants to merge 45 commits into
base: master
Choose a base branch
from
Open
Changes from 9 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
0771f8e
Update html.py
AhmedTammaa Oct 28, 2024
c52667a
Merge branch 'master' into patch-1
AhmedTammaa Oct 29, 2024
8dc8e46
Merge branch 'master' into patch-1
AhmedTammaa Nov 8, 2024
d4efd97
Update html.py
AhmedTammaa Nov 8, 2024
73c001c
Update html.py
AhmedTammaa Nov 8, 2024
9119fe9
Update html.py
AhmedTammaa Nov 8, 2024
7e0ce8e
Update html.py
AhmedTammaa Nov 8, 2024
d604fd1
Merge branch 'master' into patch-1
AhmedTammaa Nov 8, 2024
dfe4ee4
Merge branch 'master' into patch-1
eyurtsev Dec 13, 2024
6bfc158
Update html.py
AhmedTammaa Dec 16, 2024
b84f13c
Update test_text_splitters.py
AhmedTammaa Dec 16, 2024
6a2f1e9
Merge branch 'master' into patch-1
AhmedTammaa Dec 17, 2024
17ae8b9
added import Tuple
AhmedTammaa Dec 17, 2024
be9de90
Merge branch 'master' into patch-1
AhmedTammaa Dec 17, 2024
851ba7e
Merge branch 'master' into patch-1
AhmedTammaa Dec 17, 2024
0306951
added beautifulsoup4 to poetry depedencies
AhmedTammaa Dec 17, 2024
09e7852
Merge branch 'master' into patch-1
AhmedTammaa Dec 18, 2024
ae50b32
discarded bs4 dependency
AhmedTammaa Dec 18, 2024
f9a93d0
Removed uncessary module docstring, updated docstring of HTMLHeaderTe…
AhmedTammaa Dec 18, 2024
438aedd
improved docstring for the class `HTMLHeaderTextSplitter`
AhmedTammaa Dec 18, 2024
d573723
removed typing from docstring when type is hinted.
AhmedTammaa Dec 18, 2024
405ea70
Merge branch 'master' into patch-1
AhmedTammaa Dec 19, 2024
f6e45e2
Merge branch 'master' into patch-1
AhmedTammaa Dec 19, 2024
617e04a
Merge branch 'master' into patch-1
AhmedTammaa Dec 19, 2024
b82bfc9
added pytest mark require bs4
AhmedTammaa Dec 19, 2024
4297787
added requirement bs4 marker for the test cases
AhmedTammaa Dec 19, 2024
c2107b1
all test function involving HTMLHeaderTextSplitter has bs4 requirment…
AhmedTammaa Dec 19, 2024
4261885
added bs4 import in the split_file_function and removed it from top l…
AhmedTammaa Dec 19, 2024
567318a
fixing linting errors and improved documentation for HTMLHeaderTextSp…
AhmedTammaa Dec 19, 2024
53685eb
fixed docstring issue and sorted imports
AhmedTammaa Dec 19, 2024
9ff0bfa
sorted imports and defined `nodes` in `_generate_documents` docstring
AhmedTammaa Dec 19, 2024
aeae28c
updated import order
AhmedTammaa Dec 19, 2024
e67f6bd
fixed all linting issues with Ruff
AhmedTammaa Dec 20, 2024
3b8a547
Merge branch 'master' into patch-1
AhmedTammaa Dec 20, 2024
cdd62b7
removed extra blank space from `_finalize_chunk`
AhmedTammaa Dec 20, 2024
b4d4e57
added types for untyped function paramters. Typed `stack` variable as…
AhmedTammaa Dec 20, 2024
d7ea998
fixed "line too long" in test_text_splitters
AhmedTammaa Dec 20, 2024
2bf3726
fixed linter issues in test_text_splitter.py
AhmedTammaa Dec 20, 2024
7dd9f15
fixed mypy issues
AhmedTammaa Dec 20, 2024
456c36a
fixed all formatting issues and checked with pre-commit
AhmedTammaa Dec 20, 2024
533bc90
Merge branch 'master' into patch-1
AhmedTammaa Dec 20, 2024
f31e4b7
Merge branch 'master' into patch-1
AhmedTammaa Dec 20, 2024
bbe5616
simplified HTMLHeaderSplitter Logic
AhmedTammaa Dec 21, 2024
5637dc7
improved documentation and formatting
AhmedTammaa Dec 21, 2024
4aaa912
Merge branch 'master' into patch-1
AhmedTammaa Dec 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
152 changes: 87 additions & 65 deletions libs/text-splitters/langchain_text_splitters/html.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@
from typing import Any, Dict, Iterable, List, Optional, Tuple, TypedDict, cast

import requests
from langchain_core.documents import Document
from bs4 import BeautifulSoup
AhmedTammaa marked this conversation as resolved.
Show resolved Hide resolved
from bs4.element import Tag
from langchain.docstore.document import Document

from langchain_text_splitters.character import RecursiveCharacterTextSplitter

Expand Down Expand Up @@ -89,80 +91,100 @@ def split_text(self, text: str) -> List[Document]:
text: HTML text
"""
return self.split_text_from_file(StringIO(text))

def split_text_from_file(self, file: Any) -> List[Document]:
"""Split HTML file.

Args:
file: HTML file
file: HTML file path or file-like object.

Returns:
List of Document objects with page_content and metadata.
"""
try:
from lxml import etree
except ImportError as e:
raise ImportError(
"Unable to import lxml, please install with `pip install lxml`."
) from e
# use lxml library to parse html document and return xml ElementTree
# Explicitly encoding in utf-8 allows non-English
# html files to be processed without garbled characters
parser = etree.HTMLParser(encoding="utf-8")
tree = etree.parse(file, parser)

# document transformation for "structure-aware" chunking is handled with xsl.
# see comments in html_chunks_with_headers.xslt for more detailed information.
xslt_path = pathlib.Path(__file__).parent / "xsl/html_chunks_with_headers.xslt"
xslt_tree = etree.parse(xslt_path)
transform = etree.XSLT(xslt_tree)
result = transform(tree)
result_dom = etree.fromstring(str(result))
# Read the HTML content from the file or file-like object
if isinstance(file, str):
with open(file, 'r', encoding='utf-8') as f:
html_content = f.read()
else:
# Assuming file is a file-like object
html_content = file.read()

# create filter and mapping for header metadata
header_filter = [header[0] for header in self.headers_to_split_on]
header_mapping = dict(self.headers_to_split_on)
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# map xhtml namespace prefix
ns_map = {"h": "http://www.w3.org/1999/xhtml"}
# Extract the header tags and their corresponding metadata keys
headers_to_split_on = [tag[0] for tag in self.headers_to_split_on]
header_mapping = dict(self.headers_to_split_on)

# build list of elements from DOM
elements = []
for element in result_dom.findall("*//*", ns_map):
if element.findall("*[@class='headers']") or element.findall(
"*[@class='chunk']"
):
elements.append(
ElementType(
url=file,
xpath="".join(
[
node.text or ""
for node in element.findall("*[@class='xpath']", ns_map)
]
),
content="".join(
[
node.text or ""
for node in element.findall("*[@class='chunk']", ns_map)
]
),
metadata={
# Add text of specified headers to metadata using header
# mapping.
header_mapping[node.tag]: node.text or ""
for node in filter(
lambda x: x.tag in header_filter,
element.findall("*[@class='headers']/*", ns_map),
)
},
)
)
documents = []

if not self.return_each_element:
return self.aggregate_elements_to_chunks(elements)
# Find the body of the document
body = soup.body if soup.body else soup

# Find all header tags in the order they appear
all_headers = body.find_all(headers_to_split_on)

# If there's content before the first header, collect it
first_header = all_headers[0] if all_headers else None
if first_header:
pre_header_content = ''
for elem in first_header.find_all_previous():
if isinstance(elem, bs4.Tag):
text = elem.get_text(separator=' ', strip=True)
if text:
pre_header_content = text + ' ' + pre_header_content
if pre_header_content.strip():
documents.append(Document(
page_content=pre_header_content.strip(),
metadata={} # No metadata since there's no header
))
else:
return [
Document(page_content=chunk["content"], metadata=chunk["metadata"])
for chunk in elements
]
# If no headers are found, return the whole content
full_text = body.get_text(separator=' ', strip=True)
if full_text.strip():
documents.append(Document(
page_content=full_text.strip(),
metadata={}
))
return documents

# Process each header and its associated content
for header in all_headers:
current_metadata = {}
header_name = header.name
header_text = header.get_text(separator=' ', strip=True)
current_metadata[header_mapping[header_name]] = header_text

# Collect all sibling elements until the next header of the same or higher level
content_elements = []
for sibling in header.find_next_siblings():
if sibling.name in headers_to_split_on:
# Stop at the next header
break
if isinstance(sibling, bs4.Tag):
content_elements.append(sibling)

# Get the text content of the collected elements
current_content = ''
for elem in content_elements:
text = elem.get_text(separator=' ', strip=True)
if text:
current_content += text + ' '

# Create a Document if there is content
if current_content.strip():
documents.append(Document(
page_content=current_content.strip(),
metadata=current_metadata.copy()
))
else:
# If there's no content, but we have metadata, still create a Document
documents.append(Document(
page_content='',
metadata=current_metadata.copy()
))

return documents


class HTMLSectionSplitter:
Expand Down
Loading