Skip to content

Commit c43ffc1

Browse files
author
David Xue
authored
Improve HTML Splitter, URL Fetching Logic & Astro Docs Ingestion (#293)
### Description - This is 1st part of a 2 part effort to improve the scraping, extraction, chunking and tokenizing logic for Ask Astro's data ingestion process. (see details in this issue #258) - This PR mainly focuses on improving noise from ingestion process of the Astro Docs data source, along with some other related changes such as only scraping the latest doc versions, add auto exponential backoff on html get function and etc. ### Closes the Following Issues - #292 - #270 - #209 ### Partially Completes Issues - #258 (2 part effort, only 1 PR completed) - #221 (tackles token limit in html splitting logic, other parts needs tackling still) ### Technical Details - airflow/include/tasks/extract/astro_docs.py - Add function `process_astro_doc_page_content`: which gets rid of noisey not useful content such as nav bar, footer, header and only extract the main page article content - Remove the previous function `scrape_page` (which scraps the HTML content AND finds scraps all its sub pages using links contained). This is done since 1. there is already a centralized util function called `fetch_page_content()` that does the job of fetching each page's HTML elements, 2. there is already a centralized util function called `get_internal_links` that finds all links in, 3. the scraping process itself does not exclude noisey unrelated content which is replaced by the function in the previous bullet point `process_astro_doc_page_content` - airflow/include/tasks/split.py - Modify function `split_html`: it previously splits on specific HTML tags using `HTMLHeaderTextSplitter` but it is not ideal as we do not want to split that often and there is no guarantee splitting on such tags retains semantic meaning. This is changed to using `RecursiveCharacterTextSplitter` with a token limit. This will ONLY split if the chunk starts exceeding a certain number of specified token amount. If it still exceeds then go down the separator list and split further, until splitting by space and character to fit into token limit. This retains better semantic meaning in each chunks and enforces token limit. - airflow/include/tasks/extract/utils/html_utils.py - Change function `fetch_page_content` to add auto retry with exponential backoff using tenacity. - Change function `get_page_links` to make it traverse a given page recursively and finds all links related to this website. This ensures no duplicate pages are traversed and no pages are missing. Previously, the logic is missing some links when traversing potentially due to the fact that it is using a for loop and not doing recursive traversal until all links are exhausted. - Note: This has a huge URL difference. Previously a lot of links were like https://abc.com/abc#XXX and https://abc.com/abc#YYY where the hashtag is the same page but one section of the page, but the logic wasn't able to distinguish them. - - airflow/requirements.txt: adding required packages - api/ask_astro/settings.py: remove unused variables ### Results #### Astro Docs: Better URLs Fetched + Crawling Improvement + HTML Splitter Improvement 1. Example of formatting and chunking - Previously (near unreadable) ![image](https://github.com/astronomer/ask-astro/assets/26350341/90ff59f9-1401-4395-8add-cecd8bf08ac4) - Now (cleaned!) ![image](https://github.com/astronomer/ask-astro/assets/26350341/b465a4fc-497c-4687-b601-aa03ba12fc15) 2. Example of URLs difference - Previously - around 1000 links fetched. Many have DUPLCIATE content since they are the same link. - XMLs and non HTML/website content fetch See old links: [astro_docs_links_old.txt](https://github.com/astronomer/ask-astro/files/14146665/astro_docs_links_old.txt) - Now - No more duplicate pages or unreleased pages - No older versions for software docs, only latest docs being ingested. (e.g.: the .../0.31... links are gone) [new_astro_docs_links.txt](https://github.com/astronomer/ask-astro/files/14146669/new_astro_docs_links.txt) #### Evaluation - Overall improvement in answer and retrieval quality - No degradation noted - CSV posted in comments
1 parent d6bc44b commit c43ffc1

18 files changed

+136
-130
lines changed

airflow/dags/ingestion/ask-astro-forum-load.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ def ask_astro_load_astro_forum():
4444
class_name=WEAVIATE_CLASS,
4545
existing="replace",
4646
document_column="docLink",
47-
batch_config_params={"batch_size": 1000},
47+
batch_config_params={"batch_size": 7, "dynamic": False},
4848
verbose=True,
4949
conn_id=_WEAVIATE_CONN_ID,
5050
task_id="WeaviateDocumentIngestOperator",

airflow/dags/ingestion/ask-astro-load-airflow-docs.py

+6-19
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
import os
22
from datetime import datetime
33

4-
import pandas as pd
54
from include.utils.slack import send_failure_notification
65

76
from airflow.decorators import dag, task
@@ -20,21 +19,6 @@
2019
schedule_interval = os.environ.get("INGESTION_SCHEDULE", "0 5 * * 2") if ask_astro_env == "prod" else None
2120

2221

23-
@task
24-
def split_docs(urls: str, chunk_size: int = 100) -> list[list[pd.DataFrame]]:
25-
"""
26-
Split the URLs in chunk and get dataframe for the content
27-
28-
param urls: List for HTTP URL
29-
param chunk_size: Max number of document in split chunk
30-
"""
31-
from include.tasks import split
32-
from include.tasks.extract.utils.html_utils import urls_to_dataframe
33-
34-
chunked_urls = split.split_list(list(urls), chunk_size=chunk_size)
35-
return [[urls_to_dataframe(chunk_url)] for chunk_url in chunked_urls]
36-
37-
3822
@dag(
3923
schedule_interval=schedule_interval,
4024
start_date=datetime(2023, 9, 27),
@@ -51,19 +35,22 @@ def ask_astro_load_airflow_docs():
5135
data from a point-in-time data capture. By using the upsert logic of the weaviate_import decorator
5236
any existing documents that have been updated will be removed and re-added.
5337
"""
38+
from include.tasks import split
5439
from include.tasks.extract import airflow_docs
5540

56-
extracted_airflow_docs = task(airflow_docs.extract_airflow_docs)(docs_base_url=airflow_docs_base_url)
41+
extracted_airflow_docs = task(split.split_html).expand(
42+
dfs=[airflow_docs.extract_airflow_docs(docs_base_url=airflow_docs_base_url)]
43+
)
5744

5845
_import_data = WeaviateDocumentIngestOperator.partial(
5946
class_name=WEAVIATE_CLASS,
6047
existing="replace",
6148
document_column="docLink",
62-
batch_config_params={"batch_size": 1000},
49+
batch_config_params={"batch_size": 7, "dynamic": False},
6350
verbose=True,
6451
conn_id=_WEAVIATE_CONN_ID,
6552
task_id="WeaviateDocumentIngestOperator",
66-
).expand(input_data=split_docs(extracted_airflow_docs, chunk_size=100))
53+
).expand(input_data=[extracted_airflow_docs])
6754

6855

6956
ask_astro_load_airflow_docs()

airflow/dags/ingestion/ask-astro-load-astro-cli.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ def ask_astro_load_astro_cli_docs():
4242
class_name=WEAVIATE_CLASS,
4343
existing="replace",
4444
document_column="docLink",
45-
batch_config_params={"batch_size": 1000},
45+
batch_config_params={"batch_size": 7, "dynamic": False},
4646
verbose=True,
4747
conn_id=_WEAVIATE_CONN_ID,
4848
task_id="WeaviateDocumentIngestOperator",

airflow/dags/ingestion/ask-astro-load-astro-sdk.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ def ask_astro_load_astro_sdk():
4141
class_name=WEAVIATE_CLASS,
4242
existing="replace",
4343
document_column="docLink",
44-
batch_config_params={"batch_size": 1000},
44+
batch_config_params={"batch_size": 7, "dynamic": False},
4545
verbose=True,
4646
conn_id=_WEAVIATE_CONN_ID,
4747
task_id="WeaviateDocumentIngestOperator",

airflow/dags/ingestion/ask-astro-load-astronomer-docs.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -36,17 +36,17 @@ def ask_astro_load_astronomer_docs():
3636

3737
astro_docs = task(extract_astro_docs)()
3838

39-
split_md_docs = task(split.split_markdown).expand(dfs=[astro_docs])
39+
split_html_docs = task(split.split_html).expand(dfs=[astro_docs])
4040

4141
_import_data = WeaviateDocumentIngestOperator.partial(
4242
class_name=WEAVIATE_CLASS,
4343
existing="replace",
4444
document_column="docLink",
45-
batch_config_params={"batch_size": 1000},
45+
batch_config_params={"batch_size": 7, "dynamic": False},
4646
verbose=True,
4747
conn_id=_WEAVIATE_CONN_ID,
4848
task_id="WeaviateDocumentIngestOperator",
49-
).expand(input_data=[split_md_docs])
49+
).expand(input_data=[split_html_docs])
5050

5151

5252
ask_astro_load_astronomer_docs()

airflow/dags/ingestion/ask-astro-load-astronomer-provider.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ def ask_astro_load_astronomer_providers():
4747
class_name=WEAVIATE_CLASS,
4848
existing="replace",
4949
document_column="docLink",
50-
batch_config_params={"batch_size": 1000},
50+
batch_config_params={"batch_size": 7, "dynamic": False},
5151
verbose=True,
5252
conn_id=_WEAVIATE_CONN_ID,
5353
task_id="WeaviateDocumentIngestOperator",

airflow/dags/ingestion/ask-astro-load-blogs.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ def ask_astro_load_blogs():
4545
class_name=WEAVIATE_CLASS,
4646
existing="replace",
4747
document_column="docLink",
48-
batch_config_params={"batch_size": 1000},
48+
batch_config_params={"batch_size": 7, "dynamic": False},
4949
verbose=True,
5050
conn_id=_WEAVIATE_CONN_ID,
5151
task_id="WeaviateDocumentIngestOperator",

airflow/dags/ingestion/ask-astro-load-github.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ def ask_astro_load_github():
6464
class_name=WEAVIATE_CLASS,
6565
existing="replace",
6666
document_column="docLink",
67-
batch_config_params={"batch_size": 1000},
67+
batch_config_params={"batch_size": 7, "dynamic": False},
6868
verbose=True,
6969
conn_id=_WEAVIATE_CONN_ID,
7070
task_id="WeaviateDocumentIngestOperator",

airflow/dags/ingestion/ask-astro-load-registry.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ def ask_astro_load_registry():
4747
class_name=WEAVIATE_CLASS,
4848
existing="replace",
4949
document_column="docLink",
50-
batch_config_params={"batch_size": 1000},
50+
batch_config_params={"batch_size": 7, "dynamic": False},
5151
verbose=True,
5252
conn_id=_WEAVIATE_CONN_ID,
5353
task_id="WeaviateDocumentIngestOperator",

airflow/dags/ingestion/ask-astro-load-slack.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ def ask_astro_load_slack():
5353
class_name=WEAVIATE_CLASS,
5454
existing="replace",
5555
document_column="docLink",
56-
batch_config_params={"batch_size": 1000},
56+
batch_config_params={"batch_size": 7, "dynamic": False},
5757
verbose=True,
5858
conn_id=_WEAVIATE_CONN_ID,
5959
task_id="WeaviateDocumentIngestOperator",

airflow/dags/ingestion/ask-astro-load-stackoverflow.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ def ask_astro_load_stackoverflow():
5353
class_name=WEAVIATE_CLASS,
5454
existing="replace",
5555
document_column="docLink",
56-
batch_config_params={"batch_size": 1000},
56+
batch_config_params={"batch_size": 7, "dynamic": False},
5757
verbose=True,
5858
conn_id=_WEAVIATE_CONN_ID,
5959
task_id="WeaviateDocumentIngestOperator",

airflow/dags/ingestion/ask-astro-load.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -206,7 +206,7 @@ def extract_airflow_docs():
206206
else:
207207
raise Exception("Parquet file exists locally but is not readable.")
208208
else:
209-
df = airflow_docs.extract_airflow_docs(docs_base_url=airflow_docs_base_url)[0]
209+
df = airflow_docs.extract_airflow_docs.function(docs_base_url=airflow_docs_base_url)[0]
210210
df.to_parquet(parquet_file)
211211

212212
return [df]
@@ -442,7 +442,7 @@ def import_baseline(
442442
class_name=WEAVIATE_CLASS,
443443
existing="replace",
444444
document_column="docLink",
445-
batch_config_params={"batch_size": 1000},
445+
batch_config_params={"batch_size": 7, "dynamic": False},
446446
verbose=True,
447447
conn_id=_WEAVIATE_CONN_ID,
448448
task_id="WeaviateDocumentIngestOperator",
@@ -455,7 +455,7 @@ def import_baseline(
455455
document_column="docLink",
456456
uuid_column="id",
457457
vector_column="vector",
458-
batch_config_params={"batch_size": 1000},
458+
batch_config_params={"batch_size": 7, "dynamic": False},
459459
verbose=True,
460460
)
461461

airflow/include/tasks/extract/airflow_docs.py

+6-1
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,11 @@
88
from bs4 import BeautifulSoup
99
from weaviate.util import generate_uuid5
1010

11+
from airflow.decorators import task
1112
from include.tasks.extract.utils.html_utils import get_internal_links
1213

1314

15+
@task
1416
def extract_airflow_docs(docs_base_url: str) -> list[pd.DataFrame]:
1517
"""
1618
This task return all internal url for Airflow docs
@@ -36,7 +38,10 @@ def extract_airflow_docs(docs_base_url: str) -> list[pd.DataFrame]:
3638
docs_url_parts = urllib.parse.urlsplit(docs_base_url)
3739
docs_url_base = f"{docs_url_parts.scheme}://{docs_url_parts.netloc}"
3840
# make sure we didn't accidentally pickup any unrelated links in recursion
39-
non_doc_links = {link if docs_url_base not in link else "" for link in all_links}
41+
old_version_doc_pattern = r"/(\d+\.)*\d+/"
42+
non_doc_links = {
43+
link if (docs_url_base not in link) or re.search(old_version_doc_pattern, link) else "" for link in all_links
44+
}
4045
docs_links = all_links - non_doc_links
4146

4247
df = pd.DataFrame(docs_links, columns=["docLink"])
+51-63
Original file line numberDiff line numberDiff line change
@@ -1,83 +1,58 @@
11
from __future__ import annotations
22

3-
import logging
4-
from urllib.parse import urldefrag, urljoin
3+
import re
54

65
import pandas as pd
7-
import requests
86
from bs4 import BeautifulSoup
97
from weaviate.util import generate_uuid5
108

9+
from include.tasks.extract.utils.html_utils import fetch_page_content, get_internal_links
10+
1111
base_url = "https://docs.astronomer.io/"
1212

1313

14-
def fetch_page_content(url: str) -> str:
15-
"""
16-
Fetches the content of a given URL.
14+
def process_astro_doc_page_content(page_content: str) -> str:
15+
soup = BeautifulSoup(page_content, "html.parser")
1716

18-
:param url: URL of the page to fetch.
19-
:return: HTML content of the page.
20-
"""
21-
try:
22-
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
23-
if response.status_code == 200:
24-
return response.content
25-
except requests.RequestException as e:
26-
logging.error(f"Error fetching {url}: {e}")
27-
return ""
17+
# Find the main article container
18+
main_container = soup.find("main", class_="docMainContainer_TBSr")
2819

20+
content_of_interest = main_container if main_container else soup
21+
for nav_tag in content_of_interest.find_all("nav"):
22+
nav_tag.decompose()
2923

30-
def extract_links(soup: BeautifulSoup, base_url: str) -> list[str]:
31-
"""
32-
Extracts all valid links from a BeautifulSoup object.
24+
for script_or_style in content_of_interest.find_all(["script", "style", "button", "img", "svg"]):
25+
script_or_style.decompose()
3326

34-
:param soup: BeautifulSoup object to extract links from.
35-
:param base_url: Base URL for relative links.
36-
:return: List of extracted URLs.
37-
"""
38-
links = []
39-
for link in soup.find_all("a", href=True):
40-
href = link["href"]
41-
if not href.startswith("http"):
42-
href = urljoin(base_url, href)
43-
if href.startswith(base_url):
44-
links.append(href)
45-
return links
27+
feedback_widget = content_of_interest.find("div", id="feedbackWidget")
28+
if feedback_widget:
29+
feedback_widget.decompose()
4630

31+
newsletter_form = content_of_interest.find("form", id="newsletterForm")
32+
if newsletter_form:
33+
newsletter_form.decompose()
4734

48-
def scrape_page(url: str, visited_urls: set, docs_data: list) -> None:
49-
"""
50-
Recursively scrapes a webpage and its subpages.
35+
sidebar = content_of_interest.find("ul", class_=lambda value: value and "table-of-contents" in value)
36+
if sidebar:
37+
sidebar.decompose()
5138

52-
:param url: URL of the page to scrape.
53-
:param visited_urls: Set of URLs already visited.
54-
:param docs_data: List to append extracted data to.
55-
"""
56-
if url in visited_urls or not url.startswith(base_url):
57-
return
58-
59-
# Normalize URL by stripping off the fragment
60-
base_url_no_fragment, frag = urldefrag(url)
39+
footers = content_of_interest.find_all("footer")
40+
for footer in footers:
41+
footer.decompose()
6142

62-
# If the URL is the base URL plus a fragment, ignore it
63-
if base_url_no_fragment == base_url and frag:
64-
return
43+
# The actual article in almost all pages of Astro Docs website is in the following HTML container
44+
container_div = content_of_interest.find("div", class_=lambda value: value and "container" in value)
6545

66-
visited_urls.add(url)
46+
if container_div:
47+
row_div = container_div.find("div", class_="row")
6748

68-
logging.info(f"Scraping : {url}")
49+
if row_div:
50+
col_div = row_div.find("div", class_=lambda value: value and "col" in value)
6951

70-
page_content = fetch_page_content(url)
71-
if not page_content:
72-
return
52+
if col_div:
53+
content_of_interest = str(col_div)
7354

74-
soup = BeautifulSoup(page_content, "lxml")
75-
content = soup.get_text(strip=True)
76-
sha = generate_uuid5(content)
77-
docs_data.append({"docSource": "astro docs", "sha": sha, "content": content, "docLink": url})
78-
# Recursively scrape linked pages
79-
for link in extract_links(soup, base_url):
80-
scrape_page(link, visited_urls, docs_data)
55+
return str(content_of_interest).strip()
8156

8257

8358
def extract_astro_docs(base_url: str = base_url) -> list[pd.DataFrame]:
@@ -86,13 +61,26 @@ def extract_astro_docs(base_url: str = base_url) -> list[pd.DataFrame]:
8661
8762
:return: A list of pandas dataframes with extracted data.
8863
"""
89-
visited_urls = set()
90-
docs_data = []
64+
all_links = get_internal_links(base_url, exclude_literal=["learn/tags"])
65+
66+
# for software references, we only want latest docs, ones with version number (old) is removed
67+
old_version_doc_pattern = r"^https://docs\.astronomer\.io/software/\d+\.\d+/.+$"
68+
# remove duplicate xml files, we only want html pages
69+
non_doc_links = {
70+
link if link.endswith("xml") or re.match(old_version_doc_pattern, link) else "" for link in all_links
71+
}
72+
docs_links = all_links - non_doc_links
73+
74+
df = pd.DataFrame(docs_links, columns=["docLink"])
75+
76+
df["html_content"] = df["docLink"].apply(lambda url: fetch_page_content(url))
9177

92-
scrape_page(base_url, visited_urls, docs_data)
78+
# Only keep the main article content
79+
df["content"] = df["html_content"].apply(process_astro_doc_page_content)
9380

94-
df = pd.DataFrame(docs_data)
95-
df.drop_duplicates(subset="sha", inplace=True)
81+
df["sha"] = df["content"].apply(generate_uuid5)
82+
df["docSource"] = "astro docs"
9683
df.reset_index(drop=True, inplace=True)
9784

85+
df = df[["docSource", "sha", "content", "docLink"]]
9886
return [df]

0 commit comments

Comments
 (0)