Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research & Implement: Data Ingestion Related Improvements #258

Closed
davidgxue opened this issue Jan 11, 2024 · 0 comments · Fixed by #307
Closed

Research & Implement: Data Ingestion Related Improvements #258

davidgxue opened this issue Jan 11, 2024 · 0 comments · Fixed by #307
Assignees
Milestone

Comments

@davidgxue
Copy link
Contributor

Context

  • This is a follow-up action item sub-issue as a result of this research issue Research: Improve the ranking of the sources #133
  • The items mentioned below will be investigated to see if improvements can be made with their various implementations. If so, make the related PRs to implement the changes.

Items of Investigation

  1. Data Cleaning and Exclusion of Irrelevant Content

    • Implement data cleaning during ingestion to remove non-essential content like navigation bars, footers, headers, and other irrelevant sections that may introduce keyword spam and reduce retrieval accuracy.
  2. Review and Refinement of Chunking Logic

    • Reassess the logic for document chunking to prevent the inclusion of headers or short, meaningless text segments.
  3. Summarization of Large Documents

    • Generate and insert summaries for excessively large documents that are split into numerous chunks, using a language model to aid in comprehension and retrieval.
  4. Topic Keyword Extraction and Metadata Storage

    • Perform topic keyword extraction on each document in the Vector DB, store the results as metadata, and enhance queries with user-prompt-derived keywords during Q&A sessions. This strategy requires significant effort and its effectiveness is yet to be determined.
@davidgxue davidgxue self-assigned this Jan 11, 2024
@davidgxue davidgxue added this to the 0.3.0 milestone Jan 16, 2024
davidgxue added a commit that referenced this issue Feb 23, 2024
### Description
- This is 1st part of a 2 part effort to improve the scraping,
extraction, chunking and tokenizing logic for Ask Astro's data ingestion
process. (see details in this issue
#258)
- This PR mainly focuses on improving noise from ingestion process of
the Astro Docs data source, along with some other related changes such
as only scraping the latest doc versions, add auto exponential backoff
on html get function and etc.

### Closes the Following Issues
- #292 
- #270
- #209

### Partially Completes Issues
- #258 (2 part effort,
only 1 PR completed)
- #221 (tackles token
limit in html splitting logic, other parts needs tackling still)

### Technical Details
- airflow/include/tasks/extract/astro_docs.py
- Add function `process_astro_doc_page_content`: which gets rid of
noisey not useful content such as nav bar, footer, header and only
extract the main page article content
- Remove the previous function `scrape_page` (which scraps the HTML
content AND finds scraps all its sub pages using links contained). This
is done since 1. there is already a centralized util function called
`fetch_page_content()` that does the job of fetching each page's HTML
elements, 2. there is already a centralized util function called
`get_internal_links` that finds all links in, 3. the scraping process
itself does not exclude noisey unrelated content which is replaced by
the function in the previous bullet point
`process_astro_doc_page_content`
- airflow/include/tasks/split.py
- Modify function `split_html`: it previously splits on specific HTML
tags using `HTMLHeaderTextSplitter` but it is not ideal as we do not
want to split that often and there is no guarantee splitting on such
tags retains semantic meaning. This is changed to using
`RecursiveCharacterTextSplitter` with a token limit. This will ONLY
split if the chunk starts exceeding a certain number of specified token
amount. If it still exceeds then go down the separator list and split
further, until splitting by space and character to fit into token limit.
This retains better semantic meaning in each chunks and enforces token
limit.
- airflow/include/tasks/extract/utils/html_utils.py
- Change function `fetch_page_content` to add auto retry with
exponential backoff using tenacity.
- Change function `get_page_links` to make it traverse a given page
recursively and finds all links related to this website. This ensures no
duplicate pages are traversed and no pages are missing. Previously, the
logic is missing some links when traversing potentially due to the fact
that it is using a for loop and not doing recursive traversal until all
links are exhausted.
- Note: This has a huge URL difference. Previously a lot of links were
like https://abc.com/abc#XXX and https://abc.com/abc#YYY where the
hashtag is the same page but one section of the page, but the logic
wasn't able to distinguish them.
    - 
- airflow/requirements.txt: adding required packages
- api/ask_astro/settings.py: remove unused variables

### Results
#### Astro Docs: Better URLs Fetched + Crawling Improvement + HTML
Splitter Improvement
1. Example of formatting and chunking
- Previously (near unreadable)

![image](https://github.com/astronomer/ask-astro/assets/26350341/90ff59f9-1401-4395-8add-cecd8bf08ac4)
- Now (cleaned!)

![image](https://github.com/astronomer/ask-astro/assets/26350341/b465a4fc-497c-4687-b601-aa03ba12fc15)
2. Example of URLs difference
- Previously
- around 1000 links fetched. Many have DUPLCIATE content since they are
the same link.
    - XMLs and non HTML/website content fetch
See old links:
[astro_docs_links_old.txt](https://github.com/astronomer/ask-astro/files/14146665/astro_docs_links_old.txt)
- Now
    - No more duplicate pages or unreleased pages
- No older versions for software docs, only latest docs being ingested.
(e.g.: the .../0.31... links are gone)

[new_astro_docs_links.txt](https://github.com/astronomer/ask-astro/files/14146669/new_astro_docs_links.txt)

#### Evaluation
- Overall improvement in answer and retrieval quality
- No degradation noted
- CSV posted in comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant