Handle big text ingestion #221

pankajastro · 2023-12-19T13:34:15Z

Recently, I saw below error and feel that we need a mechanism to handle large docs

[2023-12-19, 05:42:30 UTC] {ask_astro_weaviate_hook.py:439} WARNING - Error during upsert. Rolling back all inserts for docs with errors.
[2023-12-19, 05:42:30 UTC] {ask_astro_weaviate_hook.py:328} INFO - Removing id f9c660b6-a393-54ef-9f95-20a616c3edfe for rollback.
[2023-12-19, 05:42:30 UTC] {ask_astro_weaviate_hook.py:333} INFO - UUID 767b916a-f6ef-51e1-bf23-cea43a801baa does not exist. Skipping deletion during rollback.
[2023-12-19, 05:42:30 UTC] {ask_astro_weaviate_hook.py:328} INFO - Removing id c7ccb038-c35a-501e-8404-1ce26782f0f7 for rollback.
[2023-12-19, 05:42:30 UTC] {ask_astro_weaviate_hook.py:328} INFO - Removing id 8a084e49-5e06-5827-8165-69c05d29716e for rollback.
[2023-12-19, 05:42:30 UTC] {ask_astro_weaviate_hook.py:328} INFO - Removing id 9a413399-6b68-5660-99d8-4bda030f9496 for rollback.
[2023-12-19, 05:42:30 UTC] {ask_astro_weaviate_hook.py:467} ERROR - Errors encountered during ingest.
[2023-12-19, 05:42:30 UTC] {ask_astro_weaviate_hook.py:468} ERROR - <generator object AskAstroWeaviateHook.ingest_data.<locals>.<genexpr> at 0x7fc00df27e00>
[2023-12-19, 05:42:31 UTC] {taskinstance.py:1937} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/decorators/base.py", line 221, in execute
    return_value = super().execute(context)
                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/operators/python.py", line 192, in execute
    return_value = self.execute_callable()
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/operators/python.py", line 209, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/include/tasks/extract/utils/weaviate/ask_astro_weaviate_hook.py", line 469, in ingest_data
    raise AirflowException("Errors encountered during ingest.")
airflow.exceptions.AirflowException: Errors encountered during ingest.

https://clmkpsyfc000301kcjie00t1l.astronomer.run/d2uulcqu/dags/ask_astro_load_airflow_docs/grid?dag_run_id=scheduled__2023-12-18T05%3A00%3A00%2B00%3A00&task_id=ingest_data&tab=mapped_tasks&map_index=0

The text was updated successfully, but these errors were encountered:

pankajastro · 2023-12-20T14:27:01Z

I handle in case forum by truncating it. not a perfect solution just in case want to check

ask-astro/airflow/include/tasks/extract/astro_forum_docs.py

Lines 88 to 109 in 441def6

    
           def truncate_tokens(text: str, encoding_name: str, max_length: int = 8192) -> str: 
        
               """ 
        
               Truncates a text string based on the maximum number of tokens. 
        
               param string (str): The input text string to be truncated. 
        
               param encoding_name (str): The name of the encoding model. 
        
               param max_length (int): The maximum number of tokens allowed. Default is 8192. 
        
               """ 
        
               import tiktoken 
        
               try: 
        
                   encoding = tiktoken.encoding_for_model(encoding_name) 
        
               except ValueError as e: 
        
                   raise ValueError(f"Invalid encoding_name: {e}") 
        
               encoded_string = encoding.encode(text) 
        
               num_tokens = len(encoded_string) 
        
               if num_tokens > max_length: 
        
                   text = encoding.decode(encoded_string[:max_length]) 
        
               return text

davidgxue · 2024-01-17T20:28:40Z

On a related note, I also noticed some of our inserts/upserts have issues with having too long of a chunk being attempted to be ingested. see error log below. This needs further investigation as it could mean some docs aren't properly ingested

[2024-01-16, 05:02:04 UTC] {weaviate.py:430} INFO - Error occurred in batch process for 00d97bbe-cec4-51bc-b0d2-11f1326f167e with error {'error': [{'message': "update vector: connection to: OpenAI API failed with status: 400 error: This model's maximum context length is 8192 tokens, however you requested 20585 tokens (20585 in your prompt; 0 for the completion). Please reduce your prompt; or completion length."}]}
[2024-01-16, 05:02:04 UTC] {weaviate.py:430} INFO - Error occurred in batch process for d1eace29-205f-53bf-9fc5-6645f133877c with error {'error': [{'message': "update vector: connection to: OpenAI API failed with status: 400 error: This model's maximum context length is 8192 tokens, however you requested 28319 tokens (28319 in your prompt; 0 for the completion). Please reduce your prompt; or completion length."}]}
[2024-01-16, 05:02:04 UTC] {weaviate.py:430} INFO - Error occurred in batch process for a82bb7bc-f2fc-5698-ad79-c28185a33e82 with error {'error': [{'message': "update vector: connection to: OpenAI API failed with status: 400 error: This model's maximum context length is 8192 tokens, however you requested 13851 tokens (13851 in your prompt; 0 for the completion). Please reduce your prompt; or completion length."}]}
[2024-01-16, 05:02:04 UTC] {weaviate.py:430} INFO - Error occurred in batch process for 9075559f-b0ce-5d8f-8162-28fd974a99e1 with error {'error': [{'message': "update vector: connection to: OpenAI API failed with status: 400 error: This model's maximum context length is 8192 tokens, however you requested 15193 tokens (15193 in your prompt; 0 for the completion). Please reduce your prompt; or completion length."}]}
[2024-01-16, 05:02:04 UTC] {weaviate.py:440} INFO - Total Objects 6066 / Objects 6009 successfully inserted and Objects 22 had errors.

### Description - This is 1st part of a 2 part effort to improve the scraping, extraction, chunking and tokenizing logic for Ask Astro's data ingestion process. (see details in this issue #258) - This PR mainly focuses on improving noise from ingestion process of the Astro Docs data source, along with some other related changes such as only scraping the latest doc versions, add auto exponential backoff on html get function and etc. ### Closes the Following Issues - #292 - #270 - #209 ### Partially Completes Issues - #258 (2 part effort, only 1 PR completed) - #221 (tackles token limit in html splitting logic, other parts needs tackling still) ### Technical Details - airflow/include/tasks/extract/astro_docs.py - Add function `process_astro_doc_page_content`: which gets rid of noisey not useful content such as nav bar, footer, header and only extract the main page article content - Remove the previous function `scrape_page` (which scraps the HTML content AND finds scraps all its sub pages using links contained). This is done since 1. there is already a centralized util function called `fetch_page_content()` that does the job of fetching each page's HTML elements, 2. there is already a centralized util function called `get_internal_links` that finds all links in, 3. the scraping process itself does not exclude noisey unrelated content which is replaced by the function in the previous bullet point `process_astro_doc_page_content` - airflow/include/tasks/split.py - Modify function `split_html`: it previously splits on specific HTML tags using `HTMLHeaderTextSplitter` but it is not ideal as we do not want to split that often and there is no guarantee splitting on such tags retains semantic meaning. This is changed to using `RecursiveCharacterTextSplitter` with a token limit. This will ONLY split if the chunk starts exceeding a certain number of specified token amount. If it still exceeds then go down the separator list and split further, until splitting by space and character to fit into token limit. This retains better semantic meaning in each chunks and enforces token limit. - airflow/include/tasks/extract/utils/html_utils.py - Change function `fetch_page_content` to add auto retry with exponential backoff using tenacity. - Change function `get_page_links` to make it traverse a given page recursively and finds all links related to this website. This ensures no duplicate pages are traversed and no pages are missing. Previously, the logic is missing some links when traversing potentially due to the fact that it is using a for loop and not doing recursive traversal until all links are exhausted. - Note: This has a huge URL difference. Previously a lot of links were like https://abc.com/abc#XXX and https://abc.com/abc#YYY where the hashtag is the same page but one section of the page, but the logic wasn't able to distinguish them. - - airflow/requirements.txt: adding required packages - api/ask_astro/settings.py: remove unused variables ### Results #### Astro Docs: Better URLs Fetched + Crawling Improvement + HTML Splitter Improvement 1. Example of formatting and chunking - Previously (near unreadable) ![image](https://github.com/astronomer/ask-astro/assets/26350341/90ff59f9-1401-4395-8add-cecd8bf08ac4) - Now (cleaned!) ![image](https://github.com/astronomer/ask-astro/assets/26350341/b465a4fc-497c-4687-b601-aa03ba12fc15) 2. Example of URLs difference - Previously - around 1000 links fetched. Many have DUPLCIATE content since they are the same link. - XMLs and non HTML/website content fetch See old links: [astro_docs_links_old.txt](https://github.com/astronomer/ask-astro/files/14146665/astro_docs_links_old.txt) - Now - No more duplicate pages or unreleased pages - No older versions for software docs, only latest docs being ingested. (e.g.: the .../0.31... links are gone) [new_astro_docs_links.txt](https://github.com/astronomer/ask-astro/files/14146669/new_astro_docs_links.txt) #### Evaluation - Overall improvement in answer and retrieval quality - No degradation noted - CSV posted in comments

Code ready for review, evaluations/testing is still in progress ### Description - This is **part 2** of the data ingestion improvements. The goal is to analyze and investigate any potential issues with the data ingestion process, remove noisy data such as badly formatted content or gibberish text, find bugs, and enforce standards on the process to have predictable cost estimation/reduction. ### Technical Changes - Renamed the original `split.py` file to `chunking_utils.py` (multiple files changed due to import naming) - During the above change, found 2 DAGs somehow were NOT changed. Provider docs and SDK docs's incremental ingest DAGs somehow never had a chunking/splitting task (even though the bulk ingest does). I added chunking to these 2 DAGs - Airflow Docs extraction process has been improved(see code for details) - Provider Docs extraction process has been improved (see code for details) - Astro SDK Docs extraction process has been improved (see code for details) -> Astro SDK docs are not currently being ingested, but fixing code since they exist. - Provider docs used to ingest all past versions leading to many duplicate docs in the DB. It is now changes to only ingest the `stable` version of the docs. ### Evaluations - Answer quality generally improved with some add on of more concise answering and linking hyperlinks when the model knows the answer is not exhaustive. - Please attached for a quick list of results - [data_ingest_comparison_part_2.csv](https://github.com/astronomer/ask-astro/files/14484682/data_ingest_comparison_part_2.csv) [data_ingest_results_part_2.csv](https://github.com/astronomer/ask-astro/files/14484683/data_ingest_results_part_2.csv) - Bulk ingest also runs with no issues <img width="668" alt="image" src="https://github.com/astronomer/ask-astro/assets/26350341/635c1ba0-b6b0-4fcd-81ed-8c337f7b3f78"> ### Related Issues closes #221 closes #258 closes #295 (Reranker has been addressed in GCP environment variables, embedding model change completed in a different PR) closes #285 (This PR prevents empty docs from being ingested)

davidgxue added this to the 0.3.0 milestone Jan 17, 2024

davidgxue self-assigned this Jan 25, 2024

davidgxue mentioned this issue Feb 21, 2024

Improve HTML Splitter, URL Fetching Logic & Astro Docs Ingestion #293

Merged

davidgxue mentioned this issue Feb 28, 2024

Data Ingestion Improvement/Cleanup/Bug Fix - Part 2 #307

Merged

davidgxue closed this as completed in #307 Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle big text ingestion #221

Handle big text ingestion #221

pankajastro commented Dec 19, 2023

pankajastro commented Dec 20, 2023

davidgxue commented Jan 17, 2024

Handle big text ingestion #221

Handle big text ingestion #221

Comments

pankajastro commented Dec 19, 2023

pankajastro commented Dec 20, 2023

davidgxue commented Jan 17, 2024