A streamlined pipeline to scrape, clean, annotate links, and convert Kubernetes documentation (or related content) into a structured CSV file. This workflow makes it easy to reuse extracted information for knowledge bases or other documentation tools.
This project automates the process of extracting, annotating, and transforming content from Kubernetes-related documentation into a structured CSV file. Specifically, it:
- Scrapes the content from a user-provided URL.
- Annotates all <a>tags by converting them to a custom inline format (e.g., Some Text[LINK:https://example.com]).
- Organizes the extracted data into logical sections and code blocks.
- Removes unnecessary HTML tags and ensures readability.
- Outputs a CSV file with fields such as Topic, Concept, Content, URL, Link to, and Tags.
These steps produce a clean, organized dataset suitable for further analysis or integration with other platforms.
- 
Inline Link Annotations - <a>tags are replaced with- [LINK:href]notations, preserving anchor text while making link parsing simpler in subsequent steps.
 
- 
Clickable Links in CSV - The links in the CSV are clickable, with support for relative-link resolution via the root URL (hardcoded as https://kubernetes.io/).
 
- The links in the CSV are clickable, with support for relative-link resolution via the root URL (hardcoded as 
- 
Dynamic Topic Assignment - The topic column in the final CSV is derived from the last word in the URL (e.g., Ingress, Gateway).
 
- 
Reference URL - Each CSV entry includes the original source URL for easy traceability.
 
- 
Intermediate Outputs - Each step writes intermediate files to data/so you can inspect or debug the pipeline.
 
- Each step writes intermediate files to 
- 
Multiple URL Support - Use Multi_facade.pyto run the entire workflow against multiple URLs at once, combining all results into a single CSV.
 
- Use 
Below is a simplified layout of the repository:
Facade.py  
Multi_facade.py  
lib/  
  ├── simple_spider.py  
  ├── clean_html_links.py  
  ├── extract_h2.py  
  ├── extract_code_example.py  
  ├── clean_all_tags_and_newline.py  
  ├── final_refine.py  
  ├── to_csv.py  
data/  
  ├── multi_url.txt (optional list of URLs)  
  ├── final_output_<Topic>.csv (generated by Facade.py)  
  ├── multi_final.csv (generated by Multi_facade.py)  
  multi_temp_csv/  
   ├── final_output_<Topic>.csv (generated by Multi_Facade.py)  
requirements.txt  
README.md  
- Facade.py: Main script orchestrating all the steps for one URL.
- Multi_facade.py: Higher-level script that runs Facade for multiple URLs and merges the outputs.
- lib/: Individual Python scripts for each step in the data processing pipeline.
- data/: Contains intermediate files, final CSV outputs, and an optional list of URLs (- multi_url.txt).
The workflow is executed in sequential steps. When you run Facade.py, it will:
- 
simple_spider.py- Input: --urlcommand-line argument.
- Action: Scrapes HTML from the given URL, extracting <div class="td-content">sections.
- Output: data/step1_spider_output.txt
 
- Input: 
- 
clean_html_links.py- Action: Replaces every <a>tag withinner_text [LINK:href].
- Output: data/step2_clean_links_output.txt
 
- Action: Replaces every 
- 
extract_h2.py- Action: Segments the HTML into logical chunks based on <h2>headings.
- Output: data/step3_extract_h2_output.txt
 
- Action: Segments the HTML into logical chunks based on 
- 
extract_code_example.py- Action: Identifies and processes <code>blocks, retaining them inline where appropriate.
- Output: data/step4_extract_code_output.txt
 
- Action: Identifies and processes 
- 
clean_all_tags_and_newline.py- Action: Strips out any remaining HTML tags and ensures each sentence ends with a newline for readability.
- Output: data/step5_clean_tags_output.txt
 
- 
final_refine.py- Action: Removes unnecessary separators and processes [CODE_BLOCK_START]/[CODE_BLOCK_END]tags for cleaner output.
- Output: data/step6_final_refine_output.txt
 
- Action: Removes unnecessary separators and processes 
- 
to_csv.py- Action: Converts the processed text into a CSV, adding Topic (derived from the URL), and clickable links.
- Output: data/final_output_<Topic>.csv
 
Run the workflow end-to-end for a single URL with Facade.py:
python3 Facade.py --url <your_url>Arguments:
- --url: The URL to scrape. Defaults to K8s Ingress if not provided.
Outputs:
- Intermediate files in data/...
- A final CSV: data/final_output_<Topic>.csv(e.g.,final_output_Ingress.csv).
For processing multiple URLs, use Multi_facade.py:
- Create (or edit) a text file with one URL per line @ data/multi_url.txt. (If not provided, a default file will be created.)
- Run:
python3 Multi_facade.py- --input: Points to the file containing multiple URLs (defaults to- data/multi_url.txt).
- --output: The combined CSV with data from all URLs (defaults to- data/multi_final.csv).
Process:
- For each URL, Multi_facade.pycallsFacade.pyinternally.
- It moves each final CSV (e.g., final_output_Ingress.csv, etc.) to a temporary folderdata/multi_temp_csv/.
- Once all URLs are processed, it combines them into a single CSV file multi_final.csv.
- requests– Fetches HTML from the web.
- beautifulsoup4– Parses HTML content.
- re(built-in) – Regex processing for link annotations and text cleaning.
- csv(built-in) – Reading and writing CSV files.
- argparse(built-in) – Handling command-line arguments.
- shutil(built-in) – Used in- Multi_facade.pyfor folder operations.
Installation:
pip install -r requirements.txt- 
Missing data/Folder- The scripts create data/automatically if it does not exist, but ensure you have write permissions.
 
- The scripts create 
- 
Empty or Corrupted Intermediate Files - Check the console output for errors in earlier steps.
- Verify the URL actually contains <div class="td-content">elements for scraping.
 
- 
Links Are Not Populating - Confirm that <a>tags existed in the scraped content.
- Ensure to_csv.pyandFacade.pyhandle relative links correctly.
 
- Confirm that 
- 
Multiple URLs Failing - Confirm each URL in multi_url.txtis valid.
- Check if network or SSL errors occur for certain sites.
 
- Confirm each URL in 
- Duplication
- When testing on large url set (700+), there's 53 duplicate concepts detected, the reason remians unclear, I suspect might be an issue with the url set. 1/28/2025.