Skip to content

Jinx2000/K8s_Doc_Reptile

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kubernetes Documentation Processing Workflow

A streamlined pipeline to scrape, clean, annotate links, and convert Kubernetes documentation (or related content) into a structured CSV file. This workflow makes it easy to reuse extracted information for knowledge bases or other documentation tools.


Table of Contents

  1. Overview
  2. Key Features
  3. Project Structure
  4. Workflow
  5. Usage
  6. Dependencies
  7. Troubleshooting
  8. Known Issues

1. Overview

This project automates the process of extracting, annotating, and transforming content from Kubernetes-related documentation into a structured CSV file. Specifically, it:

  • Scrapes the content from a user-provided URL.
  • Annotates all <a> tags by converting them to a custom inline format (e.g., Some Text [LINK:https://example.com]).
  • Organizes the extracted data into logical sections and code blocks.
  • Removes unnecessary HTML tags and ensures readability.
  • Outputs a CSV file with fields such as Topic, Concept, Content, URL, Link to, and Tags.

These steps produce a clean, organized dataset suitable for further analysis or integration with other platforms.


2. Key Features

  1. Inline Link Annotations

    • <a> tags are replaced with [LINK:href] notations, preserving anchor text while making link parsing simpler in subsequent steps.
  2. Clickable Links in CSV

    • The links in the CSV are clickable, with support for relative-link resolution via the root URL (hardcoded as https://kubernetes.io/).
  3. Dynamic Topic Assignment

    • The topic column in the final CSV is derived from the last word in the URL (e.g., Ingress, Gateway).
  4. Reference URL

    • Each CSV entry includes the original source URL for easy traceability.
  5. Intermediate Outputs

    • Each step writes intermediate files to data/ so you can inspect or debug the pipeline.
  6. Multiple URL Support

    • Use Multi_facade.py to run the entire workflow against multiple URLs at once, combining all results into a single CSV.

3. Project Structure

Below is a simplified layout of the repository:

Facade.py  
Multi_facade.py  
lib/  
  ├── simple_spider.py  
  ├── clean_html_links.py  
  ├── extract_h2.py  
  ├── extract_code_example.py  
  ├── clean_all_tags_and_newline.py  
  ├── final_refine.py  
  ├── to_csv.py  
data/  
  ├── multi_url.txt (optional list of URLs)  
  ├── final_output_<Topic>.csv (generated by Facade.py)  
  ├── multi_final.csv (generated by Multi_facade.py)  
  multi_temp_csv/  
   ├── final_output_<Topic>.csv (generated by Multi_Facade.py)  
requirements.txt  
README.md  
  • Facade.py: Main script orchestrating all the steps for one URL.
  • Multi_facade.py: Higher-level script that runs Facade for multiple URLs and merges the outputs.
  • lib/: Individual Python scripts for each step in the data processing pipeline.
  • data/: Contains intermediate files, final CSV outputs, and an optional list of URLs (multi_url.txt).

4. Workflow

The workflow is executed in sequential steps. When you run Facade.py, it will:

  1. simple_spider.py

    • Input: --url command-line argument.
    • Action: Scrapes HTML from the given URL, extracting <div class="td-content"> sections.
    • Output: data/step1_spider_output.txt
  2. clean_html_links.py

    • Action: Replaces every <a> tag with inner_text [LINK:href].
    • Output: data/step2_clean_links_output.txt
  3. extract_h2.py

    • Action: Segments the HTML into logical chunks based on <h2> headings.
    • Output: data/step3_extract_h2_output.txt
  4. extract_code_example.py

    • Action: Identifies and processes <code> blocks, retaining them inline where appropriate.
    • Output: data/step4_extract_code_output.txt
  5. clean_all_tags_and_newline.py

    • Action: Strips out any remaining HTML tags and ensures each sentence ends with a newline for readability.
    • Output: data/step5_clean_tags_output.txt
  6. final_refine.py

    • Action: Removes unnecessary separators and processes [CODE_BLOCK_START]/[CODE_BLOCK_END] tags for cleaner output.
    • Output: data/step6_final_refine_output.txt
  7. to_csv.py

    • Action: Converts the processed text into a CSV, adding Topic (derived from the URL), and clickable links.
    • Output: data/final_output_<Topic>.csv

5. Usage

A. Single URL

Run the workflow end-to-end for a single URL with Facade.py:

python3 Facade.py --url <your_url>

Arguments:

  • --url: The URL to scrape. Defaults to K8s Ingress if not provided.

Outputs:

  • Intermediate files in data/...
  • A final CSV: data/final_output_<Topic>.csv (e.g., final_output_Ingress.csv).

B. Multiple URLs

For processing multiple URLs, use Multi_facade.py:

  1. Create (or edit) a text file with one URL per line @ data/multi_url.txt. (If not provided, a default file will be created.)
  2. Run:
python3 Multi_facade.py
  • --input: Points to the file containing multiple URLs (defaults to data/multi_url.txt).
  • --output: The combined CSV with data from all URLs (defaults to data/multi_final.csv).

Process:

  1. For each URL, Multi_facade.py calls Facade.py internally.
  2. It moves each final CSV (e.g., final_output_Ingress.csv, etc.) to a temporary folder data/multi_temp_csv/.
  3. Once all URLs are processed, it combines them into a single CSV file multi_final.csv.

6. Dependencies

  • requests – Fetches HTML from the web.
  • beautifulsoup4 – Parses HTML content.
  • re (built-in) – Regex processing for link annotations and text cleaning.
  • csv (built-in) – Reading and writing CSV files.
  • argparse (built-in) – Handling command-line arguments.
  • shutil (built-in) – Used in Multi_facade.py for folder operations.

Installation:

pip install -r requirements.txt

7. Troubleshooting

  1. Missing data/ Folder

    • The scripts create data/ automatically if it does not exist, but ensure you have write permissions.
  2. Empty or Corrupted Intermediate Files

    • Check the console output for errors in earlier steps.
    • Verify the URL actually contains <div class="td-content"> elements for scraping.
  3. Links Are Not Populating

    • Confirm that <a> tags existed in the scraped content.
    • Ensure to_csv.py and Facade.py handle relative links correctly.
  4. Multiple URLs Failing

    • Confirm each URL in multi_url.txt is valid.
    • Check if network or SSL errors occur for certain sites.

7. Known Issues

  1. Duplication
    • When testing on large url set (700+), there's 53 duplicate concepts detected, the reason remians unclear, I suspect might be an issue with the url set. 1/28/2025.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages