Kubernetes Documentation Processing Workflow

A streamlined pipeline to scrape, clean, annotate links, and convert Kubernetes documentation (or related content) into a structured CSV file. This workflow makes it easy to reuse extracted information for knowledge bases or other documentation tools.

1. Overview

This project automates the process of extracting, annotating, and transforming content from Kubernetes-related documentation into a structured CSV file. Specifically, it:

Scrapes the content from a user-provided URL.
Annotates all <a> tags by converting them to a custom inline format (e.g., Some Text [LINK:https://example.com]).
Organizes the extracted data into logical sections and code blocks.
Removes unnecessary HTML tags and ensures readability.
Outputs a CSV file with fields such as Topic, Concept, Content, URL, Link to, and Tags.

These steps produce a clean, organized dataset suitable for further analysis or integration with other platforms.

2. Key Features

Inline Link Annotations
- <a> tags are replaced with [LINK:href] notations, preserving anchor text while making link parsing simpler in subsequent steps.
Clickable Links in CSV
- The links in the CSV are clickable, with support for relative-link resolution via the root URL (hardcoded as https://kubernetes.io/).
Dynamic Topic Assignment
- The topic column in the final CSV is derived from the last word in the URL (e.g., Ingress, Gateway).
Reference URL
- Each CSV entry includes the original source URL for easy traceability.
Intermediate Outputs
- Each step writes intermediate files to data/ so you can inspect or debug the pipeline.
Multiple URL Support
- Use Multi_facade.py to run the entire workflow against multiple URLs at once, combining all results into a single CSV.

3. Project Structure

Below is a simplified layout of the repository:

Facade.py  
Multi_facade.py  
lib/  
  ├── simple_spider.py  
  ├── clean_html_links.py  
  ├── extract_h2.py  
  ├── extract_code_example.py  
  ├── clean_all_tags_and_newline.py  
  ├── final_refine.py  
  ├── to_csv.py  
data/  
  ├── multi_url.txt (optional list of URLs)  
  ├── final_output_<Topic>.csv (generated by Facade.py)  
  ├── multi_final.csv (generated by Multi_facade.py)  
  multi_temp_csv/  
   ├── final_output_<Topic>.csv (generated by Multi_Facade.py)  
requirements.txt  
README.md

Facade.py: Main script orchestrating all the steps for one URL.
Multi_facade.py: Higher-level script that runs Facade for multiple URLs and merges the outputs.
lib/: Individual Python scripts for each step in the data processing pipeline.
data/: Contains intermediate files, final CSV outputs, and an optional list of URLs (multi_url.txt).

4. Workflow

The workflow is executed in sequential steps. When you run Facade.py, it will:

simple_spider.py
- Input: --url command-line argument.
- Action: Scrapes HTML from the given URL, extracting <div class="td-content"> sections.
- Output: data/step1_spider_output.txt
clean_html_links.py
- Action: Replaces every <a> tag with inner_text [LINK:href].
- Output: data/step2_clean_links_output.txt
extract_h2.py
- Action: Segments the HTML into logical chunks based on <h2> headings.
- Output: data/step3_extract_h2_output.txt
extract_code_example.py
- Action: Identifies and processes <code> blocks, retaining them inline where appropriate.
- Output: data/step4_extract_code_output.txt
clean_all_tags_and_newline.py
- Action: Strips out any remaining HTML tags and ensures each sentence ends with a newline for readability.
- Output: data/step5_clean_tags_output.txt
final_refine.py
- Action: Removes unnecessary separators and processes [CODE_BLOCK_START]/[CODE_BLOCK_END] tags for cleaner output.
- Output: data/step6_final_refine_output.txt
to_csv.py
- Action: Converts the processed text into a CSV, adding Topic (derived from the URL), and clickable links.
- Output: data/final_output_<Topic>.csv

5. Usage

A. Single URL

Run the workflow end-to-end for a single URL with Facade.py:

python3 Facade.py --url <your_url>

Arguments:

--url: The URL to scrape. Defaults to K8s Ingress if not provided.

Outputs:

Intermediate files in data/...
A final CSV: data/final_output_<Topic>.csv (e.g., final_output_Ingress.csv).

B. Multiple URLs

For processing multiple URLs, use Multi_facade.py:

Create (or edit) a text file with one URL per line @ data/multi_url.txt. (If not provided, a default file will be created.)
Run:

python3 Multi_facade.py

--input: Points to the file containing multiple URLs (defaults to data/multi_url.txt).
--output: The combined CSV with data from all URLs (defaults to data/multi_final.csv).

Process:

For each URL, Multi_facade.py calls Facade.py internally.
It moves each final CSV (e.g., final_output_Ingress.csv, etc.) to a temporary folder data/multi_temp_csv/.
Once all URLs are processed, it combines them into a single CSV file multi_final.csv.

6. Dependencies

requests – Fetches HTML from the web.
beautifulsoup4 – Parses HTML content.
re (built-in) – Regex processing for link annotations and text cleaning.
csv (built-in) – Reading and writing CSV files.
argparse (built-in) – Handling command-line arguments.
shutil (built-in) – Used in Multi_facade.py for folder operations.

Installation:

pip install -r requirements.txt

7. Troubleshooting

Missing data/ Folder
- The scripts create data/ automatically if it does not exist, but ensure you have write permissions.
Empty or Corrupted Intermediate Files
- Check the console output for errors in earlier steps.
- Verify the URL actually contains <div class="td-content"> elements for scraping.
Links Are Not Populating
- Confirm that <a> tags existed in the scraped content.
- Ensure to_csv.py and Facade.py handle relative links correctly.
Multiple URLs Failing
- Confirm each URL in multi_url.txt is valid.
- Check if network or SSL errors occur for certain sites.

7. Known Issues

Duplication
- When testing on large url set (700+), there's 53 duplicate concepts detected, the reason remians unclear, I suspect might be an issue with the url set. 1/28/2025.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kubernetes Documentation Processing Workflow

Table of Contents

1. Overview

2. Key Features

3. Project Structure

4. Workflow

5. Usage

A. Single URL

B. Multiple URLs

6. Dependencies

7. Troubleshooting

7. Known Issues

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
lib		lib
.gitattributes		.gitattributes
.gitignore		.gitignore
Facade.py		Facade.py
Multi_facade.py		Multi_facade.py
README.md		README.md
requirements.txt		requirements.txt

Jinx2000/K8s_Doc_Reptile

Folders and files

Latest commit

History

Repository files navigation

Kubernetes Documentation Processing Workflow

Table of Contents

1. Overview

2. Key Features

3. Project Structure

4. Workflow

5. Usage

A. Single URL

B. Multiple URLs

6. Dependencies

7. Troubleshooting

7. Known Issues

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages