Removing junk from sitemap and full site (recursive) crawls by coleam00 · Pull Request #711 · coleam00/Archon

coleam00 · 2025-09-19T20:04:40Z

Pull Request

Summary

Removed junk from sitemap and full site (recursive) crawls - mostly links and footer/header text. This is using Crawl4AI's PruningContentFilter:

https://docs.crawl4ai.com/core/markdown-generation/#52-pruningcontentfilter

Changes Made

Added a junk remover markdown generator to use for the sitemap and regular site (recursive) crawlers - uses PruningContentFilter (linked above)
Updated the sitemap and regular site crawlers to use the fit_markdown property of result.markdown from the Crawl4AI crawls which excludes junk (headers, footers, links)

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance improvement
Code refactoring

Affected Services

Testing

All existing tests pass
Added new tests for new functionality
Manually tested affected user flows
Docker builds succeed for all services

Test Evidence

Tested crawls with:

https://ai.pydantic.dev/llms.txt (to make sure llms.txt aren't affected by this because they're literally just links - I originally had the pruner on every strategy and it removed all llms.txt content so that's why I made the pruner specific to sitemap/regular sites)
https://mem0.ai/sitemap.xml
https://help.getzep.com/graphiti/getting-started/welcome

Checklist

My code follows the service architecture patterns
If using an AI coding assistant, I used the CLAUDE.md rules
I have added tests that prove my fix/feature works
All new and existing tests pass locally
My changes generate no new warnings
I have updated relevant documentation
I have verified no regressions in existing features

Additional Notes

There are still chunks that contain just links for some sites (such as https://mem0.ai/sitemap.xml) - I am keeping this for now since sometimes you might want to actually retrieve a bunch of links for a site with RAG. The main thing this PR accomplishes is getting rid of navigation links and footer text from chunks that actually have information outside of a list of links.

Summary by CodeRabbit

New Features
- Enhanced crawling produces cleaner, link-pruned Markdown output for batch and recursive strategies, reducing noise from internal or low-value links.
Bug Fixes
- Stricter success validation ensures only pages with valid, formatted Markdown are included, preventing empty or partial results from appearing in outputs.
Chores
- Internal configuration added to support link-pruned Markdown generation without changing public interfaces.

coderabbitai · 2025-09-19T20:04:50Z

Walkthrough

Adds a link-pruning markdown generator in SiteConfig and wires it into crawling_service for batch and recursive strategies. Batch and recursive strategies now treat results as successful only if result.markdown.fit_markdown exists and use that nested value for output.

Changes

Cohort / File(s)	Summary
SiteConfig: link-pruning markdown generator `python/src/server/services/crawling/helpers/site_config.py`	Adds `SiteConfig.get_link_pruning_markdown_generator()` using `PruningContentFilter` and `DefaultMarkdownGenerator` with specified options; existing `get_markdown_generator()` unchanged.
Crawling service wiring `python/src/server/services/crawling/crawling_service.py`	Passes `self.link_pruning_markdown_generator` (from SiteConfig) to `BatchCrawlStrategy` and `RecursiveCrawlStrategy`; single-page and sitemap strategies unchanged.
Batch strategy result handling `python/src/server/services/crawling/strategies/batch.py`	Success requires `result.success && result.markdown && result.markdown.fit_markdown`; stores `fit_markdown` under "markdown"; treats missing `fit_markdown` as failure.
Recursive strategy result handling `python/src/server/services/crawling/strategies/recursive.py`	Filters successes by presence of `result.markdown.fit_markdown`; appends `fit_markdown` instead of `markdown` to results.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant CrawlingService
  participant SiteConfig
  participant Strategy as Batch/Recursive Strategy
  participant CrawlEngine

  Client->>CrawlingService: start_crawl(...)
  CrawlingService->>SiteConfig: get_link_pruning_markdown_generator()
  SiteConfig-->>CrawlingService: pruning-enabled MarkdownGenerator
  CrawlingService->>Strategy: init(generator=link_pruning_markdown_generator)
  Strategy->>CrawlEngine: crawl(urls)
  CrawlEngine-->>Strategy: Result{success, markdown{fit_markdown?, ...}}

  alt success && markdown && fit_markdown
    Strategy-->>CrawlingService: include fit_markdown in outputs
  else missing fit_markdown
    Strategy-->>CrawlingService: treat as failure (exclude/log)
  end

  CrawlingService-->>Client: aggregated results

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I nibble links and prune the vines,
Through nested markdown, gold now shines.
Batch or burrow, I sift what’s right,
Fit markdown gleams in lunar light.
Thump-thump! My paws approve this run—
Cleaner trails, the crawl is done. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "Removing junk from sitemap and full site (recursive) crawls" is concise, specific, and accurately summarizes the primary change (adding pruning to sitemap and recursive/full-site crawls to remove headers/footers/links), so it communicates the main intent clearly to reviewers.
Description Check	✅ Passed	The PR description follows the repository template and is mostly complete: it provides a clear Summary, a detailed "Changes Made" list, the Type of Change and affected services, Testing with specific test targets, checklist items, and Additional Notes describing edge cases, giving reviewers sufficient context and verification steps.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch crawl4ai-junk-pruning

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3799419 and 2b8224a.

📒 Files selected for processing (4)

python/src/server/services/crawling/crawling_service.py (1 hunks)
python/src/server/services/crawling/helpers/site_config.py (2 hunks)
python/src/server/services/crawling/strategies/batch.py (1 hunks)
python/src/server/services/crawling/strategies/recursive.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: Never accept corrupted data in batch/continuation paths: skip failed items entirely (e.g., do not store zero embeddings, null FKs, or malformed JSON)
Use specific exception types; avoid catching bare Exception
Preserve full stack traces in logs (use exc_info=True with Python logging)
Never return None to indicate failure; raise an exception with details
Authentication/authorization failures must halt the operation and be clearly surfaced
Service startup, missing configuration, database connection, or critical dependency failures should crash fast with clear errors
During crawling/batch/background tasks and WebSocket events, continue processing other items but log failures with context
Include context (operation intent, relevant IDs/URLs/data) in error messages
Pydantic should raise on data corruption or validation errors; do not accept invalid inputs

Files:

python/src/server/services/crawling/strategies/recursive.py
python/src/server/services/crawling/strategies/batch.py
python/src/server/services/crawling/crawling_service.py
python/src/server/services/crawling/helpers/site_config.py

python/src/server/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/services/**/*.py: For batch operations, report both success count and a detailed failure list
On external API calls, retry with exponential backoff and fail with a clear message after retries
Place business logic in service layer modules (API Route → Service → Database pattern)

Files:

python/src/server/services/crawling/strategies/recursive.py
python/src/server/services/crawling/strategies/batch.py
python/src/server/services/crawling/crawling_service.py
python/src/server/services/crawling/helpers/site_config.py

python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting (errors, warnings, unused imports)

Files:

python/src/server/services/crawling/strategies/recursive.py
python/src/server/services/crawling/strategies/batch.py
python/src/server/services/crawling/crawling_service.py
python/src/server/services/crawling/helpers/site_config.py

python/src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use MyPy for type checking to ensure type safety

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

Files:

python/src/server/services/crawling/strategies/recursive.py
python/src/server/services/crawling/strategies/batch.py
python/src/server/services/crawling/crawling_service.py
python/src/server/services/crawling/helpers/site_config.py

🧬 Code graph analysis (1)

python/src/server/services/crawling/crawling_service.py (3)

python/src/server/services/crawling/helpers/site_config.py (1)

get_link_pruning_markdown_generator (102-129)

python/src/server/services/crawling/strategies/batch.py (1)

BatchCrawlStrategy (19-273)

python/src/server/services/crawling/strategies/recursive.py (1)

RecursiveCrawlStrategy (21-331)

🔇 Additional comments (6)

python/src/server/services/crawling/crawling_service.py (2)

77-77: LGTM! Proper initialization of link-pruning markdown generator.

The addition of the link-pruning markdown generator is well-integrated with the existing configuration pattern.

80-81: LGTM! Correct strategy initialization with pruning generator.

Both batch and recursive strategies are correctly updated to use the link-pruning markdown generator instead of the standard one, which aligns with the PR objective to remove junk content from crawl results.

python/src/server/services/crawling/strategies/recursive.py (1)

279-284: LGTM! Proper validation of pruned markdown content.

The enhanced success condition now correctly validates both result.markdown and result.markdown.fit_markdown exist before processing. Using fit_markdown ensures only the filtered content (with pruned boilerplate) is stored, which aligns with the PR's junk-removal objective.

The change from result.markdown to result.markdown.fit_markdown for the stored content ensures consistent use of the pruned output throughout the crawling pipeline.

python/src/server/services/crawling/strategies/batch.py (1)

234-241: LGTM! Consistent pruned markdown validation and storage.

The batch strategy correctly mirrors the recursive strategy changes by:

Validating that result.markdown.fit_markdown exists and is truthy

Storing result.markdown.fit_markdown instead of result.markdown

This ensures consistent behavior across both crawling strategies and proper utilization of the pruned content.

python/src/server/services/crawling/helpers/site_config.py (2)

7-7: LGTM! Proper import of pruning filter.

The import statement correctly adds the PruningContentFilter dependency needed for the new generator method.

101-129: LGTM! Well-configured pruning markdown generator.

The new method properly creates a pruning-enabled markdown generator with appropriate settings:

PruningContentFilter with threshold=0.2 removes low-density content based on text density and link density

Consistent options with the existing get_markdown_generator() method to preserve code formatting

Clear documentation explaining its purpose for recursive crawling cleanup

The configuration strikes a good balance between noise reduction (via pruning) and content preservation (via code-friendly options).

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…#711) * Removing junk from sitemap and full site (recursive) crawls * Small typo fix for result.markdown

* feat(web): live step & tool progress on Mission Control dashboard (#711) - Emit tool_started/tool_completed events from workflow executor (sequential, loop, DAG) - Bridge tool activity events to SSE as workflow_tool_activity - Add __dashboard__ multiplexed SSE endpoint for all workflow events - Extend DashboardWorkflowRun with current step name/status and agent counts via correlated subqueries (SQLite + PostgreSQL dialect-aware) - Add useDashboardSSE hook connecting to __dashboard__ SSE stream - Add handleWorkflowToolActivity to Zustand workflow store - WorkflowRunCard subscribes to Zustand store directly for live step/tool updates - DashboardPage hydrates store from REST data for active runs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(web): correct event_index SQL bug, deduplicate CASE subquery, and type/code quality fixes - Replace non-existent `event_index` column with `created_at` in all 8 correlated subqueries in `listDashboardRuns` (CRITICAL runtime fix — would crash dashboard for all users) - Remove `current_step_event_index` field from `DashboardWorkflowRun` and `DashboardRunResponse` (field was never consumed by frontend) - Deduplicate the triplicated `CASE` subquery into a single `CASE expr WHEN ...` form (HIGH performance/correctness fix) - Add `WorkflowToolActivityEvent` to `SSEEvent` discriminated union in `types.ts` (MEDIUM type safety) - Remove unused `sourceRef` from `useDashboardSSE` hook (MEDIUM YAGNI) - Add `{ streamId: '__dashboard__' }` context object to all dashboard SSE log calls (MEDIUM logging compliance) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: totalSteps JSON key mismatch and extract IIFE to named component - Fix total_steps always null: change jsonIntExtract key from 'totalSteps' to 'total_steps' to match what the executor writes - Extract 25-line IIFE in WorkflowRunCard JSX to named StepProgress component - Fix stepIndex > 0 guard to stepIndex != null (was hiding Step 0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: move WorkflowState import to top of file (ESLint import/first) The import was placed after the PLATFORM_ICONS constant, violating ESLint's import/first rule which fails CI with --max-warnings 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(workflows): emit tool_started/tool_completed events from loop node executor The loop node executor in dag-executor.ts was writing tool events to the database but not emitting them via getWorkflowEventEmitter(). This meant the WorkflowEventBridge never received tool activity events for loop nodes, so the dashboard SSE stream had no workflow_tool_activity events and the WorkflowRunCard's currentTool display stayed empty. Add tool_started/tool_completed emitter calls to executeLoopNode(), matching the pattern already used in executeNodeInternal() for regular DAG nodes and executeStepInternal() for sequential steps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(web): persist last tool activity on dashboard cards instead of flashing currentTool was a plain string set on tool_started and cleared to null on tool_completed, causing sub-second flashes that were invisible to users. Change currentTool to a rich object { name, status, durationMs } so completed tools display as "Read (5.7s)" in muted text and running tools show as "Read…" in accent color, persisting until the next tool starts or the workflow finishes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(web): make live tool progress prominent on dashboard cards Move StepProgress out of the tiny metadata row into its own dedicated section with a highlighted background. Step info renders at text-sm with font-medium, tool calls in monospace. Running tools show a CSS spinner. Much more visible than the previous inline text-xs rendering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…leam00#730) * feat(web): live step & tool progress on Mission Control dashboard (coleam00#711) - Emit tool_started/tool_completed events from workflow executor (sequential, loop, DAG) - Bridge tool activity events to SSE as workflow_tool_activity - Add __dashboard__ multiplexed SSE endpoint for all workflow events - Extend DashboardWorkflowRun with current step name/status and agent counts via correlated subqueries (SQLite + PostgreSQL dialect-aware) - Add useDashboardSSE hook connecting to __dashboard__ SSE stream - Add handleWorkflowToolActivity to Zustand workflow store - WorkflowRunCard subscribes to Zustand store directly for live step/tool updates - DashboardPage hydrates store from REST data for active runs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(web): correct event_index SQL bug, deduplicate CASE subquery, and type/code quality fixes - Replace non-existent `event_index` column with `created_at` in all 8 correlated subqueries in `listDashboardRuns` (CRITICAL runtime fix — would crash dashboard for all users) - Remove `current_step_event_index` field from `DashboardWorkflowRun` and `DashboardRunResponse` (field was never consumed by frontend) - Deduplicate the triplicated `CASE` subquery into a single `CASE expr WHEN ...` form (HIGH performance/correctness fix) - Add `WorkflowToolActivityEvent` to `SSEEvent` discriminated union in `types.ts` (MEDIUM type safety) - Remove unused `sourceRef` from `useDashboardSSE` hook (MEDIUM YAGNI) - Add `{ streamId: '__dashboard__' }` context object to all dashboard SSE log calls (MEDIUM logging compliance) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: totalSteps JSON key mismatch and extract IIFE to named component - Fix total_steps always null: change jsonIntExtract key from 'totalSteps' to 'total_steps' to match what the executor writes - Extract 25-line IIFE in WorkflowRunCard JSX to named StepProgress component - Fix stepIndex > 0 guard to stepIndex != null (was hiding Step 0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: move WorkflowState import to top of file (ESLint import/first) The import was placed after the PLATFORM_ICONS constant, violating ESLint's import/first rule which fails CI with --max-warnings 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(workflows): emit tool_started/tool_completed events from loop node executor The loop node executor in dag-executor.ts was writing tool events to the database but not emitting them via getWorkflowEventEmitter(). This meant the WorkflowEventBridge never received tool activity events for loop nodes, so the dashboard SSE stream had no workflow_tool_activity events and the WorkflowRunCard's currentTool display stayed empty. Add tool_started/tool_completed emitter calls to executeLoopNode(), matching the pattern already used in executeNodeInternal() for regular DAG nodes and executeStepInternal() for sequential steps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(web): persist last tool activity on dashboard cards instead of flashing currentTool was a plain string set on tool_started and cleared to null on tool_completed, causing sub-second flashes that were invisible to users. Change currentTool to a rich object { name, status, durationMs } so completed tools display as "Read (5.7s)" in muted text and running tools show as "Read…" in accent color, persisting until the next tool starts or the workflow finishes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(web): make live tool progress prominent on dashboard cards Move StepProgress out of the tiny metadata row into its own dedicated section with a highlighted background. Step info renders at text-sm with font-medium, tool calls in monospace. Running tools show a CSS spinner. Much more visible than the previous inline text-xs rendering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Removing junk from sitemap and full site (recursive) crawls

2265713

Small typo fix for result.markdown

2b8224a

coleam00 merged commit b1085a5 into main Sep 20, 2025
8 checks passed

leonj1 pushed a commit to leonj1/Archon that referenced this pull request Oct 13, 2025

Removing junk from sitemap and full site (recursive) crawls (coleam00…

5fc15b3

…#711) * Removing junk from sitemap and full site (recursive) crawls * Small typo fix for result.markdown

coderabbitai Bot mentioned this pull request Nov 13, 2025

feat: Add glob pattern filtering and link review for knowledge crawling #847

Closed

6 tasks

Wirasm deleted the crawl4ai-junk-pruning branch April 6, 2026 07:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing junk from sitemap and full site (recursive) crawls#711

Removing junk from sitemap and full site (recursive) crawls#711
coleam00 merged 2 commits intomainfrom
crawl4ai-junk-pruning

coleam00 commented Sep 19, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Sep 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

coleam00 commented Sep 19, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Summary

Changes Made

Type of Change

Affected Services

Testing

Test Evidence

Checklist

Additional Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coleam00 commented Sep 19, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Sep 19, 2025 •

edited

Loading