Removing junk from sitemap and full site (recursive) crawls#711
Removing junk from sitemap and full site (recursive) crawls#711
Conversation
WalkthroughAdds a link-pruning markdown generator in SiteConfig and wires it into crawling_service for batch and recursive strategies. Batch and recursive strategies now treat results as successful only if result.markdown.fit_markdown exists and use that nested value for output. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant CrawlingService
participant SiteConfig
participant Strategy as Batch/Recursive Strategy
participant CrawlEngine
Client->>CrawlingService: start_crawl(...)
CrawlingService->>SiteConfig: get_link_pruning_markdown_generator()
SiteConfig-->>CrawlingService: pruning-enabled MarkdownGenerator
CrawlingService->>Strategy: init(generator=link_pruning_markdown_generator)
Strategy->>CrawlEngine: crawl(urls)
CrawlEngine-->>Strategy: Result{success, markdown{fit_markdown?, ...}}
alt success && markdown && fit_markdown
Strategy-->>CrawlingService: include fit_markdown in outputs
else missing fit_markdown
Strategy-->>CrawlingService: treat as failure (exclude/log)
end
CrawlingService-->>Client: aggregated results
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (4)
🧰 Additional context used📓 Path-based instructions (4)python/src/server/**/*.py📄 CodeRabbit inference engine (CLAUDE.md)
Files:
python/src/server/services/**/*.py📄 CodeRabbit inference engine (CLAUDE.md)
Files:
python/**/*.py📄 CodeRabbit inference engine (CLAUDE.md)
Files:
python/src/**/*.py📄 CodeRabbit inference engine (CLAUDE.md)
Files:
🧬 Code graph analysis (1)python/src/server/services/crawling/crawling_service.py (3)
🔇 Additional comments (6)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…#711) * Removing junk from sitemap and full site (recursive) crawls * Small typo fix for result.markdown
* feat(web): live step & tool progress on Mission Control dashboard (#711) - Emit tool_started/tool_completed events from workflow executor (sequential, loop, DAG) - Bridge tool activity events to SSE as workflow_tool_activity - Add __dashboard__ multiplexed SSE endpoint for all workflow events - Extend DashboardWorkflowRun with current step name/status and agent counts via correlated subqueries (SQLite + PostgreSQL dialect-aware) - Add useDashboardSSE hook connecting to __dashboard__ SSE stream - Add handleWorkflowToolActivity to Zustand workflow store - WorkflowRunCard subscribes to Zustand store directly for live step/tool updates - DashboardPage hydrates store from REST data for active runs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(web): correct event_index SQL bug, deduplicate CASE subquery, and type/code quality fixes - Replace non-existent `event_index` column with `created_at` in all 8 correlated subqueries in `listDashboardRuns` (CRITICAL runtime fix — would crash dashboard for all users) - Remove `current_step_event_index` field from `DashboardWorkflowRun` and `DashboardRunResponse` (field was never consumed by frontend) - Deduplicate the triplicated `CASE` subquery into a single `CASE expr WHEN ...` form (HIGH performance/correctness fix) - Add `WorkflowToolActivityEvent` to `SSEEvent` discriminated union in `types.ts` (MEDIUM type safety) - Remove unused `sourceRef` from `useDashboardSSE` hook (MEDIUM YAGNI) - Add `{ streamId: '__dashboard__' }` context object to all dashboard SSE log calls (MEDIUM logging compliance) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: totalSteps JSON key mismatch and extract IIFE to named component - Fix total_steps always null: change jsonIntExtract key from 'totalSteps' to 'total_steps' to match what the executor writes - Extract 25-line IIFE in WorkflowRunCard JSX to named StepProgress component - Fix stepIndex > 0 guard to stepIndex != null (was hiding Step 0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: move WorkflowState import to top of file (ESLint import/first) The import was placed after the PLATFORM_ICONS constant, violating ESLint's import/first rule which fails CI with --max-warnings 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(workflows): emit tool_started/tool_completed events from loop node executor The loop node executor in dag-executor.ts was writing tool events to the database but not emitting them via getWorkflowEventEmitter(). This meant the WorkflowEventBridge never received tool activity events for loop nodes, so the dashboard SSE stream had no workflow_tool_activity events and the WorkflowRunCard's currentTool display stayed empty. Add tool_started/tool_completed emitter calls to executeLoopNode(), matching the pattern already used in executeNodeInternal() for regular DAG nodes and executeStepInternal() for sequential steps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(web): persist last tool activity on dashboard cards instead of flashing currentTool was a plain string set on tool_started and cleared to null on tool_completed, causing sub-second flashes that were invisible to users. Change currentTool to a rich object { name, status, durationMs } so completed tools display as "Read (5.7s)" in muted text and running tools show as "Read…" in accent color, persisting until the next tool starts or the workflow finishes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(web): make live tool progress prominent on dashboard cards Move StepProgress out of the tiny metadata row into its own dedicated section with a highlighted background. Step info renders at text-sm with font-medium, tool calls in monospace. Running tools show a CSS spinner. Much more visible than the previous inline text-xs rendering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…leam00#730) * feat(web): live step & tool progress on Mission Control dashboard (coleam00#711) - Emit tool_started/tool_completed events from workflow executor (sequential, loop, DAG) - Bridge tool activity events to SSE as workflow_tool_activity - Add __dashboard__ multiplexed SSE endpoint for all workflow events - Extend DashboardWorkflowRun with current step name/status and agent counts via correlated subqueries (SQLite + PostgreSQL dialect-aware) - Add useDashboardSSE hook connecting to __dashboard__ SSE stream - Add handleWorkflowToolActivity to Zustand workflow store - WorkflowRunCard subscribes to Zustand store directly for live step/tool updates - DashboardPage hydrates store from REST data for active runs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(web): correct event_index SQL bug, deduplicate CASE subquery, and type/code quality fixes - Replace non-existent `event_index` column with `created_at` in all 8 correlated subqueries in `listDashboardRuns` (CRITICAL runtime fix — would crash dashboard for all users) - Remove `current_step_event_index` field from `DashboardWorkflowRun` and `DashboardRunResponse` (field was never consumed by frontend) - Deduplicate the triplicated `CASE` subquery into a single `CASE expr WHEN ...` form (HIGH performance/correctness fix) - Add `WorkflowToolActivityEvent` to `SSEEvent` discriminated union in `types.ts` (MEDIUM type safety) - Remove unused `sourceRef` from `useDashboardSSE` hook (MEDIUM YAGNI) - Add `{ streamId: '__dashboard__' }` context object to all dashboard SSE log calls (MEDIUM logging compliance) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: totalSteps JSON key mismatch and extract IIFE to named component - Fix total_steps always null: change jsonIntExtract key from 'totalSteps' to 'total_steps' to match what the executor writes - Extract 25-line IIFE in WorkflowRunCard JSX to named StepProgress component - Fix stepIndex > 0 guard to stepIndex != null (was hiding Step 0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: move WorkflowState import to top of file (ESLint import/first) The import was placed after the PLATFORM_ICONS constant, violating ESLint's import/first rule which fails CI with --max-warnings 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(workflows): emit tool_started/tool_completed events from loop node executor The loop node executor in dag-executor.ts was writing tool events to the database but not emitting them via getWorkflowEventEmitter(). This meant the WorkflowEventBridge never received tool activity events for loop nodes, so the dashboard SSE stream had no workflow_tool_activity events and the WorkflowRunCard's currentTool display stayed empty. Add tool_started/tool_completed emitter calls to executeLoopNode(), matching the pattern already used in executeNodeInternal() for regular DAG nodes and executeStepInternal() for sequential steps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(web): persist last tool activity on dashboard cards instead of flashing currentTool was a plain string set on tool_started and cleared to null on tool_completed, causing sub-second flashes that were invisible to users. Change currentTool to a rich object { name, status, durationMs } so completed tools display as "Read (5.7s)" in muted text and running tools show as "Read…" in accent color, persisting until the next tool starts or the workflow finishes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(web): make live tool progress prominent on dashboard cards Move StepProgress out of the tiny metadata row into its own dedicated section with a highlighted background. Step info renders at text-sm with font-medium, tool calls in monospace. Running tools show a CSS spinner. Much more visible than the previous inline text-xs rendering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…leam00#730) * feat(web): live step & tool progress on Mission Control dashboard (coleam00#711) - Emit tool_started/tool_completed events from workflow executor (sequential, loop, DAG) - Bridge tool activity events to SSE as workflow_tool_activity - Add __dashboard__ multiplexed SSE endpoint for all workflow events - Extend DashboardWorkflowRun with current step name/status and agent counts via correlated subqueries (SQLite + PostgreSQL dialect-aware) - Add useDashboardSSE hook connecting to __dashboard__ SSE stream - Add handleWorkflowToolActivity to Zustand workflow store - WorkflowRunCard subscribes to Zustand store directly for live step/tool updates - DashboardPage hydrates store from REST data for active runs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(web): correct event_index SQL bug, deduplicate CASE subquery, and type/code quality fixes - Replace non-existent `event_index` column with `created_at` in all 8 correlated subqueries in `listDashboardRuns` (CRITICAL runtime fix — would crash dashboard for all users) - Remove `current_step_event_index` field from `DashboardWorkflowRun` and `DashboardRunResponse` (field was never consumed by frontend) - Deduplicate the triplicated `CASE` subquery into a single `CASE expr WHEN ...` form (HIGH performance/correctness fix) - Add `WorkflowToolActivityEvent` to `SSEEvent` discriminated union in `types.ts` (MEDIUM type safety) - Remove unused `sourceRef` from `useDashboardSSE` hook (MEDIUM YAGNI) - Add `{ streamId: '__dashboard__' }` context object to all dashboard SSE log calls (MEDIUM logging compliance) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: totalSteps JSON key mismatch and extract IIFE to named component - Fix total_steps always null: change jsonIntExtract key from 'totalSteps' to 'total_steps' to match what the executor writes - Extract 25-line IIFE in WorkflowRunCard JSX to named StepProgress component - Fix stepIndex > 0 guard to stepIndex != null (was hiding Step 0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: move WorkflowState import to top of file (ESLint import/first) The import was placed after the PLATFORM_ICONS constant, violating ESLint's import/first rule which fails CI with --max-warnings 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(workflows): emit tool_started/tool_completed events from loop node executor The loop node executor in dag-executor.ts was writing tool events to the database but not emitting them via getWorkflowEventEmitter(). This meant the WorkflowEventBridge never received tool activity events for loop nodes, so the dashboard SSE stream had no workflow_tool_activity events and the WorkflowRunCard's currentTool display stayed empty. Add tool_started/tool_completed emitter calls to executeLoopNode(), matching the pattern already used in executeNodeInternal() for regular DAG nodes and executeStepInternal() for sequential steps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(web): persist last tool activity on dashboard cards instead of flashing currentTool was a plain string set on tool_started and cleared to null on tool_completed, causing sub-second flashes that were invisible to users. Change currentTool to a rich object { name, status, durationMs } so completed tools display as "Read (5.7s)" in muted text and running tools show as "Read…" in accent color, persisting until the next tool starts or the workflow finishes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(web): make live tool progress prominent on dashboard cards Move StepProgress out of the tiny metadata row into its own dedicated section with a highlighted background. Step info renders at text-sm with font-medium, tool calls in monospace. Running tools show a CSS spinner. Much more visible than the previous inline text-xs rendering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Pull Request
Summary
Removed junk from sitemap and full site (recursive) crawls - mostly links and footer/header text. This is using Crawl4AI's PruningContentFilter:
https://docs.crawl4ai.com/core/markdown-generation/#52-pruningcontentfilter
Changes Made
Type of Change
Affected Services
Testing
Test Evidence
Tested crawls with:
Checklist
Additional Notes
There are still chunks that contain just links for some sites (such as https://mem0.ai/sitemap.xml) - I am keeping this for now since sometimes you might want to actually retrieve a bunch of links for a site with RAG. The main thing this PR accomplishes is getting rid of navigation links and footer text from chunks that actually have information outside of a list of links.
Summary by CodeRabbit
New Features
Bug Fixes
Chores