feat: Advanced web crawling with domain filtering#728
Conversation
- Implement domain filtering for web crawler with whitelist/blacklist support - Add URL pattern matching (glob-style) for include/exclude patterns - Create AdvancedCrawlConfig UI component with collapsible panel - Add domain filter to Knowledge Inspector sidebar for easy filtering - Implement crawl-v2 API endpoint with backward compatibility - Add comprehensive unit tests for domain filtering logic Implements priority-based filtering: 1. Blacklist (excluded_domains) - highest priority 2. Whitelist (allowed_domains) - must match if provided 3. Exclude patterns - glob patterns to exclude 4. Include patterns - glob patterns to include UI improvements: - Advanced configuration section in Add Knowledge dialog - Domain pills in Inspector sidebar showing document distribution - Visual domain indicators on each document - Responsive domain filtering with document counts 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughAdds Crawl V2 with domain/pattern filtering across UI, hooks, types, services, and server; introduces AdvancedCrawlConfig UI, domain-aware browsing/inspector filters, update-crawl-config flow, DomainFilter logic in crawler, and tests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor U as User
participant UI as AddKnowledgeDialog
participant H as useCrawlUrl / useCrawlUrlV2
participant S as knowledgeService
participant API as /api/knowledge-items/crawl[-v2]
participant BG as Background Crawl Runner
participant P as ProgressTracker
U->>UI: Click "Start Crawling"
UI->>H: submit(url, crawl_config?)
alt crawl_config present (v2)
H->>S: crawlUrlV2({url,...,crawl_config})
S->>API: POST /crawl-v2
else no crawl_config (v1)
H->>S: crawlUrl({url,...})
S->>API: POST /crawl
end
API->>P: create progressId
API-->>S: {progressId, metadata}
S-->>H: response
H->>H: optimistic updates (cache)
API->>BG: start async crawl task
BG->>P: update progress
P-->>UI: polled progress updates
sequenceDiagram
autonumber
actor U as User
participant KC as KnowledgeCard
participant D as EditCrawlConfigDialog
participant H as useUpdateCrawlConfig
participant S as knowledgeService
participant API as POST /knowledge-items/{id}/update-config
participant BG as Recrawl Runner
participant P as ProgressTracker
U->>KC: Open menu → Edit Configuration
KC->>D: open(sourceId)
U->>D: Save
D->>H: mutate({sourceId, url, max_depth, tags, crawl_config})
H->>S: updateCrawlConfig(request)
S->>API: POST update-config
API->>P: create progressId
API-->>S: {progressId}
S-->>H: ok
H->>H: optimistic processing state
API->>BG: start recrawl
BG->>P: progress updates
sequenceDiagram
autonumber
participant R as RecursiveCrawlStrategy
participant DF as DomainFilter
participant CFG as CrawlConfig
participant Q as URL Queue
loop discovered next_url
R->>DF: is_url_allowed(next_url, base_url, CFG)
alt allowed
DF-->>R: true
R->>Q: enqueue next_url
else blocked
DF-->>R: false
R->>R: skip & log
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- Change domain filter from pills to dropdown with 'All' option as default - Add 'View Source' link for each document chunk - Add 'View Metadata' button to view chunk metadata in expandable panel - Improve UI consistency with smaller action buttons 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Fix button nesting error by using div with onClick instead of nested buttons - Move metadata panel outside the clickable item area - Add close button for metadata panel - Ensure metadata displays within the UI (not opening in new tab) - Fix type comparison for showMetadata state - Improve layout with better spacing and scrollable metadata 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Move metadata display from sidebar list to content viewer where chunk content is shown - Add collapsible metadata section in content viewer with better formatting - Remove metadata button and panel from sidebar to reduce clutter - Keep only 'View Source' link in sidebar for quick access - Show domain information in content footer for better context - Improve overall UI consistency and user experience 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Move metadata section to be always visible below header - Add full-width collapsible button with hover effect - Show property count in metadata header - Add max-height and scroll to metadata content - Separate content area from metadata for better visibility - Improve UI with clear visual hierarchy 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Move metadata panel from top to bottom of content viewer - Place scroll on outer container to show more metadata (max-h-64) - Keep metadata section always accessible at the bottom - Maintain clear visual separation with border-top - Ensure better visibility of all metadata properties 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
## New Features
### Edit Crawler Configuration
- Added `EditCrawlConfigDialog` component for editing existing crawler settings
- Users can now modify URL, knowledge type, max depth, tags, and advanced crawl config
- Added "Edit Configuration" menu item to knowledge card actions
- Shows warning that recrawl will be triggered when saving changes
- Created `useUpdateCrawlConfig` hook with optimistic updates
- Added backend API endpoint `/api/knowledge-items/{source_id}/update-config`
### Enhanced Metadata Viewing
- Fixed metadata display to show ALL properties from backend (not just 3-5)
- Increased metadata panel height from 256px to 500px for better viewing
- Added proper scrolling with visible scrollbar for large metadata objects
- Now displays complete JSON including properties like:
- `url`, `source`, `headers`, `filename`
- `has_code`, `has_links`, `source_id`
- `char_count`, `word_count`, `line_count`
- `chunk_size`, `chunk_index`
- `source_type`, `knowledge_type`
- All other backend metadata fields
### Domain Filtering Improvements
- Converted domain filter from pills to dropdown with "All" option
- Added domain statistics showing document count per domain
- Improved InspectorSidebar with better domain filtering UX
## Technical Implementation
### Frontend Components
- `EditCrawlConfigDialog.tsx` - Edit configuration dialog with warning
- Enhanced `KnowledgeCardActions.tsx` - Added edit configuration menu item
- Updated `KnowledgeCard.tsx` - Integrated edit dialog
- Improved `ContentViewer.tsx` - Better metadata display with full JSON
- Fixed `KnowledgeInspector.tsx` - Pass complete metadata instead of filtered subset
### React Hooks & Services
- `useUpdateCrawlConfig()` - Mutation hook for updating crawler config
- `knowledgeService.updateCrawlConfig()` - API service method
- Optimistic updates with progress tracking
### Backend API
- POST `/api/knowledge-items/{source_id}/update-config` - Update crawler configuration
- Validates existing item, deletes old data, triggers new crawl with updated config
- Returns progress ID for tracking recrawl operation
## User Experience
- Users can now edit any existing crawler configuration
- Clear warning about recrawl being triggered
- Complete metadata viewing for debugging and analysis
- Improved domain filtering in document browser
- Progress tracking for configuration updates
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Removed redundant View Source links from InspectorSidebar - Main content area already contains source links - Cleaned up unused ExternalLink import
- Fixed missing count and timestamp fields in optimistic updates - Preserve all ActiveOperationsResponse fields when updating progress IDs - Fixed incorrect field comparison (source_id vs id) when replacing temp IDs - Added query invalidation for progress queries in v2 implementation - Ensures proper data shape consistency with backend API These fixes ensure that: 1. ActiveOperationsResponse always has required count/timestamp fields 2. Optimistic entities are correctly matched and updated with real IDs 3. Progress queries are properly refreshed after crawl starts
Backend fixes for crawling stability: - Add comment clarifying DomainFilter doesn't need init params - Improve base URL selection in recursive strategy: - Check start_urls length before indexing - Use appropriate base URL for domain checks - Fallback to original_url when start_urls is empty - Add error handling for domain filter: - Wrap is_url_allowed in try/except block - Log exceptions and conservatively skip URLs on error - Prevents domain filter exceptions from crashing crawler - Better handling of relative URL resolution These changes ensure more robust crawling especially when: - start_urls array is empty - Domain filter encounters unexpected URLs - Relative links need proper base URL resolution
Added detailed documentation for CrawlConfig interface including: ## Documentation improvements: - Clear precedence rules (excluded_domains > allowed_domains > exclude_patterns > include_patterns) - Pattern syntax explanation (glob patterns with fnmatch for URLs, wildcards for domains) - Comprehensive examples showing common use cases: - Single subdomain with path exclusions - Multiple subdomains with specific exclusions - File type and directory blocking - Individual property documentation with examples ## Code improvements: - Refactored DocumentBrowser to avoid repeated URL/domain computation - Extract resolvedUrl and resolvedDomain once as constants - Improved readability and performance This documentation helps developers understand: - How conflicting rules are resolved (blacklist always wins) - What pattern syntax to use (glob patterns) - How to compose allow/deny lists effectively
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
Fixed bug where edit configuration dialog wasn't showing the existing crawl config: 1. **Enhanced data loading in EditCrawlConfigDialog**: - Check for data at both top-level and metadata level - Properly handle max_depth, tags, and crawl_config fields - Ensure crawl_config has the right shape with all arrays initialized - Reset form state when dialog opens without data 2. **Improved type definitions**: - Added crawl_config and max_depth to KnowledgeItemMetadata interface - Added optional fields to KnowledgeItem for top-level storage - Added index signature to allow additional untyped fields from backend 3. **Better state management**: - Reset form when dialog opens to prevent stale data - Only load data when both item exists and dialog is open - Initialize empty arrays for all crawl_config fields This ensures that when editing an existing knowledge item: - All original crawl settings are properly loaded - Advanced configuration shows the domain filters and patterns - Form is pre-filled with the exact configuration used for initial crawl
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
python/src/server/services/crawling/crawling_service.py (2)
299-311: Wire crawl_config from request to service instance.If the service wasn’t constructed with
crawl_config, it remainsNoneand domain filtering is skipped. Load it from the request before crawling.Apply this diff:
try: url = str(request.get("url", "")) safe_logfire_info(f"Starting async crawl orchestration | url={url} | task_id={task_id}") + # Ensure crawl_config is available for strategies + if self.crawl_config is None and "crawl_config" in request: + self.crawl_config = request.get("crawl_config") + safe_logfire_info( + f"Crawl config attached | has_config={bool(self.crawl_config)}" + ) + # Start the progress tracker if available if self.progress_tracker: await self.progress_tracker.start({
686-699: Domain filtering not applied to batch/link-collection crawls — critical fix requiredBatchCrawlStrategy is constructed without a DomainFilter/crawl_config, so extracted links crawled via crawl_batch_with_progress can bypass domain restrictions (confirmed in python/src/server/services/crawling/strategies/batch.py and the call site in python/src/server/services/crawling/crawling_service.py:686-699).
- Change BatchCrawlStrategy to accept domain_filter and optional crawl_config and apply domain_filter.is_url_allowed(...) (or equivalent) before enqueueing URLs. (python/src/server/services/crawling/strategies/batch.py)
- Initialize and use it at the call site: BatchCrawlStrategy(crawler, self.link_pruning_markdown_generator, self.domain_filter) and pass self.crawl_config into crawl_batch_with_progress(...) so the strategy can enforce filtering. (python/src/server/services/crawling/crawling_service.py:686-699)
🧹 Nitpick comments (38)
archon-ui-main/src/features/knowledge/inspector/components/ContentViewer.tsx (3)
7-7: Avoid repeated JSON.stringify; memoize metadata rendering.Stringifying large metadata objects on every render can jank the UI. Compute it only when the panel is open or metadata changes.
Apply this diff:
-import { useState } from "react"; +import { useMemo, useState } from "react";- const [showMetadata, setShowMetadata] = useState(false); + const [showMetadata, setShowMetadata] = useState(false); + const metadataString = useMemo(() => { + if (!showMetadata) return ""; + const data = selectedItem?.metadata; + try { + return data ? JSON.stringify(data, null, 2) : "{}"; + } catch { + return "{}"; + } + }, [showMetadata, selectedItem?.metadata]);- <pre className="text-xs text-gray-400 font-mono whitespace-pre-wrap break-words"> - {JSON.stringify(selectedItem.metadata, null, 2)} - </pre> + <pre className="text-xs text-gray-400 font-mono whitespace-pre-wrap break-words"> + {metadataString} + </pre>Also applies to: 19-19, 148-150
133-139: Minor a11y: reflect toggle state via ARIA.Expose expansion state and associate button with the panel.
- <button + <button type="button" onClick={() => setShowMetadata(!showMetadata)} - className="w-full px-4 py-3 flex items-center gap-2 text-sm font-medium text-cyan-400 hover:bg-white/5 transition-colors text-left" + className="w-full px-4 py-3 flex items-center gap-2 text-sm font-medium text-cyan-400 hover:bg-white/5 transition-colors text-left" + aria-expanded={showMetadata} + aria-controls="inspector-metadata-panel" >- <div className="px-4 pb-4"> + <div id="inspector-metadata-panel" className="px-4 pb-4">Also applies to: 147-147
168-183: Option: show domain/source for code items too.If code metadata includes a URL, consider showing Domain + “View Source” here as well for parity with documents.
archon-ui-main/src/features/knowledge/inspector/components/InspectorSidebar.tsx (2)
46-61: Domain stats LGTM; consider de-duping “empty/invalid” domains.If extractDomain ever returns an empty string, it will produce a blank entry. Guard with a truthy domain check before incrementing.
- const domain = extractDomain(url); - stats.set(domain, (stats.get(domain) || 0) + 1); + const domain = extractDomain(url); + if (domain) { + stats.set(domain, (stats.get(domain) || 0) + 1); + }
121-149: Nice filter UX; optional “Unknown domain” bucket.Docs without a URL disappear when a specific domain is selected. Offering an “unknown” option improves discoverability.
archon-ui-main/src/features/knowledge/inspector/components/KnowledgeInspector.tsx (3)
159-169: Header counts ignore domain filter.When a domain is selected, the header still shows counts for all loaded items. Compute filtered count to avoid user confusion.
Apply this diff (and import extractDomain as shown in the next comment):
- filteredDocumentCount={viewMode === "documents" ? currentItems.length : 0} + filteredDocumentCount={ + viewMode === "documents" + ? (selectedDomain === "all" + ? currentItems.length + : (currentItems as DocumentChunk[]).filter((d) => { + const url = d.url || d.metadata?.url || ""; + return extractDomain(url) === selectedDomain; + }).length) + : 0 + }
6-14: Import extractDomain for filtered count computation.Required by the previous change.
import { ContentViewer } from "./ContentViewer"; import { InspectorHeader } from "./InspectorHeader"; import { InspectorSidebar } from "./InspectorSidebar"; import { copyToClipboard } from "../../../shared/utils/clipboard"; +import { extractDomain } from "../../utils/knowledge-utils";
186-188: Clear stale selection when domain changes.Switching domains can leave a selected item that’s no longer visible. Reset selection on domain change.
- onDomainChange={setSelectedDomain} + onDomainChange={(d) => { + setSelectedDomain(d); + setSelectedItem(null); + }}python/src/server/models/crawl_models.py (3)
48-56: Docstring mismatch with behavior.The validator adds https:// (not http://). Update the comment for clarity.
- """Ensure URL is properly formatted.""" + """Ensure URL is properly formatted.""" @@ - # Add http:// if no protocol specified + # Add https:// if no protocol specified
41-47: Minor typing nit: tags shouldn’t be optional if defaulting to [].Use a non-optional list to better reflect runtime.
- tags: list[str] | None = Field(default_factory=list, description="Tags to apply to crawled content") + tags: list[str] = Field(default_factory=list, description="Tags to apply to crawled content")
12-19: Forbid unexpected fields on Pydantic models — Pydantic v2 detected (2.11.9)Add model_config = ConfigDict(extra="forbid") to CrawlConfig (python/src/server/models/crawl_models.py, lines 12–19) and other Pydantic models (e.g., CrawlRequestV2) to reject unrecognized inputs.
from pydantic import ConfigDict class CrawlConfig(BaseModel): ... model_config = ConfigDict(extra="forbid")archon-ui-main/src/features/knowledge/types/knowledge.ts (2)
41-47: Allow backend to pass through extra code metadata fields.Enhanced metadata viewer displays all backend fields; add an index signature to avoid TS friction.
export interface CodeExampleMetadata { language?: string; file_path?: string; summary?: string; relevance_score?: number; - // No additional flexible properties - use strict typing + // Allow backend-provided extras (e.g., source, headers, filename, counters) + [key: string]: unknown; }
64-71: Same for document metadata: allow extra fields.Aligns with “Enhanced Metadata Viewing” that surfaces all backend fields.
export interface DocumentChunkMetadata { title?: string; section?: string; relevance_score?: number; url?: string; tags?: string[]; - // No additional flexible properties - use strict typing + // Allow backend-provided extras (e.g., headers, source_id, counts, flags) + [key: string]: unknown; }archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx (1)
350-361: Lazy-mount the Edit dialog to avoid unnecessary work.Only render the dialog when open to prevent fetching/initialization on every card.
- {/* Edit Configuration Dialog */} - <EditCrawlConfigDialog - sourceId={item.source_id} - open={showEditConfigDialog} - onOpenChange={setShowEditConfigDialog} - onSuccess={() => { - // The refresh will be handled by the update itself - if (onRefreshStarted) { - // We'll get the progressId from the update response - } - }} - /> + {/* Edit Configuration Dialog */} + {showEditConfigDialog && ( + <EditCrawlConfigDialog + sourceId={item.source_id} + open={showEditConfigDialog} + onOpenChange={setShowEditConfigDialog} + onSuccess={(progressId?: string) => { + if (progressId && onRefreshStarted) onRefreshStarted(progressId); + }} + /> + )}python/src/server/services/tests/test_domain_filter.py (2)
178-193: Add coverage for non-HTTP(S) schemes and anchors in domain extraction.
get_domains_from_urlsshould ignoremailto:,javascript:, and in-page anchors. Add a test to lock this down.Apply this diff to add coverage:
@@ def test_get_domains_from_urls(self): """Test extracting domains from URL list.""" urls = [ "https://example.com/page1", "https://docs.example.com/api", "https://example.com/page2", "https://other.com/resource", "https://WWW.TEST.COM/page", "/relative/path", # Should be skipped "invalid-url", # Should be skipped + "mailto:someone@example.com", # Should be skipped + "javascript:alert(1)", # Should be skipped + "#section", # Should be skipped ] @@ - assert domains == {"example.com", "docs.example.com", "other.com", "test.com"} + assert domains == {"example.com", "docs.example.com", "other.com", "test.com"}
118-142: Edge precedence test: include vs. exclude patterns.You validate blacklist over whitelist, but not exclude over include. Add a focused test to ensure
exclude_patternswins when both match a URL.I can draft the test if you confirm the intended precedence is: blacklist > whitelist > include > exclude.
archon-ui-main/src/features/knowledge/components/EditCrawlConfigDialog.tsx (2)
42-57: Defensive state hydration from metadata.Ensure tags are always an array and avoid sharing references to
metadata.crawl_config(potential in-place mutation). Deep-clone the config or create a new object.Apply this diff:
useEffect(() => { if (item) { setUrl(item.url || ""); setKnowledgeType(item.knowledge_type || "technical"); // Access max_depth from metadata as any since it's not typed const metadata = item.metadata as any; setMaxDepth(metadata?.max_depth?.toString() || "2"); - setTags(metadata?.tags || []); + setTags(Array.isArray(metadata?.tags) ? metadata.tags : []); // Load existing crawl config if available if (metadata?.crawl_config) { - setCrawlConfig(metadata.crawl_config); + try { + // cheap deep clone to avoid mutating original object by reference + setCrawlConfig(JSON.parse(JSON.stringify(metadata.crawl_config))); + } catch { + setCrawlConfig({ ...metadata.crawl_config }); + } } } }, [item]);
96-105: Handle knowledge item load errors (disable destructive save).If
useKnowledgeItemfails, the form will show defaults and could overwrite existing config. Surface an error state and disable Save.If
useKnowledgeItemexposesisError/error, wrap the form with a guard and render an inline error with a disabled Save button.archon-ui-main/src/features/knowledge/services/knowledgeService.ts (2)
238-255: Typecrawl_configprecisely instead ofany.Use the
CrawlConfigtype to catch shape errors at compile-time.Apply this diff:
@@ -import type { +import type { ChunksResponse, CodeExamplesResponse, CrawlRequest, CrawlRequestV2, CrawlStartResponse, KnowledgeItem, KnowledgeItemsFilter, KnowledgeItemsResponse, KnowledgeSource, RefreshResponse, SearchOptions, SearchResultsResponse, UploadMetadata, + CrawlConfig, } from "../types"; @@ - async updateCrawlConfig(request: { + async updateCrawlConfig(request: { sourceId: string; url: string; knowledge_type: "technical" | "business"; max_depth: number; tags?: string[]; - crawl_config?: any; + crawl_config?: CrawlConfig; }): Promise<CrawlStartResponse> {
93-103: Optional: add request timeouts for crawl starts.Crawl starts can hang under load. Consider passing an AbortSignal with a sane timeout via
callAPIWithETagif supported.Confirm whether
callAPIWithETagsupports timeouts; if not, I can help add atimeoutMsoption and wireAbortController.archon-ui-main/src/features/knowledge/components/DocumentBrowser.tsx (3)
55-68: Memoize filtering to avoid O(n) recomputation on every render.Wrap
filteredChunksinuseMemokeyed by[chunks, searchQuery, selectedDomains]to reduce work on large datasets.Apply this diff:
- // Filter chunks based on search and domain - const filteredChunks = chunks.filter((chunk) => { + // Filter chunks based on search and domain + const filteredChunks = useMemo(() => chunks.filter((chunk) => { // Search filter const matchesSearch = !searchQuery || chunk.content.toLowerCase().includes(searchQuery.toLowerCase()) || chunk.metadata?.title?.toLowerCase().includes(searchQuery.toLowerCase()); // Domain filter const url = chunk.url || chunk.metadata?.url; const matchesDomain = selectedDomains.size === 0 || (url && selectedDomains.has(extractDomain(url))); - return matchesSearch && matchesDomain; - }); + return matchesSearch && matchesDomain; + }), [chunks, searchQuery, selectedDomains]);
151-179: Improve domain chip toggles: functional state update + accessibility.Use functional state to avoid stale closures and add
aria-pressedfor screen readers.Apply this diff:
- onClick={() => { - const newSelection = new Set(selectedDomains); - if (isSelected) { - newSelection.delete(domain); - } else { - newSelection.add(domain); - } - setSelectedDomains(newSelection); - }} + onClick={() => + setSelectedDomains((prev) => { + const next = new Set(prev); + if (next.has(domain)) next.delete(domain); + else next.add(domain); + return next; + }) + } + aria-pressed={isSelected} className={cn(
27-35: Consider server-side domain filtering to reduce payload.
knowledgeService.getKnowledgeItemChunkssupportsdomainFilter. When domains are selected, request filtered chunks instead of pulling all and filtering client-side.If
useKnowledgeItemChunksaccepts options, passdomainFilteras a CSV; otherwise, I can help extend the hook and invalidate the query on selection changes.python/src/server/services/crawling/domain_filter.py (3)
58-60: Avoid blanket replacement of 'www.'; strip only the leading prefix.- normalized_domain = domain.replace("www.", "") + normalized_domain = domain.removeprefix("www.")If Python <3.9, replace with a guarded regex or slice.
62-79: Normalize patterns to lowercase before matching.Configs may carry mixed-case hostnames; ensure case-insensitive domain checks.
- for excluded in config.excluded_domains: - if self._matches_domain(normalized_domain, excluded): + for excluded in config.excluded_domains: + if self._matches_domain(normalized_domain, excluded.lower()): logger.debug(f"URL blocked by excluded domain | url={url} | domain={normalized_domain} | excluded={excluded}") return False ... - for allowed_domain in config.allowed_domains: - if self._matches_domain(normalized_domain, allowed_domain): + for allowed_domain in config.allowed_domains: + if self._matches_domain(normalized_domain, allowed_domain.lower()): allowed = True break
107-146: Strengthen _matches_domain normalization (lowercase, strip scheme/port, leading www).Prevents false negatives with ports like example.com:8080 and ensures consistent comparison.
- # Remove any remaining protocol or path from pattern - pattern = pattern.replace("http://", "").replace("https://", "").split("/")[0] - pattern = pattern.replace("www.", "") # Remove www. for consistent matching + # Normalize inputs + domain = (domain or "").lower() + # Remove any remaining protocol or path from pattern + pattern = (pattern or "").lower() + pattern = pattern.replace("http://", "").replace("https://", "").split("/")[0] + # Drop port if present on pattern (e.g., example.com:8080) + pattern = pattern.split(":", 1)[0] + # Strip leading www. only + pattern = pattern.removeprefix("www.")Optional hardening: also drop port from domain if present (defensive when callers pass netloc).
- # Exact match + # Drop port from domain defensively + domain = domain.split(":", 1)[0] + # Exact matcharchon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts (4)
315-321: Remove unused previousKnowledge from mutation context.It’s never set or read; simplifies types.
- { - previousKnowledge?: KnowledgeItem[]; + { previousSummaries?: Array<[readonly unknown[], KnowledgeItemsResponse | undefined]>; previousOperations?: ActiveOperationsResponse; tempProgressId: string; tempItemId: string; }
1006-1017: Type crawl_config precisely; avoid any.Use CrawlConfig to get compile-time safety across UI/service/backend.
- return useMutation< + return useMutation< CrawlStartResponse, Error, { sourceId: string; url: string; knowledge_type: "technical" | "business"; max_depth: number; tags?: string[]; - crawl_config?: any; + crawl_config?: CrawlConfig; } >({Also add CrawlConfig to the import from "../types".
390-424: Minor: build optimistic active operation once (no duplication).The two branches differ only in list concatenation; collapse to one setQueryData call.
1049-1054: Feedback toast on success is missing for config update.Consider notifying users that a recrawl has started and where to track it.
python/src/server/services/crawling/strategies/recursive.py (2)
301-313: Base URL selection for filtering is inconsistent for multi-start crawls.Using start_urls[0] can misclassify links from other domains; prefer the page’s original_url as base.
- base_url = original_url - if len(start_urls) > 0: - # If we have start_urls, use the first one - base_url = start_urls[0] - else: - # Try to use the current page's URL as base - # This handles relative links better - try: - from urllib.parse import urljoin - base_url = urljoin(original_url, next_url) - except Exception: - base_url = original_url + # Use the current page as base; it best reflects link context + base_url = original_urlDomainFilter should resolve relative URLs internally via urljoin (see suggested changes there).
288-326: Consider filtering start URLs as well.Currently only discovered links are filtered. If configs include strict whitelists/patterns, initial set should be validated too.
archon-ui-main/src/features/knowledge/components/AdvancedCrawlConfig.tsx (3)
22-37: Avoid contradictory membership: remove domain from the opposite list when adding.A domain shouldn’t be in both allowed and excluded; blacklist wins but UX should prevent conflict.
const handleAddDomain = (type: "allowed" | "excluded") => { if (!newDomain.trim()) return; const domain = newDomain.trim().toLowerCase().replace(/^https?:\/\//, "").replace(/\/$/, ""); const key = `${type}_domains` as keyof CrawlConfig; + const otherKey = `${type === "allowed" ? "excluded" : "allowed"}_domains` as keyof CrawlConfig; const current = config[key] || []; + const other = (config[otherKey] || []).filter((d: string) => d !== domain); if (!current.includes(domain)) { onChange({ ...config, - [key]: [...current, domain], + [key]: [...current, domain], + [otherKey]: other, }); } setNewDomain(""); };
171-176: Add accessible labels to remove buttons.Improves screen reader usability.
- <button + <button + aria-label={`Remove domain ${domain}`} onClick={() => handleRemoveDomain("allowed", domain)} className="opacity-0 group-hover:opacity-100 transition-opacity" > ... - <button + <button + aria-label={`Remove domain ${domain}`} onClick={() => handleRemoveDomain("excluded", domain)} className="opacity-0 group-hover:opacity-100 transition-opacity" > ... - <button + <button + aria-label={`Remove include pattern ${pattern}`} onClick={() => handleRemovePattern("include", pattern)} className="opacity-0 group-hover:opacity-100 transition-opacity" > ... - <button + <button + aria-label={`Remove exclude pattern ${pattern}`} onClick={() => handleRemovePattern("exclude", pattern)} className="opacity-0 group-hover:opacity-100 transition-opacity" >Also applies to: 186-191, 269-273, 284-288
25-35: Consider stripping leading 'www.' for display consistency.Backend normalizes without the prefix; aligning here reduces user confusion.
python/src/server/api_routes/knowledge_api.py (2)
578-591: Let FastAPI validate the request with a Pydantic model.Use CrawlRequestV2 directly in the endpoint signature for consistent validation and OpenAPI docs.
-@router.post("/knowledge-items/{source_id}/update-config") -async def update_crawl_config(source_id: str, request: dict): +from ..models.crawl_models import CrawlRequestV2 + +@router.post("/knowledge-items/{source_id}/update-config") +async def update_crawl_config(source_id: str, request: CrawlRequestV2): ... - crawl_request = CrawlRequestV2(**request) + crawl_request = requestPropagate the same change to crawl_knowledge_item_v2.
985-991: Return estimated duration consistently with v1.Minor UX polish: consider aligning wording/format with /knowledge-items/crawl to avoid confusing copy.
archon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx (1)
73-79: Consider extracting the configuration check into a utility function.The logic for detecting domain filtering configuration is sound but could be more maintainable as a separate utility function.
+const hasAdvancedCrawlConfig = (config: CrawlConfig): boolean => { + return !!( + (config.allowed_domains && config.allowed_domains.length > 0) || + (config.excluded_domains && config.excluded_domains.length > 0) || + (config.include_patterns && config.include_patterns.length > 0) || + (config.exclude_patterns && config.exclude_patterns.length > 0) + ); +}; const handleCrawl = async () => { // ... existing code ... - // Check if we have any domain filtering configuration - const hasCrawlConfig = - (crawlConfig.allowed_domains && crawlConfig.allowed_domains.length > 0) || - (crawlConfig.excluded_domains && crawlConfig.excluded_domains.length > 0) || - (crawlConfig.include_patterns && crawlConfig.include_patterns.length > 0) || - (crawlConfig.exclude_patterns && crawlConfig.exclude_patterns.length > 0); + const hasCrawlConfig = hasAdvancedCrawlConfig(crawlConfig);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (19)
archon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx(11 hunks)archon-ui-main/src/features/knowledge/components/AdvancedCrawlConfig.tsx(1 hunks)archon-ui-main/src/features/knowledge/components/DocumentBrowser.tsx(7 hunks)archon-ui-main/src/features/knowledge/components/EditCrawlConfigDialog.tsx(1 hunks)archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx(4 hunks)archon-ui-main/src/features/knowledge/components/KnowledgeCardActions.tsx(5 hunks)archon-ui-main/src/features/knowledge/components/index.ts(1 hunks)archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts(3 hunks)archon-ui-main/src/features/knowledge/inspector/components/ContentViewer.tsx(4 hunks)archon-ui-main/src/features/knowledge/inspector/components/InspectorSidebar.tsx(8 hunks)archon-ui-main/src/features/knowledge/inspector/components/KnowledgeInspector.tsx(7 hunks)archon-ui-main/src/features/knowledge/services/knowledgeService.ts(3 hunks)archon-ui-main/src/features/knowledge/types/knowledge.ts(2 hunks)python/src/server/api_routes/knowledge_api.py(3 hunks)python/src/server/models/crawl_models.py(1 hunks)python/src/server/services/crawling/crawling_service.py(3 hunks)python/src/server/services/crawling/domain_filter.py(1 hunks)python/src/server/services/crawling/strategies/recursive.py(3 hunks)python/src/server/services/tests/test_domain_filter.py(1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-09-19T10:32:55.557Z
Learnt from: CR
PR: coleam00/Archon#0
File: AGENTS.md:0-0
Timestamp: 2025-09-19T10:32:55.557Z
Learning: Applies to archon-ui-main/src/features/*/components/**/*.{ts,tsx} : Place new UI components under src/features/[feature]/components
Applied to files:
archon-ui-main/src/features/knowledge/components/index.ts
📚 Learning: 2025-08-28T13:07:24.810Z
Learnt from: Wirasm
PR: coleam00/Archon#514
File: archon-ui-main/src/services/crawlProgressService.ts:35-39
Timestamp: 2025-08-28T13:07:24.810Z
Learning: The crawlProgressService.ts in the Archon codebase should be deprecated in favor of the existing useCrawlProgressPolling hook from usePolling.ts, which already includes ETag support, 304 handling, tab visibility detection, and proper React lifecycle integration. This consolidation reduces code duplication and improves performance.
Applied to files:
archon-ui-main/src/features/knowledge/components/EditCrawlConfigDialog.tsxarchon-ui-main/src/features/knowledge/components/AdvancedCrawlConfig.tsxarchon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.tsarchon-ui-main/src/features/knowledge/services/knowledgeService.tsarchon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx
📚 Learning: 2025-08-28T12:56:47.840Z
Learnt from: Wirasm
PR: coleam00/Archon#514
File: archon-ui-main/src/pages/ProjectPage.tsx:329-331
Timestamp: 2025-08-28T12:56:47.840Z
Learning: In the ProjectPage.tsx polling refactor, temporary project creation logic with progress cards was removed in favor of simpler modal loading states, as the complexity of managing in-flight temporary projects wasn't justified when HTTP polling would show new projects within seconds anyway.
Applied to files:
archon-ui-main/src/features/knowledge/components/KnowledgeCardActions.tsx
📚 Learning: 2025-09-19T10:32:55.557Z
Learnt from: CR
PR: coleam00/Archon#0
File: AGENTS.md:0-0
Timestamp: 2025-09-19T10:32:55.557Z
Learning: Applies to archon-ui-main/src/features/*/hooks/**/*.{ts,tsx} : Use feature-scoped TanStack Query hooks in src/features/[feature]/hooks
Applied to files:
archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts
🔇 Additional comments (21)
archon-ui-main/src/features/knowledge/inspector/components/InspectorSidebar.tsx (1)
166-221: List rendering/accessibility looks solid.Keyboard and ARIA attributes are reasonable; animated hover and selection state are consistent.
archon-ui-main/src/features/knowledge/components/index.ts (1)
2-2: Re-export of AdvancedCrawlConfig looks good.Keeps feature modules discoverable via a single barrel.
archon-ui-main/src/features/knowledge/types/knowledge.ts (2)
136-201: CrawlConfig docs and shape look solid.Clear precedence and examples; matches backend model.
211-214: CrawlRequestV2 extension is appropriate.Optional crawl_config keeps v1 compatibility.
archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx (1)
226-227: Nice: gate “Edit Configuration” to URL sources only.Prevents surfacing irrelevant actions for file uploads.
archon-ui-main/src/features/knowledge/components/KnowledgeCardActions.tsx (1)
96-99: Edit Configuration action wiring is clean.Propagation is stopped; action only shown for URLs; icon/text consistent.
Also applies to: 133-141
python/src/server/services/tests/test_domain_filter.py (1)
157-166: Clarify 'www' normalization semantics.The test expects
allowed_domains=["www.test.com"]to also allowhttps://test.com/page. That implies strippingwww.during normalization. Please confirm this is intentional; it weakens exact‑host whitelisting and may surprise users who only allowwww.test.com. If not intended, update either the test or the DomainFilter to require exact host matches unless a wildcard is used.python/src/server/services/crawling/crawling_service.py (1)
86-90: Pass DomainFilter to recursive strategy — verified.
RecursiveCrawlStrategy stores the DomainFilter and calls domain_filter.is_url_allowed(...) during the crawl (python/src/server/services/crawling/strategies/recursive.py:24,316). Align batch/sitemap similarly for consistent coverage.archon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx (13)
13-15: LGTM! Proper imports for Crawl V2 functionality.The new imports support the advanced crawling features with domain filtering. The component structure follows good practices with proper type imports.
36-36: LGTM! Consistent mutation hook integration.The new
useCrawlUrlV2hook is properly imported and follows the same pattern as the existing crawl mutation.
48-48: LGTM! Clean state management for crawl configuration.The empty object initialization for
crawlConfigis appropriate as it represents optional domain filtering settings.
60-60: LGTM! Proper state reset.The
crawlConfigis properly reset alongside other form state, maintaining consistency in form reset behavior.
82-101: LGTM! Clean conditional API selection with proper TypeScript typing.The logic correctly determines whether to use V1 or V2 endpoint based on domain filtering configuration. Both request objects are properly typed and constructed.
108-108: LGTM! Clear user feedback with contextual messaging.The conditional toast message appropriately informs users whether domain filtering is active, providing valuable feedback about the crawl type being initiated.
151-151: LGTM! Comprehensive loading state handling.The
isProcessingstate correctly includes both V1 and V2 mutations, ensuring the UI properly reflects all possible loading states.
155-159: LGTM! Enhanced dialog layout for better content management.The flexbox layout with proper height constraints and flex-shrink settings ensures the dialog content is properly contained and scrollable.
161-215: LGTM! Enhanced tab styling with visual hierarchy.The tab interface provides clear visual feedback with gradient backgrounds, borders, and accent glows. The implementation maintains good accessibility with proper focus states.
218-227: Enhanced scrolling container with proper height management.The scrolling wrapper ensures content overflow is handled properly while maintaining consistent styling across different viewport sizes.
252-254: LGTM! Proper integration of AdvancedCrawlConfig component.The component is positioned logically beneath the URL input and receives proper state management through controlled props. The useId hook used in this component generates stable, unique IDs that are consistent across renders and work seamlessly with SSR.
273-273: LGTM! Comprehensive loading state for start button.The button properly reflects the loading state of both V1 and V2 mutations, providing consistent user feedback regardless of which API endpoint is being used.
285-299: LGTM! Consistent scrollable container for upload tab.The upload tab maintains the same scrolling behavior and height management as the crawl tab, ensuring a consistent user experience across both tabs.
- Wire crawl_config from request to service instance for domain filtering - Add domain filtering to BatchCrawlStrategy to ensure all crawls respect filters - Replace 'any' types with proper CrawlConfig TypeScript types - Add memoization for filtered chunks in DocumentBrowser for performance - Improve domain normalization (strip leading www only, handle ports) - Use Pydantic model validation directly in API endpoints These fixes ensure domain filtering works consistently across all crawl strategies and improves type safety and performance throughout the codebase. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
python/src/server/api_routes/knowledge_api.py (2)
893-903: Stop/remove of active task happens too early; attach cleanup to the real crawl task instead of wrapper.The wrapper returns immediately after scheduling; your finally block deletes the registry entry, breaking cancellation via /stop.
Apply:
- crawl_task = result.get("task") - if crawl_task: - active_crawl_tasks[progress_id] = crawl_task - safe_logfire_info( - f"Stored actual crawl task in active_crawl_tasks | progress_id={progress_id} | task_name={crawl_task.get_name()}" - ) + crawl_task = result.get("task") + if crawl_task: + active_crawl_tasks[progress_id] = crawl_task + safe_logfire_info( + f"Stored actual crawl task in active_crawl_tasks | progress_id={progress_id} | task_name={crawl_task.get_name()}" + ) + # Ensure cleanup happens when the underlying task finishes + def _cleanup(_): + if progress_id in active_crawl_tasks: + del active_crawl_tasks[progress_id] + safe_logfire_info( + f"Cleaned up crawl task from registry | progress_id={progress_id}" + ) + try: + crawl_task.add_done_callback(_cleanup) + except Exception: + pass @@ - finally: - # Clean up task from registry when done (success or failure) - if progress_id in active_crawl_tasks: - del active_crawl_tasks[progress_id] - safe_logfire_info( - f"Cleaned up crawl task from registry | progress_id={progress_id}" - ) + finally: + # Cleanup is handled by the task's done-callback + passAlso applies to: 930-936
738-760: Same early cleanup bug in refresh path; delete via done-callback, not wrapper finally.Otherwise /stop may not find the task.
Apply:
- finally: - # Clean up task from registry when done (success or failure) - if progress_id in active_crawl_tasks: - del active_crawl_tasks[progress_id] - safe_logfire_info( - f"Cleaned up refresh task from registry | progress_id={progress_id}" - ) + finally: + # Cleanup handled when underlying crawl task completes + if 'crawl_task' in locals() and crawl_task: + def _cleanup(_): + if progress_id in active_crawl_tasks: + del active_crawl_tasks[progress_id] + safe_logfire_info( + f"Cleaned up refresh task from registry | progress_id={progress_id}" + ) + try: + crawl_task.add_done_callback(_cleanup) + except Exception: + passarchon-ui-main/src/features/knowledge/components/DocumentBrowser.tsx (1)
221-223: Prevent crashes when content is missing; compute preview safely.Substring/length on undefined will throw.
Apply:
- const preview = chunk.content.substring(0, 200); - const needsExpansion = chunk.content.length > 200; + const content = chunk.content || ""; + const preview = content.substring(0, 200); + const needsExpansion = content.length > 200;python/src/server/services/crawling/strategies/batch.py (1)
7-12: Critical: crawl_config name collision breaks domain filtering (CrawlerRunConfig shadows CrawlConfig).The method parameter
crawl_config(intended for DomainFilter rules) is overwritten by a localCrawlerRunConfig, sois_url_allowed(..., crawl_config)receives the wrong type. This can cause incorrect filtering or runtime errors. Rename the local variable torun_configand the parameter tofilter_config, and type-hint both to prevent regressions.Apply this diff:
@@ -import asyncio -from collections.abc import Awaitable, Callable -from typing import Any +import asyncio +from collections.abc import Awaitable, Callable +from typing import Any, TYPE_CHECKING @@ -from ....config.logfire_config import get_logger +from ....config.logfire_config import get_logger @@ +from ..domain_filter import DomainFilter + +if TYPE_CHECKING: + from ....models.crawl_models import CrawlConfig @@ -class BatchCrawlStrategy: +class BatchCrawlStrategy: @@ - def __init__(self, crawler, markdown_generator, domain_filter=None): + def __init__(self, crawler, markdown_generator, domain_filter: "DomainFilter" | None = None): @@ - async def crawl_batch_with_progress( + async def crawl_batch_with_progress( self, urls: list[str], transform_url_func: Callable[[str], str], is_documentation_site_func: Callable[[str], bool], max_concurrent: int | None = None, progress_callback: Callable[..., Awaitable[None]] | None = None, cancellation_check: Callable[[], None] | None = None, - crawl_config=None, + filter_config: "CrawlConfig" | None = None, ) -> list[dict[str, Any]]: @@ - if has_doc_sites: + if has_doc_sites: logger.info("Detected documentation sites in batch, using enhanced configuration") # Use generic documentation selectors for batch crawling - crawl_config = CrawlerRunConfig( + run_config = CrawlerRunConfig( @@ - else: + else: # Configuration for regular batch crawling - crawl_config = CrawlerRunConfig( + run_config = CrawlerRunConfig( @@ - if self.domain_filter and crawl_config: + if self.domain_filter and filter_config: @@ - if not self.domain_filter.is_url_allowed(url, base_url, crawl_config): + if not self.domain_filter.is_url_allowed(url, base_url, filter_config): @@ - batch_results = await self.crawler.arun_many( - urls=batch_urls, config=crawl_config, dispatcher=dispatcher + batch_results = await self.crawler.arun_many( + urls=batch_urls, config=run_config, dispatcher=dispatcher )Also applies to: 22-22, 35-44, 108-135, 174-182, 228-230
python/src/server/services/crawling/crawling_service.py (2)
209-217: Update call to match renamed parameter in BatchCrawlStrategy.After fixing the name collision in BatchCrawlStrategy, pass
filter_configexplicitly.Apply this diff:
return await self.batch_strategy.crawl_batch_with_progress( urls, self.url_handler.transform_github_url, self.site_config.is_documentation_site, max_concurrent, progress_callback, self._check_cancellation, # Pass cancellation check - self.crawl_config, # Pass crawl_config for domain filtering + filter_config=self.crawl_config, # Pass CrawlConfig for domain filtering )
227-236: Update recursive call to match renamed parameter in RecursiveCrawlStrategy (if mirrored).Keep parameter naming consistent across strategies.
Apply this diff:
return await self.recursive_strategy.crawl_recursive_with_progress( start_urls, self.url_handler.transform_github_url, self.site_config.is_documentation_site, max_depth, max_concurrent, progress_callback, self._check_cancellation, # Pass cancellation check - self.crawl_config, # Pass crawl config for domain filtering + filter_config=self.crawl_config, # Pass CrawlConfig for domain filtering )
🧹 Nitpick comments (8)
python/src/server/api_routes/knowledge_api.py (1)
974-977: Don’t register wrapper tasks in active_crawl_tasks; only track the actual crawl task.The wrapper is replaced immediately by the real task, creating churn and racey cleanup.
Apply:
- crawl_task = asyncio.create_task(_run_crawl_v2(request_dict=crawl_request.dict(), progress_id=progress_id)) - active_crawl_tasks[progress_id] = crawl_task + asyncio.create_task(_run_crawl_v2(request_dict=crawl_request.dict(), progress_id=progress_id))- crawl_task = asyncio.create_task( - _run_crawl_v2(request_dict=request_dict, progress_id=progress_id) - ) - active_crawl_tasks[progress_id] = crawl_task + asyncio.create_task(_run_crawl_v2(request_dict=request_dict, progress_id=progress_id))Also applies to: 636-640
archon-ui-main/src/features/knowledge/components/DocumentBrowser.tsx (1)
41-53: Skip empty/invalid domains in domainStats.Avoid rendering blank chips and skewed counts.
Apply:
const stats = new Map<string, number>(); chunks.forEach((chunk) => { const url = chunk.url || chunk.metadata?.url; if (url) { - const domain = extractDomain(url); - stats.set(domain, (stats.get(domain) || 0) + 1); + const domain = extractDomain(url); + if (!domain) return; + stats.set(domain, (stats.get(domain) || 0) + 1); } });archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts (1)
336-364: Minor: tempItemId is unused in v2 flow; remove to reduce noise.Keeps context minimal and avoids confusion.
Apply:
- const tempItemId = optimisticItem.id; + // No tempItemId used in v2; we key by source_id (tempProgressId)python/src/server/services/crawling/strategies/batch.py (2)
155-161: Progress totals should reflect filtering outcome.Initial message reports the pre-filter count. Consider updating once filtering completes to keep UI totals consistent.
Apply this diff:
- await report_progress( - 0, # Start at 0% progress - f"Starting to process {initial_urls_count} URLs...", - total_pages=initial_urls_count, - processed_pages=0 - ) + await report_progress( + 0, + f"Starting to process {initial_urls_count} URLs...", + total_pages=initial_urls_count, + processed_pages=0 + ) @@ - # Update total count after filtering + # Update total count after filtering total_urls = len(transformed_urls) + if filtered_count: + await report_progress( + 0, + f"Filtered {filtered_count} URLs; {total_urls} remaining to crawl", + total_pages=total_urls, + processed_pages=0 + )Also applies to: 187-193
216-221: Guard progress math and smoothness.Use a non‑zero denominator to avoid edge cases and keep progress monotonic.
Apply this diff:
- progress_percentage = int((i / total_urls) * 100) + progress_percentage = int((i / max(total_urls, 1)) * 100) @@ - progress_percentage = int((processed / total_urls) * 100) + progress_percentage = int((processed / max(total_urls, 1)) * 100)Also applies to: 268-276
archon-ui-main/src/features/knowledge/types/knowledge.ts (3)
24-26: Avoidanyin index signature; preferunknownfor type safety.
[key: string]: anymasks type errors acrossKnowledgeItemMetadata. Useunknownand narrow at use sites.Apply this diff:
- [key: string]: any; // Allow additional untyped fields from backend + [key: string]: unknown; // Allow additional untyped fields from backend with safe reads
42-46: Duplication of fields at top level vs metadata — define source of truth.
max_depth,tags, andcrawl_configappear both at the top level and undermetadata. This can drift. Prefer one canonical location (ideallymetadata) and map on the API layer, or document precedence for consumers.Do you want a follow-up task to consolidate?
143-207: Make arrays readonly to clarify immutability in UI models.These config lists are typically read-only in the UI; marking as
readonlyhelps.Apply this diff:
-export interface CrawlConfig { +export interface CrawlConfig { @@ - allowed_domains?: string[]; + readonly allowed_domains?: readonly string[]; @@ - excluded_domains?: string[]; + readonly excluded_domains?: readonly string[]; @@ - include_patterns?: string[]; + readonly include_patterns?: readonly string[]; @@ - exclude_patterns?: string[]; + readonly exclude_patterns?: readonly string[]; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
archon-ui-main/src/features/knowledge/components/DocumentBrowser.tsx(7 hunks)archon-ui-main/src/features/knowledge/components/EditCrawlConfigDialog.tsx(1 hunks)archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts(3 hunks)archon-ui-main/src/features/knowledge/services/knowledgeService.ts(3 hunks)archon-ui-main/src/features/knowledge/types/knowledge.ts(4 hunks)python/src/server/api_routes/knowledge_api.py(4 hunks)python/src/server/services/crawling/crawling_service.py(5 hunks)python/src/server/services/crawling/domain_filter.py(1 hunks)python/src/server/services/crawling/strategies/batch.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- archon-ui-main/src/features/knowledge/services/knowledgeService.ts
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-28T13:07:24.810Z
Learnt from: Wirasm
PR: coleam00/Archon#514
File: archon-ui-main/src/services/crawlProgressService.ts:35-39
Timestamp: 2025-08-28T13:07:24.810Z
Learning: The crawlProgressService.ts in the Archon codebase should be deprecated in favor of the existing useCrawlProgressPolling hook from usePolling.ts, which already includes ETag support, 304 handling, tab visibility detection, and proper React lifecycle integration. This consolidation reduces code duplication and improves performance.
Applied to files:
archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts
📚 Learning: 2025-09-19T10:32:55.557Z
Learnt from: CR
PR: coleam00/Archon#0
File: AGENTS.md:0-0
Timestamp: 2025-09-19T10:32:55.557Z
Learning: Applies to archon-ui-main/src/features/*/hooks/**/*.{ts,tsx} : Use feature-scoped TanStack Query hooks in src/features/[feature]/hooks
Applied to files:
archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts
🔇 Additional comments (11)
python/src/server/services/crawling/domain_filter.py (3)
162-179: Extract domains with hostname and normalize “www.”.Avoid ports/userinfo in domain set.
Apply:
- for url in urls: + for url in urls: try: - parsed = urlparse(url) - if parsed.netloc: - domain = parsed.netloc.lower().replace("www.", "") - domains.add(domain) + parsed = urlparse(url) + host = parsed.hostname + if host: + domains.add(host.lower().removeprefix("www."))
8-10: Use urljoin and hostname for robust URL resolution and domain extraction.netloc includes ports/userinfo and relative URLs aren’t resolved.
Apply:
-from urllib.parse import urlparse +from urllib.parse import urlparse, urljoin
44-57: Fix relative URL handling and ensure consistent host normalization.Preserve query strings, resolve relatives, and use hostname (not netloc).
Apply:
- # Parse the URL - parsed = urlparse(url) - - # Handle relative URLs by using base URL's domain - if not parsed.netloc: - base_parsed = urlparse(base_url) - domain = base_parsed.netloc.lower() - # Construct full URL for pattern matching - full_url = f"{base_parsed.scheme}://{base_parsed.netloc}{parsed.path or '/'}" - else: - domain = parsed.netloc.lower() - full_url = url + # Resolve to absolute and extract canonical host + parsed = urlparse(url) + if not parsed.scheme and not parsed.netloc: + resolved = urlparse(urljoin(base_url, url)) + else: + resolved = parsed + host = (resolved.hostname or "").lower() + domain = host + # Use resolved URL (keeps query/fragment) for pattern matching + full_url = resolved.geturl() @@ - normalized_domain = domain - if normalized_domain.startswith("www."): - normalized_domain = normalized_domain[4:] + normalized_domain = domain.removeprefix("www.")python/src/server/api_routes/knowledge_api.py (2)
627-634: Don’t delete existing data before recrawl; gate behind a flag and defer deletion until success.Immediate deletion risks irreversible data loss if the new crawl fails. Default to non-destructive behavior and only hard-delete when explicitly requested.
Apply:
-@router.post("/knowledge-items/{source_id}/update-config") -async def update_crawl_config(source_id: str, request: CrawlRequestV2): +@router.post("/knowledge-items/{source_id}/update-config") +async def update_crawl_config( + source_id: str, + request: CrawlRequestV2, + force_delete_old: bool = Query(False), +): @@ - # First delete the existing knowledge item and its documents - safe_logfire_info(f"Deleting existing knowledge item before recrawl | source_id={source_id}") - try: - await service.delete_item(source_id) - except Exception as e: - safe_logfire_error(f"Failed to delete existing item | error={str(e)}") - # Continue anyway - we'll overwrite + # Build request dict and mark replacement target (defer deletion) + request_dict = crawl_request.dict() + request_dict.setdefault("metadata", {}) + request_dict["metadata"]["replace_source_id"] = source_id + + # Optionally allow immediate deletion (explicit opt-in) + if force_delete_old: + safe_logfire_info( + f"force_delete_old=True; deleting existing knowledge item before recrawl | source_id={source_id}" + ) + try: + await service.delete_item(source_id) + except Exception as e: + safe_logfire_error(f"Failed to delete existing item | error={str(e)}") + # Continue anyway; replacement path will create fresh data @@ - crawl_task = asyncio.create_task( - _run_crawl_v2(request_dict=crawl_request.dict(), progress_id=progress_id) - ) - active_crawl_tasks[progress_id] = crawl_task + crawl_task = asyncio.create_task( + _run_crawl_v2(request_dict=request_dict, progress_id=progress_id) + )Add missing import (outside this hunk):
from fastapi import APIRouter, File, Form, HTTPException, UploadFile, Query
994-1064: v2 crawl bypasses concurrency guard; wrap with the semaphore and fix task cleanup.Crawl v2 should honor CONCURRENT_CRAWL_LIMIT and must not remove the task from the registry before the underlying task completes.
Apply:
async def _run_crawl_v2(request_dict: dict, progress_id: str): """Run the crawl v2 with domain filtering in background.""" tracker = ProgressTracker(progress_id, operation_type="crawl") - - try: - safe_logfire_info( - f"Starting crawl v2 with progress tracking | progress_id={progress_id} | url={request_dict['url']}" - ) + crawl_task: asyncio.Task | None = None + try: + async with crawl_semaphore: + safe_logfire_info( + f"Starting crawl v2 with progress tracking | progress_id={progress_id} | url={request_dict['url']}" + ) @@ - result = await orchestration_service.orchestrate_crawl(request_dict) + result = await orchestration_service.orchestrate_crawl(request_dict) @@ - crawl_task = result.get("task") - if crawl_task: - active_crawl_tasks[progress_id] = crawl_task - safe_logfire_info( - f"Stored actual crawl v2 task in active_crawl_tasks | progress_id={progress_id}" - ) - else: - safe_logfire_error(f"No task returned from orchestrate_crawl v2 | progress_id={progress_id}") + crawl_task = result.get("task") + if crawl_task: + active_crawl_tasks[progress_id] = crawl_task + # Clean up registry when the actual crawl completes + def _cleanup(_): + if progress_id in active_crawl_tasks: + del active_crawl_tasks[progress_id] + safe_logfire_info( + f"Cleaned up crawl v2 task from registry | progress_id={progress_id}" + ) + try: + crawl_task.add_done_callback(_cleanup) + except Exception: + # Best-effort; don't fail the request on callback errors + pass + else: + safe_logfire_error(f"No task returned from orchestrate_crawl v2 | progress_id={progress_id}") @@ - finally: - # Clean up task from registry when done - if progress_id in active_crawl_tasks: - del active_crawl_tasks[progress_id] - safe_logfire_info(f"Cleaned up crawl v2 task from registry | progress_id={progress_id}") + finally: + # Wrapper-level cleanup no longer removes the actual crawl entry; handled by done callback + passarchon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts (1)
433-444: Fix activeOps replacement: also update operation_id and source_id, not just progressId.Leaving a temp operation_id breaks tracking elsewhere.
Apply:
- const activeOps = queryClient.getQueryData<ActiveOperationsResponse>(progressKeys.active()); - if (activeOps) { - const updated: ActiveOperationsResponse = { - ...activeOps, // Preserve count, timestamp, and any other fields - operations: activeOps.operations.map((op) => - op.progressId === context.tempProgressId ? { ...op, progressId: response.progressId } : op, - ), - }; - queryClient.setQueryData(progressKeys.active(), updated); - } + const activeOps = queryClient.getQueryData<ActiveOperationsResponse>(progressKeys.active()); + if (activeOps) { + const updated: ActiveOperationsResponse = { + ...activeOps, + operations: activeOps.operations.map((op) => + op.operation_id === context.tempProgressId + ? { + ...op, + operation_id: response.progressId, + progressId: response.progressId, + source_id: response.progressId, + message: response.message || op.message, + } + : op, + ), + }; + queryClient.setQueryData(progressKeys.active(), updated); + }archon-ui-main/src/features/knowledge/components/EditCrawlConfigDialog.tsx (1)
82-105: Validate max depth; prevent NaN/invalid values from reaching the API.parseInt on invalid input yields NaN and will 422/500 backend.
Apply:
const handleSave = async () => { if (!url) { showToast("URL is required", "error"); return; } try { - await updateMutation.mutateAsync({ + const depth = Number((maxDepth || "").trim()); + if (!Number.isInteger(depth) || depth < 0) { + showToast("Crawl depth must be a non-negative integer", "error"); + return; + } + await updateMutation.mutateAsync({ sourceId, url, knowledge_type: knowledgeType, - max_depth: parseInt(maxDepth, 10), + max_depth: depth, tags: tags.length > 0 ? tags : undefined, crawl_config: crawlConfig, });python/src/server/services/crawling/strategies/batch.py (1)
173-186: Filtering on original vs transformed URLs — confirm intent.You filter before
transform_url_func. If transformation changes domain (e.g., GitHub raw/blob), rules may be applied to the wrong host. Either filter both original and transformed URLs, or transform first and filter on the final URL, depending on DomainFilter semantics.Do you want a patch to filter on both forms for safety?
archon-ui-main/src/features/knowledge/types/knowledge.ts (1)
218-221: Confirm casing and serialization for crawl_config across API boundary.The UI uses snake_case keys (
allowed_domains, etc.). Ensure your client and server don’t auto-transform to camelCase, and thatundefinedis omitted on serialize to avoid noisy payloads.I can add a small Zod schema to validate shape before POST if helpful.
python/src/server/services/crawling/crawling_service.py (2)
61-70: Add type hint for crawl_config (repeat).Annotate to
Optional[CrawlConfig]to align with imports and tooling. This also prevents “unused import” forCrawlConfig.Apply this diff:
- def __init__(self, crawler=None, supabase_client=None, progress_id=None, crawl_config=None): + def __init__(self, crawler=None, supabase_client=None, progress_id=None, crawl_config: Optional[CrawlConfig] = None): @@ - crawl_config: Optional CrawlConfig for domain filtering + crawl_config: Optional CrawlConfig for domain filtering
86-88: LGTM: DomainFilter is wired into both batch and recursive strategies.Good cohesion and single source of filtering logic.
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
- Remove code blocks from JSDoc to prevent parsing conflicts - Simplify example format to avoid asterisk/slash interpretation issues - Maintain documentation clarity while ensuring TypeScript compilation
…ng errors - Replace * with [star] in all JSDoc examples to avoid parser conflicts - Maintain documentation clarity while ensuring proper compilation - Fix issue across all pattern examples in CrawlConfig documentation
- Increase dialog max height from 85vh to 90vh for better content visibility - Improve scrolling behavior with proper padding adjustments - Fix data loading to properly query item only when dialog is open - Add comprehensive fallback checks for max_depth, tags, and crawl_config - Add error handling for failed configuration loads - Ensure arrays are properly validated before assignment - Remove debug console.log statements
- Add GET /knowledge-items/{source_id} endpoint to fetch single item
- Fixes 'Method Not Allowed' error in Edit Crawler Configuration dialog
- Returns full item data including metadata for configuration editing
- Use metadata.original_url when available to show the actual crawled URL - Falls back to item.url if original_url is not present - Ensures the edit dialog shows the same URL that was originally crawled
- Add stopPropagation wrapper to prevent dialog clicks from bubbling to card - Include crawl configuration fields at top level of knowledge item response - Ensure max_depth, tags, and crawl_config are accessible for edit dialog - Fix issue where clicking inside edit dialog would open document browser
- Add onKeyDown stopPropagation to prevent Enter key from opening document browser - Auto-expand Advanced Configuration when existing config is present - Fix issue where pressing Enter in tag input would trigger card action - Improve UX by showing loaded configuration immediately when editing
- Fix deletion of existing documents before recrawl with new configuration - Use SourceManagementService.delete_source() instead of non-existent delete_item() - Ensure all crawl configuration fields are persisted to metadata - Store knowledge_type, max_depth, tags, and original_url in metadata - Add proper logging for successful deletions - Ensure crawl_config is persisted and reloaded correctly
…tence - Fix scrolling in Edit Crawler Configuration dialog when content is expanded - Remove negative margins and simplify padding for better scroll behavior - Add console logging to debug crawl_config persistence issues - Ensure save button is always visible with proper scrollbar
|
@leex279 is this something we are still considering? |
|
@Wirasm yes, need to finish that. |
|
🔄 This repository is being replaced by a new version of Archon. The original Python/MCP codebase is being archived to the This PR is being closed as part of the migration. Thank you for your contribution! |
The test 'should call generateAndSetTitle with workflow name and user message' was failing with 'Cannot create worktree: not in a git repository' because two mocks were inconsistent with the code path being exercised. Changes: - Add `configureIsolation` to the `@archon/isolation` mock (was imported but unmocked) - Fix `findCodebaseByDefaultCwd` mock to return a codebase object instead of null so the isolation block is entered and `generateAndSetTitle` is reached Fixes #717 Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…m00#728) The test 'should call generateAndSetTitle with workflow name and user message' was failing with 'Cannot create worktree: not in a git repository' because two mocks were inconsistent with the code path being exercised. Changes: - Add `configureIsolation` to the `@archon/isolation` mock (was imported but unmocked) - Fix `findCodebaseByDefaultCwd` mock to return a codebase object instead of null so the isolation block is entered and `generateAndSetTitle` is reached Fixes coleam00#717 Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…m00#728) The test 'should call generateAndSetTitle with workflow name and user message' was failing with 'Cannot create worktree: not in a git repository' because two mocks were inconsistent with the code path being exercised. Changes: - Add `configureIsolation` to the `@archon/isolation` mock (was imported but unmocked) - Fix `findCodebaseByDefaultCwd` mock to return a codebase object instead of null so the isolation block is entered and `generateAndSetTitle` is reached Fixes coleam00#717 Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
This PR implements Advanced Web Crawling with Domain Filtering capabilities for Archon's knowledge management system, along with comprehensive Edit Crawler Configuration and Enhanced Metadata Viewing features.
2025-09-22_12h59_12.mp4
2025-09-22_12h59_52.mp4
2025-09-22_13h00_19.mp4
New Features Added
🔧 Edit Crawler Configuration
📊 Enhanced Metadata Viewing
url,source,headers,filenamehas_code,has_links,source_idchar_count,word_count,line_countchunk_size,chunk_indexsource_type,knowledge_type🌐 Advanced Domain Filtering
🎯 Improved Document Browser
Technical Implementation
Frontend Architecture
useUpdateCrawlConfighook with optimistic updates and error handlingBackend Integration
POST /api/knowledge-items/{source_id}/update-configKey Files Modified/Added
Frontend Components
src/features/knowledge/components/EditCrawlConfigDialog.tsx⭐ NEWsrc/features/knowledge/components/KnowledgeCardActions.tsxsrc/features/knowledge/components/KnowledgeCard.tsxsrc/features/knowledge/inspector/components/ContentViewer.tsxsrc/features/knowledge/inspector/components/KnowledgeInspector.tsxsrc/features/knowledge/inspector/components/InspectorSidebar.tsxServices & Hooks
src/features/knowledge/hooks/useKnowledgeQueries.ts- AddeduseUpdateCrawlConfigsrc/features/knowledge/services/knowledgeService.ts- AddedupdateCrawlConfigBackend API
python/src/server/api_routes/knowledge_api.py- Added update-config endpointUser Experience Flow
Editing Configuration:
Enhanced Metadata Viewing:
Domain Filtering:
Testing & Validation
Benefits
This implementation provides a comprehensive solution for advanced web crawling with full configurability and excellent user experience for managing and analyzing crawled content.
Summary by CodeRabbit
New Features
UI/UX Improvements
Bug Fixes
Tests