feat: Document browser with advanced domain filtering#537
feat: Document browser with advanced domain filtering#537
Conversation
Frontend:
- Add DocumentBrowser component with domain filtering and search
- Add advanced domain configuration UI to AddKnowledgeModal
- Add "Browse Documents" button to KnowledgeItemCard
- Support comma-separated domain/pattern input with badges
- Make modal scrollable and improve UX
Backend:
- Add CrawlConfig model with domain/pattern filtering options
- Implement domain filtering logic with fnmatch pattern matching
- Add /knowledge-items/{source_id}/chunks endpoint for chunk browsing
- Add /knowledge-items/crawl-v2 endpoint with domain filtering support
- Filter URLs during crawling based on allowed/excluded domains and patterns
Features:
- Whitelist specific domains to crawl
- Blacklist domains to exclude
- Include/exclude URL patterns using glob-style matching
- Browse and search document chunks with domain filtering
- Collapsible advanced configuration section
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
WalkthroughAdds a DocumentBrowser modal to browse knowledge-base chunks with search and domain filtering, wires it into KnowledgeBasePage via KnowledgeItemCard’s clickable page-count badge, introduces getKnowledgeItemChunks API on frontend and backend, adds crawl v2 with domain-filter configuration and URL filtering, and removes legacy document upload paths. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User
participant K as KnowledgeItemCard
participant P as KnowledgeBasePage
participant DB as DocumentBrowser
participant Svc as knowledgeBaseService
participant API as GET /knowledge-items/{id}/chunks
participant Store as Chunks Store
U->>K: Click page-count badge
K-->>P: onBrowseDocuments(sourceId)
P->>DB: Open modal with {sourceId}
DB->>Svc: getKnowledgeItemChunks(sourceId, domainFilter?)
Svc->>API: GET chunks?domain_filter=...
API-->>Svc: chunks[]
Svc-->>DB: chunks[]
DB->>Store: cache chunks, compute domains
U->>DB: Search / select domain / pick chunk
DB-->>U: Render filtered chunk content + metadata
sequenceDiagram
autonumber
participant UI as KnowledgeBasePage
participant Svc as knowledgeBaseService.crawlUrl
participant API as POST /crawl-v2
participant Crawl as CrawlingService
Note over UI,API: Crawl v2 with crawl_config
UI->>Svc: crawlUrl({ url, crawl_config })
alt crawl_config provided
Svc->>API: POST /crawl-v2 (CrawlRequest)
API->>Crawl: start crawl task (semaphore)
Crawl->>Crawl: build filter_config<br/>apply URL filtering
Crawl-->>API: progress updates (WS/events)
else
Svc->>API: POST /crawl (v1)
end
API-->>UI: completion signal
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Assessment against linked issues
Assessment against linked issues: Out-of-scope changes
Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
- Add optional chaining for domains array mapping - Add safety checks for filteredChunks and chunks arrays - Remove unsafe HTML content replacement to prevent XSS - Ensure component handles empty/undefined data gracefully 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Replace HTML option children with options prop
- Select component expects {value, label} objects array
- Fixes "Cannot read properties of undefined (reading 'map')" error
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Change from 'documents' to 'archon_crawled_pages' table - Fixes 500 error when fetching document chunks - Aligns with existing database schema naming convention 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Move DocumentBrowser from KnowledgeItemCard to KnowledgeBasePage level - Add onBrowseDocuments callback prop to KnowledgeItemCard - Fix modal rendering inside card container (z-index/stacking context issue) - Now opens as full-screen modal like other modals (CodeViewer, EditModal) - Fix table name in chunks API from 'documents' to 'archon_crawled_pages' 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add left sidebar with document list (like code examples list) - Add right content area for selected document chunk - Add click-to-select functionality for document chunks - Auto-select first chunk when opening browser - Match CodeViewer modal design pattern but in blue theme - Show document preview in sidebar with domain badges - Improve overall UX with familiar layout pattern 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add visible blue-themed scrollbars to document list sidebar - Add matching scrollbars to document content area - Include both Firefox (scrollbar-width/color) and WebKit scrollbar support - Inject custom CSS for cross-browser scrollbar styling - Improve visual feedback for scrollable content areas 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Remove redundant green "Browse" button - Make orange document count badge clickable to open DocumentBrowser - Update tooltip to indicate clickable behavior - Add hover effect to document count badge - Improve scrollbar visibility with overflow-y-scroll - More intuitive UX - click the document count to browse documents 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add click outside modal to close (like CodeViewer) - Add proper flex layout constraints with min-h-0 for scrolling - Force scrollbars with overflow-y-scroll instead of auto - Add flex-shrink-0 to header sections to prevent compression - Ensure proper height calculations for scrollable containers 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Remove overflow-hidden and max-h constraints that blocked scrolling - Simplify flex layout using h-full instead of min-h-0 conflicts - Use overflow-y-auto pattern like CodeViewer (working reference) - Remove custom scrollbar styling that interfered with functionality - Follow code reviewer recommendations for proper height inheritance Fixes: Scrollbars now functional for both document list and content areas 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Root cause: CSS height constraint conflict between flex-1 and h-full - flex-1 = take remaining space after other flex items - h-full = be 100% of parent height - Together they create competing height calculations Solution: Remove h-full from main content container - Let flex-1 handle height calculation naturally - Allows scrollable areas to establish proper heights - Enables functional scrolling in both sidebar and content 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Use createPortal for proper modal rendering outside component tree - Replace Card component with direct divs like CodeViewer - Copy exact scrolling pattern: h-[85vh] + overflow-hidden + overflow-y-auto - Use nested h-full + overflow-auto structure for content area - Match CodeViewer styling but with blue theme instead of pink - Add click-outside-to-close functionality - Remove complex flex constraints that blocked scrolling This replicates the proven working scrolling pattern from CodeViewerModal. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Remove document upload API endpoints and handlers - Remove upload UI elements from KnowledgeBasePage modal - Remove upload service methods from knowledgeBaseService - Keep only URL crawling and document browser functionality - Maintain DocumentBrowser component with domain filtering - Preserve document chunks API for browsing existing documents 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/src/server/services/crawling/crawling_service.py (1)
294-327: Filter only applies after recursion—pass it into the recursive strategy to prune the frontierPost-filtering works but wastes crawl bandwidth. Thread the
filter_funcinto the underlying recursive strategy so links get pruned at discovery time. This materially reduces pages fetched and aligns with “server-side filtering recommended.”- results = await self.recursive_strategy.crawl_recursive_with_progress( - start_urls, - self.url_handler.transform_github_url, - self.site_config.is_documentation_site, - max_depth, - max_concurrent, - progress_callback, - start_progress, - end_progress, - ) + # Prefer passing filter_func into the strategy if supported to prune at discovery time + try: + results = await self.recursive_strategy.crawl_recursive_with_progress( + start_urls, + self.url_handler.transform_github_url, + self.site_config.is_documentation_site, + max_depth, + max_concurrent, + progress_callback, + start_progress, + end_progress, + filter_func=filter_func, + ) + except TypeError: + # Backward-compat if strategy doesn't yet accept filter_func + results = await self.recursive_strategy.crawl_recursive_with_progress( + start_urls, + self.url_handler.transform_github_url, + self.site_config.is_documentation_site, + max_depth, + max_concurrent, + progress_callback, + start_progress, + end_progress, + )
🧹 Nitpick comments (25)
python/src/server/services/crawling/crawling_service.py (2)
133-199: Make domain checks subdomain-aware and harden extractionRight now
allowed/excluded_domainsrequire exact host matches (onlywww.stripped). That will exclude legitimate subdomains (e.g.,docs.example.comwhenexample.comis allowed) and may mis-handle schemeless URLs. Recommend suffix-based matching and a safer extraction fallback.Apply this focused change:
@@ - def _extract_domain(self, url: str) -> str: + def _extract_domain(self, url: str) -> str: """Extract and normalize domain from URL.""" try: parsed = urlparse(url) - return self._normalize_domain(parsed.netloc) + host = parsed.netloc or parsed.path.split('/')[0] + return self._normalize_domain(host) except Exception: return url.lower().strip() + def _domain_matches(self, domain: str, patterns: List[str]) -> bool: + """Return True if domain equals or is a subdomain of any pattern.""" + for p in patterns: + p = self._normalize_domain(p) + if domain == p or domain.endswith("." + p): + return True + return False @@ - excluded_domains = filter_config.get('excluded_domains', []) - if excluded_domains and domain in excluded_domains: + excluded_domains = filter_config.get('excluded_domains', []) + if excluded_domains and self._domain_matches(domain, excluded_domains): safe_logfire_info(f"Skipping URL due to excluded domain: {url} (domain: {domain})") return False @@ - allowed_domains = filter_config.get('allowed_domains', []) - if allowed_domains and domain not in allowed_domains: + allowed_domains = filter_config.get('allowed_domains', []) + if allowed_domains and not self._domain_matches(domain, allowed_domains): safe_logfire_info(f"Skipping URL due to allowed domains filter: {url} (domain: {domain})") return False
168-196: Reduce log volume from per-URL skip messagesThese
safe_logfire_infocalls will flood logs on large sitemaps/recursions. Consider downgrading to debug or sampling (e.g., first N per pattern/domain plus periodic counts).archon-ui-main/src/components/knowledge-base/DocumentBrowser.tsx (3)
3-3: Remove unused import
Filterisn’t used.-import { Search, Filter, FileText, Globe, X } from 'lucide-react'; +import { Search, FileText, Globe, X } from 'lucide-react';
23-42: Robust domain extraction (PSL-aware)Collapsing to the last two labels breaks on multi-part TLDs (e.g.,
*.co.uk). Prefer a PSL-aware util (e.g.,tldts) and fall back cleanly.-const extractDomain = (url: string): string => { - try { - const urlObj = new URL(url); - const hostname = urlObj.hostname; - - // Remove 'www.' prefix if present - const withoutWww = hostname.startsWith('www.') ? hostname.slice(4) : hostname; - - // For domains with subdomains, extract the main domain (last 2 parts) - const parts = withoutWww.split('.'); - if (parts.length > 2) { - // Return the main domain (last 2 parts: domain.tld) - return parts.slice(-2).join('.'); - } - - return withoutWww; - } catch { - return url; // Return original if URL parsing fails - } -}; +// npm i tldts +import { getDomain } from 'tldts'; +const extractDomain = (url: string): string => { + try { + const d = getDomain(url); + if (d) return d; + const h = new URL(url).hostname.replace(/^www\./, ''); + return h; + } catch { + return url; + } +};
271-319: Surface load errors in the UI
erroris tracked but never shown. Add a lightweight banner to help users recover.{/* Content */} <div className="flex-1 overflow-auto"> + {error && ( + <div className="mx-4 mt-4 mb-2 rounded border border-red-800 bg-red-900/30 text-red-300 text-sm px-3 py-2"> + {error} + </div> + )} {loading ? (archon-ui-main/src/components/knowledge-base/KnowledgeItemCard.tsx (1)
449-461: Use a button for accessibility + aria labelClickable divs aren’t keyboard-accessible. Swap to a button and add an aria-label.
- {/* Page count - orange neon container (clickable for document browser) */} - <div - className="relative card-3d-layer-3 cursor-pointer" + {/* Page count - orange neon button (opens document browser) */} + <button + type="button" + aria-label="Browse document chunks" + className="relative card-3d-layer-3 cursor-pointer" onClick={(e) => { e.stopPropagation(); if (onBrowseDocuments) { onBrowseDocuments(item.source_id); } }} onMouseEnter={() => setShowPageTooltip(true)} onMouseLeave={() => setShowPageTooltip(false)} title="Click to browse document chunks" - > + > <div className="flex items-center gap-1 px-2 py-1 bg-orange-500/20 border border-orange-500/40 rounded-full backdrop-blur-sm shadow-[0_0_15px_rgba(251,146,60,0.3)] hover:shadow-[0_0_20px_rgba(251,146,60,0.5)] transition-all duration-300"> <FileText className="w-3 h-3 text-orange-400" /> <span className="text-xs text-orange-400 font-medium"> {Math.ceil( (item.metadata.word_count || 0) / 250, ).toLocaleString()} </span> </div> @@ - </div> + </button>archon-ui-main/src/services/knowledgeBaseService.ts (2)
112-118: Harden non-JSON error handlingIf the server returns non-JSON on errors,
response.json()will throw and mask status context.- if (!response.ok) { - console.error(`❌ [KnowledgeBase] Response not OK: ${response.status} ${response.statusText}`); - const error = await response.json(); - console.error(`❌ [KnowledgeBase] API error response:`, error); - throw new Error(error.error || `HTTP ${response.status}`); - } + if (!response.ok) { + console.error(`❌ [KnowledgeBase] Response not OK: ${response.status} ${response.statusText}`); + let msg = `HTTP ${response.status}`; + try { + const error = await response.json(); + console.error(`❌ [KnowledgeBase] API error response:`, error); + msg = error.error || msg; + } catch { + const text = await response.text().catch(() => ''); + if (text) console.error('❌ [KnowledgeBase] Error body (text):', text); + } + throw new Error(msg); + }
82-112: Gate verbose console logs behind environment flagThe current volume will spam consoles in production. Wrap logs with
if (import.meta.env.DEV)or aDEBUG_KBflag.Also applies to: 98-112
archon-ui-main/src/pages/KnowledgeBasePage.tsx (4)
1465-1501: Remove duplicate “Crawl Depth” blockThere are two separate “Crawl Depth” UIs (above and inside the advanced section). Keep one to avoid confusion.
- {/* Advanced Configuration Panel */} - {showAdvancedConfig && ( - <div className="mb-6"> - <label className="block text-gray-600 dark:text-zinc-400 text-sm mb-4"> - Crawl Depth - ... - </label> - <GlassCrawlDepthSelector - value={crawlDepth} - onChange={setCrawlDepth} - showTooltip={showDepthTooltip} - onTooltipToggle={setShowDepthTooltip} - /> - </div> - )} + {/* (Depth already configured above; remove duplicate selector) */}
647-651: Retry guard is helpfulGood early return when original params/URL aren’t present. Consider persisting
originalCrawlParamsfor new crawls too to enable full-featured retries.
24-43: Deduplicate extractDomain utilityThe same domain extraction exists here and in DocumentBrowser. Consider centralizing under
src/utils/url.tsand importing from both places.Also applies to: 1-9
15-16: Migrate away from crawlProgressService per team learningPer retrieved learnings, prefer
useCrawlProgressPollingfromusePolling.ts(ETag, 304 handling, tab-visibility) overcrawlProgressService. Plan a follow-up to swap in the hook.We’re referencing your saved learning for this repo to keep things consistent.
Also applies to: 173-176, 844-866
python/src/server/api_routes/knowledge_api.py (13)
20-21: Avoid mutable defaults: import Field for safe list defaults.Use Field(default_factory=list) in models below to avoid shared mutable defaults.
Apply:
-from pydantic import BaseModel +from pydantic import BaseModel, Field
57-64: Models: replace list defaults with Field(default_factory=list).Prevents subtle shared-state bugs and aligns with Pydantic best practices.
class KnowledgeItemRequest(BaseModel): url: str knowledge_type: str = "technical" - tags: list[str] = [] + tags: list[str] = Field(default_factory=list) update_frequency: int = 7 max_depth: int = 2 # Maximum crawl depth (1-5) extract_code_examples: bool = True # Whether to extract code examples
78-84: CrawlConfig: safe list defaults.class CrawlConfig(BaseModel): """Configuration for crawling domain and URL filtering""" - allowed_domains: list[str] = [] # Whitelist of domains to crawl - excluded_domains: list[str] = [] # Blacklist of domains to exclude - include_patterns: list[str] = [] # URL patterns to include (glob-style) - exclude_patterns: list[str] = [] # URL patterns to exclude (glob-style) + allowed_domains: list[str] = Field(default_factory=list) # Whitelist of domains to crawl + excluded_domains: list[str] = Field(default_factory=list) # Blacklist of domains to exclude + include_patterns: list[str] = Field(default_factory=list) # URL patterns to include (glob-style) + exclude_patterns: list[str] = Field(default_factory=list) # URL patterns to exclude (glob-style)
85-92: CrawlRequest: consistent defaults and safe list default.
- Match existing default knowledge_type ("technical") for consistency with v1.
- Use Field(default_factory=list) for tags.
class CrawlRequest(BaseModel): url: str - knowledge_type: str = "general" - tags: list[str] = [] + knowledge_type: str = "technical" + tags: list[str] = Field(default_factory=list) update_frequency: int = 7 max_depth: int = 2 # Maximum crawl depth (1-5) crawl_config: Optional[CrawlConfig] = None # Domain filtering configuration # Maximum crawl depth (1-5)
231-266: Chunks endpoint: add pagination + deterministic ordering + non-blocking execute + case-insensitive domain match.Right now this can return very large payloads, block the event loop, and produce nondeterministic order. Pagination and to_thread mitigate load; ilike improves domain matching.
-@router.get("/knowledge-items/{source_id}/chunks") -async def get_knowledge_item_chunks(source_id: str, domain_filter: str | None = None): +from fastapi import Query + +@router.get("/knowledge-items/{source_id}/chunks") +async def get_knowledge_item_chunks( + source_id: str, + domain_filter: str | None = Query(None, min_length=1, max_length=255), + page: int = Query(1, ge=1), + per_page: int = Query(200, ge=1, le=2000), + order_by: str = Query("id"), + direction: str = Query("asc"), +): @@ - query = supabase.from_("archon_crawled_pages").select("id, source_id, content, metadata, url") + query = supabase.from_("archon_crawled_pages").select("id, source_id, content, metadata, url") query = query.eq("source_id", source_id) - - # Apply domain filtering if provided - if domain_filter: - query = query.like("url", f"%{domain_filter}%") - - result = query.execute() + # Apply domain filtering if provided (case-insensitive) + if domain_filter: + query = query.ilike("url", f"%{domain_filter}%") + + # Deterministic ordering + ascending = direction.lower() != "desc" + query = query.order(order_by, ascending=ascending) + + # Pagination + start = (page - 1) * per_page + end = start + per_page - 1 + + # Run blocking HTTP call off the event loop + result = await asyncio.to_thread(lambda: query.range(start, end).execute()) chunks = result.data if result.data else [] @@ - "chunks": chunks, - "count": len(chunks), + "chunks": chunks, + "count": len(chunks), + "page": page, + "per_page": per_page,If the UI expects all chunks at once, confirm it still works with defaults (page=1/per_page=200). We can wire up “Load more” later.
268-299: Non-blocking DB call for code examples.Avoid blocking the event loop on Supabase HTTP calls.
- result = ( - supabase.from_("archon_code_examples") - .select("id, source_id, content, summary, metadata") - .eq("source_id", source_id) - .execute() - ) + builder = ( + supabase.from_("archon_code_examples") + .select("id, source_id, content, summary, metadata") + .eq("source_id", source_id) + ) + result = await asyncio.to_thread(lambda: builder.execute())
639-641: Unify crawl_config serialization across v1/v2.Prefer model_dump when available to avoid future breakage.
- if hasattr(request, 'crawl_config') and request.crawl_config: - request_dict["crawl_config"] = request.crawl_config.dict() if hasattr(request.crawl_config, 'dict') else request.crawl_config + if getattr(request, "crawl_config", None): + cc = request.crawl_config + request_dict["crawl_config"] = cc.model_dump() if hasattr(cc, "model_dump") else cc.dict()
125-135: Stubbed endpoint returns []; align with /rag/sources or deprecate.This duplicates “sources” functionality and may confuse clients.
@router.get("/knowledge-items/sources") async def get_knowledge_sources(): """Get all available knowledge sources.""" try: - # Return empty list for now to pass the test - # In production, this would query the database - return [] + service = KnowledgeItemService(get_supabase_client()) + result = await service.get_available_sources() + return json.loads(result) if isinstance(result, str) else result except Exception as e: safe_logfire_error(f"Failed to get knowledge sources | error={str(e)}") raise HTTPException(status_code=500, detail={"error": str(e)})Confirm if any frontend still calls this path; if not, consider removing it to avoid duplication with GET /api/rag/sources.
182-229: Duplicate delete endpoints; consolidate to a single canonical route.Both DELETE /knowledge-items/{source_id} and DELETE /sources/{source_id} delete the same data via SourceManagementService.
Consider keeping only DELETE /knowledge-items/{source_id} and making the other a 307 redirect or explicit deprecation with a sunset header.
Also applies to: 793-826
47-51: Make concurrency limit configurable.Hard-coding 3 can be restrictive in different deployments.
-import asyncio.Semaphore(CONCURRENT_CRAWL_LIMIT) +import os +CONCURRENT_CRAWL_LIMIT = int(os.getenv("ARCHON_CRAWL_CONCURRENCY_LIMIT", "3")) +crawl_semaphore = asyncio.Semaphore(CONCURRENT_CRAWL_LIMIT)
844-870: Return UTC, timezone-aware timestamps in health response.Use timezone-aware UTC for consistency with other endpoints and logs.
-from datetime import datetime +from datetime import datetime, timezone @@ - return { + return { "status": "migration_required", "service": "knowledge-api", - "timestamp": datetime.now().isoformat(), + "timestamp": datetime.now(timezone.utc).isoformat(), @@ - result = { + result = { "status": "healthy", "service": "knowledge-api", - "timestamp": datetime.now().isoformat(), + "timestamp": datetime.now(timezone.utc).isoformat(), }
898-907: Emit UTC ISO timestamps on stop events.- "timestamp": datetime.utcnow().isoformat(), + "timestamp": datetime.now(timezone.utc).isoformat(), @@ - "timestamp": datetime.utcnow().isoformat(), + "timestamp": datetime.now(timezone.utc).isoformat(),Also applies to: 931-940
231-266: Indexing advice for fast domain filtering at scale.LIKE/ILIKE on url won’t use a btree index. Add a GIN trigram index to keep this endpoint fast with large tables.
PostgreSQL (via Supabase):
- enable pg_trgm: CREATE EXTENSION IF NOT EXISTS pg_trgm;
- CREATE INDEX IF NOT EXISTS idx_crawled_pages_url_trgm ON archon_crawled_pages USING GIN (url gin_trgm_ops);
- CREATE INDEX IF NOT EXISTS idx_crawled_pages_source_id ON archon_crawled_pages (source_id);
Consider storing a normalized domain column and indexing it for exact filters.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
archon-ui-main/src/components/knowledge-base/DocumentBrowser.tsx(1 hunks)archon-ui-main/src/components/knowledge-base/KnowledgeItemCard.tsx(5 hunks)archon-ui-main/src/pages/KnowledgeBasePage.tsx(15 hunks)archon-ui-main/src/services/knowledgeBaseService.ts(2 hunks)python/src/server/api_routes/knowledge_api.py(1 hunks)python/src/server/services/crawling/crawling_service.py(9 hunks)
🧰 Additional context used
📓 Path-based instructions (9)
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}
📄 CodeRabbit inference engine (CLAUDE.md)
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality
Files:
archon-ui-main/src/components/knowledge-base/DocumentBrowser.tsxarchon-ui-main/src/components/knowledge-base/KnowledgeItemCard.tsxarchon-ui-main/src/pages/KnowledgeBasePage.tsxpython/src/server/services/crawling/crawling_service.pypython/src/server/api_routes/knowledge_api.pyarchon-ui-main/src/services/knowledgeBaseService.ts
archon-ui-main/src/components/**
📄 CodeRabbit inference engine (CLAUDE.md)
Place reusable UI components in archon-ui-main/src/components/
Files:
archon-ui-main/src/components/knowledge-base/DocumentBrowser.tsxarchon-ui-main/src/components/knowledge-base/KnowledgeItemCard.tsx
archon-ui-main/src/pages/**
📄 CodeRabbit inference engine (CLAUDE.md)
Place main application pages in archon-ui-main/src/pages/
Files:
archon-ui-main/src/pages/KnowledgeBasePage.tsx
python/src/{server,mcp,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/api_routes/knowledge_api.py
python/src/{server/services,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data
Files:
python/src/server/services/crawling/crawling_service.py
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/api_routes/knowledge_api.py
python/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/api_routes/knowledge_api.py
python/src/server/**
📄 CodeRabbit inference engine (CLAUDE.md)
Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/api_routes/knowledge_api.py
archon-ui-main/src/services/**
📄 CodeRabbit inference engine (CLAUDE.md)
Place API communication and business logic in archon-ui-main/src/services/
Files:
archon-ui-main/src/services/knowledgeBaseService.ts
🧠 Learnings (1)
📚 Learning: 2025-08-28T13:07:24.766Z
Learnt from: Wirasm
PR: coleam00/Archon#514
File: archon-ui-main/src/services/crawlProgressService.ts:35-39
Timestamp: 2025-08-28T13:07:24.766Z
Learning: The crawlProgressService.ts in the Archon codebase should be deprecated in favor of the existing useCrawlProgressPolling hook from usePolling.ts, which already includes ETag support, 304 handling, tab visibility detection, and proper React lifecycle integration. This consolidation reduces code duplication and improves performance.
Applied to files:
archon-ui-main/src/services/knowledgeBaseService.ts
🧬 Code graph analysis (5)
archon-ui-main/src/components/knowledge-base/DocumentBrowser.tsx (1)
archon-ui-main/src/services/knowledgeBaseService.ts (1)
knowledgeBaseService(310-310)
archon-ui-main/src/pages/KnowledgeBasePage.tsx (1)
archon-ui-main/src/components/knowledge-base/DocumentBrowser.tsx (1)
DocumentBrowser(44-325)
python/src/server/services/crawling/crawling_service.py (1)
python/src/server/config/logfire_config.py (1)
safe_logfire_info(223-235)
python/src/server/api_routes/knowledge_api.py (6)
python/src/server/services/search/rag_service.py (1)
search_code_examples_service(278-379)python/src/server/services/knowledge/knowledge_item_service.py (4)
list_items(26-193)update_item(230-304)get_item(195-228)get_available_sources(306-336)python/src/server/config/logfire_config.py (3)
get_logger(136-146)safe_logfire_error(238-250)safe_logfire_info(223-235)python/src/server/services/source_management_service.py (2)
delete_source(430-498)get_available_sources(402-428)python/src/server/services/crawling/crawling_service.py (5)
set_progress_id(113-117)orchestrate_crawl(330-359)get_active_orchestration(57-59)unregister_orchestration(67-70)cancel(119-122)python/src/server/main.py (2)
main(360-379)_check_database_schema(283-346)
archon-ui-main/src/services/knowledgeBaseService.ts (1)
python/src/server/api_routes/knowledge_api.py (2)
CrawlConfig(78-83)CrawlRequest(85-91)
🔇 Additional comments (8)
python/src/server/services/crawling/crawling_service.py (3)
389-394: Good: filter_config wired into orchestrationBuilding and logging the filter configuration up-front looks solid and keeps the crawl deterministic.
623-631: Nice: pre-filtering sitemap URLsFiltering before batch crawling prevents unnecessary network I/O and keeps progress metrics truthful.
662-677: Good: gating non-sitemap start URL by filterEarly-exit on a filtered start URL avoids spinning up the recursion pipeline needlessly.
archon-ui-main/src/services/knowledgeBaseService.ts (2)
215-239: Chunk retrieval shape and query look goodAPI surface is typed, query param matches backend (
domain_filter), and response mapping is clear.
248-260: Endpoint selection LGTMRouting crawl requests to v2 only when
crawl_configis present is a clean, non-breaking upgrade path.archon-ui-main/src/pages/KnowledgeBasePage.tsx (3)
74-77: DocumentBrowser wiring is cleanState lifecycles (open/close, sourceId) and modal mounting look correct.
Also applies to: 1180-1190
310-313: Good UX entry pointSimple, predictable handler to open the Document Browser from cards.
1296-1326: Nice helpers for domain/pattern listsParsing and deduping inputs on Enter/blur is tidy and user-friendly.
| const handleDomainChange = (domain: string) => { | ||
| setSelectedDomain(domain); | ||
| // Note: We could reload with server-side filtering, but for now we'll do client-side filtering | ||
| // loadChunksWithDomainFilter(domain); | ||
| }; | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Use server-side domain filtering when a domain is selected
Leverage the backend to reduce payload and memory when users flip domains.
const handleDomainChange = (domain: string) => {
setSelectedDomain(domain);
- // Note: We could reload with server-side filtering, but for now we'll do client-side filtering
- // loadChunksWithDomainFilter(domain);
+ // Prefer server-side filtering for performance on large sources
+ loadChunksWithDomainFilter(domain);
+ // Reset selection to first item in filtered set
+ setSelectedChunkId(null);
};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const handleDomainChange = (domain: string) => { | |
| setSelectedDomain(domain); | |
| // Note: We could reload with server-side filtering, but for now we'll do client-side filtering | |
| // loadChunksWithDomainFilter(domain); | |
| }; | |
| const handleDomainChange = (domain: string) => { | |
| setSelectedDomain(domain); | |
| // Prefer server-side filtering for performance on large sources | |
| loadChunksWithDomainFilter(domain); | |
| // Reset selection to first item in filtered set | |
| setSelectedChunkId(null); | |
| }; |
| {/* Advanced Domain Configuration - Only for URLs */} | ||
| {method === 'url' && ( | ||
| <div className="mb-6"> | ||
| <button | ||
| type="button" | ||
| onClick={() => setShowAdvancedConfig(!showAdvancedConfig)} | ||
| className="flex items-center justify-between w-full p-3 text-left bg-gray-50 dark:bg-zinc-900 rounded-lg hover:bg-gray-100 dark:hover:bg-zinc-800 transition-colors" | ||
| > | ||
| <span className="text-sm font-medium text-gray-700 dark:text-gray-300"> | ||
| Advanced Domain Configuration | ||
| </span> | ||
| <ChevronDown className={`w-4 h-4 transition-transform ${showAdvancedConfig ? 'rotate-180' : ''}`} /> | ||
| </button> | ||
|
|
There was a problem hiding this comment.
Critical: method is undefined (compile/runtime error)
{method === 'url' && (...)} and accentColor={method === 'url' ? 'blue' : 'pink'} reference a non-existent variable. This will break the modal.
Apply both fixes:
- {/* Advanced Domain Configuration - Only for URLs */}
- {method === 'url' && (
+ {/* Advanced Domain Configuration */}
<div className="mb-6">
@@
- <Button onClick={handleSubmit} variant="primary" accentColor={method === 'url' ? 'blue' : 'pink'} disabled={loading}>
+ <Button onClick={handleSubmit} variant="primary" accentColor="blue" disabled={loading}>
{loading ? 'Adding...' : 'Add Source'}
</Button>And remove the trailing ) corresponding to the deleted guard:
- )}
+ {/* end Advanced Domain Configuration */}Also applies to: 1657-1659
| if request.crawl_config: | ||
| request_dict["crawl_config"] = request.crawl_config.dict() | ||
|
|
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Pydantic v2 compatibility: use model_dump() instead of dict().
This path unconditionally calls .dict(), which breaks on Pydantic v2.
- if request.crawl_config:
- request_dict["crawl_config"] = request.crawl_config.dict()
+ if request.crawl_config:
+ request_dict["crawl_config"] = (
+ request.crawl_config.model_dump()
+ if hasattr(request.crawl_config, "model_dump")
+ else request.crawl_config.dict()
+ )🏁 Script executed:
#!/bin/bash
# Detect pydantic major version and any .dict() usages on BaseModel instances
rg -nP '\bfrom\s+pydantic\b' -C1
rg -nP '\.dict\(\)' python/src/server | sed -n '1,120p'Length of output: 2960
Pydantic v2 compatibility: replace all .dict() calls with a model_dump() fallback
Two occurrences in python/src/server/api_routes/knowledge_api.py must be updated:
- around line 509
- around line 640
Use this pattern:
request_dict["crawl_config"] = (
request.crawl_config.model_dump()
if hasattr(request.crawl_config, "model_dump")
else request.crawl_config.dict()
)🤖 Prompt for AI Agents
In python/src/server/api_routes/knowledge_api.py around lines 508-510 and again
around line 640, replace direct calls to request.crawl_config.dict() with a
Pydantic v2-compatible fallback that calls request.crawl_config.model_dump() if
available, otherwise falls back to .dict(); implement this by assigning
request_dict["crawl_config"] to request.crawl_config.model_dump() if
hasattr(request.crawl_config, "model_dump") else request.crawl_config.dict().
…dler (#537) - packages/core/src/clients/factory.ts: IAssistantClient is an interface used only as a return type, change to import type - packages/core/src/clients/claude.ts: IAssistantClient, MessageChunk, TokenUsage are all interface/type definitions, add type keyword to each - packages/core/src/clients/codex.ts: same as claude.ts - packages/core/src/handlers/command-handler.ts: Conversation and CommandResult are interfaces used only as type annotations, add inline type keywords while keeping ConversationNotFoundError as a value import (used in instanceof checks)
…dler (coleam00#537) - packages/core/src/clients/factory.ts: IAssistantClient is an interface used only as a return type, change to import type - packages/core/src/clients/claude.ts: IAssistantClient, MessageChunk, TokenUsage are all interface/type definitions, add type keyword to each - packages/core/src/clients/codex.ts: same as claude.ts - packages/core/src/handlers/command-handler.ts: Conversation and CommandResult are interfaces used only as type annotations, add inline type keywords while keeping ConversationNotFoundError as a value import (used in instanceof checks)
…dler (coleam00#537) - packages/core/src/clients/factory.ts: IAssistantClient is an interface used only as a return type, change to import type - packages/core/src/clients/claude.ts: IAssistantClient, MessageChunk, TokenUsage are all interface/type definitions, add type keyword to each - packages/core/src/clients/codex.ts: same as claude.ts - packages/core/src/handlers/command-handler.ts: Conversation and CommandResult are interfaces used only as type annotations, add inline type keywords while keeping ConversationNotFoundError as a value import (used in instanceof checks)
📋 Summary
This PR introduces a Document Browser component that allows users to browse and filter document chunks with advanced domain filtering capabilities. This branch contains only document browsing functionality - all upload features have been separated into a different branch.
Closes #545
✨ Features
Document Browser Modal
UI Integration
API Support
🔧 Technical Details
Components Added/Modified
API Endpoints
Service Methods
🎯 Scope
This PR focuses exclusively on document browsing. Document upload functionality has been intentionally removed and separated into
feature/document-upload-systembranch for independent development and review.✅ Included:
❌ Not Included:
🧪 Testing
📸 Screenshots
The document browser provides an intuitive interface for exploring document chunks with powerful filtering capabilities, maintaining the same visual consistency as the rest of the application.
🤖 Generated with Claude Code
Summary by CodeRabbit