feat: Document browser with advanced domain filtering by leex279 · Pull Request #537 · coleam00/Archon

leex279 · 2025-08-30T19:04:12Z

📋 Summary

This PR introduces a Document Browser component that allows users to browse and filter document chunks with advanced domain filtering capabilities. This branch contains only document browsing functionality - all upload features have been separated into a different branch.

Closes #545

✨ Features

Document Browser Modal

Two-column layout inspired by CodeViewer pattern
Domain filtering - Filter chunks by domain (e.g., docs.anthropic.com)
Search functionality - Search through document content
Chunk navigation - Click through individual document chunks
Metadata display - View chunk metadata in expandable sections

UI Integration

Clickable document count badge - Opens browser for that source
Browse button in knowledge item cards
Modal overlay with smooth animations
Responsive design with proper scrolling

API Support

GET /api/knowledge-items/{source_id}/chunks - Fetch document chunks
Optional domain filtering - Server-side domain filtering support
Metadata preservation - Full chunk content and metadata

🔧 Technical Details

Components Added/Modified

DocumentBrowser.tsx - New modal component for browsing chunks
KnowledgeItemCard.tsx - Added browse functionality integration
KnowledgeBasePage.tsx - Browser state management and modal integration

API Endpoints

Document Chunks API - Fetch chunks with optional domain filtering
Clean separation - Only crawling and browsing, no upload functionality

Service Methods

getKnowledgeItemChunks() - Fetch chunks with domain filtering support
Domain extraction - Client-side domain extraction utilities

🎯 Scope

This PR focuses exclusively on document browsing. Document upload functionality has been intentionally removed and separated into feature/document-upload-system branch for independent development and review.

✅ Included:

Document browser modal with domain filtering
Chunk viewing and navigation
URL crawling functionality
Browse integration in existing UI

❌ Not Included:

Document upload functionality (moved to separate branch)
File selection and upload UI
Upload API endpoints and processing
Upload progress tracking

🧪 Testing

✅ TypeScript compilation passes
✅ Frontend builds successfully
✅ Document browser modal opens and displays chunks
✅ Domain filtering works correctly
✅ Search functionality operates as expected
✅ Clean separation from upload features verified

📸 Screenshots

The document browser provides an intuitive interface for exploring document chunks with powerful filtering capabilities, maintaining the same visual consistency as the rest of the application.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Browse a knowledge item’s documents in a modal with search and domain filters; view content, source links, and metadata.
- Open the document browser directly from a knowledge card via the page-count badge.
UX Improvements
- Richer tooltips (browse/code), hover states, and a new “average novels” metric.
- “Add Knowledge” simplified to a single URL flow with Advanced Domain Configuration (allow/deny domains, include/exclude patterns).
Performance & Reliability
- More consistent crawl progress and support for domain-filtered crawling.

Frontend: - Add DocumentBrowser component with domain filtering and search - Add advanced domain configuration UI to AddKnowledgeModal - Add "Browse Documents" button to KnowledgeItemCard - Support comma-separated domain/pattern input with badges - Make modal scrollable and improve UX Backend: - Add CrawlConfig model with domain/pattern filtering options - Implement domain filtering logic with fnmatch pattern matching - Add /knowledge-items/{source_id}/chunks endpoint for chunk browsing - Add /knowledge-items/crawl-v2 endpoint with domain filtering support - Filter URLs during crawling based on allowed/excluded domains and patterns Features: - Whitelist specific domains to crawl - Blacklist domains to exclude - Include/exclude URL patterns using glob-style matching - Browse and search document chunks with domain filtering - Collapsible advanced configuration section 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai · 2025-08-30T19:04:18Z

Walkthrough

Adds a DocumentBrowser modal to browse knowledge-base chunks with search and domain filtering, wires it into KnowledgeBasePage via KnowledgeItemCard’s clickable page-count badge, introduces getKnowledgeItemChunks API on frontend and backend, adds crawl v2 with domain-filter configuration and URL filtering, and removes legacy document upload paths.

Changes

Cohort / File(s)	Summary
Document Browser UI `archon-ui-main/src/components/knowledge-base/DocumentBrowser.tsx`	New modal component that loads chunks by sourceId, computes domains, supports client-side domain and search filtering, selects first chunk by default, and displays content, URL, and optional metadata. Uses framer-motion and portal; onClose handled via props.
Knowledge Card integration `archon-ui-main/src/components/knowledge-base/KnowledgeItemCard.tsx`	Adds onBrowseDocuments prop; makes page-count badge clickable to open DocumentBrowser; updates tooltips and minor UI styles.
Knowledge Base Page wiring & modal flow `archon-ui-main/src/pages/KnowledgeBasePage.tsx`	Integrates DocumentBrowser state and handler; passes onBrowseDocuments to KnowledgeItemCard. Overhauls Add Knowledge modal to a single URL flow with advanced domain config; updates crawl submission and progress handling.
Frontend service API changes `archon-ui-main/src/services/knowledgeBaseService.ts`	Adds CrawlConfig and optional crawl_config on CrawlRequest; adds getKnowledgeItemChunks(sourceId, domainFilter?); routes crawlUrl to v2 when crawl_config present; removes uploadDocument and UploadMetadata.
Backend API endpoints `python/src/server/api_routes/knowledge_api.py`	Adds CrawlConfig, CrawlRequest, crawl_knowledge_item_v2, get_knowledge_item_chunks (domain_filter), get_knowledge_item_code_examples; introduces concurrent crawl control; removes document upload endpoint/flow.
Crawling service domain filtering `python/src/server/services/crawling/crawling_service.py`	Implements domain normalization/extraction, filter config builder, URL filter (_should_crawl_url), and propagates filtering through crawl orchestration, including sitemap pre/post filtering and optional filter_func.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant K as KnowledgeItemCard
  participant P as KnowledgeBasePage
  participant DB as DocumentBrowser
  participant Svc as knowledgeBaseService
  participant API as GET /knowledge-items/{id}/chunks
  participant Store as Chunks Store

  U->>K: Click page-count badge
  K-->>P: onBrowseDocuments(sourceId)
  P->>DB: Open modal with {sourceId}
  DB->>Svc: getKnowledgeItemChunks(sourceId, domainFilter?)
  Svc->>API: GET chunks?domain_filter=...
  API-->>Svc: chunks[]
  Svc-->>DB: chunks[]
  DB->>Store: cache chunks, compute domains
  U->>DB: Search / select domain / pick chunk
  DB-->>U: Render filtered chunk content + metadata

sequenceDiagram
  autonumber
  participant UI as KnowledgeBasePage
  participant Svc as knowledgeBaseService.crawlUrl
  participant API as POST /crawl-v2
  participant Crawl as CrawlingService
  Note over UI,API: Crawl v2 with crawl_config
  UI->>Svc: crawlUrl({ url, crawl_config })
  alt crawl_config provided
    Svc->>API: POST /crawl-v2 (CrawlRequest)
    API->>Crawl: start crawl task (semaphore)
    Crawl->>Crawl: build filter_config<br/>apply URL filtering
    Crawl-->>API: progress updates (WS/events)
  else
    Svc->>API: POST /crawl (v1)
  end
  API-->>UI: completion signal

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Assessment against linked issues

Objective	Addressed	Explanation
Document Browser modal with two-column layout, chunk list, content view, click navigation [#545]	✅
Domain filtering (server-side) in browser; domain dropdown [#545]	❓	Client-side filtering implemented; server-side domain_filter exists but primary path appears unused in UI.
Search across chunk content in real-time [#545]	✅
Metadata displayed in expandable sections [#545]	❓	Metadata shown; expand/collapse behavior not clearly implemented.
Integration: clickable document count badge opens modal; smooth modal overlay/animations; responsive UI [#545]	❓	Badge opens modal and animations added; responsiveness not verified from changes alone.

Assessment against linked issues: Out-of-scope changes

Code Change	Explanation
Removal of document upload flow and endpoint (python/src/server/api_routes/knowledge_api.py)	Issue #545 focuses on browsing existing content; upload features are explicitly out of scope.
Add Knowledge modal overhaul removing upload and switching to single URL flow (archon-ui-main/src/pages/KnowledgeBasePage.tsx)	Changes to knowledge creation/upload are unrelated to the browsing feature in #545.
New endpoint get_knowledge_item_code_examples (python/src/server/api_routes/knowledge_api.py)	Code examples retrieval is not part of document browsing objectives in #545.

Possibly related PRs

Fix race condition in concurrent crawling with unique source IDs #472 — Introduces/get refactors around chunk retrieval and crawl domain-filtering, overlapping with this PR’s getKnowledgeItemChunks and crawl v2 plumbing.

Suggested reviewers

coleam00
tazmon95

Poem

I hop through chunks, domain to domain,
A browser-bunny on a knowledge train.
With filters snug and searches bright,
I nibble metadata late at night.
Click, skim, sip—such tidy fun!
Docs in a modal, and then I’m done. 🥕

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/document-browser-domain-filtering

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

- Add optional chaining for domains array mapping - Add safety checks for filteredChunks and chunks arrays - Remove unsafe HTML content replacement to prevent XSS - Ensure component handles empty/undefined data gracefully 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Replace HTML option children with options prop - Select component expects {value, label} objects array - Fixes "Cannot read properties of undefined (reading 'map')" error 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Change from 'documents' to 'archon_crawled_pages' table - Fixes 500 error when fetching document chunks - Aligns with existing database schema naming convention 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Move DocumentBrowser from KnowledgeItemCard to KnowledgeBasePage level - Add onBrowseDocuments callback prop to KnowledgeItemCard - Fix modal rendering inside card container (z-index/stacking context issue) - Now opens as full-screen modal like other modals (CodeViewer, EditModal) - Fix table name in chunks API from 'documents' to 'archon_crawled_pages' 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add left sidebar with document list (like code examples list) - Add right content area for selected document chunk - Add click-to-select functionality for document chunks - Auto-select first chunk when opening browser - Match CodeViewer modal design pattern but in blue theme - Show document preview in sidebar with domain badges - Improve overall UX with familiar layout pattern 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add visible blue-themed scrollbars to document list sidebar - Add matching scrollbars to document content area - Include both Firefox (scrollbar-width/color) and WebKit scrollbar support - Inject custom CSS for cross-browser scrollbar styling - Improve visual feedback for scrollable content areas 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove redundant green "Browse" button - Make orange document count badge clickable to open DocumentBrowser - Update tooltip to indicate clickable behavior - Add hover effect to document count badge - Improve scrollbar visibility with overflow-y-scroll - More intuitive UX - click the document count to browse documents 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add click outside modal to close (like CodeViewer) - Add proper flex layout constraints with min-h-0 for scrolling - Force scrollbars with overflow-y-scroll instead of auto - Add flex-shrink-0 to header sections to prevent compression - Ensure proper height calculations for scrollable containers 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove overflow-hidden and max-h constraints that blocked scrolling - Simplify flex layout using h-full instead of min-h-0 conflicts - Use overflow-y-auto pattern like CodeViewer (working reference) - Remove custom scrollbar styling that interfered with functionality - Follow code reviewer recommendations for proper height inheritance Fixes: Scrollbars now functional for both document list and content areas 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Root cause: CSS height constraint conflict between flex-1 and h-full - flex-1 = take remaining space after other flex items - h-full = be 100% of parent height - Together they create competing height calculations Solution: Remove h-full from main content container - Let flex-1 handle height calculation naturally - Allows scrollable areas to establish proper heights - Enables functional scrolling in both sidebar and content 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Use createPortal for proper modal rendering outside component tree - Replace Card component with direct divs like CodeViewer - Copy exact scrolling pattern: h-[85vh] + overflow-hidden + overflow-y-auto - Use nested h-full + overflow-auto structure for content area - Match CodeViewer styling but with blue theme instead of pink - Add click-outside-to-close functionality - Remove complex flex constraints that blocked scrolling This replicates the proven working scrolling pattern from CodeViewerModal. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove document upload API endpoints and handlers - Remove upload UI elements from KnowledgeBasePage modal - Remove upload service methods from knowledgeBaseService - Keep only URL crawling and document browser functionality - Maintain DocumentBrowser component with domain filtering - Preserve document chunks API for browsing existing documents 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

python/src/server/services/crawling/crawling_service.py (1)

294-327: Filter only applies after recursion—pass it into the recursive strategy to prune the frontier

Post-filtering works but wastes crawl bandwidth. Thread the filter_func into the underlying recursive strategy so links get pruned at discovery time. This materially reduces pages fetched and aligns with “server-side filtering recommended.”

-        results = await self.recursive_strategy.crawl_recursive_with_progress(
-            start_urls,
-            self.url_handler.transform_github_url,
-            self.site_config.is_documentation_site,
-            max_depth,
-            max_concurrent,
-            progress_callback,
-            start_progress,
-            end_progress,
-        )
+        # Prefer passing filter_func into the strategy if supported to prune at discovery time
+        try:
+            results = await self.recursive_strategy.crawl_recursive_with_progress(
+                start_urls,
+                self.url_handler.transform_github_url,
+                self.site_config.is_documentation_site,
+                max_depth,
+                max_concurrent,
+                progress_callback,
+                start_progress,
+                end_progress,
+                filter_func=filter_func,
+            )
+        except TypeError:
+            # Backward-compat if strategy doesn't yet accept filter_func
+            results = await self.recursive_strategy.crawl_recursive_with_progress(
+                start_urls,
+                self.url_handler.transform_github_url,
+                self.site_config.is_documentation_site,
+                max_depth,
+                max_concurrent,
+                progress_callback,
+                start_progress,
+                end_progress,
+            )

🧹 Nitpick comments (25)

python/src/server/services/crawling/crawling_service.py (2)

133-199: Make domain checks subdomain-aware and harden extraction

Right now allowed/excluded_domains require exact host matches (only www. stripped). That will exclude legitimate subdomains (e.g., docs.example.com when example.com is allowed) and may mis-handle schemeless URLs. Recommend suffix-based matching and a safer extraction fallback.

Apply this focused change:

@@
-    def _extract_domain(self, url: str) -> str:
+    def _extract_domain(self, url: str) -> str:
         """Extract and normalize domain from URL."""
         try:
             parsed = urlparse(url)
-            return self._normalize_domain(parsed.netloc)
+            host = parsed.netloc or parsed.path.split('/')[0]
+            return self._normalize_domain(host)
         except Exception:
             return url.lower().strip()
 
+    def _domain_matches(self, domain: str, patterns: List[str]) -> bool:
+        """Return True if domain equals or is a subdomain of any pattern."""
+        for p in patterns:
+            p = self._normalize_domain(p)
+            if domain == p or domain.endswith("." + p):
+                return True
+        return False
@@
-        excluded_domains = filter_config.get('excluded_domains', [])
-        if excluded_domains and domain in excluded_domains:
+        excluded_domains = filter_config.get('excluded_domains', [])
+        if excluded_domains and self._domain_matches(domain, excluded_domains):
             safe_logfire_info(f"Skipping URL due to excluded domain: {url} (domain: {domain})")
             return False
@@
-        allowed_domains = filter_config.get('allowed_domains', [])
-        if allowed_domains and domain not in allowed_domains:
+        allowed_domains = filter_config.get('allowed_domains', [])
+        if allowed_domains and not self._domain_matches(domain, allowed_domains):
             safe_logfire_info(f"Skipping URL due to allowed domains filter: {url} (domain: {domain})")
             return False

168-196: Reduce log volume from per-URL skip messages

These safe_logfire_info calls will flood logs on large sitemaps/recursions. Consider downgrading to debug or sampling (e.g., first N per pattern/domain plus periodic counts).

archon-ui-main/src/components/knowledge-base/DocumentBrowser.tsx (3)

3-3: Remove unused import

Filter isn’t used.

-import { Search, Filter, FileText, Globe, X } from 'lucide-react';
+import { Search, FileText, Globe, X } from 'lucide-react';

23-42: Robust domain extraction (PSL-aware)

Collapsing to the last two labels breaks on multi-part TLDs (e.g., *.co.uk). Prefer a PSL-aware util (e.g., tldts) and fall back cleanly.

-const extractDomain = (url: string): string => {
-  try {
-    const urlObj = new URL(url);
-    const hostname = urlObj.hostname;
-    
-    // Remove 'www.' prefix if present
-    const withoutWww = hostname.startsWith('www.') ? hostname.slice(4) : hostname;
-    
-    // For domains with subdomains, extract the main domain (last 2 parts)
-    const parts = withoutWww.split('.');
-    if (parts.length > 2) {
-      // Return the main domain (last 2 parts: domain.tld)
-      return parts.slice(-2).join('.');
-    }
-    
-    return withoutWww;
-  } catch {
-    return url; // Return original if URL parsing fails
-  }
-};
+// npm i tldts
+import { getDomain } from 'tldts';
+const extractDomain = (url: string): string => {
+  try {
+    const d = getDomain(url);
+    if (d) return d;
+    const h = new URL(url).hostname.replace(/^www\./, '');
+    return h;
+  } catch {
+    return url;
+  }
+};

271-319: Surface load errors in the UI

error is tracked but never shown. Add a lightweight banner to help users recover.

           {/* Content */}
           <div className="flex-1 overflow-auto">
+            {error && (
+              <div className="mx-4 mt-4 mb-2 rounded border border-red-800 bg-red-900/30 text-red-300 text-sm px-3 py-2">
+                {error}
+              </div>
+            )}
             {loading ? (

archon-ui-main/src/components/knowledge-base/KnowledgeItemCard.tsx (1)

449-461: Use a button for accessibility + aria label

Clickable divs aren’t keyboard-accessible. Swap to a button and add an aria-label.

-              {/* Page count - orange neon container (clickable for document browser) */}
-              <div
-                className="relative card-3d-layer-3 cursor-pointer"
+              {/* Page count - orange neon button (opens document browser) */}
+              <button
+                type="button"
+                aria-label="Browse document chunks"
+                className="relative card-3d-layer-3 cursor-pointer"
                 onClick={(e) => {
                   e.stopPropagation();
                   if (onBrowseDocuments) {
                     onBrowseDocuments(item.source_id);
                   }
                 }}
                 onMouseEnter={() => setShowPageTooltip(true)}
                 onMouseLeave={() => setShowPageTooltip(false)}
                 title="Click to browse document chunks"
-              >
+              >
                 <div className="flex items-center gap-1 px-2 py-1 bg-orange-500/20 border border-orange-500/40 rounded-full backdrop-blur-sm shadow-[0_0_15px_rgba(251,146,60,0.3)] hover:shadow-[0_0_20px_rgba(251,146,60,0.5)] transition-all duration-300">
                   <FileText className="w-3 h-3 text-orange-400" />
                   <span className="text-xs text-orange-400 font-medium">
                     {Math.ceil(
                       (item.metadata.word_count || 0) / 250,
                     ).toLocaleString()}
                   </span>
                 </div>
@@
-              </div>
+              </button>

archon-ui-main/src/services/knowledgeBaseService.ts (2)

112-118: Harden non-JSON error handling

If the server returns non-JSON on errors, response.json() will throw and mask status context.

-    if (!response.ok) {
-      console.error(`❌ [KnowledgeBase] Response not OK: ${response.status} ${response.statusText}`);
-      const error = await response.json();
-      console.error(`❌ [KnowledgeBase] API error response:`, error);
-      throw new Error(error.error || `HTTP ${response.status}`);
-    }
+    if (!response.ok) {
+      console.error(`❌ [KnowledgeBase] Response not OK: ${response.status} ${response.statusText}`);
+      let msg = `HTTP ${response.status}`;
+      try {
+        const error = await response.json();
+        console.error(`❌ [KnowledgeBase] API error response:`, error);
+        msg = error.error || msg;
+      } catch {
+        const text = await response.text().catch(() => '');
+        if (text) console.error('❌ [KnowledgeBase] Error body (text):', text);
+      }
+      throw new Error(msg);
+    }

82-112: Gate verbose console logs behind environment flag

The current volume will spam consoles in production. Wrap logs with if (import.meta.env.DEV) or a DEBUG_KB flag.

Also applies to: 98-112

archon-ui-main/src/pages/KnowledgeBasePage.tsx (4)

1465-1501: Remove duplicate “Crawl Depth” block

There are two separate “Crawl Depth” UIs (above and inside the advanced section). Keep one to avoid confusion.
-        {/* Advanced Configuration Panel */}
-        {showAdvancedConfig && (
-          <div className="mb-6">
-            <label className="block text-gray-600 dark:text-zinc-400 text-sm mb-4">
-              Crawl Depth
-              ...
-            </label>
-            <GlassCrawlDepthSelector
-              value={crawlDepth}
-              onChange={setCrawlDepth}
-              showTooltip={showDepthTooltip}
-              onTooltipToggle={setShowDepthTooltip}
-            />
-          </div>
-        )}
+        {/* (Depth already configured above; remove duplicate selector) */}
647-651: Retry guard is helpful

Good early return when original params/URL aren’t present. Consider persisting originalCrawlParams for new crawls too to enable full-featured retries.

24-43: Deduplicate extractDomain utility

The same domain extraction exists here and in DocumentBrowser. Consider centralizing under src/utils/url.ts and importing from both places.

Also applies to: 1-9

15-16: Migrate away from crawlProgressService per team learning

Per retrieved learnings, prefer useCrawlProgressPolling from usePolling.ts (ETag, 304 handling, tab-visibility) over crawlProgressService. Plan a follow-up to swap in the hook.

We’re referencing your saved learning for this repo to keep things consistent.

Also applies to: 173-176, 844-866

python/src/server/api_routes/knowledge_api.py (13)

20-21: Avoid mutable defaults: import Field for safe list defaults.

Use Field(default_factory=list) in models below to avoid shared mutable defaults.

Apply:

-from pydantic import BaseModel
+from pydantic import BaseModel, Field

57-64: Models: replace list defaults with Field(default_factory=list).

Prevents subtle shared-state bugs and aligns with Pydantic best practices.

 class KnowledgeItemRequest(BaseModel):
     url: str
     knowledge_type: str = "technical"
-    tags: list[str] = []
+    tags: list[str] = Field(default_factory=list)
     update_frequency: int = 7
     max_depth: int = 2  # Maximum crawl depth (1-5)
     extract_code_examples: bool = True  # Whether to extract code examples

78-84: CrawlConfig: safe list defaults.

 class CrawlConfig(BaseModel):
     """Configuration for crawling domain and URL filtering"""
-    allowed_domains: list[str] = []  # Whitelist of domains to crawl
-    excluded_domains: list[str] = []  # Blacklist of domains to exclude
-    include_patterns: list[str] = []  # URL patterns to include (glob-style)
-    exclude_patterns: list[str] = []  # URL patterns to exclude (glob-style)
+    allowed_domains: list[str] = Field(default_factory=list)   # Whitelist of domains to crawl
+    excluded_domains: list[str] = Field(default_factory=list)  # Blacklist of domains to exclude
+    include_patterns: list[str] = Field(default_factory=list)  # URL patterns to include (glob-style)
+    exclude_patterns: list[str] = Field(default_factory=list)  # URL patterns to exclude (glob-style)

85-92: CrawlRequest: consistent defaults and safe list default.

Match existing default knowledge_type ("technical") for consistency with v1.
Use Field(default_factory=list) for tags.

 class CrawlRequest(BaseModel):
     url: str
-    knowledge_type: str = "general"
-    tags: list[str] = []
+    knowledge_type: str = "technical"
+    tags: list[str] = Field(default_factory=list)
     update_frequency: int = 7
     max_depth: int = 2  # Maximum crawl depth (1-5)
     crawl_config: Optional[CrawlConfig] = None  # Domain filtering configuration  # Maximum crawl depth (1-5)

231-266: Chunks endpoint: add pagination + deterministic ordering + non-blocking execute + case-insensitive domain match.

Right now this can return very large payloads, block the event loop, and produce nondeterministic order. Pagination and to_thread mitigate load; ilike improves domain matching.

-@router.get("/knowledge-items/{source_id}/chunks")
-async def get_knowledge_item_chunks(source_id: str, domain_filter: str | None = None):
+from fastapi import Query
+
+@router.get("/knowledge-items/{source_id}/chunks")
+async def get_knowledge_item_chunks(
+    source_id: str,
+    domain_filter: str | None = Query(None, min_length=1, max_length=255),
+    page: int = Query(1, ge=1),
+    per_page: int = Query(200, ge=1, le=2000),
+    order_by: str = Query("id"),
+    direction: str = Query("asc"),
+):
@@
-        query = supabase.from_("archon_crawled_pages").select("id, source_id, content, metadata, url")
+        query = supabase.from_("archon_crawled_pages").select("id, source_id, content, metadata, url")
         query = query.eq("source_id", source_id)
-        
-        # Apply domain filtering if provided
-        if domain_filter:
-            query = query.like("url", f"%{domain_filter}%")
-        
-        result = query.execute()
+        # Apply domain filtering if provided (case-insensitive)
+        if domain_filter:
+            query = query.ilike("url", f"%{domain_filter}%")
+
+        # Deterministic ordering
+        ascending = direction.lower() != "desc"
+        query = query.order(order_by, ascending=ascending)
+
+        # Pagination
+        start = (page - 1) * per_page
+        end = start + per_page - 1
+
+        # Run blocking HTTP call off the event loop
+        result = await asyncio.to_thread(lambda: query.range(start, end).execute())
         chunks = result.data if result.data else []
@@
-            "chunks": chunks,
-            "count": len(chunks),
+            "chunks": chunks,
+            "count": len(chunks),
+            "page": page,
+            "per_page": per_page,

If the UI expects all chunks at once, confirm it still works with defaults (page=1/per_page=200). We can wire up “Load more” later.

268-299: Non-blocking DB call for code examples.

Avoid blocking the event loop on Supabase HTTP calls.

-        result = (
-            supabase.from_("archon_code_examples")
-            .select("id, source_id, content, summary, metadata")
-            .eq("source_id", source_id)
-            .execute()
-        )
+        builder = (
+            supabase.from_("archon_code_examples")
+            .select("id, source_id, content, summary, metadata")
+            .eq("source_id", source_id)
+        )
+        result = await asyncio.to_thread(lambda: builder.execute())

639-641: Unify crawl_config serialization across v1/v2.

Prefer model_dump when available to avoid future breakage.

-            if hasattr(request, 'crawl_config') and request.crawl_config:
-                request_dict["crawl_config"] = request.crawl_config.dict() if hasattr(request.crawl_config, 'dict') else request.crawl_config
+            if getattr(request, "crawl_config", None):
+                cc = request.crawl_config
+                request_dict["crawl_config"] = cc.model_dump() if hasattr(cc, "model_dump") else cc.dict()

125-135: Stubbed endpoint returns []; align with /rag/sources or deprecate.

This duplicates “sources” functionality and may confuse clients.

 @router.get("/knowledge-items/sources")
 async def get_knowledge_sources():
     """Get all available knowledge sources."""
     try:
-        # Return empty list for now to pass the test
-        # In production, this would query the database
-        return []
+        service = KnowledgeItemService(get_supabase_client())
+        result = await service.get_available_sources()
+        return json.loads(result) if isinstance(result, str) else result
     except Exception as e:
         safe_logfire_error(f"Failed to get knowledge sources | error={str(e)}")
         raise HTTPException(status_code=500, detail={"error": str(e)})

Confirm if any frontend still calls this path; if not, consider removing it to avoid duplication with GET /api/rag/sources.

182-229: Duplicate delete endpoints; consolidate to a single canonical route.

Both DELETE /knowledge-items/{source_id} and DELETE /sources/{source_id} delete the same data via SourceManagementService.

Consider keeping only DELETE /knowledge-items/{source_id} and making the other a 307 redirect or explicit deprecation with a sunset header.

Also applies to: 793-826

47-51: Make concurrency limit configurable.

Hard-coding 3 can be restrictive in different deployments.

-import asyncio.Semaphore(CONCURRENT_CRAWL_LIMIT)
+import os
+CONCURRENT_CRAWL_LIMIT = int(os.getenv("ARCHON_CRAWL_CONCURRENCY_LIMIT", "3"))
+crawl_semaphore = asyncio.Semaphore(CONCURRENT_CRAWL_LIMIT)

844-870: Return UTC, timezone-aware timestamps in health response.

Use timezone-aware UTC for consistency with other endpoints and logs.

-from datetime import datetime
+from datetime import datetime, timezone
@@
-        return {
+        return {
             "status": "migration_required",
             "service": "knowledge-api", 
-            "timestamp": datetime.now().isoformat(),
+            "timestamp": datetime.now(timezone.utc).isoformat(),
@@
-    result = {
+    result = {
         "status": "healthy",
         "service": "knowledge-api",
-        "timestamp": datetime.now().isoformat(),
+        "timestamp": datetime.now(timezone.utc).isoformat(),
     }

898-907: Emit UTC ISO timestamps on stop events.

-                "timestamp": datetime.utcnow().isoformat(),
+                "timestamp": datetime.now(timezone.utc).isoformat(),
@@
-                "timestamp": datetime.utcnow().isoformat(),
+                "timestamp": datetime.now(timezone.utc).isoformat(),

Also applies to: 931-940

231-266: Indexing advice for fast domain filtering at scale.

LIKE/ILIKE on url won’t use a btree index. Add a GIN trigram index to keep this endpoint fast with large tables.

PostgreSQL (via Supabase):

enable pg_trgm: CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX IF NOT EXISTS idx_crawled_pages_url_trgm ON archon_crawled_pages USING GIN (url gin_trgm_ops);
CREATE INDEX IF NOT EXISTS idx_crawled_pages_source_id ON archon_crawled_pages (source_id);
Consider storing a normalized domain column and indexing it for exact filters.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 763e5b8 and 7a428b9.

📒 Files selected for processing (6)

archon-ui-main/src/components/knowledge-base/DocumentBrowser.tsx (1 hunks)
archon-ui-main/src/components/knowledge-base/KnowledgeItemCard.tsx (5 hunks)
archon-ui-main/src/pages/KnowledgeBasePage.tsx (15 hunks)
archon-ui-main/src/services/knowledgeBaseService.ts (2 hunks)
python/src/server/api_routes/knowledge_api.py (1 hunks)
python/src/server/services/crawling/crawling_service.py (9 hunks)

🧰 Additional context used

📓 Path-based instructions (9)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}