Skip to content

feat: Advanced web crawling with domain filtering#728

Closed
leex279 wants to merge 22 commits intomainfrom
feature/advanced-crawl-domain-filtering
Closed

feat: Advanced web crawling with domain filtering#728
leex279 wants to merge 22 commits intomainfrom
feature/advanced-crawl-domain-filtering

Conversation

@leex279
Copy link
Copy Markdown
Collaborator

@leex279 leex279 commented Sep 22, 2025

Summary

This PR implements Advanced Web Crawling with Domain Filtering capabilities for Archon's knowledge management system, along with comprehensive Edit Crawler Configuration and Enhanced Metadata Viewing features.

2025-09-22_12h59_12.mp4
2025-09-22_12h59_52.mp4
2025-09-22_13h00_19.mp4

New Features Added

🔧 Edit Crawler Configuration

  • Edit existing crawler settings: Users can now modify any existing crawler configuration including URL, knowledge type, max depth, tags, and advanced domain filtering settings
  • Edit Configuration menu item: Added to knowledge card actions dropdown, positioned between View options and Recrawl
  • Recrawl warning: Clear warning message that saving changes will trigger a recrawl and replace existing documents
  • Complete configuration editing: Reuses the same AdvancedCrawlConfig component from the add dialog for consistency
  • Progress tracking: Full integration with existing progress tracking system for recrawl operations

📊 Enhanced Metadata Viewing

  • Complete metadata display: Fixed to show ALL metadata properties from backend (previously only showed 3-5 fields)
  • Improved viewing area: Increased metadata panel height from 256px to 500px with proper scrolling
  • Full JSON display: Now shows complete metadata including:
    • url, source, headers, filename
    • has_code, has_links, source_id
    • char_count, word_count, line_count
    • chunk_size, chunk_index
    • source_type, knowledge_type
    • All other backend metadata fields
  • Better UX: Visible scrollbar and proper text wrapping for large metadata objects

🌐 Advanced Domain Filtering

  • Whitelist domains: Only crawl pages from specified domains
  • Blacklist domains: Exclude specific domains from crawling
  • URL patterns: Include/exclude based on URL pattern matching
  • Priority system: Blacklist > Whitelist > Patterns for conflict resolution
  • Domain statistics: Shows document count per domain in filter dropdown
  • "All domains" option: Easy way to view all documents across domains

🎯 Improved Document Browser

  • Domain filter dropdown: Replaced pills with cleaner dropdown interface showing domain statistics
  • Enhanced sidebar: Better search and filtering experience in Knowledge Inspector
  • Metadata panel: Collapsible metadata viewer in content area

Technical Implementation

Frontend Architecture

  • Vertical slice pattern: All edit configuration features organized in knowledge feature folder
  • React Query integration: useUpdateCrawlConfig hook with optimistic updates and error handling
  • Component reuse: EditCrawlConfigDialog reuses existing AdvancedCrawlConfig, KnowledgeTypeSelector, etc.
  • TypeScript: Full type safety with proper CrawlConfig interfaces

Backend Integration

  • New API endpoint: POST /api/knowledge-items/{source_id}/update-config
  • Configuration validation: Uses existing CrawlRequestV2 and CrawlConfig Pydantic models
  • Safe recrawling: Deletes old data before starting new crawl with updated configuration
  • Progress tracking: Returns progress ID for monitoring recrawl status

Key Files Modified/Added

Frontend Components

  • src/features/knowledge/components/EditCrawlConfigDialog.tsxNEW
  • src/features/knowledge/components/KnowledgeCardActions.tsx
  • src/features/knowledge/components/KnowledgeCard.tsx
  • src/features/knowledge/inspector/components/ContentViewer.tsx
  • src/features/knowledge/inspector/components/KnowledgeInspector.tsx
  • src/features/knowledge/inspector/components/InspectorSidebar.tsx

Services & Hooks

  • src/features/knowledge/hooks/useKnowledgeQueries.ts - Added useUpdateCrawlConfig
  • src/features/knowledge/services/knowledgeService.ts - Added updateCrawlConfig

Backend API

  • python/src/server/api_routes/knowledge_api.py - Added update-config endpoint

User Experience Flow

  1. Editing Configuration:

    • User clicks three-dot menu on knowledge card → "Edit Configuration"
    • Dialog opens with current settings pre-loaded
    • User can modify any crawler settings including advanced domain filtering
    • Warning shows that recrawl will be triggered
    • Save triggers new crawl with updated configuration
  2. Enhanced Metadata Viewing:

    • User opens Knowledge Inspector and selects any document
    • Metadata panel at bottom shows complete JSON with all properties
    • 500px height with scrolling for large metadata objects
    • All backend fields visible for debugging and analysis
  3. Domain Filtering:

    • User adds advanced crawl configuration with domain filters
    • During crawl, domains are filtered according to whitelist/blacklist/patterns
    • In document browser, user can filter by domain to validate results
    • Domain dropdown shows statistics (e.g., "example.com (15 documents)")

Testing & Validation

  • ✅ TypeScript compilation passes
  • ✅ Frontend build completes successfully
  • ✅ All new components follow Archon's design patterns
  • ✅ Backend API endpoint follows existing patterns
  • ✅ Optimistic updates work correctly
  • ✅ Progress tracking integrates with existing system
  • ✅ Complete metadata display verified with real data

Benefits

  • Better content control: Users can precisely control what gets crawled using domain filtering
  • Configuration flexibility: Easy editing of existing crawler settings without re-creating from scratch
  • Enhanced debugging: Complete metadata visibility helps users understand and debug crawl results
  • Improved UX: Cleaner domain filtering interface and better metadata viewing experience
  • Validation capabilities: Users can easily verify that domain filtering worked as expected

This implementation provides a comprehensive solution for advanced web crawling with full configurability and excellent user experience for managing and analyzing crawled content.

Summary by CodeRabbit

  • New Features

    • Crawl V2: start crawls with domain & URL-pattern filtering and update/save per-item crawl configurations; new "Edit Configuration" action to open the editor.
    • Domain-aware browsing: filter and view documents by domain; metadata/footer shows source domain and quick links.
  • UI/UX Improvements

    • Advanced crawl options panel, clearer dialog/tab layouts, loading/toast messages reflect domain-filtered crawls, and improved scroll/height behavior.
  • Bug Fixes

    • More accurate combined text + domain filtering and reset behavior.
  • Tests

    • Added comprehensive unit tests for domain filtering logic.

- Implement domain filtering for web crawler with whitelist/blacklist support
- Add URL pattern matching (glob-style) for include/exclude patterns
- Create AdvancedCrawlConfig UI component with collapsible panel
- Add domain filter to Knowledge Inspector sidebar for easy filtering
- Implement crawl-v2 API endpoint with backward compatibility
- Add comprehensive unit tests for domain filtering logic

Implements priority-based filtering:
1. Blacklist (excluded_domains) - highest priority
2. Whitelist (allowed_domains) - must match if provided
3. Exclude patterns - glob patterns to exclude
4. Include patterns - glob patterns to include

UI improvements:
- Advanced configuration section in Add Knowledge dialog
- Domain pills in Inspector sidebar showing document distribution
- Visual domain indicators on each document
- Responsive domain filtering with document counts

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Sep 22, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Adds Crawl V2 with domain/pattern filtering across UI, hooks, types, services, and server; introduces AdvancedCrawlConfig UI, domain-aware browsing/inspector filters, update-crawl-config flow, DomainFilter logic in crawler, and tests.

Changes

Cohort / File(s) Summary
Add Knowledge dialog & Advanced config UI
archon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx, archon-ui-main/src/features/knowledge/components/AdvancedCrawlConfig.tsx
Add AdvancedCrawlConfig component and crawlConfig state; choose between v1 and v2 crawl requests; adjust layout, toasts, reset, and processing state to account for v2.
Knowledge card actions & edit dialog
archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx, archon-ui-main/src/features/knowledge/components/KnowledgeCardActions.tsx, archon-ui-main/src/features/knowledge/components/EditCrawlConfigDialog.tsx, archon-ui-main/src/features/knowledge/components/index.ts
Add EditCrawlConfigDialog, wire into KnowledgeCard/KnowledgeCardActions via new onEditConfig prop, expose AdvancedCrawlConfig via index export; enable editing and saving crawl_config (triggers recrawl).
Document browser & inspector UI
archon-ui-main/src/features/knowledge/components/DocumentBrowser.tsx, archon-ui-main/src/features/knowledge/inspector/components/ContentViewer.tsx, archon-ui-main/src/features/knowledge/inspector/components/InspectorSidebar.tsx, archon-ui-main/src/features/knowledge/inspector/components/KnowledgeInspector.tsx
Compute domainStats, add domain filter UI and domain links, toggle metadata panel, restructure layouts, wire selectedDomain state and filtering through sidebar/inspector/content viewer.
Hooks, services & types (client)
archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts, archon-ui-main/src/features/knowledge/services/knowledgeService.ts, archon-ui-main/src/features/knowledge/types/knowledge.ts
Add types CrawlConfig/CrawlRequestV2; add useCrawlUrlV2 and useUpdateCrawlConfig with optimistic updates; add service methods crawlUrlV2 and updateCrawlConfig.
Server API & Pydantic models
python/src/server/api_routes/knowledge_api.py, python/src/server/models/crawl_models.py
Add endpoints for /crawl-v2 and update-crawl-config; introduce CrawlConfig and CrawlRequestV2 models with validation; spawn background v2 crawl runner and progress tracking.
Crawler services & domain filtering
python/src/server/services/crawling/crawling_service.py, python/src/server/services/crawling/domain_filter.py, python/src/server/services/crawling/strategies/recursive.py, python/src/server/services/crawling/strategies/batch.py
Introduce DomainFilter, thread crawl_config through CrawlingService and strategies, apply domain/pattern checks when expanding/processing URLs; update constructors/signatures and logging/progress behavior.
Tests
python/src/server/services/tests/test_domain_filter.py
Add comprehensive unit tests covering domain normalization, whitelist/blacklist precedence, wildcard/subdomain matching, include/exclude patterns, relative URLs, and edge cases.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant UI as AddKnowledgeDialog
  participant H as useCrawlUrl / useCrawlUrlV2
  participant S as knowledgeService
  participant API as /api/knowledge-items/crawl[-v2]
  participant BG as Background Crawl Runner
  participant P as ProgressTracker

  U->>UI: Click "Start Crawling"
  UI->>H: submit(url, crawl_config?)
  alt crawl_config present (v2)
    H->>S: crawlUrlV2({url,...,crawl_config})
    S->>API: POST /crawl-v2
  else no crawl_config (v1)
    H->>S: crawlUrl({url,...})
    S->>API: POST /crawl
  end
  API->>P: create progressId
  API-->>S: {progressId, metadata}
  S-->>H: response
  H->>H: optimistic updates (cache)
  API->>BG: start async crawl task
  BG->>P: update progress
  P-->>UI: polled progress updates
Loading
sequenceDiagram
  autonumber
  actor U as User
  participant KC as KnowledgeCard
  participant D as EditCrawlConfigDialog
  participant H as useUpdateCrawlConfig
  participant S as knowledgeService
  participant API as POST /knowledge-items/{id}/update-config
  participant BG as Recrawl Runner
  participant P as ProgressTracker

  U->>KC: Open menu → Edit Configuration
  KC->>D: open(sourceId)
  U->>D: Save
  D->>H: mutate({sourceId, url, max_depth, tags, crawl_config})
  H->>S: updateCrawlConfig(request)
  S->>API: POST update-config
  API->>P: create progressId
  API-->>S: {progressId}
  S-->>H: ok
  H->>H: optimistic processing state
  API->>BG: start recrawl
  BG->>P: progress updates
Loading
sequenceDiagram
  autonumber
  participant R as RecursiveCrawlStrategy
  participant DF as DomainFilter
  participant CFG as CrawlConfig
  participant Q as URL Queue

  loop discovered next_url
    R->>DF: is_url_allowed(next_url, base_url, CFG)
    alt allowed
      DF-->>R: true
      R->>Q: enqueue next_url
    else blocked
      DF-->>R: false
      R->>R: skip & log
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • coleam00
  • Wirasm
  • tazmon95

Poem

A nibble of links, a hop through domains,
I sift with whiskers, follow careful lanes.
Wildcards twinkle, patterns neatly spun,
V2 drums rolling—recrawls have begun.
Filters in paw—clean knowledge, crystal clear! 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title "feat: Advanced web crawling with domain filtering" is concise, a single sentence, and accurately captures the primary change in the changeset (adding advanced crawling with domain/domain-pattern filtering across frontend and backend). It is specific enough for a teammate scanning history and avoids noisy file lists or vague terms. The "feat:" prefix follows common conventional commit style used in many repos.
Description Check ✅ Passed The PR description is thorough and maps well to the provided diffs, containing a clear summary, detailed feature descriptions, technical implementation notes, key files changed, and a testing/validation section, so reviewers can understand scope and intent. It does not strictly follow the repository's required template structure: the explicit "Changes Made" bullet list, the "Type of Change" checkboxes, the "Affected Services" checkbox matrix, the formal Testing checklist with a Test Evidence code block, and the Checklist/Breaking Changes sections from the template are missing or not presented in the exact template format. Because substantive content is present and useful, this is rated as a pass while recommending alignment with the template for consistency and automation.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

leex279 and others added 10 commits September 22, 2025 09:39
- Change domain filter from pills to dropdown with 'All' option as default
- Add 'View Source' link for each document chunk
- Add 'View Metadata' button to view chunk metadata in expandable panel
- Improve UI consistency with smaller action buttons

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Fix button nesting error by using div with onClick instead of nested buttons
- Move metadata panel outside the clickable item area
- Add close button for metadata panel
- Ensure metadata displays within the UI (not opening in new tab)
- Fix type comparison for showMetadata state
- Improve layout with better spacing and scrollable metadata

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Move metadata display from sidebar list to content viewer where chunk content is shown
- Add collapsible metadata section in content viewer with better formatting
- Remove metadata button and panel from sidebar to reduce clutter
- Keep only 'View Source' link in sidebar for quick access
- Show domain information in content footer for better context
- Improve overall UI consistency and user experience

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Move metadata section to be always visible below header
- Add full-width collapsible button with hover effect
- Show property count in metadata header
- Add max-height and scroll to metadata content
- Separate content area from metadata for better visibility
- Improve UI with clear visual hierarchy

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Move metadata panel from top to bottom of content viewer
- Place scroll on outer container to show more metadata (max-h-64)
- Keep metadata section always accessible at the bottom
- Maintain clear visual separation with border-top
- Ensure better visibility of all metadata properties

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
## New Features

### Edit Crawler Configuration
- Added `EditCrawlConfigDialog` component for editing existing crawler settings
- Users can now modify URL, knowledge type, max depth, tags, and advanced crawl config
- Added "Edit Configuration" menu item to knowledge card actions
- Shows warning that recrawl will be triggered when saving changes
- Created `useUpdateCrawlConfig` hook with optimistic updates
- Added backend API endpoint `/api/knowledge-items/{source_id}/update-config`

### Enhanced Metadata Viewing
- Fixed metadata display to show ALL properties from backend (not just 3-5)
- Increased metadata panel height from 256px to 500px for better viewing
- Added proper scrolling with visible scrollbar for large metadata objects
- Now displays complete JSON including properties like:
  - `url`, `source`, `headers`, `filename`
  - `has_code`, `has_links`, `source_id`
  - `char_count`, `word_count`, `line_count`
  - `chunk_size`, `chunk_index`
  - `source_type`, `knowledge_type`
  - All other backend metadata fields

### Domain Filtering Improvements
- Converted domain filter from pills to dropdown with "All" option
- Added domain statistics showing document count per domain
- Improved InspectorSidebar with better domain filtering UX

## Technical Implementation

### Frontend Components
- `EditCrawlConfigDialog.tsx` - Edit configuration dialog with warning
- Enhanced `KnowledgeCardActions.tsx` - Added edit configuration menu item
- Updated `KnowledgeCard.tsx` - Integrated edit dialog
- Improved `ContentViewer.tsx` - Better metadata display with full JSON
- Fixed `KnowledgeInspector.tsx` - Pass complete metadata instead of filtered subset

### React Hooks & Services
- `useUpdateCrawlConfig()` - Mutation hook for updating crawler config
- `knowledgeService.updateCrawlConfig()` - API service method
- Optimistic updates with progress tracking

### Backend API
- POST `/api/knowledge-items/{source_id}/update-config` - Update crawler configuration
- Validates existing item, deletes old data, triggers new crawl with updated config
- Returns progress ID for tracking recrawl operation

## User Experience
- Users can now edit any existing crawler configuration
- Clear warning about recrawl being triggered
- Complete metadata viewing for debugging and analysis
- Improved domain filtering in document browser
- Progress tracking for configuration updates

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Removed redundant View Source links from InspectorSidebar
- Main content area already contains source links
- Cleaned up unused ExternalLink import
- Fixed missing count and timestamp fields in optimistic updates
- Preserve all ActiveOperationsResponse fields when updating progress IDs
- Fixed incorrect field comparison (source_id vs id) when replacing temp IDs
- Added query invalidation for progress queries in v2 implementation
- Ensures proper data shape consistency with backend API

These fixes ensure that:
1. ActiveOperationsResponse always has required count/timestamp fields
2. Optimistic entities are correctly matched and updated with real IDs
3. Progress queries are properly refreshed after crawl starts
Backend fixes for crawling stability:

- Add comment clarifying DomainFilter doesn't need init params
- Improve base URL selection in recursive strategy:
  - Check start_urls length before indexing
  - Use appropriate base URL for domain checks
  - Fallback to original_url when start_urls is empty
- Add error handling for domain filter:
  - Wrap is_url_allowed in try/except block
  - Log exceptions and conservatively skip URLs on error
  - Prevents domain filter exceptions from crashing crawler
- Better handling of relative URL resolution

These changes ensure more robust crawling especially when:
- start_urls array is empty
- Domain filter encounters unexpected URLs
- Relative links need proper base URL resolution
Added detailed documentation for CrawlConfig interface including:

## Documentation improvements:
- Clear precedence rules (excluded_domains > allowed_domains > exclude_patterns > include_patterns)
- Pattern syntax explanation (glob patterns with fnmatch for URLs, wildcards for domains)
- Comprehensive examples showing common use cases:
  - Single subdomain with path exclusions
  - Multiple subdomains with specific exclusions
  - File type and directory blocking
- Individual property documentation with examples

## Code improvements:
- Refactored DocumentBrowser to avoid repeated URL/domain computation
- Extract resolvedUrl and resolvedDomain once as constants
- Improved readability and performance

This documentation helps developers understand:
- How conflicting rules are resolved (blacklist always wins)
- What pattern syntax to use (glob patterns)
- How to compose allow/deny lists effectively
@leex279
Copy link
Copy Markdown
Collaborator Author

leex279 commented Sep 22, 2025

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Sep 22, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Fixed bug where edit configuration dialog wasn't showing the existing crawl config:

1. **Enhanced data loading in EditCrawlConfigDialog**:
   - Check for data at both top-level and metadata level
   - Properly handle max_depth, tags, and crawl_config fields
   - Ensure crawl_config has the right shape with all arrays initialized
   - Reset form state when dialog opens without data

2. **Improved type definitions**:
   - Added crawl_config and max_depth to KnowledgeItemMetadata interface
   - Added optional fields to KnowledgeItem for top-level storage
   - Added index signature to allow additional untyped fields from backend

3. **Better state management**:
   - Reset form when dialog opens to prevent stale data
   - Only load data when both item exists and dialog is open
   - Initialize empty arrays for all crawl_config fields

This ensures that when editing an existing knowledge item:
- All original crawl settings are properly loaded
- Advanced configuration shows the domain filters and patterns
- Form is pre-filled with the exact configuration used for initial crawl
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
python/src/server/services/crawling/crawling_service.py (2)

299-311: Wire crawl_config from request to service instance.

If the service wasn’t constructed with crawl_config, it remains None and domain filtering is skipped. Load it from the request before crawling.

Apply this diff:

         try:
             url = str(request.get("url", ""))
             safe_logfire_info(f"Starting async crawl orchestration | url={url} | task_id={task_id}")
 
+            # Ensure crawl_config is available for strategies
+            if self.crawl_config is None and "crawl_config" in request:
+                self.crawl_config = request.get("crawl_config")
+                safe_logfire_info(
+                    f"Crawl config attached | has_config={bool(self.crawl_config)}"
+                )
+
             # Start the progress tracker if available
             if self.progress_tracker:
                 await self.progress_tracker.start({

686-699: Domain filtering not applied to batch/link-collection crawls — critical fix required

BatchCrawlStrategy is constructed without a DomainFilter/crawl_config, so extracted links crawled via crawl_batch_with_progress can bypass domain restrictions (confirmed in python/src/server/services/crawling/strategies/batch.py and the call site in python/src/server/services/crawling/crawling_service.py:686-699).

  • Change BatchCrawlStrategy to accept domain_filter and optional crawl_config and apply domain_filter.is_url_allowed(...) (or equivalent) before enqueueing URLs. (python/src/server/services/crawling/strategies/batch.py)
  • Initialize and use it at the call site: BatchCrawlStrategy(crawler, self.link_pruning_markdown_generator, self.domain_filter) and pass self.crawl_config into crawl_batch_with_progress(...) so the strategy can enforce filtering. (python/src/server/services/crawling/crawling_service.py:686-699)
🧹 Nitpick comments (38)
archon-ui-main/src/features/knowledge/inspector/components/ContentViewer.tsx (3)

7-7: Avoid repeated JSON.stringify; memoize metadata rendering.

Stringifying large metadata objects on every render can jank the UI. Compute it only when the panel is open or metadata changes.

Apply this diff:

-import { useState } from "react";
+import { useMemo, useState } from "react";
-  const [showMetadata, setShowMetadata] = useState(false);
+  const [showMetadata, setShowMetadata] = useState(false);
+  const metadataString = useMemo(() => {
+    if (!showMetadata) return "";
+    const data = selectedItem?.metadata;
+    try {
+      return data ? JSON.stringify(data, null, 2) : "{}";
+    } catch {
+      return "{}";
+    }
+  }, [showMetadata, selectedItem?.metadata]);
-                  <pre className="text-xs text-gray-400 font-mono whitespace-pre-wrap break-words">
-                    {JSON.stringify(selectedItem.metadata, null, 2)}
-                  </pre>
+                  <pre className="text-xs text-gray-400 font-mono whitespace-pre-wrap break-words">
+                    {metadataString}
+                  </pre>

Also applies to: 19-19, 148-150


133-139: Minor a11y: reflect toggle state via ARIA.

Expose expansion state and associate button with the panel.

-            <button
+            <button
               type="button"
               onClick={() => setShowMetadata(!showMetadata)}
-              className="w-full px-4 py-3 flex items-center gap-2 text-sm font-medium text-cyan-400 hover:bg-white/5 transition-colors text-left"
+              className="w-full px-4 py-3 flex items-center gap-2 text-sm font-medium text-cyan-400 hover:bg-white/5 transition-colors text-left"
+              aria-expanded={showMetadata}
+              aria-controls="inspector-metadata-panel"
             >
-              <div className="px-4 pb-4">
+              <div id="inspector-metadata-panel" className="px-4 pb-4">

Also applies to: 147-147


168-183: Option: show domain/source for code items too.

If code metadata includes a URL, consider showing Domain + “View Source” here as well for parity with documents.

archon-ui-main/src/features/knowledge/inspector/components/InspectorSidebar.tsx (2)

46-61: Domain stats LGTM; consider de-duping “empty/invalid” domains.

If extractDomain ever returns an empty string, it will produce a blank entry. Guard with a truthy domain check before incrementing.

-        const domain = extractDomain(url);
-        stats.set(domain, (stats.get(domain) || 0) + 1);
+        const domain = extractDomain(url);
+        if (domain) {
+          stats.set(domain, (stats.get(domain) || 0) + 1);
+        }

121-149: Nice filter UX; optional “Unknown domain” bucket.

Docs without a URL disappear when a specific domain is selected. Offering an “unknown” option improves discoverability.

archon-ui-main/src/features/knowledge/inspector/components/KnowledgeInspector.tsx (3)

159-169: Header counts ignore domain filter.

When a domain is selected, the header still shows counts for all loaded items. Compute filtered count to avoid user confusion.

Apply this diff (and import extractDomain as shown in the next comment):

-            filteredDocumentCount={viewMode === "documents" ? currentItems.length : 0}
+            filteredDocumentCount={
+              viewMode === "documents"
+                ? (selectedDomain === "all"
+                    ? currentItems.length
+                    : (currentItems as DocumentChunk[]).filter((d) => {
+                        const url = d.url || d.metadata?.url || "";
+                        return extractDomain(url) === selectedDomain;
+                      }).length)
+                : 0
+            }

6-14: Import extractDomain for filtered count computation.

Required by the previous change.

 import { ContentViewer } from "./ContentViewer";
 import { InspectorHeader } from "./InspectorHeader";
 import { InspectorSidebar } from "./InspectorSidebar";
 import { copyToClipboard } from "../../../shared/utils/clipboard";
+import { extractDomain } from "../../utils/knowledge-utils";

186-188: Clear stale selection when domain changes.

Switching domains can leave a selected item that’s no longer visible. Reset selection on domain change.

-            onDomainChange={setSelectedDomain}
+            onDomainChange={(d) => {
+              setSelectedDomain(d);
+              setSelectedItem(null);
+            }}
python/src/server/models/crawl_models.py (3)

48-56: Docstring mismatch with behavior.

The validator adds https:// (not http://). Update the comment for clarity.

-        """Ensure URL is properly formatted."""
+        """Ensure URL is properly formatted."""
@@
-        # Add http:// if no protocol specified
+        # Add https:// if no protocol specified

41-47: Minor typing nit: tags shouldn’t be optional if defaulting to [].

Use a non-optional list to better reflect runtime.

-    tags: list[str] | None = Field(default_factory=list, description="Tags to apply to crawled content")
+    tags: list[str] = Field(default_factory=list, description="Tags to apply to crawled content")

12-19: Forbid unexpected fields on Pydantic models — Pydantic v2 detected (2.11.9)

Add model_config = ConfigDict(extra="forbid") to CrawlConfig (python/src/server/models/crawl_models.py, lines 12–19) and other Pydantic models (e.g., CrawlRequestV2) to reject unrecognized inputs.

from pydantic import ConfigDict

class CrawlConfig(BaseModel):
    ...
    model_config = ConfigDict(extra="forbid")
archon-ui-main/src/features/knowledge/types/knowledge.ts (2)

41-47: Allow backend to pass through extra code metadata fields.

Enhanced metadata viewer displays all backend fields; add an index signature to avoid TS friction.

 export interface CodeExampleMetadata {
   language?: string;
   file_path?: string;
   summary?: string;
   relevance_score?: number;
-  // No additional flexible properties - use strict typing
+  // Allow backend-provided extras (e.g., source, headers, filename, counters)
+  [key: string]: unknown;
 }

64-71: Same for document metadata: allow extra fields.

Aligns with “Enhanced Metadata Viewing” that surfaces all backend fields.

 export interface DocumentChunkMetadata {
   title?: string;
   section?: string;
   relevance_score?: number;
   url?: string;
   tags?: string[];
-  // No additional flexible properties - use strict typing
+  // Allow backend-provided extras (e.g., headers, source_id, counts, flags)
+  [key: string]: unknown;
 }
archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx (1)

350-361: Lazy-mount the Edit dialog to avoid unnecessary work.

Only render the dialog when open to prevent fetching/initialization on every card.

-      {/* Edit Configuration Dialog */}
-      <EditCrawlConfigDialog
-        sourceId={item.source_id}
-        open={showEditConfigDialog}
-        onOpenChange={setShowEditConfigDialog}
-        onSuccess={() => {
-          // The refresh will be handled by the update itself
-          if (onRefreshStarted) {
-            // We'll get the progressId from the update response
-          }
-        }}
-      />
+      {/* Edit Configuration Dialog */}
+      {showEditConfigDialog && (
+        <EditCrawlConfigDialog
+          sourceId={item.source_id}
+          open={showEditConfigDialog}
+          onOpenChange={setShowEditConfigDialog}
+          onSuccess={(progressId?: string) => {
+            if (progressId && onRefreshStarted) onRefreshStarted(progressId);
+          }}
+        />
+      )}
python/src/server/services/tests/test_domain_filter.py (2)

178-193: Add coverage for non-HTTP(S) schemes and anchors in domain extraction.

get_domains_from_urls should ignore mailto:, javascript:, and in-page anchors. Add a test to lock this down.

Apply this diff to add coverage:

@@
     def test_get_domains_from_urls(self):
         """Test extracting domains from URL list."""
         urls = [
             "https://example.com/page1",
             "https://docs.example.com/api",
             "https://example.com/page2",
             "https://other.com/resource",
             "https://WWW.TEST.COM/page",
             "/relative/path",  # Should be skipped
             "invalid-url",  # Should be skipped
+            "mailto:someone@example.com",  # Should be skipped
+            "javascript:alert(1)",  # Should be skipped
+            "#section",  # Should be skipped
         ]
@@
-        assert domains == {"example.com", "docs.example.com", "other.com", "test.com"}
+        assert domains == {"example.com", "docs.example.com", "other.com", "test.com"}

118-142: Edge precedence test: include vs. exclude patterns.

You validate blacklist over whitelist, but not exclude over include. Add a focused test to ensure exclude_patterns wins when both match a URL.

I can draft the test if you confirm the intended precedence is: blacklist > whitelist > include > exclude.

archon-ui-main/src/features/knowledge/components/EditCrawlConfigDialog.tsx (2)

42-57: Defensive state hydration from metadata.

Ensure tags are always an array and avoid sharing references to metadata.crawl_config (potential in-place mutation). Deep-clone the config or create a new object.

Apply this diff:

   useEffect(() => {
     if (item) {
       setUrl(item.url || "");
       setKnowledgeType(item.knowledge_type || "technical");
       // Access max_depth from metadata as any since it's not typed
       const metadata = item.metadata as any;
       setMaxDepth(metadata?.max_depth?.toString() || "2");
-      setTags(metadata?.tags || []);
+      setTags(Array.isArray(metadata?.tags) ? metadata.tags : []);
 
       // Load existing crawl config if available
       if (metadata?.crawl_config) {
-        setCrawlConfig(metadata.crawl_config);
+        try {
+          // cheap deep clone to avoid mutating original object by reference
+          setCrawlConfig(JSON.parse(JSON.stringify(metadata.crawl_config)));
+        } catch {
+          setCrawlConfig({ ...metadata.crawl_config });
+        }
       }
     }
   }, [item]);

96-105: Handle knowledge item load errors (disable destructive save).

If useKnowledgeItem fails, the form will show defaults and could overwrite existing config. Surface an error state and disable Save.

If useKnowledgeItem exposes isError/error, wrap the form with a guard and render an inline error with a disabled Save button.

archon-ui-main/src/features/knowledge/services/knowledgeService.ts (2)

238-255: Type crawl_config precisely instead of any.

Use the CrawlConfig type to catch shape errors at compile-time.

Apply this diff:

@@
-import type {
+import type {
   ChunksResponse,
   CodeExamplesResponse,
   CrawlRequest,
   CrawlRequestV2,
   CrawlStartResponse,
   KnowledgeItem,
   KnowledgeItemsFilter,
   KnowledgeItemsResponse,
   KnowledgeSource,
   RefreshResponse,
   SearchOptions,
   SearchResultsResponse,
   UploadMetadata,
+  CrawlConfig,
 } from "../types";
@@
-  async updateCrawlConfig(request: {
+  async updateCrawlConfig(request: {
     sourceId: string;
     url: string;
     knowledge_type: "technical" | "business";
     max_depth: number;
     tags?: string[];
-    crawl_config?: any;
+    crawl_config?: CrawlConfig;
   }): Promise<CrawlStartResponse> {

93-103: Optional: add request timeouts for crawl starts.

Crawl starts can hang under load. Consider passing an AbortSignal with a sane timeout via callAPIWithETag if supported.

Confirm whether callAPIWithETag supports timeouts; if not, I can help add a timeoutMs option and wire AbortController.

archon-ui-main/src/features/knowledge/components/DocumentBrowser.tsx (3)

55-68: Memoize filtering to avoid O(n) recomputation on every render.

Wrap filteredChunks in useMemo keyed by [chunks, searchQuery, selectedDomains] to reduce work on large datasets.

Apply this diff:

-  // Filter chunks based on search and domain
-  const filteredChunks = chunks.filter((chunk) => {
+  // Filter chunks based on search and domain
+  const filteredChunks = useMemo(() => chunks.filter((chunk) => {
     // Search filter
     const matchesSearch =
       !searchQuery ||
       chunk.content.toLowerCase().includes(searchQuery.toLowerCase()) ||
       chunk.metadata?.title?.toLowerCase().includes(searchQuery.toLowerCase());
 
     // Domain filter
     const url = chunk.url || chunk.metadata?.url;
     const matchesDomain = selectedDomains.size === 0 || (url && selectedDomains.has(extractDomain(url)));
 
-    return matchesSearch && matchesDomain;
-  });
+    return matchesSearch && matchesDomain;
+  }), [chunks, searchQuery, selectedDomains]);

151-179: Improve domain chip toggles: functional state update + accessibility.

Use functional state to avoid stale closures and add aria-pressed for screen readers.

Apply this diff:

-                        onClick={() => {
-                          const newSelection = new Set(selectedDomains);
-                          if (isSelected) {
-                            newSelection.delete(domain);
-                          } else {
-                            newSelection.add(domain);
-                          }
-                          setSelectedDomains(newSelection);
-                        }}
+                        onClick={() =>
+                          setSelectedDomains((prev) => {
+                            const next = new Set(prev);
+                            if (next.has(domain)) next.delete(domain);
+                            else next.add(domain);
+                            return next;
+                          })
+                        }
+                        aria-pressed={isSelected}
                         className={cn(

27-35: Consider server-side domain filtering to reduce payload.

knowledgeService.getKnowledgeItemChunks supports domainFilter. When domains are selected, request filtered chunks instead of pulling all and filtering client-side.

If useKnowledgeItemChunks accepts options, pass domainFilter as a CSV; otherwise, I can help extend the hook and invalidate the query on selection changes.

python/src/server/services/crawling/domain_filter.py (3)

58-60: Avoid blanket replacement of 'www.'; strip only the leading prefix.

-            normalized_domain = domain.replace("www.", "")
+            normalized_domain = domain.removeprefix("www.")

If Python <3.9, replace with a guarded regex or slice.


62-79: Normalize patterns to lowercase before matching.

Configs may carry mixed-case hostnames; ensure case-insensitive domain checks.

-                for excluded in config.excluded_domains:
-                    if self._matches_domain(normalized_domain, excluded):
+                for excluded in config.excluded_domains:
+                    if self._matches_domain(normalized_domain, excluded.lower()):
                         logger.debug(f"URL blocked by excluded domain | url={url} | domain={normalized_domain} | excluded={excluded}")
                         return False
...
-                for allowed_domain in config.allowed_domains:
-                    if self._matches_domain(normalized_domain, allowed_domain):
+                for allowed_domain in config.allowed_domains:
+                    if self._matches_domain(normalized_domain, allowed_domain.lower()):
                         allowed = True
                         break

107-146: Strengthen _matches_domain normalization (lowercase, strip scheme/port, leading www).

Prevents false negatives with ports like example.com:8080 and ensures consistent comparison.

-        # Remove any remaining protocol or path from pattern
-        pattern = pattern.replace("http://", "").replace("https://", "").split("/")[0]
-        pattern = pattern.replace("www.", "")  # Remove www. for consistent matching
+        # Normalize inputs
+        domain = (domain or "").lower()
+        # Remove any remaining protocol or path from pattern
+        pattern = (pattern or "").lower()
+        pattern = pattern.replace("http://", "").replace("https://", "").split("/")[0]
+        # Drop port if present on pattern (e.g., example.com:8080)
+        pattern = pattern.split(":", 1)[0]
+        # Strip leading www. only
+        pattern = pattern.removeprefix("www.")

Optional hardening: also drop port from domain if present (defensive when callers pass netloc).

-        # Exact match
+        # Drop port from domain defensively
+        domain = domain.split(":", 1)[0]
+        # Exact match
archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts (4)

315-321: Remove unused previousKnowledge from mutation context.

It’s never set or read; simplifies types.

-    {
-      previousKnowledge?: KnowledgeItem[];
+    {
       previousSummaries?: Array<[readonly unknown[], KnowledgeItemsResponse | undefined]>;
       previousOperations?: ActiveOperationsResponse;
       tempProgressId: string;
       tempItemId: string;
     }

1006-1017: Type crawl_config precisely; avoid any.

Use CrawlConfig to get compile-time safety across UI/service/backend.

-  return useMutation<
+  return useMutation<
     CrawlStartResponse,
     Error,
     {
       sourceId: string;
       url: string;
       knowledge_type: "technical" | "business";
       max_depth: number;
       tags?: string[];
-      crawl_config?: any;
+      crawl_config?: CrawlConfig;
     }
   >({

Also add CrawlConfig to the import from "../types".


390-424: Minor: build optimistic active operation once (no duplication).

The two branches differ only in list concatenation; collapse to one setQueryData call.


1049-1054: Feedback toast on success is missing for config update.

Consider notifying users that a recrawl has started and where to track it.

python/src/server/services/crawling/strategies/recursive.py (2)

301-313: Base URL selection for filtering is inconsistent for multi-start crawls.

Using start_urls[0] can misclassify links from other domains; prefer the page’s original_url as base.

-                                    base_url = original_url
-                                    if len(start_urls) > 0:
-                                        # If we have start_urls, use the first one
-                                        base_url = start_urls[0]
-                                    else:
-                                        # Try to use the current page's URL as base
-                                        # This handles relative links better
-                                        try:
-                                            from urllib.parse import urljoin
-                                            base_url = urljoin(original_url, next_url)
-                                        except Exception:
-                                            base_url = original_url
+                                    # Use the current page as base; it best reflects link context
+                                    base_url = original_url

DomainFilter should resolve relative URLs internally via urljoin (see suggested changes there).


288-326: Consider filtering start URLs as well.

Currently only discovered links are filtered. If configs include strict whitelists/patterns, initial set should be validated too.

archon-ui-main/src/features/knowledge/components/AdvancedCrawlConfig.tsx (3)

22-37: Avoid contradictory membership: remove domain from the opposite list when adding.

A domain shouldn’t be in both allowed and excluded; blacklist wins but UX should prevent conflict.

   const handleAddDomain = (type: "allowed" | "excluded") => {
     if (!newDomain.trim()) return;

     const domain = newDomain.trim().toLowerCase().replace(/^https?:\/\//, "").replace(/\/$/, "");
     const key = `${type}_domains` as keyof CrawlConfig;
+    const otherKey = `${type === "allowed" ? "excluded" : "allowed"}_domains` as keyof CrawlConfig;
     const current = config[key] || [];
+    const other = (config[otherKey] || []).filter((d: string) => d !== domain);

     if (!current.includes(domain)) {
       onChange({
         ...config,
-        [key]: [...current, domain],
+        [key]: [...current, domain],
+        [otherKey]: other,
       });
     }

     setNewDomain("");
   };

171-176: Add accessible labels to remove buttons.

Improves screen reader usability.

-                  <button
+                  <button
+                    aria-label={`Remove domain ${domain}`}
                     onClick={() => handleRemoveDomain("allowed", domain)}
                     className="opacity-0 group-hover:opacity-100 transition-opacity"
                   >
...
-                  <button
+                  <button
+                    aria-label={`Remove domain ${domain}`}
                     onClick={() => handleRemoveDomain("excluded", domain)}
                     className="opacity-0 group-hover:opacity-100 transition-opacity"
                   >
...
-                  <button
+                  <button
+                    aria-label={`Remove include pattern ${pattern}`}
                     onClick={() => handleRemovePattern("include", pattern)}
                     className="opacity-0 group-hover:opacity-100 transition-opacity"
                   >
...
-                  <button
+                  <button
+                    aria-label={`Remove exclude pattern ${pattern}`}
                     onClick={() => handleRemovePattern("exclude", pattern)}
                     className="opacity-0 group-hover:opacity-100 transition-opacity"
                   >

Also applies to: 186-191, 269-273, 284-288


25-35: Consider stripping leading 'www.' for display consistency.

Backend normalizes without the prefix; aligning here reduces user confusion.

python/src/server/api_routes/knowledge_api.py (2)

578-591: Let FastAPI validate the request with a Pydantic model.

Use CrawlRequestV2 directly in the endpoint signature for consistent validation and OpenAPI docs.

-@router.post("/knowledge-items/{source_id}/update-config")
-async def update_crawl_config(source_id: str, request: dict):
+from ..models.crawl_models import CrawlRequestV2
+
+@router.post("/knowledge-items/{source_id}/update-config")
+async def update_crawl_config(source_id: str, request: CrawlRequestV2):
...
-        crawl_request = CrawlRequestV2(**request)
+        crawl_request = request

Propagate the same change to crawl_knowledge_item_v2.


985-991: Return estimated duration consistently with v1.

Minor UX polish: consider aligning wording/format with /knowledge-items/crawl to avoid confusing copy.

archon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx (1)

73-79: Consider extracting the configuration check into a utility function.

The logic for detecting domain filtering configuration is sound but could be more maintainable as a separate utility function.

+const hasAdvancedCrawlConfig = (config: CrawlConfig): boolean => {
+  return !!(
+    (config.allowed_domains && config.allowed_domains.length > 0) ||
+    (config.excluded_domains && config.excluded_domains.length > 0) ||
+    (config.include_patterns && config.include_patterns.length > 0) ||
+    (config.exclude_patterns && config.exclude_patterns.length > 0)
+  );
+};

 const handleCrawl = async () => {
   // ... existing code ...
   
-  // Check if we have any domain filtering configuration
-  const hasCrawlConfig =
-    (crawlConfig.allowed_domains && crawlConfig.allowed_domains.length > 0) ||
-    (crawlConfig.excluded_domains && crawlConfig.excluded_domains.length > 0) ||
-    (crawlConfig.include_patterns && crawlConfig.include_patterns.length > 0) ||
-    (crawlConfig.exclude_patterns && crawlConfig.exclude_patterns.length > 0);
+  const hasCrawlConfig = hasAdvancedCrawlConfig(crawlConfig);
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c910c1 and dbe5855.

📒 Files selected for processing (19)
  • archon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx (11 hunks)
  • archon-ui-main/src/features/knowledge/components/AdvancedCrawlConfig.tsx (1 hunks)
  • archon-ui-main/src/features/knowledge/components/DocumentBrowser.tsx (7 hunks)
  • archon-ui-main/src/features/knowledge/components/EditCrawlConfigDialog.tsx (1 hunks)
  • archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx (4 hunks)
  • archon-ui-main/src/features/knowledge/components/KnowledgeCardActions.tsx (5 hunks)
  • archon-ui-main/src/features/knowledge/components/index.ts (1 hunks)
  • archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts (3 hunks)
  • archon-ui-main/src/features/knowledge/inspector/components/ContentViewer.tsx (4 hunks)
  • archon-ui-main/src/features/knowledge/inspector/components/InspectorSidebar.tsx (8 hunks)
  • archon-ui-main/src/features/knowledge/inspector/components/KnowledgeInspector.tsx (7 hunks)
  • archon-ui-main/src/features/knowledge/services/knowledgeService.ts (3 hunks)
  • archon-ui-main/src/features/knowledge/types/knowledge.ts (2 hunks)
  • python/src/server/api_routes/knowledge_api.py (3 hunks)
  • python/src/server/models/crawl_models.py (1 hunks)
  • python/src/server/services/crawling/crawling_service.py (3 hunks)
  • python/src/server/services/crawling/domain_filter.py (1 hunks)
  • python/src/server/services/crawling/strategies/recursive.py (3 hunks)
  • python/src/server/services/tests/test_domain_filter.py (1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-09-19T10:32:55.557Z
Learnt from: CR
PR: coleam00/Archon#0
File: AGENTS.md:0-0
Timestamp: 2025-09-19T10:32:55.557Z
Learning: Applies to archon-ui-main/src/features/*/components/**/*.{ts,tsx} : Place new UI components under src/features/[feature]/components

Applied to files:

  • archon-ui-main/src/features/knowledge/components/index.ts
📚 Learning: 2025-08-28T13:07:24.810Z
Learnt from: Wirasm
PR: coleam00/Archon#514
File: archon-ui-main/src/services/crawlProgressService.ts:35-39
Timestamp: 2025-08-28T13:07:24.810Z
Learning: The crawlProgressService.ts in the Archon codebase should be deprecated in favor of the existing useCrawlProgressPolling hook from usePolling.ts, which already includes ETag support, 304 handling, tab visibility detection, and proper React lifecycle integration. This consolidation reduces code duplication and improves performance.

Applied to files:

  • archon-ui-main/src/features/knowledge/components/EditCrawlConfigDialog.tsx
  • archon-ui-main/src/features/knowledge/components/AdvancedCrawlConfig.tsx
  • archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts
  • archon-ui-main/src/features/knowledge/services/knowledgeService.ts
  • archon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx
📚 Learning: 2025-08-28T12:56:47.840Z
Learnt from: Wirasm
PR: coleam00/Archon#514
File: archon-ui-main/src/pages/ProjectPage.tsx:329-331
Timestamp: 2025-08-28T12:56:47.840Z
Learning: In the ProjectPage.tsx polling refactor, temporary project creation logic with progress cards was removed in favor of simpler modal loading states, as the complexity of managing in-flight temporary projects wasn't justified when HTTP polling would show new projects within seconds anyway.

Applied to files:

  • archon-ui-main/src/features/knowledge/components/KnowledgeCardActions.tsx
📚 Learning: 2025-09-19T10:32:55.557Z
Learnt from: CR
PR: coleam00/Archon#0
File: AGENTS.md:0-0
Timestamp: 2025-09-19T10:32:55.557Z
Learning: Applies to archon-ui-main/src/features/*/hooks/**/*.{ts,tsx} : Use feature-scoped TanStack Query hooks in src/features/[feature]/hooks

Applied to files:

  • archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts
🔇 Additional comments (21)
archon-ui-main/src/features/knowledge/inspector/components/InspectorSidebar.tsx (1)

166-221: List rendering/accessibility looks solid.

Keyboard and ARIA attributes are reasonable; animated hover and selection state are consistent.

archon-ui-main/src/features/knowledge/components/index.ts (1)

2-2: Re-export of AdvancedCrawlConfig looks good.

Keeps feature modules discoverable via a single barrel.

archon-ui-main/src/features/knowledge/types/knowledge.ts (2)

136-201: CrawlConfig docs and shape look solid.

Clear precedence and examples; matches backend model.


211-214: CrawlRequestV2 extension is appropriate.

Optional crawl_config keeps v1 compatibility.

archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx (1)

226-227: Nice: gate “Edit Configuration” to URL sources only.

Prevents surfacing irrelevant actions for file uploads.

archon-ui-main/src/features/knowledge/components/KnowledgeCardActions.tsx (1)

96-99: Edit Configuration action wiring is clean.

Propagation is stopped; action only shown for URLs; icon/text consistent.

Also applies to: 133-141

python/src/server/services/tests/test_domain_filter.py (1)

157-166: Clarify 'www' normalization semantics.

The test expects allowed_domains=["www.test.com"] to also allow https://test.com/page. That implies stripping www. during normalization. Please confirm this is intentional; it weakens exact‑host whitelisting and may surprise users who only allow www.test.com. If not intended, update either the test or the DomainFilter to require exact host matches unless a wildcard is used.

python/src/server/services/crawling/crawling_service.py (1)

86-90: Pass DomainFilter to recursive strategy — verified.
RecursiveCrawlStrategy stores the DomainFilter and calls domain_filter.is_url_allowed(...) during the crawl (python/src/server/services/crawling/strategies/recursive.py:24,316). Align batch/sitemap similarly for consistent coverage.

archon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx (13)

13-15: LGTM! Proper imports for Crawl V2 functionality.

The new imports support the advanced crawling features with domain filtering. The component structure follows good practices with proper type imports.


36-36: LGTM! Consistent mutation hook integration.

The new useCrawlUrlV2 hook is properly imported and follows the same pattern as the existing crawl mutation.


48-48: LGTM! Clean state management for crawl configuration.

The empty object initialization for crawlConfig is appropriate as it represents optional domain filtering settings.


60-60: LGTM! Proper state reset.

The crawlConfig is properly reset alongside other form state, maintaining consistency in form reset behavior.


82-101: LGTM! Clean conditional API selection with proper TypeScript typing.

The logic correctly determines whether to use V1 or V2 endpoint based on domain filtering configuration. Both request objects are properly typed and constructed.


108-108: LGTM! Clear user feedback with contextual messaging.

The conditional toast message appropriately informs users whether domain filtering is active, providing valuable feedback about the crawl type being initiated.


151-151: LGTM! Comprehensive loading state handling.

The isProcessing state correctly includes both V1 and V2 mutations, ensuring the UI properly reflects all possible loading states.


155-159: LGTM! Enhanced dialog layout for better content management.

The flexbox layout with proper height constraints and flex-shrink settings ensures the dialog content is properly contained and scrollable.


161-215: LGTM! Enhanced tab styling with visual hierarchy.

The tab interface provides clear visual feedback with gradient backgrounds, borders, and accent glows. The implementation maintains good accessibility with proper focus states.


218-227: Enhanced scrolling container with proper height management.

The scrolling wrapper ensures content overflow is handled properly while maintaining consistent styling across different viewport sizes.


252-254: LGTM! Proper integration of AdvancedCrawlConfig component.

The component is positioned logically beneath the URL input and receives proper state management through controlled props. The useId hook used in this component generates stable, unique IDs that are consistent across renders and work seamlessly with SSR.


273-273: LGTM! Comprehensive loading state for start button.

The button properly reflects the loading state of both V1 and V2 mutations, providing consistent user feedback regardless of which API endpoint is being used.


285-299: LGTM! Consistent scrollable container for upload tab.

The upload tab maintains the same scrolling behavior and height management as the crawl tab, ensuring a consistent user experience across both tabs.

Comment thread archon-ui-main/src/features/knowledge/components/EditCrawlConfigDialog.tsx Outdated
Comment thread archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx
Comment thread archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts
Comment thread python/src/server/api_routes/knowledge_api.py
Comment thread python/src/server/api_routes/knowledge_api.py
Comment thread python/src/server/models/crawl_models.py
Comment thread python/src/server/services/crawling/crawling_service.py
Comment thread python/src/server/services/crawling/domain_filter.py
Comment thread python/src/server/services/crawling/domain_filter.py
Comment thread python/src/server/services/crawling/domain_filter.py
@leex279 leex279 linked an issue Sep 22, 2025 that may be closed by this pull request
10 tasks
- Wire crawl_config from request to service instance for domain filtering
- Add domain filtering to BatchCrawlStrategy to ensure all crawls respect filters
- Replace 'any' types with proper CrawlConfig TypeScript types
- Add memoization for filtered chunks in DocumentBrowser for performance
- Improve domain normalization (strip leading www only, handle ports)
- Use Pydantic model validation directly in API endpoints

These fixes ensure domain filtering works consistently across all crawl strategies
and improves type safety and performance throughout the codebase.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@leex279
Copy link
Copy Markdown
Collaborator Author

leex279 commented Sep 22, 2025

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Sep 22, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
python/src/server/api_routes/knowledge_api.py (2)

893-903: Stop/remove of active task happens too early; attach cleanup to the real crawl task instead of wrapper.

The wrapper returns immediately after scheduling; your finally block deletes the registry entry, breaking cancellation via /stop.

Apply:

-            crawl_task = result.get("task")
-            if crawl_task:
-                active_crawl_tasks[progress_id] = crawl_task
-                safe_logfire_info(
-                    f"Stored actual crawl task in active_crawl_tasks | progress_id={progress_id} | task_name={crawl_task.get_name()}"
-                )
+            crawl_task = result.get("task")
+            if crawl_task:
+                active_crawl_tasks[progress_id] = crawl_task
+                safe_logfire_info(
+                    f"Stored actual crawl task in active_crawl_tasks | progress_id={progress_id} | task_name={crawl_task.get_name()}"
+                )
+                # Ensure cleanup happens when the underlying task finishes
+                def _cleanup(_):
+                    if progress_id in active_crawl_tasks:
+                        del active_crawl_tasks[progress_id]
+                        safe_logfire_info(
+                            f"Cleaned up crawl task from registry | progress_id={progress_id}"
+                        )
+                try:
+                    crawl_task.add_done_callback(_cleanup)
+                except Exception:
+                    pass
@@
-        finally:
-            # Clean up task from registry when done (success or failure)
-            if progress_id in active_crawl_tasks:
-                del active_crawl_tasks[progress_id]
-                safe_logfire_info(
-                    f"Cleaned up crawl task from registry | progress_id={progress_id}"
-                )
+        finally:
+            # Cleanup is handled by the task's done-callback
+            pass

Also applies to: 930-936


738-760: Same early cleanup bug in refresh path; delete via done-callback, not wrapper finally.

Otherwise /stop may not find the task.

Apply:

-            finally:
-                # Clean up task from registry when done (success or failure)
-                if progress_id in active_crawl_tasks:
-                    del active_crawl_tasks[progress_id]
-                    safe_logfire_info(
-                        f"Cleaned up refresh task from registry | progress_id={progress_id}"
-                    )
+            finally:
+                # Cleanup handled when underlying crawl task completes
+                if 'crawl_task' in locals() and crawl_task:
+                    def _cleanup(_):
+                        if progress_id in active_crawl_tasks:
+                            del active_crawl_tasks[progress_id]
+                            safe_logfire_info(
+                                f"Cleaned up refresh task from registry | progress_id={progress_id}"
+                            )
+                    try:
+                        crawl_task.add_done_callback(_cleanup)
+                    except Exception:
+                        pass
archon-ui-main/src/features/knowledge/components/DocumentBrowser.tsx (1)

221-223: Prevent crashes when content is missing; compute preview safely.

Substring/length on undefined will throw.

Apply:

-                    const preview = chunk.content.substring(0, 200);
-                    const needsExpansion = chunk.content.length > 200;
+                    const content = chunk.content || "";
+                    const preview = content.substring(0, 200);
+                    const needsExpansion = content.length > 200;
python/src/server/services/crawling/strategies/batch.py (1)

7-12: Critical: crawl_config name collision breaks domain filtering (CrawlerRunConfig shadows CrawlConfig).

The method parameter crawl_config (intended for DomainFilter rules) is overwritten by a local CrawlerRunConfig, so is_url_allowed(..., crawl_config) receives the wrong type. This can cause incorrect filtering or runtime errors. Rename the local variable to run_config and the parameter to filter_config, and type-hint both to prevent regressions.

Apply this diff:

@@
-import asyncio
-from collections.abc import Awaitable, Callable
-from typing import Any
+import asyncio
+from collections.abc import Awaitable, Callable
+from typing import Any, TYPE_CHECKING
@@
-from ....config.logfire_config import get_logger
+from ....config.logfire_config import get_logger
@@
+from ..domain_filter import DomainFilter
+
+if TYPE_CHECKING:
+    from ....models.crawl_models import CrawlConfig
@@
-class BatchCrawlStrategy:
+class BatchCrawlStrategy:
@@
-    def __init__(self, crawler, markdown_generator, domain_filter=None):
+    def __init__(self, crawler, markdown_generator, domain_filter: "DomainFilter" | None = None):
@@
-    async def crawl_batch_with_progress(
+    async def crawl_batch_with_progress(
         self,
         urls: list[str],
         transform_url_func: Callable[[str], str],
         is_documentation_site_func: Callable[[str], bool],
         max_concurrent: int | None = None,
         progress_callback: Callable[..., Awaitable[None]] | None = None,
         cancellation_check: Callable[[], None] | None = None,
-        crawl_config=None,
+        filter_config: "CrawlConfig" | None = None,
     ) -> list[dict[str, Any]]:
@@
-        if has_doc_sites:
+        if has_doc_sites:
             logger.info("Detected documentation sites in batch, using enhanced configuration")
             # Use generic documentation selectors for batch crawling
-            crawl_config = CrawlerRunConfig(
+            run_config = CrawlerRunConfig(
@@
-        else:
+        else:
             # Configuration for regular batch crawling
-            crawl_config = CrawlerRunConfig(
+            run_config = CrawlerRunConfig(
@@
-            if self.domain_filter and crawl_config:
+            if self.domain_filter and filter_config:
@@
-                if not self.domain_filter.is_url_allowed(url, base_url, crawl_config):
+                if not self.domain_filter.is_url_allowed(url, base_url, filter_config):
@@
-            batch_results = await self.crawler.arun_many(
-                urls=batch_urls, config=crawl_config, dispatcher=dispatcher
+            batch_results = await self.crawler.arun_many(
+                urls=batch_urls, config=run_config, dispatcher=dispatcher
             )

Also applies to: 22-22, 35-44, 108-135, 174-182, 228-230

python/src/server/services/crawling/crawling_service.py (2)

209-217: Update call to match renamed parameter in BatchCrawlStrategy.

After fixing the name collision in BatchCrawlStrategy, pass filter_config explicitly.

Apply this diff:

         return await self.batch_strategy.crawl_batch_with_progress(
             urls,
             self.url_handler.transform_github_url,
             self.site_config.is_documentation_site,
             max_concurrent,
             progress_callback,
             self._check_cancellation,  # Pass cancellation check
-            self.crawl_config,  # Pass crawl_config for domain filtering
+            filter_config=self.crawl_config,  # Pass CrawlConfig for domain filtering
         )

227-236: Update recursive call to match renamed parameter in RecursiveCrawlStrategy (if mirrored).

Keep parameter naming consistent across strategies.

Apply this diff:

         return await self.recursive_strategy.crawl_recursive_with_progress(
             start_urls,
             self.url_handler.transform_github_url,
             self.site_config.is_documentation_site,
             max_depth,
             max_concurrent,
             progress_callback,
             self._check_cancellation,  # Pass cancellation check
-            self.crawl_config,  # Pass crawl config for domain filtering
+            filter_config=self.crawl_config,  # Pass CrawlConfig for domain filtering
         )
🧹 Nitpick comments (8)
python/src/server/api_routes/knowledge_api.py (1)

974-977: Don’t register wrapper tasks in active_crawl_tasks; only track the actual crawl task.

The wrapper is replaced immediately by the real task, creating churn and racey cleanup.

Apply:

-        crawl_task = asyncio.create_task(_run_crawl_v2(request_dict=crawl_request.dict(), progress_id=progress_id))
-        active_crawl_tasks[progress_id] = crawl_task
+        asyncio.create_task(_run_crawl_v2(request_dict=crawl_request.dict(), progress_id=progress_id))
-        crawl_task = asyncio.create_task(
-            _run_crawl_v2(request_dict=request_dict, progress_id=progress_id)
-        )
-        active_crawl_tasks[progress_id] = crawl_task
+        asyncio.create_task(_run_crawl_v2(request_dict=request_dict, progress_id=progress_id))

Also applies to: 636-640

archon-ui-main/src/features/knowledge/components/DocumentBrowser.tsx (1)

41-53: Skip empty/invalid domains in domainStats.

Avoid rendering blank chips and skewed counts.

Apply:

     const stats = new Map<string, number>();
     chunks.forEach((chunk) => {
       const url = chunk.url || chunk.metadata?.url;
       if (url) {
-        const domain = extractDomain(url);
-        stats.set(domain, (stats.get(domain) || 0) + 1);
+        const domain = extractDomain(url);
+        if (!domain) return;
+        stats.set(domain, (stats.get(domain) || 0) + 1);
       }
     });
archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts (1)

336-364: Minor: tempItemId is unused in v2 flow; remove to reduce noise.

Keeps context minimal and avoids confusion.

Apply:

-      const tempItemId = optimisticItem.id;
+      // No tempItemId used in v2; we key by source_id (tempProgressId)
python/src/server/services/crawling/strategies/batch.py (2)

155-161: Progress totals should reflect filtering outcome.

Initial message reports the pre-filter count. Consider updating once filtering completes to keep UI totals consistent.

Apply this diff:

-        await report_progress(
-            0,  # Start at 0% progress
-            f"Starting to process {initial_urls_count} URLs...",
-            total_pages=initial_urls_count,
-            processed_pages=0
-        )
+        await report_progress(
+            0,
+            f"Starting to process {initial_urls_count} URLs...",
+            total_pages=initial_urls_count,
+            processed_pages=0
+        )
@@
-        # Update total count after filtering
+        # Update total count after filtering
         total_urls = len(transformed_urls)
+        if filtered_count:
+            await report_progress(
+                0,
+                f"Filtered {filtered_count} URLs; {total_urls} remaining to crawl",
+                total_pages=total_urls,
+                processed_pages=0
+            )

Also applies to: 187-193


216-221: Guard progress math and smoothness.

Use a non‑zero denominator to avoid edge cases and keep progress monotonic.

Apply this diff:

-            progress_percentage = int((i / total_urls) * 100)
+            progress_percentage = int((i / max(total_urls, 1)) * 100)
@@
-                progress_percentage = int((processed / total_urls) * 100)
+                progress_percentage = int((processed / max(total_urls, 1)) * 100)

Also applies to: 268-276

archon-ui-main/src/features/knowledge/types/knowledge.ts (3)

24-26: Avoid any in index signature; prefer unknown for type safety.

[key: string]: any masks type errors across KnowledgeItemMetadata. Use unknown and narrow at use sites.

Apply this diff:

-  [key: string]: any; // Allow additional untyped fields from backend
+  [key: string]: unknown; // Allow additional untyped fields from backend with safe reads

42-46: Duplication of fields at top level vs metadata — define source of truth.

max_depth, tags, and crawl_config appear both at the top level and under metadata. This can drift. Prefer one canonical location (ideally metadata) and map on the API layer, or document precedence for consumers.

Do you want a follow-up task to consolidate?


143-207: Make arrays readonly to clarify immutability in UI models.

These config lists are typically read-only in the UI; marking as readonly helps.

Apply this diff:

-export interface CrawlConfig {
+export interface CrawlConfig {
@@
-  allowed_domains?: string[];
+  readonly allowed_domains?: readonly string[];
@@
-  excluded_domains?: string[];
+  readonly excluded_domains?: readonly string[];
@@
-  include_patterns?: string[];
+  readonly include_patterns?: readonly string[];
@@
-  exclude_patterns?: string[];
+  readonly exclude_patterns?: readonly string[];
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dbe5855 and ec01e91.

📒 Files selected for processing (9)
  • archon-ui-main/src/features/knowledge/components/DocumentBrowser.tsx (7 hunks)
  • archon-ui-main/src/features/knowledge/components/EditCrawlConfigDialog.tsx (1 hunks)
  • archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts (3 hunks)
  • archon-ui-main/src/features/knowledge/services/knowledgeService.ts (3 hunks)
  • archon-ui-main/src/features/knowledge/types/knowledge.ts (4 hunks)
  • python/src/server/api_routes/knowledge_api.py (4 hunks)
  • python/src/server/services/crawling/crawling_service.py (5 hunks)
  • python/src/server/services/crawling/domain_filter.py (1 hunks)
  • python/src/server/services/crawling/strategies/batch.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • archon-ui-main/src/features/knowledge/services/knowledgeService.ts
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-28T13:07:24.810Z
Learnt from: Wirasm
PR: coleam00/Archon#514
File: archon-ui-main/src/services/crawlProgressService.ts:35-39
Timestamp: 2025-08-28T13:07:24.810Z
Learning: The crawlProgressService.ts in the Archon codebase should be deprecated in favor of the existing useCrawlProgressPolling hook from usePolling.ts, which already includes ETag support, 304 handling, tab visibility detection, and proper React lifecycle integration. This consolidation reduces code duplication and improves performance.

Applied to files:

  • archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts
📚 Learning: 2025-09-19T10:32:55.557Z
Learnt from: CR
PR: coleam00/Archon#0
File: AGENTS.md:0-0
Timestamp: 2025-09-19T10:32:55.557Z
Learning: Applies to archon-ui-main/src/features/*/hooks/**/*.{ts,tsx} : Use feature-scoped TanStack Query hooks in src/features/[feature]/hooks

Applied to files:

  • archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts
🔇 Additional comments (11)
python/src/server/services/crawling/domain_filter.py (3)

162-179: Extract domains with hostname and normalize “www.”.

Avoid ports/userinfo in domain set.

Apply:

-        for url in urls:
+        for url in urls:
             try:
-                parsed = urlparse(url)
-                if parsed.netloc:
-                    domain = parsed.netloc.lower().replace("www.", "")
-                    domains.add(domain)
+                parsed = urlparse(url)
+                host = parsed.hostname
+                if host:
+                    domains.add(host.lower().removeprefix("www."))

8-10: Use urljoin and hostname for robust URL resolution and domain extraction.

netloc includes ports/userinfo and relative URLs aren’t resolved.

Apply:

-from urllib.parse import urlparse
+from urllib.parse import urlparse, urljoin

44-57: Fix relative URL handling and ensure consistent host normalization.

Preserve query strings, resolve relatives, and use hostname (not netloc).

Apply:

-            # Parse the URL
-            parsed = urlparse(url)
-
-            # Handle relative URLs by using base URL's domain
-            if not parsed.netloc:
-                base_parsed = urlparse(base_url)
-                domain = base_parsed.netloc.lower()
-                # Construct full URL for pattern matching
-                full_url = f"{base_parsed.scheme}://{base_parsed.netloc}{parsed.path or '/'}"
-            else:
-                domain = parsed.netloc.lower()
-                full_url = url
+            # Resolve to absolute and extract canonical host
+            parsed = urlparse(url)
+            if not parsed.scheme and not parsed.netloc:
+                resolved = urlparse(urljoin(base_url, url))
+            else:
+                resolved = parsed
+            host = (resolved.hostname or "").lower()
+            domain = host
+            # Use resolved URL (keeps query/fragment) for pattern matching
+            full_url = resolved.geturl()
@@
-            normalized_domain = domain
-            if normalized_domain.startswith("www."):
-                normalized_domain = normalized_domain[4:]
+            normalized_domain = domain.removeprefix("www.")
python/src/server/api_routes/knowledge_api.py (2)

627-634: Don’t delete existing data before recrawl; gate behind a flag and defer deletion until success.

Immediate deletion risks irreversible data loss if the new crawl fails. Default to non-destructive behavior and only hard-delete when explicitly requested.

Apply:

-@router.post("/knowledge-items/{source_id}/update-config")
-async def update_crawl_config(source_id: str, request: CrawlRequestV2):
+@router.post("/knowledge-items/{source_id}/update-config")
+async def update_crawl_config(
+    source_id: str,
+    request: CrawlRequestV2,
+    force_delete_old: bool = Query(False),
+):
@@
-        # First delete the existing knowledge item and its documents
-        safe_logfire_info(f"Deleting existing knowledge item before recrawl | source_id={source_id}")
-        try:
-            await service.delete_item(source_id)
-        except Exception as e:
-            safe_logfire_error(f"Failed to delete existing item | error={str(e)}")
-            # Continue anyway - we'll overwrite
+        # Build request dict and mark replacement target (defer deletion)
+        request_dict = crawl_request.dict()
+        request_dict.setdefault("metadata", {})
+        request_dict["metadata"]["replace_source_id"] = source_id
+
+        # Optionally allow immediate deletion (explicit opt-in)
+        if force_delete_old:
+            safe_logfire_info(
+                f"force_delete_old=True; deleting existing knowledge item before recrawl | source_id={source_id}"
+            )
+            try:
+                await service.delete_item(source_id)
+            except Exception as e:
+                safe_logfire_error(f"Failed to delete existing item | error={str(e)}")
+                # Continue anyway; replacement path will create fresh data
@@
-        crawl_task = asyncio.create_task(
-            _run_crawl_v2(request_dict=crawl_request.dict(), progress_id=progress_id)
-        )
-        active_crawl_tasks[progress_id] = crawl_task
+        crawl_task = asyncio.create_task(
+            _run_crawl_v2(request_dict=request_dict, progress_id=progress_id)
+        )

Add missing import (outside this hunk):

from fastapi import APIRouter, File, Form, HTTPException, UploadFile, Query

994-1064: v2 crawl bypasses concurrency guard; wrap with the semaphore and fix task cleanup.

Crawl v2 should honor CONCURRENT_CRAWL_LIMIT and must not remove the task from the registry before the underlying task completes.

Apply:

 async def _run_crawl_v2(request_dict: dict, progress_id: str):
     """Run the crawl v2 with domain filtering in background."""
     tracker = ProgressTracker(progress_id, operation_type="crawl")
-
-    try:
-        safe_logfire_info(
-            f"Starting crawl v2 with progress tracking | progress_id={progress_id} | url={request_dict['url']}"
-        )
+    crawl_task: asyncio.Task | None = None
+    try:
+        async with crawl_semaphore:
+            safe_logfire_info(
+                f"Starting crawl v2 with progress tracking | progress_id={progress_id} | url={request_dict['url']}"
+            )
@@
-        result = await orchestration_service.orchestrate_crawl(request_dict)
+            result = await orchestration_service.orchestrate_crawl(request_dict)
@@
-        crawl_task = result.get("task")
-        if crawl_task:
-            active_crawl_tasks[progress_id] = crawl_task
-            safe_logfire_info(
-                f"Stored actual crawl v2 task in active_crawl_tasks | progress_id={progress_id}"
-            )
-        else:
-            safe_logfire_error(f"No task returned from orchestrate_crawl v2 | progress_id={progress_id}")
+            crawl_task = result.get("task")
+            if crawl_task:
+                active_crawl_tasks[progress_id] = crawl_task
+                # Clean up registry when the actual crawl completes
+                def _cleanup(_):
+                    if progress_id in active_crawl_tasks:
+                        del active_crawl_tasks[progress_id]
+                        safe_logfire_info(
+                            f"Cleaned up crawl v2 task from registry | progress_id={progress_id}"
+                        )
+                try:
+                    crawl_task.add_done_callback(_cleanup)
+                except Exception:
+                    # Best-effort; don't fail the request on callback errors
+                    pass
+            else:
+                safe_logfire_error(f"No task returned from orchestrate_crawl v2 | progress_id={progress_id}")
@@
-    finally:
-        # Clean up task from registry when done
-        if progress_id in active_crawl_tasks:
-            del active_crawl_tasks[progress_id]
-            safe_logfire_info(f"Cleaned up crawl v2 task from registry | progress_id={progress_id}")
+    finally:
+        # Wrapper-level cleanup no longer removes the actual crawl entry; handled by done callback
+        pass
archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts (1)

433-444: Fix activeOps replacement: also update operation_id and source_id, not just progressId.

Leaving a temp operation_id breaks tracking elsewhere.

Apply:

-        const activeOps = queryClient.getQueryData<ActiveOperationsResponse>(progressKeys.active());
-        if (activeOps) {
-          const updated: ActiveOperationsResponse = {
-            ...activeOps, // Preserve count, timestamp, and any other fields
-            operations: activeOps.operations.map((op) =>
-              op.progressId === context.tempProgressId ? { ...op, progressId: response.progressId } : op,
-            ),
-          };
-          queryClient.setQueryData(progressKeys.active(), updated);
-        }
+        const activeOps = queryClient.getQueryData<ActiveOperationsResponse>(progressKeys.active());
+        if (activeOps) {
+          const updated: ActiveOperationsResponse = {
+            ...activeOps,
+            operations: activeOps.operations.map((op) =>
+              op.operation_id === context.tempProgressId
+                ? {
+                    ...op,
+                    operation_id: response.progressId,
+                    progressId: response.progressId,
+                    source_id: response.progressId,
+                    message: response.message || op.message,
+                  }
+                : op,
+            ),
+          };
+          queryClient.setQueryData(progressKeys.active(), updated);
+        }
archon-ui-main/src/features/knowledge/components/EditCrawlConfigDialog.tsx (1)

82-105: Validate max depth; prevent NaN/invalid values from reaching the API.

parseInt on invalid input yields NaN and will 422/500 backend.

Apply:

   const handleSave = async () => {
     if (!url) {
       showToast("URL is required", "error");
       return;
     }
 
     try {
-      await updateMutation.mutateAsync({
+      const depth = Number((maxDepth || "").trim());
+      if (!Number.isInteger(depth) || depth < 0) {
+        showToast("Crawl depth must be a non-negative integer", "error");
+        return;
+      }
+      await updateMutation.mutateAsync({
         sourceId,
         url,
         knowledge_type: knowledgeType,
-        max_depth: parseInt(maxDepth, 10),
+        max_depth: depth,
         tags: tags.length > 0 ? tags : undefined,
         crawl_config: crawlConfig,
       });
python/src/server/services/crawling/strategies/batch.py (1)

173-186: Filtering on original vs transformed URLs — confirm intent.

You filter before transform_url_func. If transformation changes domain (e.g., GitHub raw/blob), rules may be applied to the wrong host. Either filter both original and transformed URLs, or transform first and filter on the final URL, depending on DomainFilter semantics.

Do you want a patch to filter on both forms for safety?

archon-ui-main/src/features/knowledge/types/knowledge.ts (1)

218-221: Confirm casing and serialization for crawl_config across API boundary.

The UI uses snake_case keys (allowed_domains, etc.). Ensure your client and server don’t auto-transform to camelCase, and that undefined is omitted on serialize to avoid noisy payloads.

I can add a small Zod schema to validate shape before POST if helpful.

python/src/server/services/crawling/crawling_service.py (2)

61-70: Add type hint for crawl_config (repeat).

Annotate to Optional[CrawlConfig] to align with imports and tooling. This also prevents “unused import” for CrawlConfig.

Apply this diff:

-    def __init__(self, crawler=None, supabase_client=None, progress_id=None, crawl_config=None):
+    def __init__(self, crawler=None, supabase_client=None, progress_id=None, crawl_config: Optional[CrawlConfig] = None):
@@
-            crawl_config: Optional CrawlConfig for domain filtering
+            crawl_config: Optional CrawlConfig for domain filtering

86-88: LGTM: DomainFilter is wired into both batch and recursive strategies.

Good cohesion and single source of filtering logic.

Comment thread python/src/server/services/crawling/crawling_service.py
Comment thread python/src/server/services/crawling/strategies/batch.py
@leex279
Copy link
Copy Markdown
Collaborator Author

leex279 commented Sep 22, 2025

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Sep 22, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

- Remove code blocks from JSDoc to prevent parsing conflicts
- Simplify example format to avoid asterisk/slash interpretation issues
- Maintain documentation clarity while ensuring TypeScript compilation
…ng errors

- Replace * with [star] in all JSDoc examples to avoid parser conflicts
- Maintain documentation clarity while ensuring proper compilation
- Fix issue across all pattern examples in CrawlConfig documentation
- Increase dialog max height from 85vh to 90vh for better content visibility
- Improve scrolling behavior with proper padding adjustments
- Fix data loading to properly query item only when dialog is open
- Add comprehensive fallback checks for max_depth, tags, and crawl_config
- Add error handling for failed configuration loads
- Ensure arrays are properly validated before assignment
- Remove debug console.log statements
- Add GET /knowledge-items/{source_id} endpoint to fetch single item
- Fixes 'Method Not Allowed' error in Edit Crawler Configuration dialog
- Returns full item data including metadata for configuration editing
- Use metadata.original_url when available to show the actual crawled URL
- Falls back to item.url if original_url is not present
- Ensures the edit dialog shows the same URL that was originally crawled
- Add stopPropagation wrapper to prevent dialog clicks from bubbling to card
- Include crawl configuration fields at top level of knowledge item response
- Ensure max_depth, tags, and crawl_config are accessible for edit dialog
- Fix issue where clicking inside edit dialog would open document browser
- Add onKeyDown stopPropagation to prevent Enter key from opening document browser
- Auto-expand Advanced Configuration when existing config is present
- Fix issue where pressing Enter in tag input would trigger card action
- Improve UX by showing loaded configuration immediately when editing
- Fix deletion of existing documents before recrawl with new configuration
- Use SourceManagementService.delete_source() instead of non-existent delete_item()
- Ensure all crawl configuration fields are persisted to metadata
- Store knowledge_type, max_depth, tags, and original_url in metadata
- Add proper logging for successful deletions
- Ensure crawl_config is persisted and reloaded correctly
…tence

- Fix scrolling in Edit Crawler Configuration dialog when content is expanded
- Remove negative margins and simplify padding for better scroll behavior
- Add console logging to debug crawl_config persistence issues
- Ensure save button is always visible with proper scrollbar
@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented Nov 24, 2025

@leex279 is this something we are still considering?

@leex279
Copy link
Copy Markdown
Collaborator Author

leex279 commented Nov 25, 2025

@Wirasm yes, need to finish that.

@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented Apr 6, 2026

🔄 This repository is being replaced by a new version of Archon.

The original Python/MCP codebase is being archived to the archive/v1-python-mcp branch. The new Archon (TypeScript workflow engine for AI coding agents) is replacing it.

This PR is being closed as part of the migration. Thank you for your contribution!

@Wirasm Wirasm closed this Apr 6, 2026
@Wirasm Wirasm deleted the feature/advanced-crawl-domain-filtering branch April 6, 2026 07:37
coleam00 added a commit that referenced this pull request Apr 7, 2026
The test 'should call generateAndSetTitle with workflow name and user message'
was failing with 'Cannot create worktree: not in a git repository' because
two mocks were inconsistent with the code path being exercised.

Changes:
- Add `configureIsolation` to the `@archon/isolation` mock (was imported but unmocked)
- Fix `findCodebaseByDefaultCwd` mock to return a codebase object instead of null
  so the isolation block is entered and `generateAndSetTitle` is reached

Fixes #717

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Tyone88 pushed a commit to Tyone88/Archon that referenced this pull request Apr 16, 2026
…m00#728)

The test 'should call generateAndSetTitle with workflow name and user message'
was failing with 'Cannot create worktree: not in a git repository' because
two mocks were inconsistent with the code path being exercised.

Changes:
- Add `configureIsolation` to the `@archon/isolation` mock (was imported but unmocked)
- Fix `findCodebaseByDefaultCwd` mock to return a codebase object instead of null
  so the isolation block is entered and `generateAndSetTitle` is reached

Fixes coleam00#717

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
joaobmonteiro pushed a commit to joaobmonteiro/Archon that referenced this pull request Apr 26, 2026
…m00#728)

The test 'should call generateAndSetTitle with workflow name and user message'
was failing with 'Cannot create worktree: not in a git repository' because
two mocks were inconsistent with the code path being exercised.

Changes:
- Add `configureIsolation` to the `@archon/isolation` mock (was imported but unmocked)
- Fix `findCodebaseByDefaultCwd` mock to return a codebase object instead of null
  so the isolation block is entered and `generateAndSetTitle` is reached

Fixes coleam00#717

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Advanced Web Crawling with Domain Configuration

2 participants