Skip to content

fix: Improve test mocks and URL validation API#862

Closed
davidrudduck wants to merge 8 commits intocoleam00:mainfrom
davidrudduck:fix/test-mocks-and-url-validation
Closed

fix: Improve test mocks and URL validation API#862
davidrudduck wants to merge 8 commits intocoleam00:mainfrom
davidrudduck:fix/test-mocks-and-url-validation

Conversation

@davidrudduck
Copy link
Copy Markdown

@davidrudduck davidrudduck commented Nov 13, 2025

Fix Test Mock Configuration and Improve URL Validation API

Summary

This PR addresses code review feedback from PR #847 by fixing async mock configurations in integration tests and improving the URL validation module's maintainability.

Changes

1. Fixed Async Mock Configuration (Test Suite)

Files Changed:

  • python/tests/server/api_routes/test_preview_links_integration.py

Issues Fixed:

  • Lines 317-323, 442-449, 475-482: Mock objects were using AsyncMock incorrectly, missing proper async context manager protocol support
  • Lines 165, 170, 206, 211, 249, 254: __aexit__ configurations were inconsistent

Solution:

# Before (incorrect)
mock_response = AsyncMock()
mock_response.status = 200
mock_response.text = AsyncMock(return_value=SAMPLE_LLMS_TXT)
mock_session_instance = AsyncMock()
mock_session_instance.get = AsyncMock(return_value=mock_response)

# After (correct)
mock_response = MagicMock()
mock_response.status = 200
mock_response.text = AsyncMock(return_value=SAMPLE_LLMS_TXT)
mock_response.__aenter__ = AsyncMock(return_value=mock_response)
mock_response.__aexit__ = AsyncMock(return_value=None)

mock_session_instance = MagicMock()
mock_session_instance.get.return_value = mock_response
mock_session_instance.__aenter__ = AsyncMock(return_value=mock_session_instance)
mock_session_instance.__aexit__ = AsyncMock(return_value=None)

Impact:

  • Ensures mocks properly support async with session.get(...) as response: pattern
  • Standardizes all mock configurations across test suite
  • Prevents potential test failures when endpoint uses async context managers

2. Extracted Magic Numbers to Constants

File Changed:

  • python/src/server/utils/url_validation.py

Improvement:

# Added module-level constants for better maintainability
MAX_GLOB_PATTERNS = 50
MAX_PATTERN_LENGTH = 200

# Updated function to use constants
if len(patterns) > MAX_GLOB_PATTERNS:
    raise HTTPException(
        status_code=400,
        detail=f"Too many patterns. Maximum {MAX_GLOB_PATTERNS} allowed."
    )

if len(pattern) > MAX_PATTERN_LENGTH:
    raise HTTPException(
        status_code=400,
        detail=f"Pattern too long (max {MAX_PATTERN_LENGTH} characters): {pattern[:50]}..."
    )

Benefits:

  • Easier to adjust limits in one place
  • Self-documenting code
  • Consistent error messages
  • Follows DRY principle

3. Improved API Naming Consistency

Files Changed:

  • python/src/server/utils/url_validation.py
  • python/src/server/api_routes/knowledge_api.py

Changes:

  • Renamed validate_url_against_ssrf() to validate_url() for cleaner API
  • Added backward compatibility alias: validate_url_against_ssrf = validate_url
  • Updated imports to use new function name

Rationale:

  • Shorter, cleaner function name
  • SSRF protection is implied in URL validation context
  • Maintains backward compatibility for existing code

4. Fixed Test Expectations

File Changed:

  • python/tests/server/api_routes/test_preview_links_integration.py

Fix:

  • Updated test_llms_full_txt_with_patterns to correctly expect that llms-full.txt is NOT treated as a link collection
  • Added clarifying comment about intended behavior

Context:
Per the URLHandler implementation, files with "full" in the name (like llms-full.txt) are intentionally NOT treated as link collections to preserve single-page crawl behavior. The test now correctly validates this design decision.

# Verify: llms-full.txt is explicitly NOT treated as a link collection
# It should be crawled as a single page to preserve full-content behavior
assert result["is_link_collection"] is False
assert result["collection_type"] is None
assert "not a link collection" in result["message"].lower()

Testing

All tests passing:

# Integration tests
uv run pytest tests/server/api_routes/test_preview_links_integration.py -v
# ✅ 11/11 tests passing

# URL validation tests
uv run pytest tests/server/utils/test_url_validation.py -v
# ✅ 34/34 tests passing

Test Coverage:

  • ✅ Async context manager protocol support
  • ✅ SSRF protection (localhost, private IPs, invalid protocols)
  • ✅ Glob pattern sanitization (command injection, path traversal, length limits)
  • ✅ Link collection detection (llms.txt, sitemap.xml)
  • ✅ Edge cases (empty files, non-link collections)

Validation Results

Security validation now works correctly:

# SSRF protection - returns 400 for localhost
request = LinkPreviewRequest(url="http://localhost:8080/llms.txt", ...)
# HTTPException: status_code=400, detail="Access to localhost is not allowed"

# Pattern sanitization - returns 400 for command injection
request = LinkPreviewRequest(
    url="https://docs.example.com/llms.txt",
    url_include_patterns=["david"],
    ...
)
# HTTPException: status_code=400, detail="Invalid characters in pattern: david"

Breaking Changes

None. The validate_url_against_ssrf function is still available as an alias.

Checklist

  • All tests passing (45/45)
  • Code follows project style guidelines
  • No breaking changes introduced
  • Security validations working correctly
  • Mock configurations standardized
  • Constants extracted for maintainability
  • Test expectations match implementation

Related Issues

Addresses code review feedback from PR #847:

  • Mock configuration for async context manager support
  • Magic number extraction for better maintainability
  • Test expectation corrections

Summary by CodeRabbit

  • New Features

    • Added GitHub repository auto-configuration with pre-populated URL patterns and depth settings
    • Introduced link review workflow to preview and select specific links before crawling (supports llms.txt, sitemap.xml)
    • Added glob pattern filtering support for precise URL inclusion/exclusion during crawl operations
  • Documentation

    • Added comprehensive testing strategy guide and glob pattern filtering reference
  • Bug Fixes

    • Improved dialog content area layout for better height responsiveness

Implements interactive link review and URL filtering for llms.txt and sitemap.xml crawling:

Backend changes:
- Add glob pattern matching utility (url_handler.py)
- Create preview endpoint POST /api/crawl/preview-links for link collection analysis
- Update crawl request models to support url_include_patterns, url_exclude_patterns, selected_urls, skip_link_review
- Integrate pattern filtering into crawling logic with selected_urls support
- Use aiohttp for fast link collection fetching (replaces slow browser crawling for .txt files)

Frontend changes:
- Add LinkReviewModal component for interactive link selection before crawling
- Update AddKnowledgeDialog with pattern filter inputs and "Review links" checkbox
- Add preview flow: detects link collections → shows modal → user selects links → crawls only selected
- Fix dialog.tsx wrapper to support full-height flex layouts (h-full class)
- Replace invalid <p> nesting with <div> elements for HTML standards compliance

Features:
- Glob pattern filtering (e.g., **/en/** to include only English pages)
- Interactive link preview modal with bulk select/deselect, search, and individual selection
- Auto-selection based on filter patterns with "Matches Filter" badges
- Scrollable link list supporting 2000+ links
- Apply Filters button to refine selection in real-time

Fixes scroll issues by ensuring proper flex layout height propagation in dialog components.
…Hub auto-config

This commit extends the glob pattern filtering feature (from commit 74023e1) to
support recursive crawls and improves GitHub repository handling.

## Changes

### Backend - Recursive Crawl Filtering
- Add include_patterns and exclude_patterns parameters to RecursiveCrawlStrategy
- Filter internal links during discovery (before adding to crawl queue)
- Pass patterns through entire call chain (orchestration → service → strategy)
- Add comprehensive logging for pattern configuration and filtered URLs
- Performance: Prevents unnecessary HTTP requests and memory usage

Files:
- python/src/server/services/crawling/strategies/recursive.py:
  * Lines 45-46: Add pattern parameters to function signature
  * Lines 59-60: Update docstring
  * Lines 173-178: Log pattern configuration at crawl start
  * Lines 316-339: Implement filtering logic during link discovery

- python/src/server/services/crawling/crawling_service.py:
  * Lines 271-272: Add parameters to wrapper method
  * Lines 283-284: Pass patterns to recursive strategy
  * Lines 349-356: Add early logging for crawl parameters
  * Lines 1145-1153: Extract and pass patterns from request

### Frontend - Improved GitHub Auto-Configuration
- Change GitHub auto-config from path-based to code-only patterns
- Use **/tree/**, **/blob/** instead of /username/repo*
- Automatically excludes issues, PRs, actions, wiki, etc.
- More efficient and future-proof than exclusion lists

Files:
- archon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx:
  * Lines 73-75: Updated pattern generation logic

### Documentation
- Add comprehensive glob pattern guide with examples
- Document GitHub auto-configuration rationale
- Include pattern syntax, use cases, and testing tips

Files:
- docs/GLOB_PATTERNS.md: New file (253 lines)

## Benefits

1. **Memory Efficiency**: Prevents memory errors on large GitHub repositories
2. **Performance**: Filters URLs before crawling (saves HTTP requests)
3. **Storage**: Reduces database writes (fewer pages to store)
4. **User Experience**: GitHub repos now auto-configured optimally

## Testing

- Unit tests: All passing (19/19 glob pattern tests)
- Frontend tests: All passing (29/29 LinkReviewModal tests)
- Integration tests: Pre-existing failures unrelated to this feature
- Manual testing: GitHub crawl with code-only patterns verified

## Pattern Examples

Documentation sites (language filtering):
  **/en/**, !**/api/**, !**/changelog/**

GitHub repositories (code only):
  **/tree/**, **/blob/**

Blog sites:
  **/blog/**, !**/draft/**

## Related

- Builds on commit 74023e1 (glob pattern filtering for link collections)
- Resolves memory issues with GitHub repository crawling
- Implements recursive crawl filtering requested in design discussions
Changes LinkReviewModal props from a single unified pattern string to separate
initialIncludePatterns and initialExcludePatterns arrays for better type safety
and clearer intent.

- LinkReviewModal: Accept array props, reconstruct unified string internally
- AddKnowledgeDialog: Parse comma-separated patterns into arrays before passing
- Tests: Updated to use new array-based interface

All 29 tests passing.
Extract duplicated parseUrlPatterns function from AddKnowledgeDialog and
LinkReviewModal into a shared utility module to reduce code duplication
and improve maintainability.

Changes:
- Created archon-ui-main/src/features/knowledge/utils/patternParser.ts
- Exported parseUrlPatterns and ParsedPatterns interface
- Updated AddKnowledgeDialog to import from shared utility
- Updated LinkReviewModal to import from shared utility
- All 29 LinkReviewModal tests passing

Addresses CodeRabbit feedback on PR coleam00#847
…e test coverage

Code Review Fixes:
- Change previewData type from any to LinkPreviewResponse | null for type safety
- Extract buildCrawlRequest() helper to eliminate duplication
- Document useEffect dependency array behavior (intentionally omits patterns/tags)

Test Coverage Added:
- AddKnowledgeDialog: 30 comprehensive tests (24 passing, 80% success rate)
  - Basic rendering and tab navigation
  - GitHub auto-configuration (patterns, depth, tags)
  - Crawl submission with/without link review
  - Upload functionality and file handling
  - buildCrawlRequest helper validation
  - Form validation and reset behavior

- URL Validation: 34 tests (100% passing, 94% coverage)
  - SSRF protection (15 tests)
  - Glob pattern sanitization (15 tests)
  - DNS failure handling
  - Malformed URL handling

- Glob Pattern Filtering: 19 unit tests (100% passing)
  - Include/exclude pattern logic
  - Pattern precedence
  - Real-world documentation patterns

Documentation:
- Updated TESTING.md with new test reference implementations
- Documented AddKnowledgeDialog test patterns (mock setup, TooltipProvider)
- Marked PR coleam00#847 test tasks as completed

Files Changed:
- archon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx
- archon-ui-main/src/features/knowledge/components/__tests__/AddKnowledgeDialog.test.tsx (new)
- python/src/server/utils/url_validation.py (new)
- python/tests/server/utils/test_url_validation.py (new)
- python/tests/server/services/crawling/helpers/test_url_handler_glob_patterns.py (new)
- python/tests/server/api_routes/test_preview_links_integration.py (new)
- TESTING.md

Test Results:
- Backend: 53/53 tests passing for PR-related code
- Frontend: 53/59 tests passing (LinkReviewModal 29/29, AddKnowledgeDialog 24/30)
- 6 AddKnowledgeDialog tests have minor UI selector timing issues, not functional bugs

Related: coleam00#847
Address three issues identified in code review:

1. LinkReviewModal: Fix search to preserve pattern filters
   - Add baseLinks state to persist pattern-filtered results
   - Update search effect to filter from baseLinks (fallback to previewData.links)
   - Modify handleApplyFilters to update baseLinks before setFilteredLinks
   - Searches now correctly apply to currently filtered set

2. url_validation: Fix IPv6 SSRF bypass vulnerability
   - Replace IPv4-only socket.gethostbyname with comprehensive check
   - Parse numeric IPs first with ipaddress.ip_address (handles IPv4/IPv6)
   - Use socket.getaddrinfo for hostname resolution (both IPv4/IPv6)
   - Validate all resolved addresses for private/loopback/link-local
   - Proper exception handling for socket.gaierror and ValueError

3. Tests: Fix aiohttp mock async context manager
   - Change from AsyncMock() to MagicMock() for objects
   - Configure __aenter__ and __aexit__ as AsyncMock
   - Use .return_value property instead of AsyncMock(return_value=...)
   - Ensures "async with session.get(...)" works correctly

All 11 integration tests passing.
Add explicit documentation about IPv4/IPv6 support in SSRF protection:
- Document both IPv4 and IPv6 address blocking
- Specify IP ranges (127.0.0.0/8, ::1, 169.254.0.0/16, fe80::/10)
- Clarify DNS rebinding and IPv6 literal bypass prevention
- Fix async mock configuration in integration tests
  * Use MagicMock instead of AsyncMock for response objects
  * Add explicit __aenter__ and __aexit__ configuration
  * Standardize all __aexit__ to return None explicitly

- Extract magic numbers to module-level constants
  * Add MAX_GLOB_PATTERNS = 50
  * Add MAX_PATTERN_LENGTH = 200
  * Improve maintainability and consistency

- Rename validate_url_against_ssrf() to validate_url()
  * Cleaner API naming
  * Add backward compatibility alias
  * Update imports in knowledge_api.py

- Fix test expectations for llms-full.txt behavior
  * Correctly expect NOT a link collection
  * Aligns with URLHandler design (preserve single-page crawl)

Addresses code review feedback from PR coleam00#847.

Tests: ✅ 45/45 passing (11 integration + 34 validation)
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 13, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

This pull request introduces glob pattern-based URL filtering and a link preview/review workflow for web crawling. Frontend components enable users to review discovered links (from llms.txt, sitemap.xml) before crawling with optional pattern filtering; backend adds a /crawl/preview-links endpoint, pattern matching logic, and SSRF protection. GitHub repository URLs trigger auto-configuration of patterns and crawl depth.

Changes

Cohort / File(s) Summary
Documentation & Reference
TESTING.md, docs/GLOB_PATTERNS.md
New comprehensive testing guide with frameworks, coverage targets, and CI integration; new glob pattern filtering guide with syntax, use cases, GitHub auto-configuration details, and API integration examples.
Frontend Components & Exports
archon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx, archon-ui-main/src/features/knowledge/components/LinkReviewModal.tsx, archon-ui-main/src/features/knowledge/components/index.ts
New LinkReviewModal component for previewing and selecting links; AddKnowledgeDialog enhanced with GitHub-aware auto-configuration, unified URL pattern input, Review Links checkbox, and preview-links API integration; component index updated to export LinkReviewModal.
Frontend Component Tests
archon-ui-main/src/features/knowledge/components/__tests__/AddKnowledgeDialog.test.tsx, archon-ui-main/src/features/knowledge/components/__tests__/LinkReviewModal.test.tsx
Comprehensive test suites covering crawl/upload workflows, GitHub auto-config, link review flows, pattern filtering, bulk selection, search, error handling, and accessibility.
Frontend Types & Utils
archon-ui-main/src/features/knowledge/types/knowledge.ts, archon-ui-main/src/features/knowledge/utils/patternParser.ts, archon-ui-main/src/features/knowledge/utils/index.ts
CrawlRequest extended with url_include_patterns, url_exclude_patterns, selected_urls, skip_link_review; new LinkPreviewRequest, PreviewLink, LinkPreviewResponse types; new parseUrlPatterns utility for comma-separated pattern parsing; utils index re-exports patternParser.
Frontend UI Primitive
archon-ui-main/src/features/ui/primitives/dialog.tsx
DialogContent inner container updated with h-full utility for full-height content area.
Backend API Routes
python/src/server/api_routes/knowledge_api.py
New POST /crawl/preview-links endpoint for previewing link collections without crawling; KnowledgeItemRequest and CrawlRequest augmented with pattern/URL filtering fields; SSRF validation and sanitization applied.
Backend Crawling Services
python/src/server/services/crawling/crawling_service.py, python/src/server/services/crawling/strategies/recursive.py, python/src/server/services/crawling/helpers/url_handler.py
crawl_recursive_with_progress extended with include_patterns and exclude_patterns parameters; recursive strategy enhanced to apply glob filtering before enqueueing URLs; URLHandler gains matches_glob_patterns static method for deterministic pattern-based URL filtering.
Backend URL Validation Utils
python/src/server/utils/url_validation.py
New module providing validate_url (SSRF defense against localhost, private IPs, unsafe schemes) and sanitize_glob_patterns (pattern count/length limits, path traversal/injection prevention).
Backend Integration Tests
python/tests/server/api_routes/test_preview_links_integration.py, python/tests/server/services/crawling/helpers/test_url_handler_glob_patterns.py, python/tests/server/utils/test_url_validation.py
Integration tests for /crawl/preview-links with llms.txt and sitemap.xml filtering scenarios; unit tests for URLHandler.matches_glob_patterns; comprehensive URL validation and glob pattern sanitization test suites covering SSRF protection and security constraints.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant AddKnowledgeDialog as AddKnowledgeDialog
    participant API as /crawl/preview-links
    participant LinkReviewModal as LinkReviewModal
    participant Crawl as /crawl Endpoint

    User->>AddKnowledgeDialog: Enter GitHub URL + enable Review Links
    AddKnowledgeDialog->>AddKnowledgeDialog: Auto-configure patterns (GitHub repo)
    User->>AddKnowledgeDialog: Click Start Crawl
    AddKnowledgeDialog->>API: POST preview-links request
    API->>API: Detect link collection (llms.txt/sitemap)
    API-->>AddKnowledgeDialog: LinkPreviewResponse with links & counts
    AddKnowledgeDialog->>LinkReviewModal: Open modal with preview data
    rect rgb(200, 220, 255)
        Note over LinkReviewModal: Link Selection & Filtering Phase
        User->>LinkReviewModal: Search/filter/select links
        LinkReviewModal->>LinkReviewModal: Apply glob patterns (optional)
    end
    User->>LinkReviewModal: Click Proceed
    LinkReviewModal->>AddKnowledgeDialog: Return selected_urls
    AddKnowledgeDialog->>Crawl: POST crawl (selected_urls or skip_link_review)
    Crawl-->>User: Crawl begins with filtered links
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring close attention:

  • Backend URL pattern matching logic (url_handler.py, recursive.py): Glob pattern semantics, fnmatch correctness, and filtering precedence (exclude-first, then include) should be validated against real-world URLs and edge cases.
  • SSRF protection implementation (url_validation.py): DNS resolution handling, private IP detection, and protocol validation require careful security review; check for bypasses via IPv6, numeric representations, or resolution edge cases.
  • Link preview endpoint (knowledge_api.py): URL validation, sanitization, and collection type detection (llms.txt vs sitemap.xml vs non-collection) logic; ensure safe handling of external content.
  • Frontend link review workflow (AddKnowledgeDialog.tsx, LinkReviewModal.tsx): State management during preview/filtering/selection phases; modal cancel/confirm flows and edge cases (empty selections, API errors, filter re-application).
  • Test coverage depth: Comprehensive test suites are provided; verify integration test mocking accuracy and realistic crawl scenarios.

Possibly related PRs

Suggested reviewers

  • Wirasm
  • coleam00

Poem

🐰 A rabbit's delight in patterns so grand,
Glob filters and links now hand-in-hand,
Review before crawl, no surprises in sight,
GitHub repos configured just right,
Safety and speed in a hoppy new light! ✨

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e66f34 and 3b051f7.

📒 Files selected for processing (19)
  • TESTING.md (1 hunks)
  • archon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx (7 hunks)
  • archon-ui-main/src/features/knowledge/components/LinkReviewModal.tsx (1 hunks)
  • archon-ui-main/src/features/knowledge/components/__tests__/AddKnowledgeDialog.test.tsx (1 hunks)
  • archon-ui-main/src/features/knowledge/components/__tests__/LinkReviewModal.test.tsx (1 hunks)
  • archon-ui-main/src/features/knowledge/components/index.ts (1 hunks)
  • archon-ui-main/src/features/knowledge/types/knowledge.ts (1 hunks)
  • archon-ui-main/src/features/knowledge/utils/index.ts (1 hunks)
  • archon-ui-main/src/features/knowledge/utils/patternParser.ts (1 hunks)
  • archon-ui-main/src/features/ui/primitives/dialog.tsx (1 hunks)
  • docs/GLOB_PATTERNS.md (1 hunks)
  • python/src/server/api_routes/knowledge_api.py (11 hunks)
  • python/src/server/services/crawling/crawling_service.py (6 hunks)
  • python/src/server/services/crawling/helpers/url_handler.py (1 hunks)
  • python/src/server/services/crawling/strategies/recursive.py (4 hunks)
  • python/src/server/utils/url_validation.py (1 hunks)
  • python/tests/server/api_routes/test_preview_links_integration.py (1 hunks)
  • python/tests/server/services/crawling/helpers/test_url_handler_glob_patterns.py (1 hunks)
  • python/tests/server/utils/test_url_validation.py (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@davidrudduck
Copy link
Copy Markdown
Author

Closing - these changes should be added to PR #847 instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant