feat: Add glob pattern filtering for recursive crawls and improve GitHub auto-config#859
feat: Add glob pattern filtering for recursive crawls and improve GitHub auto-config#859davidrudduck wants to merge 2 commits intocoleam00:mainfrom
Conversation
Implements interactive link review and URL filtering for llms.txt and sitemap.xml crawling: Backend changes: - Add glob pattern matching utility (url_handler.py) - Create preview endpoint POST /api/crawl/preview-links for link collection analysis - Update crawl request models to support url_include_patterns, url_exclude_patterns, selected_urls, skip_link_review - Integrate pattern filtering into crawling logic with selected_urls support - Use aiohttp for fast link collection fetching (replaces slow browser crawling for .txt files) Frontend changes: - Add LinkReviewModal component for interactive link selection before crawling - Update AddKnowledgeDialog with pattern filter inputs and "Review links" checkbox - Add preview flow: detects link collections → shows modal → user selects links → crawls only selected - Fix dialog.tsx wrapper to support full-height flex layouts (h-full class) - Replace invalid <p> nesting with <div> elements for HTML standards compliance Features: - Glob pattern filtering (e.g., **/en/** to include only English pages) - Interactive link preview modal with bulk select/deselect, search, and individual selection - Auto-selection based on filter patterns with "Matches Filter" badges - Scrollable link list supporting 2000+ links - Apply Filters button to refine selection in real-time Fixes scroll issues by ensuring proper flex layout height propagation in dialog components.
…Hub auto-config This commit extends the glob pattern filtering feature (from commit 74023e1) to support recursive crawls and improves GitHub repository handling. ## Changes ### Backend - Recursive Crawl Filtering - Add include_patterns and exclude_patterns parameters to RecursiveCrawlStrategy - Filter internal links during discovery (before adding to crawl queue) - Pass patterns through entire call chain (orchestration → service → strategy) - Add comprehensive logging for pattern configuration and filtered URLs - Performance: Prevents unnecessary HTTP requests and memory usage Files: - python/src/server/services/crawling/strategies/recursive.py: * Lines 45-46: Add pattern parameters to function signature * Lines 59-60: Update docstring * Lines 173-178: Log pattern configuration at crawl start * Lines 316-339: Implement filtering logic during link discovery - python/src/server/services/crawling/crawling_service.py: * Lines 271-272: Add parameters to wrapper method * Lines 283-284: Pass patterns to recursive strategy * Lines 349-356: Add early logging for crawl parameters * Lines 1145-1153: Extract and pass patterns from request ### Frontend - Improved GitHub Auto-Configuration - Change GitHub auto-config from path-based to code-only patterns - Use **/tree/**, **/blob/** instead of /username/repo* - Automatically excludes issues, PRs, actions, wiki, etc. - More efficient and future-proof than exclusion lists Files: - archon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx: * Lines 73-75: Updated pattern generation logic ### Documentation - Add comprehensive glob pattern guide with examples - Document GitHub auto-configuration rationale - Include pattern syntax, use cases, and testing tips Files: - docs/GLOB_PATTERNS.md: New file (253 lines) ## Benefits 1. **Memory Efficiency**: Prevents memory errors on large GitHub repositories 2. **Performance**: Filters URLs before crawling (saves HTTP requests) 3. **Storage**: Reduces database writes (fewer pages to store) 4. **User Experience**: GitHub repos now auto-configured optimally ## Testing - Unit tests: All passing (19/19 glob pattern tests) - Frontend tests: All passing (29/29 LinkReviewModal tests) - Integration tests: Pre-existing failures unrelated to this feature - Manual testing: GitHub crawl with code-only patterns verified ## Pattern Examples Documentation sites (language filtering): **/en/**, !**/api/**, !**/changelog/** GitHub repositories (code only): **/tree/**, **/blob/** Blog sites: **/blog/**, !**/draft/** ## Related - Builds on commit 74023e1 (glob pattern filtering for link collections) - Resolves memory issues with GitHub repository crawling - Implements recursive crawl filtering requested in design discussions
WalkthroughThis PR introduces glob pattern filtering for URL crawling and a link-review workflow across the system. It adds GitHub URL auto-configuration in the frontend, a new LinkReviewModal component for previewing collected links, a Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant AddKnowledgeDialog
participant LinkReviewModal
participant Frontend API
participant Backend
User->>AddKnowledgeDialog: Enter URL & enable Review Links
AddKnowledgeDialog->>Frontend API: POST /crawl/preview-links
Frontend API->>Backend: LinkPreviewRequest (url, patterns)
Backend->>Backend: Detect link collection<br/>(sitemap.xml, llms.txt)
Backend->>Frontend API: LinkPreviewResponse<br/>(is_link_collection, links, matches)
Frontend API->>AddKnowledgeDialog: Preview data
AddKnowledgeDialog->>LinkReviewModal: Show modal with links
User->>LinkReviewModal: Filter/search/select links
LinkReviewModal->>Frontend API: Apply filters (updated patterns)
Frontend API->>Backend: Re-fetch filtered links
Backend->>Frontend API: Updated link list
LinkReviewModal->>AddKnowledgeDialog: Return selected_urls
User->>AddKnowledgeDialog: Submit crawl
AddKnowledgeDialog->>Frontend API: POST /knowledge (CrawlRequest)
Frontend API->>Backend: CrawlRequest (patterns, selected_urls)
Backend->>Backend: Crawl with glob filtering
Backend->>Frontend API: Crawl results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Closing this PR - changes will be added to the original PR #847 instead |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/src/server/api_routes/knowledge_api.py (1)
1016-1023: Propagate pattern and selection settings into the crawl request.
KnowledgeItemRequestnow carriesurl_include_patterns,url_exclude_patterns,selected_urls, andskip_link_review, but_perform_crawl_with_progressdrops all of them when constructingrequest_dict. As a result, the orchestrator always sees empty filters and the new glob/selection logic never runs.request_dict = { "url": str(request.url), "knowledge_type": request.knowledge_type, "tags": request.tags or [], "max_depth": request.max_depth, "extract_code_examples": request.extract_code_examples, "generate_summary": True, + "url_include_patterns": request.url_include_patterns, + "url_exclude_patterns": request.url_exclude_patterns, + "selected_urls": request.selected_urls, + "skip_link_review": request.skip_link_review, }
🧹 Nitpick comments (2)
archon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx (2)
63-88: Add the missing hook dependenciesThis effect reads
urlPatterns,tags, andmaxDepth, but the dependency array only trackscrawlUrl. In React 18 strict mode that will tripreact-hooks/exhaustive-deps, and it also risks stale reads (e.g., if the user adjusts tags or max depth before tweaking the URL, the effect keeps operating on the old values). Including the referenced state in the dependency list keeps the auto-config logic predictable while the existing guards prevent infinite loops.- useEffect(() => { + useEffect(() => { // Only auto-populate if the URL has changed and patterns are empty if (!crawlUrl) return; // Detect GitHub URL (supports https://, http://, or just github.com) const githubUrlPattern = /^(?:https?:\/\/)?(?:www\.)?github\.com\/([^\/]+)\/([^\/\?#]+)/i; const match = crawlUrl.match(githubUrlPattern); if (match) { // Only auto-populate if patterns are currently empty (don't override user edits) if (!urlPatterns) { // Use code-only patterns: only crawl tree (directories) and blob (files) pages setUrlPatterns("**/tree/**, **/blob/**"); } // Auto-add "GitHub Repo" tag if not already present if (!tags.includes("GitHub Repo")) { setTags((prevTags) => [...prevTags, "GitHub Repo"]); } // Set max depth to 3 for GitHub repos (to traverse nested directories) if (maxDepth === "2") { setMaxDepth("3"); } } - }, [crawlUrl]); // Only depend on crawlUrl to avoid infinite loops + }, [crawlUrl, urlPatterns, tags, maxDepth]);
54-56: Preserve the LinkPreviewResponse typingWe already import
LinkPreviewResponse; keeping the state asanythrows away compile-time guarantees and forces downstream null checks by hand. Switching the state toLinkPreviewResponse | nullretains type safety while still modelling the “no preview yet” case.- const [previewData, setPreviewData] = useState<any>(null); + const [previewData, setPreviewData] = useState<LinkPreviewResponse | null>(null);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
archon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx(6 hunks)archon-ui-main/src/features/knowledge/components/LinkReviewModal.tsx(1 hunks)archon-ui-main/src/features/knowledge/components/index.ts(1 hunks)archon-ui-main/src/features/knowledge/types/knowledge.ts(1 hunks)archon-ui-main/src/features/ui/primitives/dialog.tsx(1 hunks)docs/GLOB_PATTERNS.md(1 hunks)python/src/server/api_routes/knowledge_api.py(4 hunks)python/src/server/services/crawling/crawling_service.py(6 hunks)python/src/server/services/crawling/helpers/url_handler.py(1 hunks)python/src/server/services/crawling/strategies/recursive.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
archon-ui-main/src/**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
archon-ui-main/src/**/*.{ts,tsx}: Frontend TypeScript must use strict mode with no implicit any
Use TanStack Query for all data fetching; avoid prop drilling
Use database values directly in the frontend; avoid mapping layers between BE and FE types
Files:
archon-ui-main/src/features/ui/primitives/dialog.tsxarchon-ui-main/src/features/knowledge/components/index.tsarchon-ui-main/src/features/knowledge/types/knowledge.tsarchon-ui-main/src/features/knowledge/components/LinkReviewModal.tsxarchon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx
archon-ui-main/src/features/**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Use Biome in features: 120 character line length, double quotes, and trailing commas
Files:
archon-ui-main/src/features/ui/primitives/dialog.tsxarchon-ui-main/src/features/knowledge/components/index.tsarchon-ui-main/src/features/knowledge/types/knowledge.tsarchon-ui-main/src/features/knowledge/components/LinkReviewModal.tsxarchon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx
archon-ui-main/src/features/ui/primitives/**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Use Radix UI primitives from src/features/ui/primitives when creating UI components
Files:
archon-ui-main/src/features/ui/primitives/dialog.tsx
python/src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code
Files:
python/src/server/services/crawling/helpers/url_handler.pypython/src/server/api_routes/knowledge_api.pypython/src/server/services/crawling/strategies/recursive.pypython/src/server/services/crawling/crawling_service.py
archon-ui-main/src/features/*/components/**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Place new UI components under src/features/[feature]/components
Files:
archon-ui-main/src/features/knowledge/components/index.tsarchon-ui-main/src/features/knowledge/components/LinkReviewModal.tsxarchon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx
archon-ui-main/src/features/*/types/**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Define shared types under src/features/[feature]/types
Files:
archon-ui-main/src/features/knowledge/types/knowledge.ts
🧠 Learnings (7)
📚 Learning: 2025-09-19T10:32:55.580Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-09-19T10:32:55.580Z
Learning: Applies to archon-ui-main/src/features/**/*.{tsx} : Apply Tron-inspired glassmorphism styling with Tailwind in feature UI components
Applied to files:
archon-ui-main/src/features/ui/primitives/dialog.tsxarchon-ui-main/src/features/knowledge/components/index.tsarchon-ui-main/src/features/knowledge/components/LinkReviewModal.tsx
📚 Learning: 2025-09-19T10:32:55.580Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-09-19T10:32:55.580Z
Learning: Applies to archon-ui-main/src/features/*/types/**/*.{ts,tsx} : Define shared types under src/features/[feature]/types
Applied to files:
archon-ui-main/src/features/knowledge/components/index.ts
📚 Learning: 2025-09-19T10:32:55.580Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-09-19T10:32:55.580Z
Learning: Applies to archon-ui-main/src/features/*/components/**/*.{ts,tsx} : Place new UI components under src/features/[feature]/components
Applied to files:
archon-ui-main/src/features/knowledge/components/index.tsarchon-ui-main/src/features/knowledge/components/LinkReviewModal.tsx
📚 Learning: 2025-09-19T10:32:55.580Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-09-19T10:32:55.580Z
Learning: Applies to archon-ui-main/src/features/ui/primitives/**/*.{ts,tsx} : Use Radix UI primitives from src/features/ui/primitives when creating UI components
Applied to files:
archon-ui-main/src/features/knowledge/components/index.tsarchon-ui-main/src/features/knowledge/components/LinkReviewModal.tsx
📚 Learning: 2025-09-04T16:30:05.227Z
Learnt from: stevepresley
Repo: coleam00/Archon PR: 573
File: archon-ui-main/src/config/api.ts:15-25
Timestamp: 2025-09-04T16:30:05.227Z
Learning: Archon UI API config: Prefer lazy getters getApiFullUrl() and getWsUrl() over module-load constants to avoid SSR/test crashes. Avoid CommonJS exports patterns (Object.defineProperty(exports,…)) in ESM. Add typeof window guards with VITE_API_URL fallback inside getApiUrl()/getWebSocketUrl() when SSR safety is required.
Applied to files:
archon-ui-main/src/features/knowledge/components/index.ts
📚 Learning: 2025-09-19T10:32:55.580Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-09-19T10:32:55.580Z
Learning: Applies to archon-ui-main/src/features/*/hooks/**/*.{ts,tsx} : Use feature-scoped TanStack Query hooks in src/features/[feature]/hooks
Applied to files:
archon-ui-main/src/features/knowledge/components/index.tsarchon-ui-main/src/features/knowledge/components/LinkReviewModal.tsx
📚 Learning: 2025-08-28T13:07:24.810Z
Learnt from: Wirasm
Repo: coleam00/Archon PR: 514
File: archon-ui-main/src/services/crawlProgressService.ts:35-39
Timestamp: 2025-08-28T13:07:24.810Z
Learning: The crawlProgressService.ts in the Archon codebase should be deprecated in favor of the existing useCrawlProgressPolling hook from usePolling.ts, which already includes ETag support, 304 handling, tab visibility detection, and proper React lifecycle integration. This consolidation reduces code duplication and improves performance.
Applied to files:
archon-ui-main/src/features/knowledge/types/knowledge.tsarchon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx
🧬 Code graph analysis (6)
archon-ui-main/src/features/knowledge/types/knowledge.ts (1)
python/src/server/api_routes/knowledge_api.py (1)
LinkPreviewRequest(191-203)
python/src/server/api_routes/knowledge_api.py (4)
archon-ui-main/src/features/knowledge/types/knowledge.ts (1)
LinkPreviewRequest(152-156)python/src/server/config/logfire_config.py (2)
safe_logfire_info(224-236)safe_logfire_error(239-251)python/src/server/services/crawling/crawling_service.py (1)
parse_sitemap(243-245)python/src/server/services/crawling/helpers/url_handler.py (7)
is_sitemap(21-38)is_txt(61-77)is_markdown(41-58)is_link_collection_file(390-456)extract_markdown_links_with_text(298-387)is_binary_file(80-177)matches_glob_patterns(710-779)
python/src/server/services/crawling/strategies/recursive.py (1)
python/src/server/services/crawling/helpers/url_handler.py (2)
is_binary_file(80-177)matches_glob_patterns(710-779)
archon-ui-main/src/features/knowledge/components/LinkReviewModal.tsx (5)
archon-ui-main/src/features/knowledge/types/knowledge.ts (2)
LinkPreviewResponse(165-173)PreviewLink(158-163)archon-ui-main/src/features/ui/primitives/dialog.tsx (4)
Dialog(7-7)DialogContent(32-81)DialogHeader(85-89)DialogTitle(105-119)archon-ui-main/src/features/ui/primitives/styles.ts (2)
cn(605-607)glassCard(122-566)archon-ui-main/src/features/ui/primitives/input.tsx (1)
Input(8-29)archon-ui-main/src/features/ui/primitives/button.tsx (1)
Button(11-154)
archon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx (3)
archon-ui-main/src/features/shared/api/apiClient.ts (1)
callAPIWithETag(43-134)archon-ui-main/src/features/knowledge/types/knowledge.ts (2)
LinkPreviewResponse(165-173)CrawlRequest(136-149)archon-ui-main/src/features/knowledge/components/LinkReviewModal.tsx (1)
LinkReviewModal(22-299)
python/src/server/services/crawling/crawling_service.py (2)
python/src/server/services/crawling/helpers/url_handler.py (1)
matches_glob_patterns(710-779)python/src/server/services/crawling/strategies/recursive.py (1)
crawl_recursive_with_progress(36-372)
| // Apply search filter | ||
| useEffect(() => { | ||
| if (!previewData) return; | ||
|
|
||
| const filtered = previewData.links.filter((link) => { | ||
| if (!searchTerm) return true; | ||
| const searchLower = searchTerm.toLowerCase(); | ||
| return ( | ||
| link.url.toLowerCase().includes(searchLower) || | ||
| link.text.toLowerCase().includes(searchLower) || | ||
| link.path.toLowerCase().includes(searchLower) | ||
| ); | ||
| }); | ||
|
|
||
| setFilteredLinks(filtered); | ||
| }, [searchTerm, previewData]); |
There was a problem hiding this comment.
Search should operate on the active preview results.
After you re-fetch with new patterns, filteredLinks is replaced with updatedData.links, but this effect still filters previewData.links. The next search keystroke therefore reverts to the original unfiltered payload. Keep the canonical list in component state and base the search effect on that list instead of the stale prop.
- const [filteredLinks, setFilteredLinks] = useState<PreviewLink[]>([]);
+ const [allLinks, setAllLinks] = useState<PreviewLink[]>([]);
+ const [filteredLinks, setFilteredLinks] = useState<PreviewLink[]>([]);
@@
- setFilteredLinks(previewData.links);
+ setAllLinks(previewData.links);
+ setFilteredLinks(previewData.links);
@@
- if (!previewData) return;
-
- const filtered = previewData.links.filter((link) => {
+ const filtered = allLinks.filter((link) => {
@@
- setFilteredLinks(filtered);
- }, [searchTerm, previewData]);
+ setFilteredLinks(filtered);
+ }, [searchTerm, allLinks]);
@@
- setFilteredLinks(updatedData.links);
+ setAllLinks(updatedData.links);
+ setFilteredLinks(updatedData.links);Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In archon-ui-main/src/features/knowledge/components/LinkReviewModal.tsx around
lines 48-63, the search effect currently filters previewData.links (a prop)
which becomes stale after re-fetches; this causes subsequent searches to revert
to the original payload. Fix by maintaining a canonical links state (e.g.,
canonicalLinks) that you populate/replace when previewData changes, then change
this useEffect to filter canonicalLinks rather than previewData.links and
include canonicalLinks in the dependency array; ensure any places that
previously set filteredLinks from updatedData.links instead update
canonicalLinks so the search always operates against the active preview results.
| const response = await fetch("http://localhost:8181/api/crawl/preview-links", { | ||
| method: "POST", | ||
| headers: { "Content-Type": "application/json" }, | ||
| body: JSON.stringify({ | ||
| url: previewData.source_url, | ||
| url_include_patterns: includePatternArray, | ||
| url_exclude_patterns: excludePatternArray, | ||
| }), | ||
| }); |
There was a problem hiding this comment.
Remove hard-coded localhost preview endpoint.
This locks the modal to http://localhost:8181, so any deployed build (or even a dev container on another host) will fail to apply filters. Use a relative or environment-driven URL so the request always targets the active backend instance.
- const response = await fetch("http://localhost:8181/api/crawl/preview-links", {
+ const response = await fetch("/api/crawl/preview-links", {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const response = await fetch("http://localhost:8181/api/crawl/preview-links", { | |
| method: "POST", | |
| headers: { "Content-Type": "application/json" }, | |
| body: JSON.stringify({ | |
| url: previewData.source_url, | |
| url_include_patterns: includePatternArray, | |
| url_exclude_patterns: excludePatternArray, | |
| }), | |
| }); | |
| const response = await fetch("/api/crawl/preview-links", { | |
| method: "POST", | |
| headers: { "Content-Type": "application/json" }, | |
| body: JSON.stringify({ | |
| url: previewData.source_url, | |
| url_include_patterns: includePatternArray, | |
| url_exclude_patterns: excludePatternArray, | |
| }), | |
| }); |
🤖 Prompt for AI Agents
In archon-ui-main/src/features/knowledge/components/LinkReviewModal.tsx around
lines 112 to 120, the fetch call uses a hard-coded "http://localhost:8181" host;
replace that with a deployable backend URL by using a relative path (e.g.
"/api/crawl/preview-links") or build-time/env-driven base URL (e.g. prepend
process.env.REACT_APP_API_BASE or a similar config with a safe fallback to the
relative path). Update the fetch target to compose from the chosen base (env
variable fallback to "") so the request points to the active backend in dev,
container, or production builds; ensure the app reads the env var at
runtime/build and retains the existing headers/body.
| <div | ||
| key={link.url} | ||
| className={cn( | ||
| "flex items-start space-x-3 p-3 hover:bg-gray-50 dark:hover:bg-gray-800/50 cursor-pointer transition-colors", | ||
| selectedUrls.has(link.url) && "bg-cyan-50 dark:bg-cyan-900/20" | ||
| )} | ||
| onClick={() => handleToggleLink(link.url)} | ||
| > | ||
| <input | ||
| type="checkbox" | ||
| checked={selectedUrls.has(link.url)} | ||
| onChange={() => handleToggleLink(link.url)} | ||
| className="mt-1 h-4 w-4 text-cyan-600 focus:ring-cyan-500 border-gray-300 rounded" |
There was a problem hiding this comment.
Fix double toggle when clicking the checkbox.
Clicking the checkbox fires both the row handler and the checkbox handler, so the selection flips twice and ends up unchanged. Stop the event from bubbling before you toggle the set.
- onClick={() => handleToggleLink(link.url)}
+ onClick={() => handleToggleLink(link.url)}
>
<input
type="checkbox"
checked={selectedUrls.has(link.url)}
- onChange={() => handleToggleLink(link.url)}
+ onClick={(event) => event.stopPropagation()}
+ onChange={(event) => {
+ event.stopPropagation();
+ handleToggleLink(link.url);
+ }}🤖 Prompt for AI Agents
In archon-ui-main/src/features/knowledge/components/LinkReviewModal.tsx around
lines 235 to 247, clicking the row and the checkbox both trigger handlers so the
selection toggles twice; to fix, prevent the checkbox event from bubbling before
toggling: change the checkbox handler to accept the event (e) and call
e.stopPropagation() immediately, then perform the toggle on the URL; ensure you
do this on the checkbox's input handler (onChange or onClick) so the row's
onClick won't also fire.
Summary
Extends the glob pattern filtering feature (from PR #847 / commit 74023e1) to support recursive crawls and improves GitHub repository auto-configuration.
This PR builds on the foundation of link collection filtering by adding the same powerful glob pattern capabilities to recursive website crawling.
What's New
🚀 Recursive Crawl Filtering
Glob patterns now work during recursive crawls, not just link collections:
Example: Crawl only English documentation
Only
/en/paths will be crawled, automatically skipping/fr/,/de/,/api/, etc.🎯 Improved GitHub Auto-Configuration
When you enter a GitHub repository URL, Archon now auto-configures with code-only patterns:
Before:
/username/repo*(crawls everything including issues, PRs, wiki)After:
**/tree/**, **/blob/**(only crawls code files and directories)Benefits:
Technical Implementation
Backend Changes
File:
python/src/server/services/crawling/strategies/recursive.pyinclude_patternsandexclude_patternsparameters (lines 45-46)File:
python/src/server/services/crawling/crawling_service.pyFrontend Changes
File:
archon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx**/tree/**, **/blob/**instead of/username/repo*Documentation
File:
docs/GLOB_PATTERNS.md(new, 253 lines)How It Works
Filtering Flow (Recursive Crawls)
GitHub Auto-Config Flow
Pattern Examples
Documentation Sites
Crawls English docs, excludes API reference and changelog.
GitHub Repositories (Auto-Applied)
Only crawls directory views and file views (actual code).
Blog Sites
Crawls published blog posts, excludes drafts and archives.
Testing
Unit Tests ✅
Frontend Tests ✅
Integration Tests⚠️
Manual Testing ✅
Performance Benefits
Breaking Changes
None - This is backward compatible:
Migration Notes
Related PRs/Issues
Checklist
Files Changed
Total: +434 lines, -75 lines
Screenshots
Will add in comments:
Summary by CodeRabbit
New Features
Documentation