Fix/(llms.txt) not crawling links inside of file#437
Conversation
Removing langgraph-api from requirements.txt so it is autoresolved, h…
… intelligently determines if theres links in the llms.txt and crawls them as it should. tested fully everything works!
WalkthroughIntroduces Markdown and link-collection handling in crawling, including self-link detection, markdown link extraction, binary filtering, and batched crawling with progress mapping. Enhances URL utilities for markdown detection, link parsing, GitHub URL normalization, and display names. Adds stage-specific progress fields for code/document storage. No functional changes in embeddings. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant CrawlingService
participant URLHandler
participant Progress
participant BatchCrawler as Batch Crawl
Client->>CrawlingService: crawl(url)
CrawlingService->>URLHandler: is_txt(url) / is_markdown(url)
alt Text/Markdown path
CrawlingService->>Progress: update(start=5,end=10)
CrawlingService->>CrawlingService: crawl_markdown_file(url)
CrawlingService->>URLHandler: is_link_collection_file(url, content)
alt Link-collection
CrawlingService->>URLHandler: extract_markdown_links(content, base_url)
CrawlingService->>CrawlingService: filter _is_self_link(link, base_url)
CrawlingService->>URLHandler: is_binary_file(link) (filter)
alt Has valid links
CrawlingService->>Progress: update(start=10,end=20)
CrawlingService->>BatchCrawler: crawl_batch_with_progress(links, max_concurrent, cb)
BatchCrawler-->>CrawlingService: batch results
CrawlingService-->>Client: results (type=link_collection_with_crawled_links)
else No valid links
CrawlingService-->>Client: initial results
end
else Not a link-collection
CrawlingService-->>Client: initial results
end
else Other types
CrawlingService-->>Client: existing crawl handling
end
sequenceDiagram
autonumber
participant Stage as Storage Stage
participant Progress
Note over Stage,Progress: Stage-specific progress fields
Stage->>Progress: Code: code_current_batch, code_total_batches
Stage->>Progress: Document: document_current_batch,<br/>document_completed_batches, document_total_batches
Note over Progress: Generic fields preserved alongside stage-specific ones
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/src/server/services/crawling/crawling_service.py (1)
502-517: .md files are not treated as link collections; extraction only runs for .txtThe new link-collection flow is gated by
is_txt(url), sollms.mdand similar files won’t trigger extraction/crawling of embedded links. This contradicts the PR objective.Apply this diff to include markdown files and clarify the log:
- if self.url_handler.is_txt(url): + if self.url_handler.is_txt(url) or getattr(self.url_handler, "is_markdown", lambda _u: False)(url): @@ - "log": "Detected text file, fetching content...", + "log": "Detected text/markdown file, fetching content...",And add this helper to
URLHandler(in url_handler.py):+ @staticmethod + def is_markdown(url: str) -> bool: + """ + Check if a URL points to a markdown file (.md, .mdx, .markdown). + """ + try: + path = urlparse(url).path.lower() + return path.endswith(('.md', '.mdx', '.markdown')) + except Exception as e: + logger.warning(f"Error checking if URL is markdown file: {e}", exc_info=True) + return False
🧹 Nitpick comments (4)
python/src/server/services/crawling/helpers/url_handler.py (2)
182-185: Preserve stack traces in logs for easier debuggingPer coding guidelines, include
exc_info=Truein error/warning logs to retain stack traces.Apply this diff:
@@ - except Exception as e: - logger.error(f"Error extracting markdown links: {e}") + except Exception as e: + logger.error(f"Error extracting markdown links: {e}", exc_info=True) return [] @@ - except Exception as e: - logger.warning(f"Error checking if file is link collection: {e}") + except Exception as e: + logger.warning(f"Error checking if file is link collection: {e}", exc_info=True) return FalseOptionally, consider adding
exc_info=Truein other exception handlers in this file for consistency.Also applies to: 241-243
130-185: Consider filtering non-HTML/binary assets earlyIf link collections include assets like PDFs or images, you can pre-filter using
is_binary_fileto avoid wasteful batch crawls.If desired, add:
# After URL collection, before dedup: urls = [u for u in urls if not URLHandler.is_binary_file(u)]python/src/server/services/crawling/crawling_service.py (2)
519-561: Add cancellation checks around extraction/batch steps and filter out binariesSmall robustness improvements:
- Check for cancellation before/after potentially long steps.
- Filter out binary/non-HTML links before batch to save time and storage.
Apply this diff:
@@ - # Check if this is a link collection file and extract links + # Check if this is a link collection file and extract links if crawl_results and len(crawl_results) > 0: content = crawl_results[0].get('markdown', '') if self.url_handler.is_link_collection_file(url, content): if self.progress_id: @@ - # Extract links from the content + # Check for cancellation before extraction + self._check_cancellation() + # Extract links from the content extracted_links = self.url_handler.extract_markdown_links(content, url) - if extracted_links: + # Optional: filter out binaries (pdf, images, archives, etc.) + extracted_links = [u for u in extracted_links if not self.url_handler.is_binary_file(u)] + + if extracted_links: if self.progress_id: @@ - batch_results = await self.crawl_batch_with_progress( + # Check for cancellation before starting batch crawl + self._check_cancellation() + batch_results = await self.crawl_batch_with_progress( extracted_links, - max_concurrent=request.get('max_concurrent', 3), + # Let strategy apply DB-configured concurrency unless explicitly provided + max_concurrent=request.get('max_concurrent', None), progress_callback=await self._create_crawl_progress_callback("crawling"), start_progress=30, end_progress=70, ) + # Check for cancellation after batch crawl + self._check_cancellation()
545-551: Use strategy-configured concurrency by default instead of hardcoded 3Batch strategy already reads concurrency from settings; defaulting to
3is inconsistent with other paths (e.g., sitemap). PreferNoneunless the caller explicitly overrides.Apply this minimal diff:
- max_concurrent=request.get('max_concurrent', 3), + max_concurrent=request.get('max_concurrent', None),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
python/src/server/services/crawling/crawling_service.py(1 hunks)python/src/server/services/crawling/helpers/url_handler.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
python/src/{server,mcp,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/helpers/url_handler.py
python/src/{server/services,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/helpers/url_handler.py
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/helpers/url_handler.py
python/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/helpers/url_handler.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}
📄 CodeRabbit inference engine (CLAUDE.md)
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/helpers/url_handler.py
python/src/server/**
📄 CodeRabbit inference engine (CLAUDE.md)
Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/helpers/url_handler.py
🧬 Code graph analysis (1)
python/src/server/services/crawling/crawling_service.py (2)
python/src/server/services/crawling/helpers/url_handler.py (2)
is_link_collection_file(187-243)extract_markdown_links(131-184)python/src/server/services/crawling/strategies/batch.py (1)
crawl_batch_with_progress(31-199)
🔇 Additional comments (1)
python/src/server/services/crawling/crawling_service.py (1)
519-561: Verified crawl_markdown_file return shapeI’ve confirmed that
crawl_markdown_file(insingle_page.py) always returns either:
- a single‐element list containing the fetched document on success, or
- an empty list on failure.
Because it never emits multiple entries (e.g., no multipart or redirect segments), indexing
crawl_results[0]to get the markdown is safe under the current implementation. If in the future the return shape changes to include multiple items, this use of[0]should be revisited.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (6)
python/src/server/services/crawling/helpers/url_handler.py (4)
170-173: Broaden link extraction to scheme-less and protocol-relative bare URLs; tighten cleanupMany real-world link lists use
www.example.comor//example.com. Add named groups to the regex and normalize both forms; also trim a couple more trailing punctuation chars to avoid spurious]/>.Apply this diff:
- combined_pattern = re.compile( - r'\[([^\]]*)\]\(([^)]+)\)' # group 2: markdown URL - r'|<\s*(https?://[^>\s]+)\s*>' # group 3: autolink URL - r'|(https?://[^\s<>()\[\]"]+)' # group 4: bare URL - ) + combined_pattern = re.compile( + r'\[(?P<text>[^\]]*)\]\((?P<md>[^)]+)\)' # named: md + r'|<\s*(?P<auto>https?://[^>\s]+)\s*>' # named: auto + r'|(?P<bare>https?://[^\s<>()\[\]"]+)' # named: bare + r'|(?P<proto>//[^\s<>()\[\]"]+)' # named: protocol-relative + r'|(?P<www>www\.[^\s<>()\[\]"]+)' # named: www.* without scheme + ) @@ - def _clean_url(u: str) -> str: + def _clean_url(u: str) -> str: # Trim whitespace and common trailing punctuation - return u.strip().rstrip('.,;:)') + return u.strip().rstrip('.,;:)]>') @@ - for match in re.finditer(combined_pattern, content): - url = match.group(2) or match.group(3) or match.group(4) + for match in re.finditer(combined_pattern, content): + url = ( + match.group('md') + or match.group('auto') + or match.group('bare') + or match.group('proto') + or match.group('www') + ) if not url: continue url = _clean_url(url) @@ - if url.startswith('//'): + if url.startswith('//'): url = f'https:{url}' + elif url.startswith('www.'): + url = f'https://{url}' @@ - if base_url and not url.startswith(('http://', 'https://')): + if base_url and not url.startswith(('http://', 'https://')): try: url = urljoin(base_url, url)Also applies to: 181-181, 190-193, 194-201, 175-178
254-256: Include “references” in pattern-based detectionCovers cases like
awesome-references.mdnot captured by exact filename list.- if any(pattern in filename for pattern in ['llms', 'links', 'resources']): + if any(pattern in filename for pattern in ['llms', 'links', 'resources', 'references']): if filename.endswith(('.txt', '.md', '.mdx', '.markdown')): logger.info(f"Detected potential link collection file: {filename}") return True
261-269: Avoid regex divergence: reuse your own extractor for density checksUsing
extract_markdown_linkshere keeps detection logic consistent and reduces maintenance risk when patterns evolve.- # Count markdown links + autolinks + bare URLs - markdown_link_pattern = r'\[([^\]]*)\]\(([^)]+)\)' - autolink_pattern = r'<\s*(https?://[^>\s]+)\s*>' - bare_url_pattern = r'(https?://[^\s<>()\[\]"]+)' - md_links = re.findall(markdown_link_pattern, content) - auto_links = re.findall(autolink_pattern, content) - bare_links = re.findall(bare_url_pattern, content) - total_links = len(md_links) + len(auto_links) + len(bare_links) + # Reuse extractor to avoid regex skew + links = URLHandler.extract_markdown_links(content) + total_links = len(links)
131-147: Strip GitHub blob query/fragment before building raw URL
raw.githubusercontent.comdoesn’t need?plain=1or anchors; keeping them can lead to 404s or cache splits.match = re.match(github_file_pattern, url) if match: - owner, repo, branch, path = match.groups() - raw_url = f'https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}' + owner, repo, branch, path = match.groups() + # Drop querystring/fragment from path when targeting raw + path = path.split('?', 1)[0].split('#', 1)[0] + raw_url = f'https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}' logger.info(f"Transformed GitHub file URL to raw: {url} -> {raw_url}") return raw_urlpython/src/server/services/crawling/crawling_service.py (2)
553-556: Don’t override DB-tuned concurrency when not specifiedDefaulting to
3masksCRAWL_MAX_CONCURRENTfrom settings. PassNoneunless the request explicitly sets a value; the strategy will fetch DB config.- batch_results = await self.crawl_batch_with_progress( - extracted_links, - max_concurrent=request.get('max_concurrent', 3), + batch_results = await self.crawl_batch_with_progress( + extracted_links, + max_concurrent=request.get('max_concurrent'), # None -> use DB settings progress_callback=await self._create_crawl_progress_callback("crawling"), start_progress=30, end_progress=70, )
535-541: Consider guarding against pathological link listsVery large
llms.*files can enumerate thousands of URLs. Add an upper bound (config-driven) before dispatching to batch crawl, and surface a log with the truncated count.I can propose a small helper to read
MAX_EXTRACTED_LINKSfromrag_strategy(default e.g., 500) and sliceextracted_links[:max_links]. Want me to draft it?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
python/src/server/services/crawling/crawling_service.py(2 hunks)python/src/server/services/crawling/helpers/url_handler.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
python/src/{server,mcp,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/helpers/url_handler.py
python/src/{server/services,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/helpers/url_handler.py
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/helpers/url_handler.py
python/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/helpers/url_handler.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}
📄 CodeRabbit inference engine (CLAUDE.md)
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/helpers/url_handler.py
python/src/server/**
📄 CodeRabbit inference engine (CLAUDE.md)
Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/helpers/url_handler.py
🧬 Code graph analysis (1)
python/src/server/services/crawling/crawling_service.py (2)
python/src/server/services/crawling/helpers/url_handler.py (5)
is_txt(55-69)is_markdown(36-52)is_link_collection_file(222-284)extract_markdown_links(150-219)is_binary_file(72-117)python/src/server/services/crawling/strategies/batch.py (1)
crawl_batch_with_progress(31-199)
🔇 Additional comments (3)
python/src/server/services/crawling/helpers/url_handler.py (1)
35-53: Solid addition: markdown file detection
is_markdownis concise, robust (parsing path, case-insensitive), and logs withexc_info=True. LGTM.python/src/server/services/crawling/crawling_service.py (2)
519-566: Nice orchestration: detects link-collection files, extracts, filters binaries, then batch-crawlsThe staged progress updates and result stitching are clear and keep backward compatibility for plain text files. Good addition.
524-530: Align per-stage percentages with ProgressMapper to avoid UI regressionsDirect
update_crawl_progresswrites bypassProgressMapper. Verify these percentages won’t regress relative to mapped stages reported elsewhere.If you’d like, I can scan usages of
ProgressMapper.map_progressto spot potential backward jumps and suggest exact numbers.Also applies to: 544-550
…nt add the max link paramter suggestion though.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (6)
python/src/server/services/crawling/crawling_service.py (6)
502-511: Sync ProgressMapper when emitting direct "crawling" updates to keep heartbeat accurateYou’re emitting raw progress via update_crawl_progress here, but ProgressMapper isn’t updated. Heartbeats (Line 286 ff.) will continue reporting the last mapped stage (e.g., “analyzing”), which is misleading during long file fetches. Sync the mapper when you emit direct updates.
self.progress_state.update({ "status": "crawling", "percentage": 10, "log": "Detected text/markdown file, fetching content...", }) + # Keep heartbeat stage/progress in sync with direct emissions + self.progress_mapper.map_progress("crawling", 10) await update_crawl_progress(self.progress_id, self.progress_state)
520-533: Emit mapped progress for 'extracting_links' stage to avoid stale heartbeat stageSame issue as above: heartbeats will show the previous mapped stage while extracting links. Update the mapper alongside the direct emission.
self.progress_state.update({ "status": "extracting_links", "percentage": 25, "log": "Link collection file detected, extracting embedded links...", }) + # Sync mapper for accurate heartbeats during extraction + self.progress_mapper.map_progress("extracting_links", 25) await update_crawl_progress(self.progress_id, self.progress_state)
533-541: Avoid recrawling the source file if it appears among extracted linksIt’s common for llms.* files to include a self-link or canonical URL. Drop self-referential links to prevent redundant crawling and duplicate storage later.
- extracted_links = self.url_handler.extract_markdown_links(content, url) + extracted_links = self.url_handler.extract_markdown_links(content, url) + # Drop self-links to avoid redundant crawling + extracted_links = [ + link for link in extracted_links + if link.rstrip('/') != url.rstrip('/') + ]
543-559: Use 'crawling_links' as the callback base status and keep mapper in sync
- Progress updates are tagged “crawling_links” here, but the batch progress callback still uses base status "crawling". Aligning both improves UX consistency.
- Also sync ProgressMapper so heartbeats reflect this stage correctly.
if self.progress_id: self.progress_state.update({ "status": "crawling_links", "percentage": 30, "log": f"Found {len(extracted_links)} links to crawl from {url}", }) + # Sync mapper for accurate heartbeats while batch-crawling links + self.progress_mapper.map_progress("crawling_links", 30) await update_crawl_progress(self.progress_id, self.progress_state) @@ - batch_results = await self.crawl_batch_with_progress( + batch_results = await self.crawl_batch_with_progress( extracted_links, max_concurrent=request.get('max_concurrent'), # None -> use DB settings - progress_callback=await self._create_crawl_progress_callback("crawling"), + progress_callback=await self._create_crawl_progress_callback("crawling_links"), start_progress=30, end_progress=70, )
132-159: Type hint for progress callback is too strict vs actual usageThe returned callback accepts **kwargs, but the type is Callable[[str, int, str], Awaitable[None]]. This will trip mypy when callers pass extra fields (common in progress flows). Relax the type to Callable[..., Awaitable[None]].
- ) -> Callable[[str, int, str], Awaitable[None]]: + ) -> Callable[..., Awaitable[None]]: @@ - async def callback(status: str, percentage: int, message: str, **kwargs): + async def callback(status: str, percentage: int, message: str, **kwargs): if self.progress_id:
562-566: Optional: Add in-memory URL de-duplication before merging batch results
- DocumentStorageOperations currently does not filter duplicate URLs or source IDs when ingesting
crawl_results(a search fordedup|duplicate|unique|url_to_full_document|source_idindocument_storage_operations.pyreturned no matching logic).- The new
crawl_type = "link_collection_with_crawled_links"value is only used for metadata and logging—no downstream code branches on specificcrawl_typestrings.If you’d like to prevent duplicate upserts and noisy metrics, you can apply this optional diff:
- crawl_results.extend(batch_results) + # Avoid duplicate URLs when combining results + existing_urls = { + r.get('url') + for r in crawl_results + if isinstance(r, dict) and r.get('url') + } + batch_unique = [ + r + for r in batch_results + if isinstance(r, dict) and r.get('url') not in existing_urls + ] + crawl_results.extend(batch_unique)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
python/src/server/services/crawling/crawling_service.py(2 hunks)python/src/server/services/crawling/helpers/url_handler.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- python/src/server/services/crawling/helpers/url_handler.py
🧰 Additional context used
📓 Path-based instructions (6)
python/src/{server,mcp,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead
Files:
python/src/server/services/crawling/crawling_service.py
python/src/{server/services,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data
Files:
python/src/server/services/crawling/crawling_service.py
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients
Files:
python/src/server/services/crawling/crawling_service.py
python/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit
Files:
python/src/server/services/crawling/crawling_service.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}
📄 CodeRabbit inference engine (CLAUDE.md)
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality
Files:
python/src/server/services/crawling/crawling_service.py
python/src/server/**
📄 CodeRabbit inference engine (CLAUDE.md)
Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)
Files:
python/src/server/services/crawling/crawling_service.py
🧬 Code graph analysis (1)
python/src/server/services/crawling/crawling_service.py (2)
python/src/server/services/crawling/helpers/url_handler.py (4)
is_txt(55-69)is_markdown(36-52)extract_markdown_links(152-233)is_binary_file(72-117)python/src/server/services/crawling/strategies/batch.py (1)
crawl_batch_with_progress(31-199)
🔇 Additional comments (1)
python/src/server/services/crawling/crawling_service.py (1)
519-569: Nice extension: end-to-end handling for llms. link collections*The staged flow (detect → extract → filter → batch crawl → merge) aligns with the PR objective and keeps backward compatibility for plain text files.
…bit makes me crave nicotine
|
i do recommend merging PR #378 with this aswell, it helps when youre recrawling |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
python/src/server/services/crawling/helpers/url_handler.py (1)
67-71: TXT detection hardened (case, query/fragment) — matches prior askThis mirrors the earlier recommendation to normalize via urlparse and lowercase path. Looks good.
🧹 Nitpick comments (5)
python/src/server/services/crawling/helpers/url_handler.py (4)
35-54: Markdown detection is robust and consistent with is_txtLowercasing path and ignoring query/fragment is correct.
Optional: consider supporting a few common aliases ('.mkdn', '.mdown', '.rmd') if they appear in your repos.- return path.endswith(('.md', '.mdx', '.markdown')) + return path.endswith(('.md', '.mdx', '.markdown', '.mkdn', '.mdown', '.rmd'))
85-121: Make binary extension check cheaper and central; add a few common types; include exc_info in warnings
- Reallocating a set on every call is unnecessary; a module-level tuple + endswith is faster and simpler.
- Add a few frequently encountered non-HTMLs (fonts, design files, packages).
- Include exc_info=True to preserve stack traces per guidelines.
Apply within-function changes:
- # Comprehensive list of binary and non-HTML file extensions - binary_extensions = { - # Archives - '.zip', '.tar', '.gz', '.rar', '.7z', '.bz2', '.xz', '.tgz', - # Executables and installers - '.exe', '.dmg', '.pkg', '.deb', '.rpm', '.msi', '.app', '.appimage', - # Documents (non-HTML) - '.pdf', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx', '.odt', '.ods', - # Images - '.jpg', '.jpeg', '.png', '.gif', '.svg', '.webp', '.ico', '.bmp', '.tiff', - # Audio/Video - '.mp3', '.mp4', '.avi', '.mov', '.wmv', '.flv', '.webm', '.mkv', '.wav', '.flac', - # Data files - '.csv', '.sql', '.db', '.sqlite', - # Binary data - '.iso', '.img', '.bin', '.dat', - # Development files (usually not meant to be crawled as pages) - '.wasm', '.pyc', '.jar', '.war', '.class', '.dll', '.so', '.dylib' - } - - # Check if the path ends with any binary extension - for ext in binary_extensions: - if path.endswith(ext): - logger.debug(f"Skipping binary file: {url} (matched extension: {ext})") - return True - - return False + if path.endswith(BINARY_EXTENSIONS): + # Find matched suffix for logging without scanning all suffixes again + matched = next((ext for ext in BINARY_EXTENSIONS if path.endswith(ext)), '') + logger.debug(f"Skipping binary file: {url} (matched extension: {matched})") + return True + return False except Exception as e: - logger.warning(f"Error checking if URL is binary file: {e}") + logger.warning(f"Error checking if URL is binary file: {e}", exc_info=True) # In case of error, don't skip the URL (safer to attempt crawl than miss content) return FalseAdd a module-level constant (outside the selected range):
# Module-level binary suffixes to avoid per-call allocations BINARY_EXTENSIONS = ( # Archives '.zip', '.tar', '.gz', '.rar', '.7z', '.bz2', '.xz', '.tgz', # Executables and installers '.exe', '.dmg', '.pkg', '.deb', '.rpm', '.msi', '.app', '.appimage', # Documents (non-HTML) '.pdf', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx', '.odt', '.ods', # Images '.jpg', '.jpeg', '.png', '.gif', '.svg', '.webp', '.ico', '.bmp', '.tiff', '.heic', '.heif', # Audio/Video '.mp3', '.mp4', '.avi', '.mov', '.wmv', '.flv', '.webm', '.mkv', '.wav', '.flac', # Data files '.csv', '.sql', '.db', '.sqlite', # Binary data '.iso', '.img', '.bin', '.dat', # Development and packages '.wasm', '.pyc', '.jar', '.war', '.class', '.dll', '.so', '.dylib', '.whl', # Fonts '.ttf', '.otf', '.eot', '.woff', '.woff2', # Design / assets '.psd', '.ai', # Mobile packages / ebooks '.apk', '.ipa', '.epub', )
133-153: Broaden GitHub raw handling and accept both http/httpsCurrent logic only transforms /blob/ URLs over https. Add support for:
- http scheme (normalize),
- /raw/ URLs (common in copyable links),
- existing raw.githubusercontent.com links (passthrough).
- # Pattern for GitHub file URLs - github_file_pattern = r'https://github\.com/([^/]+)/([^/]+)/blob/([^/]+)/(.+)' + # Pattern for GitHub file URLs + github_file_pattern = r'https?://github\.com/([^/]+)/([^/]+)/blob/([^/]+)/(.+)' match = re.match(github_file_pattern, url) if match: owner, repo, branch, path = match.groups() # Strip query parameters and fragments that break raw URLs path = path.split('?', 1)[0].split('#', 1)[0] raw_url = f'https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}' logger.info(f"Transformed GitHub file URL to raw: {url} -> {raw_url}") return raw_url - + # Pattern for GitHub raw URLs using /raw/ path form + github_raw_path_pattern = r'https?://github\.com/([^/]+)/([^/]+)/raw/([^/]+)/(.+)' + match = re.match(github_raw_path_pattern, url) + if match: + owner, repo, branch, path = match.groups() + path = path.split('?', 1)[0].split('#', 1)[0] + raw_url = f'https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}' + logger.info(f"Normalized GitHub raw path URL: {url} -> {raw_url}") + return raw_url + + # Already a raw.githubusercontent.com link — passthrough + if re.match(r'https?://raw\.githubusercontent\.com/', url): + return url + # Pattern for GitHub directory URLs github_dir_pattern = r'https://github\.com/([^/]+)/([^/]+)/tree/([^/]+)/(.+)' match = re.match(github_dir_pattern, url)
154-242: Precompile the regex; optionally support reference-style markdown links
- Re-compiling the large combined pattern on every call is avoidable; precompile at module scope and reuse for a noticeable hot-path win.
- Optional: support reference-style definitions (
[text][id]with[id]: https://...) if your link lists use them.Use a module-level pattern (outside selected range):
# Precompiled combined URL pattern COMBINED_URL_PATTERN = re.compile( r'\[(?P<text>[^\]]*)\]\((?P<md>[^)]+)\)' r'|<\s*(?P<auto>https?://[^>\s]+)\s*>' r'|(?P<bare>https?://[^\s<>()\[\]"]+)' r'|(?P<proto>//[^\s<>()\[\]"]+)' r'|(?P<www>www\.[^\s<>()\[\]"]+)' )Then update usage here:
- combined_pattern = re.compile( - r'\[(?P<text>[^\]]*)\]\((?P<md>[^)]+)\)' # named: md - r'|<\s*(?P<auto>https?://[^>\s]+)\s*>' # named: auto - r'|(?P<bare>https?://[^\s<>()\[\]"]+)' # named: bare - r'|(?P<proto>//[^\s<>()\[\]"]+)' # named: protocol-relative - r'|(?P<www>www\.[^\s<>()\[\]"]+)' # named: www.* without scheme - ) + # Use precompiled pattern for performance + combined_pattern = COMBINED_URL_PATTERNIf reference-style support becomes necessary, I can follow-up with a minimal, order-preserving parser that respects definitions and ignores code fences.
Additionally, consider logging at debug-level for the extraction count to reduce info-log noise on large inputs.python/src/server/services/crawling/crawling_service.py (1)
551-612: Two improvements: keep stage names consistent during batch, and sanitize max_concurrent from request
- The per-link progress callback currently uses base_status="crawling" and will overwrite your explicit "crawling_links" stage, causing status flicker in the UI.
request.get('max_concurrent')may be a string; pass an int or None to avoid type issues downstream.@@ - if extracted_links: + if extracted_links: if self.progress_id: self.progress_state.update({ - "status": "crawling_links", + "status": "crawling_links", "percentage": 30, "log": f"Found {len(extracted_links)} links to crawl from {url}", }) await update_crawl_progress(self.progress_id, self.progress_state) # Crawl the extracted links using batch crawling logger.info(f"Crawling {len(extracted_links)} extracted links from {url}") + # Sanitize max_concurrent from request + sanitized_max_concurrent = None + try: + mc = request.get('max_concurrent') + if mc is not None: + sanitized_max_concurrent = int(mc) + except Exception: + logger.warning(f"Invalid max_concurrent in request: {request.get('max_concurrent')}, using defaults") batch_results = await self.crawl_batch_with_progress( extracted_links, - max_concurrent=request.get('max_concurrent'), # None -> use DB settings - progress_callback=await self._create_crawl_progress_callback("crawling"), + max_concurrent=sanitized_max_concurrent, # None -> use DB settings + progress_callback=await self._create_crawl_progress_callback("crawling_links"), start_progress=30, end_progress=70, )If you expect very large link lists, we can also cap and paginate them via a setting (e.g., MAX_EXTRACTED_LINKS) to prevent extremely long runs.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
python/src/server/services/crawling/crawling_service.py(3 hunks)python/src/server/services/crawling/helpers/url_handler.py(5 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
python/src/{server,mcp,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead
Files:
python/src/server/services/crawling/helpers/url_handler.pypython/src/server/services/crawling/crawling_service.py
python/src/{server/services,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data
Files:
python/src/server/services/crawling/helpers/url_handler.pypython/src/server/services/crawling/crawling_service.py
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients
Files:
python/src/server/services/crawling/helpers/url_handler.pypython/src/server/services/crawling/crawling_service.py
python/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit
Files:
python/src/server/services/crawling/helpers/url_handler.pypython/src/server/services/crawling/crawling_service.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}
📄 CodeRabbit inference engine (CLAUDE.md)
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality
Files:
python/src/server/services/crawling/helpers/url_handler.pypython/src/server/services/crawling/crawling_service.py
python/src/server/**
📄 CodeRabbit inference engine (CLAUDE.md)
Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)
Files:
python/src/server/services/crawling/helpers/url_handler.pypython/src/server/services/crawling/crawling_service.py
🧬 Code graph analysis (1)
python/src/server/services/crawling/crawling_service.py (2)
python/src/server/services/crawling/helpers/url_handler.py (5)
is_txt(56-72)is_markdown(36-53)is_link_collection_file(244-301)extract_markdown_links(155-241)is_binary_file(75-120)python/src/server/services/crawling/strategies/batch.py (1)
crawl_batch_with_progress(31-199)
🔇 Additional comments (2)
python/src/server/services/crawling/helpers/url_handler.py (1)
7-8: Good: normalized URL parsing and typing importsBrings in urljoin and typing. Sets the stage for safer, normalized URL checks elsewhere.
python/src/server/services/crawling/crawling_service.py (1)
532-543: LGTM: unified text/markdown detection and progress syncUsing both is_txt and is_markdown covers the PR’s target files. Good call to sync the ProgressMapper with direct emissions to prevent UI resets.
Changes Made: 1. Progress Bar Fix: Fixed llms.txt crawling progress jumping to 90% then regressing to 45% by adjusting batch crawling progress ranges (20-30% instead of 40-90%) and using consistent ProgressMapper ranges 2. OpenAI API Compatibility: Added robust fallback logic in contextual embedding service to handle newer models (GPT-5) that require max_completion_tokens instead of max_tokens and don't support custom temperature values Files Modified: - src/server/services/crawling/crawling_service.py - Fixed progress ranges - src/server/services/crawling/progress_mapper.py - Restored original stage ranges - src/server/services/embeddings/contextual_embedding_service.py - Added fallback API logic Result: - Progress bar now smoothly progresses 030% (crawling) 35-80% (storage) 100% - Automatic compatibility with both old (GPT-4.1-nano) and new (GPT-5-nano) OpenAI models - Eliminates max_tokens not supported and temperature not supported errors
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
python/src/server/services/storage/document_storage_service.py (3)
93-101: Bug: fallback delete batch slices a fixed 10 URLs, skipping others when fallback_batch_size ≠ 10The loop steps by
fallback_batch_sizebut slices withi + 10. This will skip deletions for some URLs wheneverfallback_batch_size> 10 (and can double-delete if< 10). Fix the slice to usefallback_batch_size.Apply this diff:
- fallback_batch_size = max(10, delete_batch_size // 5) + fallback_batch_size = max(10, delete_batch_size // 5) ... - batch_urls = unique_urls[i : i + 10] + batch_urls = unique_urls[i : i + fallback_batch_size]
259-293: Wrong/misleading mapping of embeddings back to originals; O(n²) scan and duplicate-text misassignmentThe current “find by text” scan will mis-map when two chunks have identical text (e.g., boilerplate/footer), causing duplicate use of the first index and dropping the later one. It’s also O(n²).
Use a stable one-to-many map from text → indices and consume indices as they are matched.
Apply this diff:
- # Prepare batch data - only for successful embeddings - batch_data = [] - # Map successful texts back to their original indices - for j, (embedding, text) in enumerate( - zip(batch_embeddings, successful_texts, strict=False) - ): - # Find the original index of this text - orig_idx = None - for idx, orig_text in enumerate(contextual_contents): - if orig_text == text: - orig_idx = idx - break - - if orig_idx is None: - search_logger.warning("Could not map embedding back to original text") - continue - - j = orig_idx # Use original index for metadata lookup + # Prepare batch data - only for successful embeddings + batch_data = [] + # Build a stable mapping to handle duplicate texts deterministically + text_to_indices: dict[str, list[int]] = {} + for idx, orig_text in enumerate(contextual_contents): + text_to_indices.setdefault(orig_text, []).append(idx) + + for _, (embedding, text) in enumerate(zip(batch_embeddings, successful_texts, strict=False)): + # Consume the next available index for this text (handles duplicates correctly) + idx_list = text_to_indices.get(text) + if not idx_list: + search_logger.warning("Could not map embedding back to original text") + continue + orig_idx = idx_list.pop(0) # Use source_id from metadata if available, otherwise extract from URL - if batch_metadatas[j].get("source_id"): - source_id = batch_metadatas[j]["source_id"] + if batch_metadatas[orig_idx].get("source_id"): + source_id = batch_metadatas[orig_idx]["source_id"] else: # Fallback: Extract source_id from URL - parsed_url = urlparse(batch_urls[j]) + parsed_url = urlparse(batch_urls[orig_idx]) source_id = parsed_url.netloc or parsed_url.path - data = { - "url": batch_urls[j], - "chunk_number": batch_chunk_numbers[j], - "content": text, # Use the successful text - "metadata": {"chunk_size": len(text), **batch_metadatas[j]}, + "url": batch_urls[orig_idx], + "chunk_number": batch_chunk_numbers[orig_idx], + "content": text, # Use the successful text + # Ensure chunk_size reflects actual stored content; override any existing key + "metadata": {**batch_metadatas[orig_idx], "chunk_size": len(text)}, "source_id": source_id, "embedding": embedding, # Use the successful embedding } batch_data.append(data)
69-89: Prevent potential data loss by avoiding upfront deletionThe current implementation in
python/src/server/services/storage/document_storage_service.py(lines 79–83) deletes all existing chunks for each URL before performing any upserts. Since only successfully embedded chunks are re-inserted, any failure in the embedding step will permanently drop previously valid chunks for that URL.To address this:
- Instead of wholesale deletion at the start, defer removals until after you’ve confirmed which
(url, chunk_number)pairs embed successfully.- Alternatively, delete only the specific
(url, chunk_number)keys you’re about to overwrite—relying on upsert for updates—then, once a full URL batch succeeds, run a separate cleanup pass to prune truly stale chunks.- You can gate the safer deletion strategy behind a feature flag (e.g.,
RAG_STORAGE_SAFE_DELETE) for incremental rollout.Please refactor the batch‐delete logic to follow one of these safer approaches, ensuring that transient failures never cause irreversible data loss.
🧹 Nitpick comments (8)
python/src/server/services/storage/document_storage_service.py (4)
299-340: Improve error logs with stack traces and clearer context; keep backoff but log with exc_infoCatching broad
Exceptionis acceptable here due to the retry/last-resort flow, but logs should include the full stack trace and context (batch id) for debugging.Apply this diff:
- except Exception as e: - if retry < max_retries - 1: - search_logger.warning( - f"Error inserting batch (attempt {retry + 1}/{max_retries}): {e}" - ) + except Exception as e: + if retry < max_retries - 1: + search_logger.warning( + f"Error upserting batch {batch_num} (attempt {retry + 1}/{max_retries})", + exc_info=True, + ) await asyncio.sleep(retry_delay) retry_delay *= 2 # Exponential backoff else: - search_logger.error( - f"Failed to insert batch after {max_retries} attempts: {e}" - ) + search_logger.error( + f"Failed to upsert batch {batch_num} after {max_retries} attempts", + exc_info=True, + )
342-359: Per-record fallback: include stack traces and record context in logsAdd
exc_info=Trueand includeurlandchunk_numberto aid debugging and triage.Apply this diff:
- except Exception as individual_error: - search_logger.error( - f"Failed individual insert for {record['url']}: {individual_error}" - ) + except Exception as individual_error: + search_logger.error( + f"Failed individual upsert for url={record.get('url')} chunk={record.get('chunk_number')}", + exc_info=True, + )
61-68: Unused variable: enable_parallel is read but not used
enable_parallelis computed but never used in this function. Either wire it into behavior or remove to reduce confusion.Apply this diff to remove the dead assignment if not needed:
- enable_parallel = rag_settings.get("ENABLE_PARALLEL_BATCHES", "true").lower() == "true" + _ = rag_settings.get("ENABLE_PARALLEL_BATCHES", "true") # reserved for future use ... - enable_parallel = True + _ = TrueOr fully remove if not planned.
114-124: Duplicate import of credential_service inside function
credential_serviceis already imported at the module top (Line 13). The inner re-import is redundant.Apply this diff:
- from ..credential_service import credential_service + # credential_service already imported at module scopepython/src/server/services/embeddings/contextual_embedding_service.py (4)
112-125: Redundant model retrieval (unused variable); remove to avoid confusion
model_choiceis fetched and logged but not used in this function (you later call_get_model_choice). Drop the unused retrieval or reuse it.Apply this diff:
- try: - from ...services.credential_service import credential_service - - model_choice = await credential_service.get_credential("MODEL_CHOICE", "gpt-4.1-nano") - except Exception as e: - # Fallback to environment variable or default - search_logger.warning( - f"Failed to get MODEL_CHOICE from credential service: {e}, using fallback" - ) - model_choice = os.getenv("MODEL_CHOICE", "gpt-4.1-nano") - - search_logger.debug(f"Using MODEL_CHOICE: {model_choice}") + # Model is resolved by _get_model_choice(); no need to prefetch another setting here.
250-256: Cap the per-batch token budget to a sane maximum
token_limit = 250 * len(chunks)can exceed model limits and cause avoidable failures for large batches. Cap it or make it configurable (e.g., from credentials), and let the fallback split when needed.Apply this diff:
- token_limit = 250 * len(chunks) + token_limit = min(250 * len(chunks), 4000) # TODO: make model-aware or configurable
258-283: Fragile parsing of multi-line contexts; use regex with chunk sectionsSplitting by newline and matching
CHUNK N:on a single line will drop multi-line contexts or lines that wrap. Prefer a regex that captures sections between markers.Here’s a more robust sketch:
- lines = response_text.strip().split("\n") - chunk_contexts = {} - for line in lines: - if line.strip().startswith("CHUNK"): - parts = line.split(":", 1) - if len(parts) == 2: - chunk_num = int(parts[0].strip().split()[1]) - 1 - context = parts[1].strip() - chunk_contexts[chunk_num] = context + import re + pattern = re.compile(r'^\s*CHUNK\s+(\d+):\s*(.*)$', re.MULTILINE) + chunk_contexts = {} + matches = list(pattern.finditer(response_text)) + for idx, m in enumerate(matches): + chunk_idx = int(m.group(1)) - 1 + start = m.end() + end = matches[idx + 1].start() if idx + 1 < len(matches) else len(response_text) + section = (m.group(2) + "\n" + response_text[start:end]).strip() + chunk_contexts[chunk_idx] = section
285-303: Include stack traces on rate limit and generic errors; align with observability guidelineAdd
exc_info=Trueso we can triage production failures quickly.Apply this diff:
- except openai.RateLimitError as e: + except openai.RateLimitError as e: if "insufficient_quota" in str(e): - search_logger.warning(f"⚠️ QUOTA EXHAUSTED in contextual embeddings: {e}") + search_logger.warning("⚠️ QUOTA EXHAUSTED in contextual embeddings", exc_info=True) search_logger.warning( "OpenAI quota exhausted - proceeding without contextual embeddings" ) else: - search_logger.warning(f"Rate limit hit in contextual embeddings batch: {e}") + search_logger.warning("Rate limit hit in contextual embeddings batch", exc_info=True) search_logger.warning( "Rate limit hit - proceeding without contextual embeddings for this batch" ) ... - except Exception as e: - search_logger.error(f"Error in contextual embedding batch: {e}") + except Exception as e: + search_logger.error("Error in contextual embedding batch", exc_info=True)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
python/src/server/services/crawling/crawling_service.py(4 hunks)python/src/server/services/embeddings/contextual_embedding_service.py(3 hunks)python/src/server/services/storage/document_storage_service.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- python/src/server/services/crawling/crawling_service.py
🧰 Additional context used
📓 Path-based instructions (6)
python/src/{server,mcp,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead
Files:
python/src/server/services/storage/document_storage_service.pypython/src/server/services/embeddings/contextual_embedding_service.py
python/src/{server/services,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data
Files:
python/src/server/services/storage/document_storage_service.pypython/src/server/services/embeddings/contextual_embedding_service.py
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients
Files:
python/src/server/services/storage/document_storage_service.pypython/src/server/services/embeddings/contextual_embedding_service.py
python/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit
Files:
python/src/server/services/storage/document_storage_service.pypython/src/server/services/embeddings/contextual_embedding_service.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}
📄 CodeRabbit inference engine (CLAUDE.md)
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality
Files:
python/src/server/services/storage/document_storage_service.pypython/src/server/services/embeddings/contextual_embedding_service.py
python/src/server/**
📄 CodeRabbit inference engine (CLAUDE.md)
Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)
Files:
python/src/server/services/storage/document_storage_service.pypython/src/server/services/embeddings/contextual_embedding_service.py
🧠 Learnings (2)
📚 Learning: 2025-08-20T19:38:04.050Z
Learnt from: Chillbruhhh
PR: coleam00/Archon#378
File: python/src/server/services/storage/document_storage_service.py:304-306
Timestamp: 2025-08-20T19:38:04.050Z
Learning: The archon_crawled_pages table in the Archon project has a table-level unique constraint on (url, chunk_number) defined inline in the CREATE TABLE statement in migration/complete_setup.sql at line 202, which allows upsert operations with on_conflict="url,chunk_number" to work properly without requiring additional migrations.
Applied to files:
python/src/server/services/storage/document_storage_service.py
📚 Learning: 2025-08-20T19:38:04.050Z
Learnt from: Chillbruhhh
PR: coleam00/Archon#378
File: python/src/server/services/storage/document_storage_service.py:304-306
Timestamp: 2025-08-20T19:38:04.050Z
Learning: The archon_crawled_pages table in the Archon project has a table-level unique constraint on (url, chunk_number) defined inline in the CREATE TABLE statement in migration/complete_setup.sql at line 202, which allows upsert operations with on_conflict="url,chunk_number" to work properly.
Applied to files:
python/src/server/services/storage/document_storage_service.py
🔇 Additional comments (2)
python/src/server/services/storage/document_storage_service.py (1)
304-306: Upsert switch is correct and aligns with DB constraint on (url, chunk_number)Changing to
upsert(..., on_conflict="url,chunk_number")prevents duplicate-key failures and updates existing rows in-place. This matches the table-level unique constraint documented in migrations.python/src/server/services/embeddings/contextual_embedding_service.py (1)
147-157: LGTM: messages payload and centralized fallback invocationThe move to explicit messages + centralized fallback is clean and keeps the per-call code small. Good separation of concerns.
…n to here, definitley recommend looking at that though since gpt-5-nano is considered a reasoning model and doesnt use max_tokens, requires a diffrent output. also removed my upsert fix from documentstorage since thats not apart of this exact issue and i have another PR open for it. checked in code rabbit in my ide no issues, no nitpicks. should be good? might flag me for the UPSERT logic not being in here. owell has nothing to do with this was pr, was submitted in the last revision by mistake. everythings tested and good to go!
|
ready for merge |
|
Thank you for this @Chillbruhhh! I tested just now and this works beautifully for llms.txt that have a lot of links within. However, I tested with an llms-full.txt (https://ai.pydantic.dev/llms-full.txt) and this used to be really fast since it was just a single page but now with this PR it found 492 URLs and it's going through all of those when it shouldn't have to. It's just a single page. Also the progress for crawling pages is always at 100%: If you address these two things I think the PR would be ready! Probably good to rebase from main too. |
|
@coleam00 sorry about that, fixing it now |
…n crawling llms-full.txt. fixed the 100% crawl url when multiple urls are present and hasnt finished crawling. also fixed a styling issue in CrawlingProgressCard.tsx , when batching code examples the batching progress bar would sometimes glitch out of the ui fixed it to where it wont do that now.
…awling-links-inside-of-file
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/src/server/services/crawling/crawling_service.py (1)
329-331: Root cause of “progress shows 100% immediately”: don’t start ‘starting’ at 100Use 0–1% for “starting” so UI doesn’t jump to 100% at kickoff.
- await update_mapped_progress( - "starting", 100, f"Starting crawl of {url}", currentUrl=url - ) + await update_mapped_progress( + "starting", 1, f"Starting crawl of {url}", currentUrl=url + )
♻️ Duplicate comments (2)
python/src/server/services/crawling/helpers/url_handler.py (1)
288-296: Count relative links in density by passing base_urlWithout base_url, relative links aren’t resolved and undercount density.
- extracted_links = URLHandler.extract_markdown_links(content) + extracted_links = URLHandler.extract_markdown_links(content, url)python/src/server/services/crawling/crawling_service.py (1)
490-519: Canonicalize self-link comparison (host case, default ports); add exc_infoAvoid false negatives between http://example.com and http://EXAMPLE.com:80, and improve logs.
- try: - from urllib.parse import urlparse - - # Parse both URLs to compare their core components - link_parsed = urlparse(link) - base_parsed = urlparse(base_url) - - # Compare scheme, netloc, and path (ignoring query and fragment) - link_core = f"{link_parsed.scheme}://{link_parsed.netloc}{link_parsed.path.rstrip('/')}" - base_core = f"{base_parsed.scheme}://{base_parsed.netloc}{base_parsed.path.rstrip('/')}" - - return link_core == base_core + try: + from urllib.parse import urlparse + def _core(u: str) -> str: + p = urlparse(u) + scheme = (p.scheme or 'http').lower() + host = (p.hostname or '').lower() + port = p.port + if (scheme, port) in (('http', 80), ('https', 443)) or port is None: + port_part = '' + else: + port_part = f":{port}" + path = p.path.rstrip('/') + return f"{scheme}://{host}{port_part}{path}" + return _core(link) == _core(base_url) except Exception as e: - logger.warning(f"Error checking if link is self-referential: {e}") + logger.warning(f"Error checking if link is self-referential: {e}", exc_info=True) # Fallback to simple string comparison return link.rstrip('/') == base_url.rstrip('/')
🧹 Nitpick comments (4)
archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx (3)
300-303: Unify snake_case vs camelCase batch fieldsUI mixes completedBatches/totalBatches and completed_batches/total_batches, causing mismatched texts/bars. Support both or standardize to one.
Apply:
- step.message = `Batch ${progressData.completedBatches}/${progressData.totalBatches} - Saving to database...`; + const done = progressData.completedBatches ?? progressData.completed_batches ?? 0; + const total = progressData.totalBatches ?? progressData.total_batches ?? 0; + step.message = total ? `Batch ${done}/${total} - Saving to database...` : 'Saving to database...';- {progressData.completed_batches || 0}/{progressData.total_batches || 0} + {(progressData.completedBatches ?? progressData.completed_batches ?? 0)}/ + {(progressData.totalBatches ?? progressData.total_batches ?? 0)}Also applies to: 721-722
76-88: Don’t mutate props (progressData.status) in-placeDirectly setting progressData.status breaks React data flow; rely on onStop to update state in parent.
- // Optimistic UI update - immediately show stopping status - progressData.status = 'stopping'; + // Ask parent to reflect stopping state; avoid mutating props + // Parent can optimistically set status to 'stopping'
23-23: Consider switching to useCrawlProgressPollingPer team learning, prefer the existing polling hook with ETag/visibility handling over crawlProgressService for fewer renders and better perf.
python/src/server/services/crawling/helpers/url_handler.py (1)
138-153: GitHub blob→raw: good, but accept http and subdomainsMinor: the regex only matches https://github.com; consider https? and www. to be resilient to odd redirects.
- github_file_pattern = r'https://github\.com/([^/]+)/([^/]+)/blob/([^/]+)/(.+)' + github_file_pattern = r'https?://(?:www\.)?github\.com/([^/]+)/([^/]+)/blob/([^/]+)/(.+)' - github_dir_pattern = r'https://github\.com/([^/]+)/([^/]+)/tree/([^/]+)/(.+)' + github_dir_pattern = r'https?://(?:www\.)?github\.com/([^/]+)/([^/]+)/tree/([^/]+)/(.+)'
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx(1 hunks)python/src/server/services/crawling/crawling_service.py(4 hunks)python/src/server/services/crawling/helpers/url_handler.py(5 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}
📄 CodeRabbit inference engine (CLAUDE.md)
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality
Files:
archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsxpython/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/helpers/url_handler.py
archon-ui-main/src/components/**
📄 CodeRabbit inference engine (CLAUDE.md)
Place reusable UI components in archon-ui-main/src/components/
Files:
archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx
python/src/{server,mcp,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/helpers/url_handler.py
python/src/{server/services,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/helpers/url_handler.py
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/helpers/url_handler.py
python/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/helpers/url_handler.py
python/src/server/**
📄 CodeRabbit inference engine (CLAUDE.md)
Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/helpers/url_handler.py
🧠 Learnings (1)
📚 Learning: 2025-08-28T13:07:24.766Z
Learnt from: Wirasm
PR: coleam00/Archon#514
File: archon-ui-main/src/services/crawlProgressService.ts:35-39
Timestamp: 2025-08-28T13:07:24.766Z
Learning: The crawlProgressService.ts in the Archon codebase should be deprecated in favor of the existing useCrawlProgressPolling hook from usePolling.ts, which already includes ETag support, 304 handling, tab visibility detection, and proper React lifecycle integration. This consolidation reduces code duplication and improves performance.
Applied to files:
archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx
🧬 Code graph analysis (1)
python/src/server/services/crawling/crawling_service.py (3)
python/src/server/config/logfire_config.py (1)
safe_logfire_info(223-235)python/src/server/services/crawling/helpers/url_handler.py (4)
is_txt(56-72)is_markdown(36-53)extract_markdown_links(155-241)is_binary_file(75-120)python/src/server/services/crawling/strategies/batch.py (1)
crawl_batch_with_progress(30-198)
🔇 Additional comments (6)
archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx (2)
727-727: Good fix: prevent flex overflow on batch barAdding min-w-0 avoids layout glitches in tight flex parents.
732-733: Correct: clamp batch progress to [0,100]This fixes the “always 100%” rendering when totals are 0/undefined during early updates.
python/src/server/services/crawling/helpers/url_handler.py (3)
35-53: Markdown detection is robustLowercasing and parsing path correctly handle query/fragment cases.
66-72: TXT detection hardened correctlyParsing and lowercasing path resolves .TXT and ?raw=1 cases.
154-241: Extractor covers real-world patterns wellComprehensive combined regex, cleaning, www// normalization, relative resolution, and stable dedupe look good.
python/src/server/services/crawling/crawling_service.py (1)
364-373: ProgressMapper usage in document_storage is correctMapping stage progress to overall percentage prevents UI regressions here.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
python/src/server/services/crawling/helpers/url_handler.py (2)
195-206: GitHub raw URL: keep query/fragment stripping; remove markersPreserve the path sanitization to avoid broken raw links, and drop conflict markers.
-<<<<<<< HEAD - # Strip query parameters and fragments that break raw URLs - path = path.split('?', 1)[0].split('#', 1)[0] - raw_url = f'https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}' -======= - raw_url = f"https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}" ->>>>>>> origin/main + # Strip query parameters and fragments that break raw URLs + path = path.split("?", 1)[0].split("#", 1)[0] + raw_url = f"https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}"
221-232: Broken/duplicatedgenerate_unique_source_id; large conflict block needs a unified resolutionOne branch returns
url(incorrect), the other has the canonical, hashed implementation plusextract_display_name. Unify: keep the hashed implementation, keepextract_display_name, and keep your newextract_markdown_links/is_link_collection_fileas separate methods (no duplication). Remove all conflict markers.Here’s a consolidated replacement for the entire conflict block to end-of-file (keep ordering as shown and ensure only one definition of each method exists):
-<<<<<<< HEAD - return url - - @staticmethod - def extract_markdown_links(content: str, base_url: Optional[str] = None) -> List[str]: - ... - @staticmethod - def is_link_collection_file(url: str, content: Optional[str] = None) -> bool: - ... -======= - Uses 16-char SHA256 prefix (64 bits) which provides - ~18 quintillion unique values. Collision probability - is negligible for realistic usage (<1M sources). - ... - def extract_display_name(url: str) -> str: - ... ->>>>>>> origin/main + Uses 16-char SHA256 prefix (64 bits) which provides + ~18 quintillion unique values. Collision probability + is negligible for realistic usage (<1M sources). + ... + try: + from urllib.parse import urlparse, urlunparse, parse_qsl, urlencode + ... + return hashlib.sha256(canonical.encode("utf-8")).hexdigest()[:16] + except Exception as e: + ... + return hashlib.sha256(fallback.encode("utf-8")).hexdigest()[:16] + + @staticmethod + def extract_markdown_links(content: str, base_url: Optional[str] = None) -> List[str]: + ... + + @staticmethod + def is_link_collection_file(url: str, content: Optional[str] = None) -> bool: + ... + + @staticmethod + def extract_display_name(url: str) -> str: + ...Note: Replace the
...with your current bodies (with the fixes from the next two comments).Also applies to: 388-597
♻️ Duplicate comments (2)
python/src/server/services/crawling/helpers/url_handler.py (2)
366-381: Content-based detection: exclude “full” variants to prevent 492-URL crawl regressionShort-circuit content analysis when filename contains “full” (case-insensitive) to retain single-page behavior for llms-full.txt.
- if content: - # Reuse extractor to avoid regex divergence and maintain consistency - extracted_links = URLHandler.extract_markdown_links(content) + if content: + # Preserve single-page behavior for *full* variants + if "full" in filename: + logger.info(f"Skipping content-based link-collection detection for full-content file: {filename}") + return False + # Reuse extractor to avoid regex divergence and maintain consistency + extracted_links = URLHandler.extract_markdown_links(content, url)
368-371: Pass base_url so relatives count toward densityRelative links won’t be resolved or counted without
base_url; this under-detects MD link lists.- extracted_links = URLHandler.extract_markdown_links(content) + extracted_links = URLHandler.extract_markdown_links(content, url)
🧹 Nitpick comments (4)
python/src/server/services/crawling/helpers/url_handler.py (4)
255-261: Precompile the combined regex once to reduce per-call overheadMove the pattern to a module-level constant; reuse inside
extract_markdown_links.- combined_pattern = re.compile( + # at module scope: +# COMBINED_URL_PATTERN = re.compile( +# r'\[(?P<text>[^\]]*)\]\((?P<md>[^)]+)\)' +# r'|<\s*(?P<auto>https?://[^>\s]+)\s*>' +# r'|(?P<bare>https?://[^\s<>()\[\]"]+)' +# r'|(?P<proto>//[^\s<>()\[\]"]+)' +# r'|(?P<www>www\.[^\s<>()\[\]"]+)' +# ) + combined_pattern = COMBINED_URL_PATTERN
315-316: Reduce log noise for large filesDowngrade to debug or include a cap/truncation to avoid high-cardinality info logs.
- logger.info(f"Extracted {len(unique_urls)} unique links from content") + logger.debug(f"Extracted {len(unique_urls)} unique links from content")
171-177: Micro-optimization: useendswith(tuple)instead of looping setFaster and simpler; keeps constant factors down in hot paths.
- for ext in binary_extensions: - if path.endswith(ext): - logger.debug(f"Skipping binary file: {url} (matched extension: {ext})") - return True + if path.endswith(tuple(binary_extensions)): + logger.debug(f"Skipping binary file by extension match: {url}") + return True
195-205: Minor: accept both http and https for GitHub match (defensive)GitHub serves https, but being lenient costs nothing.
- github_file_pattern = r"https://github\.com/([^/]+)/([^/]+)/blob/([^/]+)/(.+)" + github_file_pattern = r"https?://github\.com/([^/]+)/([^/]+)/blob/([^/]+)/(.+)"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
python/src/server/services/crawling/crawling_service.py(4 hunks)python/src/server/services/crawling/helpers/url_handler.py(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- python/src/server/services/crawling/crawling_service.py
🧰 Additional context used
📓 Path-based instructions (6)
python/src/{server,mcp,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead
Files:
python/src/server/services/crawling/helpers/url_handler.py
python/src/{server/services,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data
Files:
python/src/server/services/crawling/helpers/url_handler.py
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients
Files:
python/src/server/services/crawling/helpers/url_handler.py
python/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit
Files:
python/src/server/services/crawling/helpers/url_handler.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}
📄 CodeRabbit inference engine (CLAUDE.md)
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality
Files:
python/src/server/services/crawling/helpers/url_handler.py
python/src/server/**
📄 CodeRabbit inference engine (CLAUDE.md)
Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)
Files:
python/src/server/services/crawling/helpers/url_handler.py
🔇 Additional comments (1)
python/src/server/services/crawling/helpers/url_handler.py (1)
9-10: Type hints/imports look goodNew imports are accurate and used correctly.
…. fixes from code rabbit
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
python/src/server/services/crawling/crawling_service.py (1)
608-625: Use mapped percentages in sitemap/recursive branches (consistency with ProgressMapper)These branches still emit raw 10/15% values. Map them via ProgressMapper to keep UI consistent.
Apply:
if self.progress_id: - self.progress_state.update({ + overall = self.progress_mapper.map_progress("crawling", 10) + self.progress_state.update({ "status": "crawling", - "percentage": 10, + "percentage": overall, "log": "Detected sitemap, parsing URLs...", }) await update_crawl_progress(self.progress_id, self.progress_state) ... if sitemap_urls: # Emit progress before starting batch crawl if self.progress_id: - self.progress_state.update({ + overall = self.progress_mapper.map_progress("crawling", 15) + self.progress_state.update({ "status": "crawling", - "percentage": 15, + "percentage": overall, "log": f"Starting batch crawl of {len(sitemap_urls)} URLs...", }) ... if self.progress_id: - self.progress_state.update({ + overall = self.progress_mapper.map_progress("crawling", 10) + self.progress_state.update({ "status": "crawling", - "percentage": 10, + "percentage": overall, "log": f"Starting recursive crawl with max depth {request.get('max_depth', 1)}...", })Also applies to: 637-643
🧹 Nitpick comments (5)
python/src/server/services/storage/code_storage_service.py (1)
967-976: Optional: include generic batch fields in the final event for BCFinal completion omits the generic batch_number/total_batches that you send per-batch. Consider including them here too to avoid any downstream regressions relying on those fields.
Apply:
await progress_callback({ "status": "code_storage", "percentage": 100, "log": f"Code storage completed. Stored {total_items} code examples.", "total_items": total_items, # Keep final batch info for code storage completion "code_total_batches": (total_items + batch_size - 1) // batch_size, "code_current_batch": (total_items + batch_size - 1) // batch_size, + # Backward-compat generic fields + "batch_number": (total_items + batch_size - 1) // batch_size, + "total_batches": (total_items + batch_size - 1) // batch_size, })archon-ui-main/src/services/crawlProgressService.ts (1)
77-83: Type surface aligns with backendAdding document_* and code_* optional fields matches the server payloads and CrawlingProgressCard usage.
Given the prior learning to deprecate this service in favor of useCrawlProgressPolling, consider moving these type updates to the hook and routing consumers there to reduce duplication.
archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx (1)
300-317: Guard for undefined vs falsy when showing code batch messagecode_current_batch can be 0/1; using a truthy check can hide valid values. Match the doc logic by checking for undefined.
Apply:
- if (progressData.code_current_batch && progressData.code_total_batches) { + if (progressData.code_current_batch !== undefined && progressData.code_total_batches) { step.message = `Batch ${progressData.code_current_batch}/${progressData.code_total_batches} - Extracting code blocks...`; } else { step.message = 'Extracting code blocks...'; }python/src/server/services/crawling/crawling_service.py (2)
495-528: Self-link normalization looks solid; consider “index.html/README” canonicalizationThe scheme/host/default-port normalization is correct. Optionally treat trailing “/index.html” or “/README.md” as equivalent to “/” to catch common homepage aliases.
Example:
def _core(u: str) -> str: p = urlparse(u) ... - path = p.path.rstrip("/") + path = p.path.rstrip("/") + if path.endswith("/index.html"): + path = path[:-11] # drop '/index.html' + if path.lower().endswith("/readme.md"): + path = path[:-10] return f"{scheme}://{host}{port_part}{path}"
559-604: Link-collection path looks correct; “full” files now excluded via helperThis should prevent the llms-full.txt over-crawl. Optional: add a configurable hard cap (e.g., RAG setting) on extracted links to avoid runaway batches on massive lists.
- if extracted_links: + if extracted_links: + # Optional safety cap (configurable) + try: + from ..credential_service import credential_service + cap = int((await credential_service.get_credential("MAX_EXTRACTED_LINKS", "200", decrypt=True)) or "200") + except Exception: + cap = 200 + if len(extracted_links) > cap: + logger.info(f"Capping extracted links from {len(extracted_links)} to {cap}") + extracted_links = extracted_links[:cap]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx(4 hunks)archon-ui-main/src/services/crawlProgressService.ts(1 hunks)python/src/server/services/crawling/crawling_service.py(3 hunks)python/src/server/services/crawling/helpers/url_handler.py(5 hunks)python/src/server/services/storage/code_storage_service.py(2 hunks)python/src/server/services/storage/document_storage_service.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- python/src/server/services/crawling/helpers/url_handler.py
🧰 Additional context used
📓 Path-based instructions (8)
python/src/{server,mcp,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead
Files:
python/src/server/services/storage/document_storage_service.pypython/src/server/services/storage/code_storage_service.pypython/src/server/services/crawling/crawling_service.py
python/src/{server/services,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data
Files:
python/src/server/services/storage/document_storage_service.pypython/src/server/services/storage/code_storage_service.pypython/src/server/services/crawling/crawling_service.py
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients
Files:
python/src/server/services/storage/document_storage_service.pypython/src/server/services/storage/code_storage_service.pypython/src/server/services/crawling/crawling_service.py
python/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit
Files:
python/src/server/services/storage/document_storage_service.pypython/src/server/services/storage/code_storage_service.pypython/src/server/services/crawling/crawling_service.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}
📄 CodeRabbit inference engine (CLAUDE.md)
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality
Files:
python/src/server/services/storage/document_storage_service.pyarchon-ui-main/src/services/crawlProgressService.tspython/src/server/services/storage/code_storage_service.pypython/src/server/services/crawling/crawling_service.pyarchon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx
python/src/server/**
📄 CodeRabbit inference engine (CLAUDE.md)
Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)
Files:
python/src/server/services/storage/document_storage_service.pypython/src/server/services/storage/code_storage_service.pypython/src/server/services/crawling/crawling_service.py
archon-ui-main/src/services/**
📄 CodeRabbit inference engine (CLAUDE.md)
Place API communication and business logic in archon-ui-main/src/services/
Files:
archon-ui-main/src/services/crawlProgressService.ts
archon-ui-main/src/components/**
📄 CodeRabbit inference engine (CLAUDE.md)
Place reusable UI components in archon-ui-main/src/components/
Files:
archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx
🧠 Learnings (1)
📚 Learning: 2025-08-28T13:07:24.766Z
Learnt from: Wirasm
PR: coleam00/Archon#514
File: archon-ui-main/src/services/crawlProgressService.ts:35-39
Timestamp: 2025-08-28T13:07:24.766Z
Learning: The crawlProgressService.ts in the Archon codebase should be deprecated in favor of the existing useCrawlProgressPolling hook from usePolling.ts, which already includes ETag support, 304 handling, tab visibility detection, and proper React lifecycle integration. This consolidation reduces code duplication and improves performance.
Applied to files:
archon-ui-main/src/services/crawlProgressService.tsarchon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx
🧬 Code graph analysis (1)
python/src/server/services/crawling/crawling_service.py (3)
python/src/server/config/logfire_config.py (1)
safe_logfire_info(223-235)python/src/server/services/crawling/helpers/url_handler.py (5)
is_txt(58-74)is_markdown(38-55)is_link_collection_file(368-435)extract_markdown_links(279-365)is_binary_file(77-174)python/src/server/services/crawling/strategies/batch.py (1)
crawl_batch_with_progress(30-198)
🪛 Biome (2.1.2)
archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx
[error] 424-425: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.
The declaration is defined in this switch clause:
Safe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
[error] 425-426: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.
The declaration is defined in this switch clause:
Safe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
[error] 436-437: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.
The declaration is defined in this switch clause:
Safe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
[error] 437-438: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.
The declaration is defined in this switch clause:
Safe fix: Wrap the declaration in a block.
(lint/correctness/noSwitchDeclarations)
🔇 Additional comments (4)
python/src/server/services/storage/code_storage_service.py (1)
958-964: Good: stage-scoped progress keys for code storageAdding code_current_batch/code_total_batches cleanly isolates code-stage progress and keeps generic fields for BC. Looks consistent with the UI consumption.
python/src/server/services/storage/document_storage_service.py (1)
335-345: Good: stage-specific doc counters + BC fieldsdocument_completed_batches/document_total_batches/document_current_batch are emitted alongside the legacy fields. This should fix “always 100%” UI symptoms while staying compatible.
Also applies to: 390-401
archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx (1)
739-775: Nice: stage-aware doc batch bar with fallbacksGood use of stage-specific counters and clamped width; avoids cross-stage contamination and div/0.
Also applies to: 746-751, 757-766, 770-775
python/src/server/services/crawling/crawling_service.py (1)
541-556: Good: mapped progress used for text/markdown detection + crawl windowUsing ProgressMapper here avoids percentage jumps and aligns with the new stage windows (5–10%). 👍
|
i can also confirm the ui batch progress bar issue is actually a pre existing issue, i confirmed this by running the main branch of archon (https://modelcontextprotocol.io/llms-full.txt) and had it occur there (see photo), so this pr would fix that issue aswell! I know this is a separate issue from the original PR, i dont mind splitting the issue up and addressing it in another pr if need be, but i can confirm this pr will add the ability to crawl llms.txt links recursively and allows the llms-full.txt to be crawled fully while also fixing the batching progress bar, so its a double whammy.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
python/src/server/services/crawling/crawling_service.py (2)
529-562: Harden self-link canonicalization (index pages + safer fallback).Treat “/index.html|/index.htm|/README.md” as the base path and lowercase in the fallback to avoid false negatives.
def _is_self_link(self, link: str, base_url: str) -> bool: @@ - def _core(u: str) -> str: + def _core(u: str) -> str: p = urlparse(u) scheme = (p.scheme or "http").lower() host = (p.hostname or "").lower() port = p.port if (scheme == "http" and port in (None, 80)) or (scheme == "https" and port in (None, 443)): port_part = "" else: port_part = f":{port}" if port else "" - path = p.path.rstrip("/") + path = p.path.rstrip("/") + # Canonicalize default index/README pages to the directory root + lp = path.lower() + if lp.endswith(("/index.html", "/index.htm", "/readme.md")): + path = path[: path.rfind("/")] if "/" in path else "" return f"{scheme}://{host}{port_part}{path}" @@ - except Exception as e: + except Exception as e: logger.warning(f"Error checking if link is self-referential: {e}", exc_info=True) # Fallback to simple string comparison - return link.rstrip('/') == base_url.rstrip('/') + return link.rstrip('/').lower() == base_url.rstrip('/').lower()
573-589: Use ProgressMapper for emitted percent; keep crawl window inside mapper range.Avoid raw “10%” emissions and align the text/markdown window with the crawling range (comment below says 3–8).
- if self.progress_tracker: - await self.progress_tracker.update( - status="crawling", - progress=10, - log="Detected text file, fetching content...", - crawl_type=crawl_type, - current_url=url - ) + if self.progress_tracker: + overall = self.progress_mapper.map_progress("crawling", 5) + await self.progress_tracker.update( + status="crawling", + progress=overall, + log="Detected text/markdown file, fetching content...", + crawl_type=crawl_type, + current_url=url + ) @@ - start_progress=5, - end_progress=10, + # Keep within ProgressMapper's crawling window + start_progress=3, + end_progress=4,Optional (outside this hunk): map inside _create_crawl_progress_callback so every crawling callback is auto-mapped.
# inside callback(...) in _create_crawl_progress_callback mapped = self.progress_mapper.map_progress(base_status, progress) await self.progress_tracker.update(status=base_status, progress=mapped, log=message, **kwargs)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
python/src/server/services/crawling/crawling_service.py(3 hunks)python/src/server/services/crawling/helpers/url_handler.py(5 hunks)python/src/server/services/storage/document_storage_service.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- python/src/server/services/storage/document_storage_service.py
- python/src/server/services/crawling/helpers/url_handler.py
🧰 Additional context used
📓 Path-based instructions (3)
python/src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/**/*.py: Fail fast on service startup failures (crash with clear error if credentials, database, or any service cannot initialize)
Fail fast on missing configuration or invalid environment settings
Fail fast on database connection failures; do not hide connection issues
Fail fast on authentication/authorization failures; halt the operation and surface the error
Fail fast on data corruption or validation errors; let Pydantic raise
Fail fast when critical dependencies are unavailable (required service down)
Never store invalid data that would corrupt state (e.g., zero embeddings, null foreign keys, malformed JSON); fail instead
For batch processing, complete what you can and log detailed failures per item
Background tasks should finish queues but log failures clearly
Do not crash on a single WebSocket/event failure; log and continue serving other clients
If optional features are disabled, log and skip rather than crashing
External API calls should retry with exponential backoff; then fail with a clear, specific error
When continuing after a failure, skip the failed item entirely; never persist partial or corrupted results
Include context about the attempted operation in error messages
Preserve full stack traces with exc_info=True in Python logging
Use specific exception types; avoid catching generic Exception
Never return None to indicate failure; raise an exception with details
For batch operations, report both success counts and detailed failure lists
Target Python 3.12 and keep line length at 120 characters
Use Ruff for linting (errors, warnings, unused imports, style) and keep code Ruff-clean
Use Mypy for static type checking and keep code type-safe
Enable auto-formatting on save in IDEs to maintain consistent Python style
Files:
python/src/server/services/crawling/crawling_service.py
python/src/server/**
📄 CodeRabbit inference engine (CLAUDE.md)
Keep the main FastAPI application under python/src/server/
Files:
python/src/server/services/crawling/crawling_service.py
python/src/server/services/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Put backend business logic services under python/src/server/services/
Files:
python/src/server/services/crawling/crawling_service.py
🧬 Code graph analysis (1)
python/src/server/services/crawling/crawling_service.py (2)
python/src/server/services/crawling/helpers/url_handler.py (4)
is_txt(58-74)is_markdown(38-55)extract_markdown_links(279-365)is_binary_file(77-174)python/src/server/services/crawling/strategies/batch.py (1)
crawl_batch_with_progress(32-236)
🔇 Additional comments (1)
python/src/server/services/crawling/crawling_service.py (1)
683-690: LGTM: recursive crawl window matches ProgressMapper comments (3–8).Consistent with mapper; keeps UI progress stable.
|
This looks really neat, will test it tomorrow, i think we can merge this in pretty quickly @coleam00 |
|
We need this feature !!! Please proceed |
|
@Chillbruhhh This PR is working really nicely for llms.txt - awesome work! However, for llms-full.txt, it still isn't ideal. I tried with: https://ai.pydantic.dev/llms-full.txt And it says it's crawling 492 pages when it's really just 1, and on main right now it does as I'd expect - it quickly crawls the llms-full.txt as a single page and within seconds moves on to the embedding and storage step. |
|
Wait hold on.... I believe GitHub glitched when switching to your PR and didn't pull some changes from when I last tested. Checking now... |
|
@coleam00 im also confirming now by downloading this exact branch to test that, this should be the correct version that fixed that issue, i tested it multiple times. |
i cloned this exact branch and removed all my other containers version, built these and this is what i see : |
|
@Chillbruhhh Yeah that was my bad (or GitHub, idk) - it's looking good now! Just doing some last testing here. |
|
Merged this now - nice work @Chillbruhhh!! I appreciate it a lot. |
* fixed the llms.txt/fulls-llm.txt/llms.md etc. to be crawleed finally. intelligently determines if theres links in the llms.txt and crawls them as it should. tested fully everything works! * updated coderabbits suggestion - resolved * refined to code rabbits suggestions take 2, should be final take. didnt add the max link paramter suggestion though. * 3rd times the charm, added nit picky thing from code rabbit. code rabbit makes me crave nicotine * Fixed progress bar accuracy and OpenAI API compatibility issues Changes Made: 1. Progress Bar Fix: Fixed llms.txt crawling progress jumping to 90% then regressing to 45% by adjusting batch crawling progress ranges (20-30% instead of 40-90%) and using consistent ProgressMapper ranges 2. OpenAI API Compatibility: Added robust fallback logic in contextual embedding service to handle newer models (GPT-5) that require max_completion_tokens instead of max_tokens and don't support custom temperature values Files Modified: - src/server/services/crawling/crawling_service.py - Fixed progress ranges - src/server/services/crawling/progress_mapper.py - Restored original stage ranges - src/server/services/embeddings/contextual_embedding_service.py - Added fallback API logic Result: - Progress bar now smoothly progresses 030% (crawling) 35-80% (storage) 100% - Automatic compatibility with both old (GPT-4.1-nano) and new (GPT-5-nano) OpenAI models - Eliminates max_tokens not supported and temperature not supported errors * removed gpt-5-handlings since thats a seprate issue and doesnt pertain to here, definitley recommend looking at that though since gpt-5-nano is considered a reasoning model and doesnt use max_tokens, requires a diffrent output. also removed my upsert fix from documentstorage since thats not apart of this exact issue and i have another PR open for it. checked in code rabbit in my ide no issues, no nitpicks. should be good? might flag me for the UPSERT logic not being in here. owell has nothing to do with this was pr, was submitted in the last revision by mistake. everythings tested and good to go! * fixed the llms-full.txt crawling issue. now crawls just that page when crawling llms-full.txt. fixed the 100% crawl url when multiple urls are present and hasnt finished crawling. also fixed a styling issue in CrawlingProgressCard.tsx , when batching code examples the batching progress bar would sometimes glitch out of the ui fixed it to where it wont do that now. * fixed a few things so it will work with the current branch! * added some enhancemments to ui rendering aswell and other little misc. fixes from code rabbit --------- Co-authored-by: Chillbruhhh <joshchesser97@gmail.com> Co-authored-by: Claude Code <claude@anthropic.com>
…lts (#437) Workflow executor now infers the provider from the model when no explicit provider is set. Previously, setting `model: sonnet` on a workflow while `defaultAssistant: codex` in config would throw a compatibility error because the provider was blindly inherited from the config default. Resolution priority is now: 1. Explicit workflow `provider` field 2. Inferred from workflow `model` (claude aliases → claude, else → codex) 3. Config `defaultAssistant` Also removes `model: sonnet` from all 8 default workflows and dead `model: haiku` step-level fields from 2 workflows. Model selection is now fully driven by config, not hardcoded in workflow YAMLs.
…lts (coleam00#437) Workflow executor now infers the provider from the model when no explicit provider is set. Previously, setting `model: sonnet` on a workflow while `defaultAssistant: codex` in config would throw a compatibility error because the provider was blindly inherited from the config default. Resolution priority is now: 1. Explicit workflow `provider` field 2. Inferred from workflow `model` (claude aliases → claude, else → codex) 3. Config `defaultAssistant` Also removes `model: sonnet` from all 8 default workflows and dead `model: haiku` step-level fields from 2 workflows. Model selection is now fully driven by config, not hardcoded in workflow YAMLs.
…lts (coleam00#437) Workflow executor now infers the provider from the model when no explicit provider is set. Previously, setting `model: sonnet` on a workflow while `defaultAssistant: codex` in config would throw a compatibility error because the provider was blindly inherited from the config default. Resolution priority is now: 1. Explicit workflow `provider` field 2. Inferred from workflow `model` (claude aliases → claude, else → codex) 3. Config `defaultAssistant` Also removes `model: sonnet` from all 8 default workflows and dead `model: haiku` step-level fields from 2 workflows. Model selection is now fully driven by config, not hardcoded in workflow YAMLs.





Pull Request
Summary
I discovered archon wasn't properly crawling llms.txt, llms.md etc. and i added the feature in to where it will now parse and crawl the links in the llms.txt file, its backwards compatible and will crawl llms.txt even if they don't have links, just fixes that bug.
This is what it was crawling before when crawling a llms.txt:
This is what it looks like crawling llms.txt now:
Changes Made
Enhanced llms.txt support - The system now automatically detects and crawls all links found inside llms.txt files, instead of just treating them as static text files.
Key Changes Made
Type of Change
Affected Services
Testing
Test Evidence
1. Testing Ultimate URL Format Support: Extracted 13 unique links from content Found 13 URLs (expected: 10+): - https://platform.openai.com/docs - https://docs.anthropic.com/ - https://python.langchain.com/docs - https://www.langchain.com/langsmith - https://wandb.ai/ - https://example.com/ - https://github.com/microsoft/vscode - https://raw.githubusercontent.com/microsoft/vscode/main/README.md - https://www.google.com - https://www.stackoverflow.com/questions - https://example.com/test - https://www.example.com - https://test.com/path?query=1#fragment 2. Testing Enhanced Markdown File Detection: ✓ Detected link collection file by filename: llms.txt https://example.com/llms.txt Markdown: ✗ Collection: ✓ ✓ Detected link collection file by filename: llms.md https://github.com/user/repo/llms.md Markdown: ✓ Collection: ✓ ✓ Detected link collection file by filename: links.mdx https://example.com/links.mdx Markdown: ✓ Collection: ✓ ✓ Detected link collection file by filename: resources.markdown https://example.com/resources.markdown Markdown: ✓ Collection: ✓ ✓ Detected link collection file by filename: references.txt https://example.com/references.txt Markdown: ✗ Collection: ✓ ✓ Detected potential link collection file: awesome-references.md https://example.com/awesome-references.md Markdown: ✓ Collection: ✓ https://example.com/regular-file.py Markdown: ✗ Collection: ✗ 3. Testing DRY Principle Implementation: Extracted 7 unique links from content ✓ Detected link collection by content analysis: 7 links, density 4.67% DRY content analysis: ✓ (should reuse main extractor) 4. Testing GitHub URL Enhancement: Transformed GitHub file URL to raw: https://github.com/microsoft/vscode/blob/main/README.md -> https://raw.githubusercontent.com/microsoft/vscode/main/README.md Original: https://github.com/microsoft/vscode/blob/main/README.md Transformed: https://raw.githubusercontent.com/microsoft/vscode/main/README.md Transformed GitHub file URL to raw: https://github.com/microsoft/vscode/blob/main/README.md?plain=1 -> https://raw.githubusercontent.com/microsoft/vscode/main/README.md Original: https://github.com/microsoft/vscode/blob/main/README.md?plain=1 Transformed: https://raw.githubusercontent.com/microsoft/vscode/main/README.md Transformed GitHub file URL to raw: https://github.com/microsoft/vscode/blob/main/README.md#installation -> https://raw.githubusercontent.com/microsoft/vscode/main/README.md Original: https://github.com/microsoft/vscode/blob/main/README.md#installation Transformed: https://raw.githubusercontent.com/microsoft/vscode/main/README.md Transformed GitHub file URL to raw: https://github.com/microsoft/vscode/blob/main/README.md?plain=1#installation -> https://raw.githubusercontent.com/microsoft/vscode/main/README.md Original: https://github.com/microsoft/vscode/blob/main/README.md?plain=1#installation Transformed: https://raw.githubusercontent.com/microsoft/vscode/main/README.md ================================================================================ 🛡️ CodeRabbit-Proof Features Implemented: ✅ Named regex groups with ultimate URL format support ✅ www.example.com and //example.com detection ✅ Enhanced punctuation cleanup (.,;:)]>) ✅ DRY principle - single source of truth for patterns ✅ Bulletproof GitHub URL handling with query/fragment stripping ✅ Complete pattern coverage including 'references' ✅ Database configuration respect for concurrency ✅ Markdown file support (.md, .mdx, .markdown) #trust me it worksChecklist
Breaking Changes
Additional Notes
Summary by CodeRabbit
New Features
Bug Fixes