Skip to content

Fix/(llms.txt) not crawling links inside of file#437

Merged
coleam00 merged 21 commits intocoleam00:mainfrom
Chillbruhhh:fix/(llms.txt)-not-crawling-links-inside-of-file
Sep 6, 2025
Merged

Fix/(llms.txt) not crawling links inside of file#437
coleam00 merged 21 commits intocoleam00:mainfrom
Chillbruhhh:fix/(llms.txt)-not-crawling-links-inside-of-file

Conversation

@Chillbruhhh
Copy link
Copy Markdown

@Chillbruhhh Chillbruhhh commented Aug 22, 2025

Pull Request

Summary

I discovered archon wasn't properly crawling llms.txt, llms.md etc. and i added the feature in to where it will now parse and crawl the links in the llms.txt file, its backwards compatible and will crawl llms.txt even if they don't have links, just fixes that bug.

This is what it was crawling before when crawling a llms.txt:

Screenshot 2025-08-22 035117

This is what it looks like crawling llms.txt now:

Screenshot 2025-08-22 041311 Screenshot 2025-08-22 041337

Changes Made

Enhanced llms.txt support - The system now automatically detects and crawls all links found inside llms.txt files, instead of just treating them as static text files.

Key Changes Made

  1. Added Link Detection & Extraction (url_handler.py)
  • extract_markdown_links() - Extracts text patterns from file content
  • is_link_collection_file() - Detects llms.txt files by name and content analysis
  • Handles relative URLs, filters invalid links, removes duplicates
  1. Enhanced Crawling Logic (crawling_service.py)
  • Before: Text files → Single document
  • After: Text files → Check if link collection → Extract links → Batch crawl all links → Combine results

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Performance improvement
  • Code refactoring

Affected Services

  • Frontend (React UI)
  • Server (FastAPI backend)

Testing

  • All existing tests pass
  • Manually tested affected user flows
  • Docker builds succeed for all services

Test Evidence

 1. Testing Ultimate URL Format Support:
     Extracted 13 unique links from content
        Found 13 URLs (expected: 10+):
        - https://platform.openai.com/docs
        - https://docs.anthropic.com/
        - https://python.langchain.com/docs
        - https://www.langchain.com/langsmith
        - https://wandb.ai/
        - https://example.com/
        - https://github.com/microsoft/vscode
        - https://raw.githubusercontent.com/microsoft/vscode/main/README.md
        - https://www.google.com
        - https://www.stackoverflow.com/questions
        - https://example.com/test
        - https://www.example.com
        - https://test.com/path?query=1#fragment

     2. Testing Enhanced Markdown File Detection:
     ✓ Detected link collection file by filename: llms.txt
        https://example.com/llms.txt
          Markdown: ✗
          Collection: ✓
     ✓ Detected link collection file by filename: llms.md
        https://github.com/user/repo/llms.md
          Markdown: ✓
          Collection: ✓
     ✓ Detected link collection file by filename: links.mdx
        https://example.com/links.mdx
          Markdown: ✓
          Collection: ✓
     ✓ Detected link collection file by filename: resources.markdown
        https://example.com/resources.markdown
          Markdown: ✓
          Collection: ✓
     ✓ Detected link collection file by filename: references.txt
        https://example.com/references.txt
          Markdown: ✗
          Collection: ✓
     ✓ Detected potential link collection file: awesome-references.md
        https://example.com/awesome-references.md
          Markdown: ✓
          Collection: ✓
        https://example.com/regular-file.py
          Markdown: ✗
          Collection: ✗

     3. Testing DRY Principle Implementation:
     Extracted 7 unique links from content
     ✓ Detected link collection by content analysis: 7 links, density 4.67%
        DRY content analysis: ✓ (should reuse main extractor)

     4. Testing GitHub URL Enhancement:
     Transformed GitHub file URL to raw: https://github.com/microsoft/vscode/blob/main/README.md -> https://raw.githubusercontent.com/microsoft/vscode/main/README.md
        Original:    https://github.com/microsoft/vscode/blob/main/README.md
        Transformed: https://raw.githubusercontent.com/microsoft/vscode/main/README.md

     Transformed GitHub file URL to raw: https://github.com/microsoft/vscode/blob/main/README.md?plain=1 -> https://raw.githubusercontent.com/microsoft/vscode/main/README.md
        Original:    https://github.com/microsoft/vscode/blob/main/README.md?plain=1
        Transformed: https://raw.githubusercontent.com/microsoft/vscode/main/README.md

     Transformed GitHub file URL to raw: https://github.com/microsoft/vscode/blob/main/README.md#installation -> https://raw.githubusercontent.com/microsoft/vscode/main/README.md
        Original:    https://github.com/microsoft/vscode/blob/main/README.md#installation
        Transformed: https://raw.githubusercontent.com/microsoft/vscode/main/README.md

     Transformed GitHub file URL to raw: https://github.com/microsoft/vscode/blob/main/README.md?plain=1#installation -> https://raw.githubusercontent.com/microsoft/vscode/main/README.md
        Original:    https://github.com/microsoft/vscode/blob/main/README.md?plain=1#installation
        Transformed: https://raw.githubusercontent.com/microsoft/vscode/main/README.md

     ================================================================================

     🛡️  CodeRabbit-Proof Features Implemented:
       ✅ Named regex groups with ultimate URL format support
       ✅ www.example.com and //example.com detection
       ✅ Enhanced punctuation cleanup (.,;:)]>)
       ✅ DRY principle - single source of truth for patterns
       ✅ Bulletproof GitHub URL handling with query/fragment stripping
       ✅ Complete pattern coverage including 'references'
       ✅ Database configuration respect for concurrency
       ✅ Markdown file support (.md, .mdx, .markdown)

#trust me it works

Checklist

  • My code follows the service architecture patterns
  • If using an AI coding assistant, I used the CLAUDE.md rules
  • All new and existing tests pass locally
  • My changes generate no new warnings
  • I have verified no regressions in existing features

Breaking Changes

Additional Notes

Screenshot 2025-08-22 031955

Summary by CodeRabbit

  • New Features

    • Markdown files are now crawled like text, with automatic detection of “link collection” docs that extract, filter, and batch-crawl embedded links.
    • Multi-stage crawling for Markdown/text with clearer, threshold-based progress updates.
    • Stage-specific progress indicators for documents and code (current/total batches) for more accurate status tracking.
    • Cleaner, user-friendly display names derived from URLs.
  • Bug Fixes

    • More reliable GitHub file URL handling that correctly resolves raw file links.
    • Improved filtering to avoid self-referential and binary links when processing link collections.

Chillbruhhh and others added 5 commits March 8, 2025 00:13
Removing langgraph-api from requirements.txt so it is autoresolved, h…
… intelligently determines if theres links in the llms.txt and crawls them as it should. tested fully everything works!
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Aug 22, 2025

Walkthrough

Introduces Markdown and link-collection handling in crawling, including self-link detection, markdown link extraction, binary filtering, and batched crawling with progress mapping. Enhances URL utilities for markdown detection, link parsing, GitHub URL normalization, and display names. Adds stage-specific progress fields for code/document storage. No functional changes in embeddings.

Changes

Cohort / File(s) Summary of Changes
Crawling: Markdown & Link-collection flow
python/src/server/services/crawling/crawling_service.py
Added _is_self_link; expanded text handling to include Markdown; introduced link-collection detection using extracted links from first result; filtered self-referential and binary links; batch-crawled valid links with progress windows; added detailed logging and ProgressMapper integration.
URL utilities: Markdown, links, binary, display
python/src/server/services/crawling/helpers/url_handler.py
Added is_markdown, extract_markdown_links, is_link_collection_file, is_binary_file, extract_display_name; updated is_txt to parse via urlparse; improved transform_github_url to strip query/fragment; added normalization/deduping, relative URL resolution, and robust logging.
Storage progress: Code
python/src/server/services/storage/code_storage_service.py
Added stage-specific progress fields: code_current_batch, code_total_batches in per-batch and final progress payloads; no control-flow changes.
Storage progress: Documents
python/src/server/services/storage/document_storage_service.py
Added document_completed_batches, document_total_batches, document_current_batch in per-batch and final progress payloads; preserved existing fields; no control-flow changes.
Embeddings
python/src/server/services/embeddings/contextual_embedding_service.py
No functional changes; formatting-only (EOF newline).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant CrawlingService
  participant URLHandler
  participant Progress
  participant BatchCrawler as Batch Crawl

  Client->>CrawlingService: crawl(url)
  CrawlingService->>URLHandler: is_txt(url) / is_markdown(url)
  alt Text/Markdown path
    CrawlingService->>Progress: update(start=5,end=10)
    CrawlingService->>CrawlingService: crawl_markdown_file(url)
    CrawlingService->>URLHandler: is_link_collection_file(url, content)
    alt Link-collection
      CrawlingService->>URLHandler: extract_markdown_links(content, base_url)
      CrawlingService->>CrawlingService: filter _is_self_link(link, base_url)
      CrawlingService->>URLHandler: is_binary_file(link) (filter)
      alt Has valid links
        CrawlingService->>Progress: update(start=10,end=20)
        CrawlingService->>BatchCrawler: crawl_batch_with_progress(links, max_concurrent, cb)
        BatchCrawler-->>CrawlingService: batch results
        CrawlingService-->>Client: results (type=link_collection_with_crawled_links)
      else No valid links
        CrawlingService-->>Client: initial results
      end
    else Not a link-collection
      CrawlingService-->>Client: initial results
    end
  else Other types
    CrawlingService-->>Client: existing crawl handling
  end
Loading
sequenceDiagram
  autonumber
  participant Stage as Storage Stage
  participant Progress
  Note over Stage,Progress: Stage-specific progress fields
  Stage->>Progress: Code: code_current_batch, code_total_batches
  Stage->>Progress: Document: document_current_batch,<br/>document_completed_batches, document_total_batches
  Note over Progress: Generic fields preserved alongside stage-specific ones
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • coleam00

Poem

A whisk of links in markdown’s warren wide,
I sniff the trails where tiny anchors hide.
I skip self-loops, let binaries pass by—
then batch-hop threads beneath a progress sky.
With tidy names and counts that clearly show,
this bunny logs the journey—on we go! 🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/src/server/services/crawling/crawling_service.py (1)

502-517: .md files are not treated as link collections; extraction only runs for .txt

The new link-collection flow is gated by is_txt(url), so llms.md and similar files won’t trigger extraction/crawling of embedded links. This contradicts the PR objective.

Apply this diff to include markdown files and clarify the log:

-        if self.url_handler.is_txt(url):
+        if self.url_handler.is_txt(url) or getattr(self.url_handler, "is_markdown", lambda _u: False)(url):
@@
-                    "log": "Detected text file, fetching content...",
+                    "log": "Detected text/markdown file, fetching content...",

And add this helper to URLHandler (in url_handler.py):

+    @staticmethod
+    def is_markdown(url: str) -> bool:
+        """
+        Check if a URL points to a markdown file (.md, .mdx, .markdown).
+        """
+        try:
+            path = urlparse(url).path.lower()
+            return path.endswith(('.md', '.mdx', '.markdown'))
+        except Exception as e:
+            logger.warning(f"Error checking if URL is markdown file: {e}", exc_info=True)
+            return False
🧹 Nitpick comments (4)
python/src/server/services/crawling/helpers/url_handler.py (2)

182-185: Preserve stack traces in logs for easier debugging

Per coding guidelines, include exc_info=True in error/warning logs to retain stack traces.

Apply this diff:

@@
-        except Exception as e:
-            logger.error(f"Error extracting markdown links: {e}")
+        except Exception as e:
+            logger.error(f"Error extracting markdown links: {e}", exc_info=True)
             return []
@@
-        except Exception as e:
-            logger.warning(f"Error checking if file is link collection: {e}")
+        except Exception as e:
+            logger.warning(f"Error checking if file is link collection: {e}", exc_info=True)
             return False

Optionally, consider adding exc_info=True in other exception handlers in this file for consistency.

Also applies to: 241-243


130-185: Consider filtering non-HTML/binary assets early

If link collections include assets like PDFs or images, you can pre-filter using is_binary_file to avoid wasteful batch crawls.

If desired, add:

# After URL collection, before dedup:
urls = [u for u in urls if not URLHandler.is_binary_file(u)]
python/src/server/services/crawling/crawling_service.py (2)

519-561: Add cancellation checks around extraction/batch steps and filter out binaries

Small robustness improvements:

  • Check for cancellation before/after potentially long steps.
  • Filter out binary/non-HTML links before batch to save time and storage.

Apply this diff:

@@
-            # Check if this is a link collection file and extract links
+            # Check if this is a link collection file and extract links
             if crawl_results and len(crawl_results) > 0:
                 content = crawl_results[0].get('markdown', '')
                 if self.url_handler.is_link_collection_file(url, content):
                     if self.progress_id:
@@
-                    # Extract links from the content
+                    # Check for cancellation before extraction
+                    self._check_cancellation()
+                    # Extract links from the content
                     extracted_links = self.url_handler.extract_markdown_links(content, url)
                     
-                    if extracted_links:
+                    # Optional: filter out binaries (pdf, images, archives, etc.)
+                    extracted_links = [u for u in extracted_links if not self.url_handler.is_binary_file(u)]
+
+                    if extracted_links:
                         if self.progress_id:
@@
-                        batch_results = await self.crawl_batch_with_progress(
+                        # Check for cancellation before starting batch crawl
+                        self._check_cancellation()
+                        batch_results = await self.crawl_batch_with_progress(
                             extracted_links,
-                            max_concurrent=request.get('max_concurrent', 3),
+                            # Let strategy apply DB-configured concurrency unless explicitly provided
+                            max_concurrent=request.get('max_concurrent', None),
                             progress_callback=await self._create_crawl_progress_callback("crawling"),
                             start_progress=30,
                             end_progress=70,
                         )
+                        # Check for cancellation after batch crawl
+                        self._check_cancellation()

545-551: Use strategy-configured concurrency by default instead of hardcoded 3

Batch strategy already reads concurrency from settings; defaulting to 3 is inconsistent with other paths (e.g., sitemap). Prefer None unless the caller explicitly overrides.

Apply this minimal diff:

-                            max_concurrent=request.get('max_concurrent', 3),
+                            max_concurrent=request.get('max_concurrent', None),
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between cb4dba1 and ddbe3c1.

📒 Files selected for processing (2)
  • python/src/server/services/crawling/crawling_service.py (1 hunks)
  • python/src/server/services/crawling/helpers/url_handler.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
python/src/{server,mcp,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/src/{server/services,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
🧬 Code graph analysis (1)
python/src/server/services/crawling/crawling_service.py (2)
python/src/server/services/crawling/helpers/url_handler.py (2)
  • is_link_collection_file (187-243)
  • extract_markdown_links (131-184)
python/src/server/services/crawling/strategies/batch.py (1)
  • crawl_batch_with_progress (31-199)
🔇 Additional comments (1)
python/src/server/services/crawling/crawling_service.py (1)

519-561: Verified crawl_markdown_file return shape

I’ve confirmed that crawl_markdown_file (in single_page.py) always returns either:

  • a single‐element list containing the fetched document on success, or
  • an empty list on failure.

Because it never emits multiple entries (e.g., no multipart or redirect segments), indexing crawl_results[0] to get the markdown is safe under the current implementation. If in the future the return shape changes to include multiple items, this use of [0] should be revisited.

Comment thread python/src/server/services/crawling/helpers/url_handler.py
Comment thread python/src/server/services/crawling/helpers/url_handler.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
python/src/server/services/crawling/helpers/url_handler.py (4)

170-173: Broaden link extraction to scheme-less and protocol-relative bare URLs; tighten cleanup

Many real-world link lists use www.example.com or //example.com. Add named groups to the regex and normalize both forms; also trim a couple more trailing punctuation chars to avoid spurious ]/>.

Apply this diff:

-            combined_pattern = re.compile(
-                r'\[([^\]]*)\]\(([^)]+)\)'          # group 2: markdown URL
-                r'|<\s*(https?://[^>\s]+)\s*>'      # group 3: autolink URL
-                r'|(https?://[^\s<>()\[\]"]+)'      # group 4: bare URL
-            )
+            combined_pattern = re.compile(
+                r'\[(?P<text>[^\]]*)\]\((?P<md>[^)]+)\)'      # named: md
+                r'|<\s*(?P<auto>https?://[^>\s]+)\s*>'        # named: auto
+                r'|(?P<bare>https?://[^\s<>()\[\]"]+)'        # named: bare
+                r'|(?P<proto>//[^\s<>()\[\]"]+)'              # named: protocol-relative
+                r'|(?P<www>www\.[^\s<>()\[\]"]+)'             # named: www.* without scheme
+            )
@@
-            def _clean_url(u: str) -> str:
+            def _clean_url(u: str) -> str:
                 # Trim whitespace and common trailing punctuation
-                return u.strip().rstrip('.,;:)')
+                return u.strip().rstrip('.,;:)]>')
@@
-            for match in re.finditer(combined_pattern, content):
-                url = match.group(2) or match.group(3) or match.group(4)
+            for match in re.finditer(combined_pattern, content):
+                url = (
+                    match.group('md')
+                    or match.group('auto')
+                    or match.group('bare')
+                    or match.group('proto')
+                    or match.group('www')
+                )
                 if not url:
                     continue
                 url = _clean_url(url)
@@
-                if url.startswith('//'):
+                if url.startswith('//'):
                     url = f'https:{url}'
+                elif url.startswith('www.'):
+                    url = f'https://{url}'
@@
-                if base_url and not url.startswith(('http://', 'https://')):
+                if base_url and not url.startswith(('http://', 'https://')):
                     try:
                         url = urljoin(base_url, url)

Also applies to: 181-181, 190-193, 194-201, 175-178


254-256: Include “references” in pattern-based detection

Covers cases like awesome-references.md not captured by exact filename list.

-            if any(pattern in filename for pattern in ['llms', 'links', 'resources']):
+            if any(pattern in filename for pattern in ['llms', 'links', 'resources', 'references']):
                 if filename.endswith(('.txt', '.md', '.mdx', '.markdown')):
                     logger.info(f"Detected potential link collection file: {filename}")
                     return True

261-269: Avoid regex divergence: reuse your own extractor for density checks

Using extract_markdown_links here keeps detection logic consistent and reduces maintenance risk when patterns evolve.

-                # Count markdown links + autolinks + bare URLs
-                markdown_link_pattern = r'\[([^\]]*)\]\(([^)]+)\)'
-                autolink_pattern = r'<\s*(https?://[^>\s]+)\s*>'
-                bare_url_pattern = r'(https?://[^\s<>()\[\]"]+)'
-                md_links = re.findall(markdown_link_pattern, content)
-                auto_links = re.findall(autolink_pattern, content)
-                bare_links = re.findall(bare_url_pattern, content)
-                total_links = len(md_links) + len(auto_links) + len(bare_links)
+                # Reuse extractor to avoid regex skew
+                links = URLHandler.extract_markdown_links(content)
+                total_links = len(links)

131-147: Strip GitHub blob query/fragment before building raw URL

raw.githubusercontent.com doesn’t need ?plain=1 or anchors; keeping them can lead to 404s or cache splits.

         match = re.match(github_file_pattern, url)
         if match:
-            owner, repo, branch, path = match.groups()
-            raw_url = f'https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}'
+            owner, repo, branch, path = match.groups()
+            # Drop querystring/fragment from path when targeting raw
+            path = path.split('?', 1)[0].split('#', 1)[0]
+            raw_url = f'https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}'
             logger.info(f"Transformed GitHub file URL to raw: {url} -> {raw_url}")
             return raw_url
python/src/server/services/crawling/crawling_service.py (2)

553-556: Don’t override DB-tuned concurrency when not specified

Defaulting to 3 masks CRAWL_MAX_CONCURRENT from settings. Pass None unless the request explicitly sets a value; the strategy will fetch DB config.

-                        batch_results = await self.crawl_batch_with_progress(
-                            extracted_links,
-                            max_concurrent=request.get('max_concurrent', 3),
+                        batch_results = await self.crawl_batch_with_progress(
+                            extracted_links,
+                            max_concurrent=request.get('max_concurrent'),  # None -> use DB settings
                             progress_callback=await self._create_crawl_progress_callback("crawling"),
                             start_progress=30,
                             end_progress=70,
                         )

535-541: Consider guarding against pathological link lists

Very large llms.* files can enumerate thousands of URLs. Add an upper bound (config-driven) before dispatching to batch crawl, and surface a log with the truncated count.

I can propose a small helper to read MAX_EXTRACTED_LINKS from rag_strategy (default e.g., 500) and slice extracted_links[:max_links]. Want me to draft it?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ddbe3c1 and a5b18ad.

📒 Files selected for processing (2)
  • python/src/server/services/crawling/crawling_service.py (2 hunks)
  • python/src/server/services/crawling/helpers/url_handler.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
python/src/{server,mcp,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/src/{server/services,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
🧬 Code graph analysis (1)
python/src/server/services/crawling/crawling_service.py (2)
python/src/server/services/crawling/helpers/url_handler.py (5)
  • is_txt (55-69)
  • is_markdown (36-52)
  • is_link_collection_file (222-284)
  • extract_markdown_links (150-219)
  • is_binary_file (72-117)
python/src/server/services/crawling/strategies/batch.py (1)
  • crawl_batch_with_progress (31-199)
🔇 Additional comments (3)
python/src/server/services/crawling/helpers/url_handler.py (1)

35-53: Solid addition: markdown file detection

is_markdown is concise, robust (parsing path, case-insensitive), and logs with exc_info=True. LGTM.

python/src/server/services/crawling/crawling_service.py (2)

519-566: Nice orchestration: detects link-collection files, extracts, filters binaries, then batch-crawls

The staged progress updates and result stitching are clear and keep backward compatibility for plain text files. Good addition.


524-530: Align per-stage percentages with ProgressMapper to avoid UI regressions

Direct update_crawl_progress writes bypass ProgressMapper. Verify these percentages won’t regress relative to mapped stages reported elsewhere.

If you’d like, I can scan usages of ProgressMapper.map_progress to spot potential backward jumps and suggest exact numbers.

Also applies to: 544-550

…nt add the max link paramter suggestion though.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
python/src/server/services/crawling/crawling_service.py (6)

502-511: Sync ProgressMapper when emitting direct "crawling" updates to keep heartbeat accurate

You’re emitting raw progress via update_crawl_progress here, but ProgressMapper isn’t updated. Heartbeats (Line 286 ff.) will continue reporting the last mapped stage (e.g., “analyzing”), which is misleading during long file fetches. Sync the mapper when you emit direct updates.

                 self.progress_state.update({
                     "status": "crawling",
                     "percentage": 10,
                     "log": "Detected text/markdown file, fetching content...",
                 })
+                # Keep heartbeat stage/progress in sync with direct emissions
+                self.progress_mapper.map_progress("crawling", 10)
                 await update_crawl_progress(self.progress_id, self.progress_state)

520-533: Emit mapped progress for 'extracting_links' stage to avoid stale heartbeat stage

Same issue as above: heartbeats will show the previous mapped stage while extracting links. Update the mapper alongside the direct emission.

                         self.progress_state.update({
                             "status": "extracting_links",
                             "percentage": 25,
                             "log": "Link collection file detected, extracting embedded links...",
                         })
+                        # Sync mapper for accurate heartbeats during extraction
+                        self.progress_mapper.map_progress("extracting_links", 25)
                         await update_crawl_progress(self.progress_id, self.progress_state)

533-541: Avoid recrawling the source file if it appears among extracted links

It’s common for llms.* files to include a self-link or canonical URL. Drop self-referential links to prevent redundant crawling and duplicate storage later.

-                    extracted_links = self.url_handler.extract_markdown_links(content, url)
+                    extracted_links = self.url_handler.extract_markdown_links(content, url)
+                    # Drop self-links to avoid redundant crawling
+                    extracted_links = [
+                        link for link in extracted_links
+                        if link.rstrip('/') != url.rstrip('/')
+                    ]

543-559: Use 'crawling_links' as the callback base status and keep mapper in sync

  • Progress updates are tagged “crawling_links” here, but the batch progress callback still uses base status "crawling". Aligning both improves UX consistency.
  • Also sync ProgressMapper so heartbeats reflect this stage correctly.
                         if self.progress_id:
                             self.progress_state.update({
                                 "status": "crawling_links",
                                 "percentage": 30,
                                 "log": f"Found {len(extracted_links)} links to crawl from {url}",
                             })
+                            # Sync mapper for accurate heartbeats while batch-crawling links
+                            self.progress_mapper.map_progress("crawling_links", 30)
                             await update_crawl_progress(self.progress_id, self.progress_state)
@@
-                        batch_results = await self.crawl_batch_with_progress(
+                        batch_results = await self.crawl_batch_with_progress(
                             extracted_links,
                             max_concurrent=request.get('max_concurrent'),  # None -> use DB settings
-                            progress_callback=await self._create_crawl_progress_callback("crawling"),
+                            progress_callback=await self._create_crawl_progress_callback("crawling_links"),
                             start_progress=30,
                             end_progress=70,
                         )

132-159: Type hint for progress callback is too strict vs actual usage

The returned callback accepts **kwargs, but the type is Callable[[str, int, str], Awaitable[None]]. This will trip mypy when callers pass extra fields (common in progress flows). Relax the type to Callable[..., Awaitable[None]].

-    ) -> Callable[[str, int, str], Awaitable[None]]:
+    ) -> Callable[..., Awaitable[None]]:
@@
-        async def callback(status: str, percentage: int, message: str, **kwargs):
+        async def callback(status: str, percentage: int, message: str, **kwargs):
             if self.progress_id:

562-566: Optional: Add in-memory URL de-duplication before merging batch results

  • DocumentStorageOperations currently does not filter duplicate URLs or source IDs when ingesting crawl_results (a search for dedup|duplicate|unique|url_to_full_document|source_id in document_storage_operations.py returned no matching logic).
  • The new crawl_type = "link_collection_with_crawled_links" value is only used for metadata and logging—no downstream code branches on specific crawl_type strings.

If you’d like to prevent duplicate upserts and noisy metrics, you can apply this optional diff:

-                        crawl_results.extend(batch_results)
+                        # Avoid duplicate URLs when combining results
+                        existing_urls = {
+                            r.get('url')
+                            for r in crawl_results
+                            if isinstance(r, dict) and r.get('url')
+                        }
+                        batch_unique = [
+                            r
+                            for r in batch_results
+                            if isinstance(r, dict) and r.get('url') not in existing_urls
+                        ]
+                        crawl_results.extend(batch_unique)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a5b18ad and e5024fe.

📒 Files selected for processing (2)
  • python/src/server/services/crawling/crawling_service.py (2 hunks)
  • python/src/server/services/crawling/helpers/url_handler.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/src/server/services/crawling/helpers/url_handler.py
🧰 Additional context used
📓 Path-based instructions (6)
python/src/{server,mcp,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead

Files:

  • python/src/server/services/crawling/crawling_service.py
python/src/{server/services,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data

Files:

  • python/src/server/services/crawling/crawling_service.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients

Files:

  • python/src/server/services/crawling/crawling_service.py
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit

Files:

  • python/src/server/services/crawling/crawling_service.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • python/src/server/services/crawling/crawling_service.py
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)

Files:

  • python/src/server/services/crawling/crawling_service.py
🧬 Code graph analysis (1)
python/src/server/services/crawling/crawling_service.py (2)
python/src/server/services/crawling/helpers/url_handler.py (4)
  • is_txt (55-69)
  • is_markdown (36-52)
  • extract_markdown_links (152-233)
  • is_binary_file (72-117)
python/src/server/services/crawling/strategies/batch.py (1)
  • crawl_batch_with_progress (31-199)
🔇 Additional comments (1)
python/src/server/services/crawling/crawling_service.py (1)

519-569: Nice extension: end-to-end handling for llms. link collections*

The staged flow (detect → extract → filter → batch crawl → merge) aligns with the PR objective and keeps backward compatibility for plain text files.

Comment thread python/src/server/services/crawling/crawling_service.py Outdated
@Chillbruhhh
Copy link
Copy Markdown
Author

i do recommend merging PR #378 with this aswell, it helps when youre recrawling

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
python/src/server/services/crawling/helpers/url_handler.py (1)

67-71: TXT detection hardened (case, query/fragment) — matches prior ask

This mirrors the earlier recommendation to normalize via urlparse and lowercase path. Looks good.

🧹 Nitpick comments (5)
python/src/server/services/crawling/helpers/url_handler.py (4)

35-54: Markdown detection is robust and consistent with is_txt

Lowercasing path and ignoring query/fragment is correct.
Optional: consider supporting a few common aliases ('.mkdn', '.mdown', '.rmd') if they appear in your repos.

-            return path.endswith(('.md', '.mdx', '.markdown'))
+            return path.endswith(('.md', '.mdx', '.markdown', '.mkdn', '.mdown', '.rmd'))

85-121: Make binary extension check cheaper and central; add a few common types; include exc_info in warnings

  • Reallocating a set on every call is unnecessary; a module-level tuple + endswith is faster and simpler.
  • Add a few frequently encountered non-HTMLs (fonts, design files, packages).
  • Include exc_info=True to preserve stack traces per guidelines.

Apply within-function changes:

-            # Comprehensive list of binary and non-HTML file extensions
-            binary_extensions = {
-                # Archives
-                '.zip', '.tar', '.gz', '.rar', '.7z', '.bz2', '.xz', '.tgz',
-                # Executables and installers
-                '.exe', '.dmg', '.pkg', '.deb', '.rpm', '.msi', '.app', '.appimage',
-                # Documents (non-HTML)
-                '.pdf', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx', '.odt', '.ods',
-                # Images
-                '.jpg', '.jpeg', '.png', '.gif', '.svg', '.webp', '.ico', '.bmp', '.tiff',
-                # Audio/Video
-                '.mp3', '.mp4', '.avi', '.mov', '.wmv', '.flv', '.webm', '.mkv', '.wav', '.flac',
-                # Data files
-                '.csv', '.sql', '.db', '.sqlite',
-                # Binary data
-                '.iso', '.img', '.bin', '.dat',
-                # Development files (usually not meant to be crawled as pages)
-                '.wasm', '.pyc', '.jar', '.war', '.class', '.dll', '.so', '.dylib'
-            }
-            
-            # Check if the path ends with any binary extension
-            for ext in binary_extensions:
-                if path.endswith(ext):
-                    logger.debug(f"Skipping binary file: {url} (matched extension: {ext})")
-                    return True
-                    
-            return False
+            if path.endswith(BINARY_EXTENSIONS):
+                # Find matched suffix for logging without scanning all suffixes again
+                matched = next((ext for ext in BINARY_EXTENSIONS if path.endswith(ext)), '')
+                logger.debug(f"Skipping binary file: {url} (matched extension: {matched})")
+                return True
+            return False
         except Exception as e:
-            logger.warning(f"Error checking if URL is binary file: {e}")
+            logger.warning(f"Error checking if URL is binary file: {e}", exc_info=True)
             # In case of error, don't skip the URL (safer to attempt crawl than miss content)
             return False

Add a module-level constant (outside the selected range):

# Module-level binary suffixes to avoid per-call allocations
BINARY_EXTENSIONS = (
    # Archives
    '.zip', '.tar', '.gz', '.rar', '.7z', '.bz2', '.xz', '.tgz',
    # Executables and installers
    '.exe', '.dmg', '.pkg', '.deb', '.rpm', '.msi', '.app', '.appimage',
    # Documents (non-HTML)
    '.pdf', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx', '.odt', '.ods',
    # Images
    '.jpg', '.jpeg', '.png', '.gif', '.svg', '.webp', '.ico', '.bmp', '.tiff', '.heic', '.heif',
    # Audio/Video
    '.mp3', '.mp4', '.avi', '.mov', '.wmv', '.flv', '.webm', '.mkv', '.wav', '.flac',
    # Data files
    '.csv', '.sql', '.db', '.sqlite',
    # Binary data
    '.iso', '.img', '.bin', '.dat',
    # Development and packages
    '.wasm', '.pyc', '.jar', '.war', '.class', '.dll', '.so', '.dylib', '.whl',
    # Fonts
    '.ttf', '.otf', '.eot', '.woff', '.woff2',
    # Design / assets
    '.psd', '.ai',
    # Mobile packages / ebooks
    '.apk', '.ipa', '.epub',
)

133-153: Broaden GitHub raw handling and accept both http/https

Current logic only transforms /blob/ URLs over https. Add support for:

  • http scheme (normalize),
  • /raw/ URLs (common in copyable links),
  • existing raw.githubusercontent.com links (passthrough).
-        # Pattern for GitHub file URLs
-        github_file_pattern = r'https://github\.com/([^/]+)/([^/]+)/blob/([^/]+)/(.+)'
+        # Pattern for GitHub file URLs
+        github_file_pattern = r'https?://github\.com/([^/]+)/([^/]+)/blob/([^/]+)/(.+)'
         match = re.match(github_file_pattern, url)
         if match:
             owner, repo, branch, path = match.groups()
             # Strip query parameters and fragments that break raw URLs
             path = path.split('?', 1)[0].split('#', 1)[0]
             raw_url = f'https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}'
             logger.info(f"Transformed GitHub file URL to raw: {url} -> {raw_url}")
             return raw_url
-        
+        # Pattern for GitHub raw URLs using /raw/ path form
+        github_raw_path_pattern = r'https?://github\.com/([^/]+)/([^/]+)/raw/([^/]+)/(.+)'
+        match = re.match(github_raw_path_pattern, url)
+        if match:
+            owner, repo, branch, path = match.groups()
+            path = path.split('?', 1)[0].split('#', 1)[0]
+            raw_url = f'https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}'
+            logger.info(f"Normalized GitHub raw path URL: {url} -> {raw_url}")
+            return raw_url
+
+        # Already a raw.githubusercontent.com link — passthrough
+        if re.match(r'https?://raw\.githubusercontent\.com/', url):
+            return url
+
         # Pattern for GitHub directory URLs
         github_dir_pattern = r'https://github\.com/([^/]+)/([^/]+)/tree/([^/]+)/(.+)'
         match = re.match(github_dir_pattern, url)

154-242: Precompile the regex; optionally support reference-style markdown links

  • Re-compiling the large combined pattern on every call is avoidable; precompile at module scope and reuse for a noticeable hot-path win.
  • Optional: support reference-style definitions ([text][id] with [id]: https://...) if your link lists use them.

Use a module-level pattern (outside selected range):

# Precompiled combined URL pattern
COMBINED_URL_PATTERN = re.compile(
    r'\[(?P<text>[^\]]*)\]\((?P<md>[^)]+)\)'
    r'|<\s*(?P<auto>https?://[^>\s]+)\s*>'
    r'|(?P<bare>https?://[^\s<>()\[\]"]+)'
    r'|(?P<proto>//[^\s<>()\[\]"]+)'
    r'|(?P<www>www\.[^\s<>()\[\]"]+)'
)

Then update usage here:

-            combined_pattern = re.compile(
-                r'\[(?P<text>[^\]]*)\]\((?P<md>[^)]+)\)'      # named: md
-                r'|<\s*(?P<auto>https?://[^>\s]+)\s*>'        # named: auto
-                r'|(?P<bare>https?://[^\s<>()\[\]"]+)'        # named: bare
-                r'|(?P<proto>//[^\s<>()\[\]"]+)'              # named: protocol-relative
-                r'|(?P<www>www\.[^\s<>()\[\]"]+)'             # named: www.* without scheme
-            )
+            # Use precompiled pattern for performance
+            combined_pattern = COMBINED_URL_PATTERN

If reference-style support becomes necessary, I can follow-up with a minimal, order-preserving parser that respects definitions and ignores code fences.
Additionally, consider logging at debug-level for the extraction count to reduce info-log noise on large inputs.

python/src/server/services/crawling/crawling_service.py (1)

551-612: Two improvements: keep stage names consistent during batch, and sanitize max_concurrent from request

  • The per-link progress callback currently uses base_status="crawling" and will overwrite your explicit "crawling_links" stage, causing status flicker in the UI.
  • request.get('max_concurrent') may be a string; pass an int or None to avoid type issues downstream.
@@
-                    if extracted_links:
+                    if extracted_links:
                         if self.progress_id:
                             self.progress_state.update({
-                                "status": "crawling_links",
+                                "status": "crawling_links",
                                 "percentage": 30,
                                 "log": f"Found {len(extracted_links)} links to crawl from {url}",
                             })
                             await update_crawl_progress(self.progress_id, self.progress_state)
                         
                         # Crawl the extracted links using batch crawling
                         logger.info(f"Crawling {len(extracted_links)} extracted links from {url}")
+                        # Sanitize max_concurrent from request
+                        sanitized_max_concurrent = None
+                        try:
+                            mc = request.get('max_concurrent')
+                            if mc is not None:
+                                sanitized_max_concurrent = int(mc)
+                        except Exception:
+                            logger.warning(f"Invalid max_concurrent in request: {request.get('max_concurrent')}, using defaults")
                         batch_results = await self.crawl_batch_with_progress(
                             extracted_links,
-                            max_concurrent=request.get('max_concurrent'),  # None -> use DB settings
-                            progress_callback=await self._create_crawl_progress_callback("crawling"),
+                            max_concurrent=sanitized_max_concurrent,  # None -> use DB settings
+                            progress_callback=await self._create_crawl_progress_callback("crawling_links"),
                             start_progress=30,
                             end_progress=70,
                         )

If you expect very large link lists, we can also cap and paginate them via a setting (e.g., MAX_EXTRACTED_LINKS) to prevent extremely long runs.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e5024fe and 50c4b09.

📒 Files selected for processing (2)
  • python/src/server/services/crawling/crawling_service.py (3 hunks)
  • python/src/server/services/crawling/helpers/url_handler.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
python/src/{server,mcp,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead

Files:

  • python/src/server/services/crawling/helpers/url_handler.py
  • python/src/server/services/crawling/crawling_service.py
python/src/{server/services,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data

Files:

  • python/src/server/services/crawling/helpers/url_handler.py
  • python/src/server/services/crawling/crawling_service.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients

Files:

  • python/src/server/services/crawling/helpers/url_handler.py
  • python/src/server/services/crawling/crawling_service.py
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit

Files:

  • python/src/server/services/crawling/helpers/url_handler.py
  • python/src/server/services/crawling/crawling_service.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • python/src/server/services/crawling/helpers/url_handler.py
  • python/src/server/services/crawling/crawling_service.py
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)

Files:

  • python/src/server/services/crawling/helpers/url_handler.py
  • python/src/server/services/crawling/crawling_service.py
🧬 Code graph analysis (1)
python/src/server/services/crawling/crawling_service.py (2)
python/src/server/services/crawling/helpers/url_handler.py (5)
  • is_txt (56-72)
  • is_markdown (36-53)
  • is_link_collection_file (244-301)
  • extract_markdown_links (155-241)
  • is_binary_file (75-120)
python/src/server/services/crawling/strategies/batch.py (1)
  • crawl_batch_with_progress (31-199)
🔇 Additional comments (2)
python/src/server/services/crawling/helpers/url_handler.py (1)

7-8: Good: normalized URL parsing and typing imports

Brings in urljoin and typing. Sets the stage for safer, normalized URL checks elsewhere.

python/src/server/services/crawling/crawling_service.py (1)

532-543: LGTM: unified text/markdown detection and progress sync

Using both is_txt and is_markdown covers the PR’s target files. Good call to sync the ProgressMapper with direct emissions to prevent UI resets.

Comment thread python/src/server/services/crawling/crawling_service.py
Comment thread python/src/server/services/crawling/helpers/url_handler.py
Chillbruhhh and others added 3 commits August 22, 2025 17:12
  Changes Made:

  1. Progress Bar Fix: Fixed llms.txt crawling progress jumping to 90% then regressing to 45% by adjusting batch crawling progress ranges (20-30% instead of 40-90%) and using consistent ProgressMapper ranges
  2. OpenAI API Compatibility: Added robust fallback logic in contextual embedding service to handle newer models (GPT-5) that require max_completion_tokens instead of max_tokens and don't support custom temperature values

  Files Modified:

  - src/server/services/crawling/crawling_service.py - Fixed progress ranges
  - src/server/services/crawling/progress_mapper.py - Restored original stage ranges
  - src/server/services/embeddings/contextual_embedding_service.py - Added fallback API logic

  Result:

  -  Progress bar now smoothly progresses 030% (crawling)  35-80% (storage)  100%
  -  Automatic compatibility with both old (GPT-4.1-nano) and new (GPT-5-nano) OpenAI models
  -  Eliminates max_tokens not supported and temperature not supported errors
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
python/src/server/services/storage/document_storage_service.py (3)

93-101: Bug: fallback delete batch slices a fixed 10 URLs, skipping others when fallback_batch_size ≠ 10

The loop steps by fallback_batch_size but slices with i + 10. This will skip deletions for some URLs whenever fallback_batch_size > 10 (and can double-delete if < 10). Fix the slice to use fallback_batch_size.

Apply this diff:

-            fallback_batch_size = max(10, delete_batch_size // 5)
+            fallback_batch_size = max(10, delete_batch_size // 5)
...
-                batch_urls = unique_urls[i : i + 10]
+                batch_urls = unique_urls[i : i + fallback_batch_size]

259-293: Wrong/misleading mapping of embeddings back to originals; O(n²) scan and duplicate-text misassignment

The current “find by text” scan will mis-map when two chunks have identical text (e.g., boilerplate/footer), causing duplicate use of the first index and dropping the later one. It’s also O(n²).

Use a stable one-to-many map from text → indices and consume indices as they are matched.

Apply this diff:

-            # Prepare batch data - only for successful embeddings
-            batch_data = []
-            # Map successful texts back to their original indices
-            for j, (embedding, text) in enumerate(
-                zip(batch_embeddings, successful_texts, strict=False)
-            ):
-                # Find the original index of this text
-                orig_idx = None
-                for idx, orig_text in enumerate(contextual_contents):
-                    if orig_text == text:
-                        orig_idx = idx
-                        break
-
-                if orig_idx is None:
-                    search_logger.warning("Could not map embedding back to original text")
-                    continue
-
-                j = orig_idx  # Use original index for metadata lookup
+            # Prepare batch data - only for successful embeddings
+            batch_data = []
+            # Build a stable mapping to handle duplicate texts deterministically
+            text_to_indices: dict[str, list[int]] = {}
+            for idx, orig_text in enumerate(contextual_contents):
+                text_to_indices.setdefault(orig_text, []).append(idx)
+
+            for _, (embedding, text) in enumerate(zip(batch_embeddings, successful_texts, strict=False)):
+                # Consume the next available index for this text (handles duplicates correctly)
+                idx_list = text_to_indices.get(text)
+                if not idx_list:
+                    search_logger.warning("Could not map embedding back to original text")
+                    continue
+                orig_idx = idx_list.pop(0)
                 # Use source_id from metadata if available, otherwise extract from URL
-                if batch_metadatas[j].get("source_id"):
-                    source_id = batch_metadatas[j]["source_id"]
+                if batch_metadatas[orig_idx].get("source_id"):
+                    source_id = batch_metadatas[orig_idx]["source_id"]
                 else:
                     # Fallback: Extract source_id from URL
-                    parsed_url = urlparse(batch_urls[j])
+                    parsed_url = urlparse(batch_urls[orig_idx])
                     source_id = parsed_url.netloc or parsed_url.path
-
                 data = {
-                    "url": batch_urls[j],
-                    "chunk_number": batch_chunk_numbers[j],
-                    "content": text,  # Use the successful text
-                    "metadata": {"chunk_size": len(text), **batch_metadatas[j]},
+                    "url": batch_urls[orig_idx],
+                    "chunk_number": batch_chunk_numbers[orig_idx],
+                    "content": text,  # Use the successful text
+                    # Ensure chunk_size reflects actual stored content; override any existing key
+                    "metadata": {**batch_metadatas[orig_idx], "chunk_size": len(text)},
                     "source_id": source_id,
                     "embedding": embedding,  # Use the successful embedding
                 }
                 batch_data.append(data)

69-89: Prevent potential data loss by avoiding upfront deletion

The current implementation in python/src/server/services/storage/document_storage_service.py (lines 79–83) deletes all existing chunks for each URL before performing any upserts. Since only successfully embedded chunks are re-inserted, any failure in the embedding step will permanently drop previously valid chunks for that URL.

To address this:

  • Instead of wholesale deletion at the start, defer removals until after you’ve confirmed which (url, chunk_number) pairs embed successfully.
  • Alternatively, delete only the specific (url, chunk_number) keys you’re about to overwrite—relying on upsert for updates—then, once a full URL batch succeeds, run a separate cleanup pass to prune truly stale chunks.
  • You can gate the safer deletion strategy behind a feature flag (e.g., RAG_STORAGE_SAFE_DELETE) for incremental rollout.

Please refactor the batch‐delete logic to follow one of these safer approaches, ensuring that transient failures never cause irreversible data loss.

🧹 Nitpick comments (8)
python/src/server/services/storage/document_storage_service.py (4)

299-340: Improve error logs with stack traces and clearer context; keep backoff but log with exc_info

Catching broad Exception is acceptable here due to the retry/last-resort flow, but logs should include the full stack trace and context (batch id) for debugging.

Apply this diff:

-                except Exception as e:
-                    if retry < max_retries - 1:
-                        search_logger.warning(
-                            f"Error inserting batch (attempt {retry + 1}/{max_retries}): {e}"
-                        )
+                except Exception as e:
+                    if retry < max_retries - 1:
+                        search_logger.warning(
+                            f"Error upserting batch {batch_num} (attempt {retry + 1}/{max_retries})",
+                            exc_info=True,
+                        )
                         await asyncio.sleep(retry_delay)
                         retry_delay *= 2  # Exponential backoff
                     else:
-                        search_logger.error(
-                            f"Failed to insert batch after {max_retries} attempts: {e}"
-                        )
+                        search_logger.error(
+                            f"Failed to upsert batch {batch_num} after {max_retries} attempts",
+                            exc_info=True,
+                        )

342-359: Per-record fallback: include stack traces and record context in logs

Add exc_info=True and include url and chunk_number to aid debugging and triage.

Apply this diff:

-                            except Exception as individual_error:
-                                search_logger.error(
-                                    f"Failed individual insert for {record['url']}: {individual_error}"
-                                )
+                            except Exception as individual_error:
+                                search_logger.error(
+                                    f"Failed individual upsert for url={record.get('url')} chunk={record.get('chunk_number')}",
+                                    exc_info=True,
+                                )

61-68: Unused variable: enable_parallel is read but not used

enable_parallel is computed but never used in this function. Either wire it into behavior or remove to reduce confusion.

Apply this diff to remove the dead assignment if not needed:

-            enable_parallel = rag_settings.get("ENABLE_PARALLEL_BATCHES", "true").lower() == "true"
+            _ = rag_settings.get("ENABLE_PARALLEL_BATCHES", "true")  # reserved for future use
...
-            enable_parallel = True
+            _ = True

Or fully remove if not planned.


114-124: Duplicate import of credential_service inside function

credential_service is already imported at the module top (Line 13). The inner re-import is redundant.

Apply this diff:

-        from ..credential_service import credential_service
+        # credential_service already imported at module scope
python/src/server/services/embeddings/contextual_embedding_service.py (4)

112-125: Redundant model retrieval (unused variable); remove to avoid confusion

model_choice is fetched and logged but not used in this function (you later call _get_model_choice). Drop the unused retrieval or reuse it.

Apply this diff:

-    try:
-        from ...services.credential_service import credential_service
-
-        model_choice = await credential_service.get_credential("MODEL_CHOICE", "gpt-4.1-nano")
-    except Exception as e:
-        # Fallback to environment variable or default
-        search_logger.warning(
-            f"Failed to get MODEL_CHOICE from credential service: {e}, using fallback"
-        )
-        model_choice = os.getenv("MODEL_CHOICE", "gpt-4.1-nano")
-
-    search_logger.debug(f"Using MODEL_CHOICE: {model_choice}")
+    # Model is resolved by _get_model_choice(); no need to prefetch another setting here.

250-256: Cap the per-batch token budget to a sane maximum

token_limit = 250 * len(chunks) can exceed model limits and cause avoidable failures for large batches. Cap it or make it configurable (e.g., from credentials), and let the fallback split when needed.

Apply this diff:

-            token_limit = 250 * len(chunks)
+            token_limit = min(250 * len(chunks), 4000)  # TODO: make model-aware or configurable

258-283: Fragile parsing of multi-line contexts; use regex with chunk sections

Splitting by newline and matching CHUNK N: on a single line will drop multi-line contexts or lines that wrap. Prefer a regex that captures sections between markers.

Here’s a more robust sketch:

-            lines = response_text.strip().split("\n")
-            chunk_contexts = {}
-            for line in lines:
-                if line.strip().startswith("CHUNK"):
-                    parts = line.split(":", 1)
-                    if len(parts) == 2:
-                        chunk_num = int(parts[0].strip().split()[1]) - 1
-                        context = parts[1].strip()
-                        chunk_contexts[chunk_num] = context
+            import re
+            pattern = re.compile(r'^\s*CHUNK\s+(\d+):\s*(.*)$', re.MULTILINE)
+            chunk_contexts = {}
+            matches = list(pattern.finditer(response_text))
+            for idx, m in enumerate(matches):
+                chunk_idx = int(m.group(1)) - 1
+                start = m.end()
+                end = matches[idx + 1].start() if idx + 1 < len(matches) else len(response_text)
+                section = (m.group(2) + "\n" + response_text[start:end]).strip()
+                chunk_contexts[chunk_idx] = section

285-303: Include stack traces on rate limit and generic errors; align with observability guideline

Add exc_info=True so we can triage production failures quickly.

Apply this diff:

-    except openai.RateLimitError as e:
+    except openai.RateLimitError as e:
         if "insufficient_quota" in str(e):
-            search_logger.warning(f"⚠️ QUOTA EXHAUSTED in contextual embeddings: {e}")
+            search_logger.warning("⚠️ QUOTA EXHAUSTED in contextual embeddings", exc_info=True)
             search_logger.warning(
                 "OpenAI quota exhausted - proceeding without contextual embeddings"
             )
         else:
-            search_logger.warning(f"Rate limit hit in contextual embeddings batch: {e}")
+            search_logger.warning("Rate limit hit in contextual embeddings batch", exc_info=True)
             search_logger.warning(
                 "Rate limit hit - proceeding without contextual embeddings for this batch"
             )
...
-    except Exception as e:
-        search_logger.error(f"Error in contextual embedding batch: {e}")
+    except Exception as e:
+        search_logger.error("Error in contextual embedding batch", exc_info=True)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 50c4b09 and ad7f02f.

📒 Files selected for processing (3)
  • python/src/server/services/crawling/crawling_service.py (4 hunks)
  • python/src/server/services/embeddings/contextual_embedding_service.py (3 hunks)
  • python/src/server/services/storage/document_storage_service.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/src/server/services/crawling/crawling_service.py
🧰 Additional context used
📓 Path-based instructions (6)
python/src/{server,mcp,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead

Files:

  • python/src/server/services/storage/document_storage_service.py
  • python/src/server/services/embeddings/contextual_embedding_service.py
python/src/{server/services,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data

Files:

  • python/src/server/services/storage/document_storage_service.py
  • python/src/server/services/embeddings/contextual_embedding_service.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients

Files:

  • python/src/server/services/storage/document_storage_service.py
  • python/src/server/services/embeddings/contextual_embedding_service.py
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit

Files:

  • python/src/server/services/storage/document_storage_service.py
  • python/src/server/services/embeddings/contextual_embedding_service.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • python/src/server/services/storage/document_storage_service.py
  • python/src/server/services/embeddings/contextual_embedding_service.py
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)

Files:

  • python/src/server/services/storage/document_storage_service.py
  • python/src/server/services/embeddings/contextual_embedding_service.py
🧠 Learnings (2)
📚 Learning: 2025-08-20T19:38:04.050Z
Learnt from: Chillbruhhh
PR: coleam00/Archon#378
File: python/src/server/services/storage/document_storage_service.py:304-306
Timestamp: 2025-08-20T19:38:04.050Z
Learning: The archon_crawled_pages table in the Archon project has a table-level unique constraint on (url, chunk_number) defined inline in the CREATE TABLE statement in migration/complete_setup.sql at line 202, which allows upsert operations with on_conflict="url,chunk_number" to work properly without requiring additional migrations.

Applied to files:

  • python/src/server/services/storage/document_storage_service.py
📚 Learning: 2025-08-20T19:38:04.050Z
Learnt from: Chillbruhhh
PR: coleam00/Archon#378
File: python/src/server/services/storage/document_storage_service.py:304-306
Timestamp: 2025-08-20T19:38:04.050Z
Learning: The archon_crawled_pages table in the Archon project has a table-level unique constraint on (url, chunk_number) defined inline in the CREATE TABLE statement in migration/complete_setup.sql at line 202, which allows upsert operations with on_conflict="url,chunk_number" to work properly.

Applied to files:

  • python/src/server/services/storage/document_storage_service.py
🔇 Additional comments (2)
python/src/server/services/storage/document_storage_service.py (1)

304-306: Upsert switch is correct and aligns with DB constraint on (url, chunk_number)

Changing to upsert(..., on_conflict="url,chunk_number") prevents duplicate-key failures and updates existing rows in-place. This matches the table-level unique constraint documented in migrations.

python/src/server/services/embeddings/contextual_embedding_service.py (1)

147-157: LGTM: messages payload and centralized fallback invocation

The move to explicit messages + centralized fallback is clean and keeps the per-call code small. Good separation of concerns.

Comment thread python/src/server/services/embeddings/contextual_embedding_service.py Outdated
…n to here, definitley recommend looking at that though since gpt-5-nano is considered a reasoning model and doesnt use max_tokens, requires a diffrent output. also removed my upsert fix from documentstorage since thats not apart of this exact issue and i have another PR open for it. checked in code rabbit in my ide no issues, no nitpicks. should be good? might flag me for the UPSERT logic not being in here. owell has nothing to do with this was pr, was submitted in the last revision by mistake. everythings tested and good to go!
@Chillbruhhh
Copy link
Copy Markdown
Author

ready for merge

@coleam00
Copy link
Copy Markdown
Owner

Thank you for this @Chillbruhhh! I tested just now and this works beautifully for llms.txt that have a lot of links within.

However, I tested with an llms-full.txt (https://ai.pydantic.dev/llms-full.txt) and this used to be really fast since it was just a single page but now with this PR it found 492 URLs and it's going through all of those when it shouldn't have to. It's just a single page.

Also the progress for crawling pages is always at 100%:
PydanticAIDocs100%

If you address these two things I think the PR would be ready! Probably good to rebase from main too.

@Chillbruhhh
Copy link
Copy Markdown
Author

@coleam00 sorry about that, fixing it now

Chillbruhhh added 2 commits August 30, 2025 23:28
…n crawling llms-full.txt. fixed the 100% crawl url when multiple urls are present and hasnt finished crawling. also fixed a styling issue in CrawlingProgressCard.tsx , when batching code examples the batching progress bar would sometimes glitch out of the ui fixed it to where it wont do that now.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/src/server/services/crawling/crawling_service.py (1)

329-331: Root cause of “progress shows 100% immediately”: don’t start ‘starting’ at 100

Use 0–1% for “starting” so UI doesn’t jump to 100% at kickoff.

-            await update_mapped_progress(
-                "starting", 100, f"Starting crawl of {url}", currentUrl=url
-            )
+            await update_mapped_progress(
+                "starting", 1, f"Starting crawl of {url}", currentUrl=url
+            )
♻️ Duplicate comments (2)
python/src/server/services/crawling/helpers/url_handler.py (1)

288-296: Count relative links in density by passing base_url

Without base_url, relative links aren’t resolved and undercount density.

-                extracted_links = URLHandler.extract_markdown_links(content)
+                extracted_links = URLHandler.extract_markdown_links(content, url)
python/src/server/services/crawling/crawling_service.py (1)

490-519: Canonicalize self-link comparison (host case, default ports); add exc_info

Avoid false negatives between http://example.com and http://EXAMPLE.com:80, and improve logs.

-        try:
-            from urllib.parse import urlparse
-            
-            # Parse both URLs to compare their core components
-            link_parsed = urlparse(link)
-            base_parsed = urlparse(base_url)
-            
-            # Compare scheme, netloc, and path (ignoring query and fragment)
-            link_core = f"{link_parsed.scheme}://{link_parsed.netloc}{link_parsed.path.rstrip('/')}"
-            base_core = f"{base_parsed.scheme}://{base_parsed.netloc}{base_parsed.path.rstrip('/')}"
-            
-            return link_core == base_core
+        try:
+            from urllib.parse import urlparse
+            def _core(u: str) -> str:
+                p = urlparse(u)
+                scheme = (p.scheme or 'http').lower()
+                host = (p.hostname or '').lower()
+                port = p.port
+                if (scheme, port) in (('http', 80), ('https', 443)) or port is None:
+                    port_part = ''
+                else:
+                    port_part = f":{port}"
+                path = p.path.rstrip('/')
+                return f"{scheme}://{host}{port_part}{path}"
+            return _core(link) == _core(base_url)
         except Exception as e:
-            logger.warning(f"Error checking if link is self-referential: {e}")
+            logger.warning(f"Error checking if link is self-referential: {e}", exc_info=True)
             # Fallback to simple string comparison
             return link.rstrip('/') == base_url.rstrip('/')
🧹 Nitpick comments (4)
archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx (3)

300-303: Unify snake_case vs camelCase batch fields

UI mixes completedBatches/totalBatches and completed_batches/total_batches, causing mismatched texts/bars. Support both or standardize to one.

Apply:

-            step.message = `Batch ${progressData.completedBatches}/${progressData.totalBatches} - Saving to database...`;
+            const done = progressData.completedBatches ?? progressData.completed_batches ?? 0;
+            const total = progressData.totalBatches ?? progressData.total_batches ?? 0;
+            step.message = total ? `Batch ${done}/${total} - Saving to database...` : 'Saving to database...';
-                              {progressData.completed_batches || 0}/{progressData.total_batches || 0}
+                              {(progressData.completedBatches ?? progressData.completed_batches ?? 0)}/
+                              {(progressData.totalBatches ?? progressData.total_batches ?? 0)}

Also applies to: 721-722


76-88: Don’t mutate props (progressData.status) in-place

Directly setting progressData.status breaks React data flow; rely on onStop to update state in parent.

-      // Optimistic UI update - immediately show stopping status
-      progressData.status = 'stopping';
+      // Ask parent to reflect stopping state; avoid mutating props
+      // Parent can optimistically set status to 'stopping'

23-23: Consider switching to useCrawlProgressPolling

Per team learning, prefer the existing polling hook with ETag/visibility handling over crawlProgressService for fewer renders and better perf.

python/src/server/services/crawling/helpers/url_handler.py (1)

138-153: GitHub blob→raw: good, but accept http and subdomains

Minor: the regex only matches https://github.com; consider https? and www. to be resilient to odd redirects.

- github_file_pattern = r'https://github\.com/([^/]+)/([^/]+)/blob/([^/]+)/(.+)'
+ github_file_pattern = r'https?://(?:www\.)?github\.com/([^/]+)/([^/]+)/blob/([^/]+)/(.+)'
- github_dir_pattern = r'https://github\.com/([^/]+)/([^/]+)/tree/([^/]+)/(.+)'
+ github_dir_pattern = r'https?://(?:www\.)?github\.com/([^/]+)/([^/]+)/tree/([^/]+)/(.+)'
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8265c8b and 1c40d28.

📒 Files selected for processing (3)
  • archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx (1 hunks)
  • python/src/server/services/crawling/crawling_service.py (4 hunks)
  • python/src/server/services/crawling/helpers/url_handler.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx
  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
archon-ui-main/src/components/**

📄 CodeRabbit inference engine (CLAUDE.md)

Place reusable UI components in archon-ui-main/src/components/

Files:

  • archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx
python/src/{server,mcp,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/src/{server/services,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
🧠 Learnings (1)
📚 Learning: 2025-08-28T13:07:24.766Z
Learnt from: Wirasm
PR: coleam00/Archon#514
File: archon-ui-main/src/services/crawlProgressService.ts:35-39
Timestamp: 2025-08-28T13:07:24.766Z
Learning: The crawlProgressService.ts in the Archon codebase should be deprecated in favor of the existing useCrawlProgressPolling hook from usePolling.ts, which already includes ETag support, 304 handling, tab visibility detection, and proper React lifecycle integration. This consolidation reduces code duplication and improves performance.

Applied to files:

  • archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx
🧬 Code graph analysis (1)
python/src/server/services/crawling/crawling_service.py (3)
python/src/server/config/logfire_config.py (1)
  • safe_logfire_info (223-235)
python/src/server/services/crawling/helpers/url_handler.py (4)
  • is_txt (56-72)
  • is_markdown (36-53)
  • extract_markdown_links (155-241)
  • is_binary_file (75-120)
python/src/server/services/crawling/strategies/batch.py (1)
  • crawl_batch_with_progress (30-198)
🔇 Additional comments (6)
archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx (2)

727-727: Good fix: prevent flex overflow on batch bar

Adding min-w-0 avoids layout glitches in tight flex parents.


732-733: Correct: clamp batch progress to [0,100]

This fixes the “always 100%” rendering when totals are 0/undefined during early updates.

python/src/server/services/crawling/helpers/url_handler.py (3)

35-53: Markdown detection is robust

Lowercasing and parsing path correctly handle query/fragment cases.


66-72: TXT detection hardened correctly

Parsing and lowercasing path resolves .TXT and ?raw=1 cases.


154-241: Extractor covers real-world patterns well

Comprehensive combined regex, cleaning, www// normalization, relative resolution, and stable dedupe look good.

python/src/server/services/crawling/crawling_service.py (1)

364-373: ProgressMapper usage in document_storage is correct

Mapping stage progress to overall percentage prevents UI regressions here.

Comment thread python/src/server/services/crawling/crawling_service.py Outdated
Comment thread python/src/server/services/crawling/helpers/url_handler.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
python/src/server/services/crawling/helpers/url_handler.py (2)

195-206: GitHub raw URL: keep query/fragment stripping; remove markers

Preserve the path sanitization to avoid broken raw links, and drop conflict markers.

-<<<<<<< HEAD
-            # Strip query parameters and fragments that break raw URLs
-            path = path.split('?', 1)[0].split('#', 1)[0]
-            raw_url = f'https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}'
-=======
-            raw_url = f"https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}"
->>>>>>> origin/main
+            # Strip query parameters and fragments that break raw URLs
+            path = path.split("?", 1)[0].split("#", 1)[0]
+            raw_url = f"https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}"

221-232: Broken/duplicated generate_unique_source_id; large conflict block needs a unified resolution

One branch returns url (incorrect), the other has the canonical, hashed implementation plus extract_display_name. Unify: keep the hashed implementation, keep extract_display_name, and keep your new extract_markdown_links/is_link_collection_file as separate methods (no duplication). Remove all conflict markers.

Here’s a consolidated replacement for the entire conflict block to end-of-file (keep ordering as shown and ensure only one definition of each method exists):

-<<<<<<< HEAD
-        return url
-    
-    @staticmethod
-    def extract_markdown_links(content: str, base_url: Optional[str] = None) -> List[str]:
-        ...
-    @staticmethod
-    def is_link_collection_file(url: str, content: Optional[str] = None) -> bool:
-        ...
-=======
-        Uses 16-char SHA256 prefix (64 bits) which provides
-        ~18 quintillion unique values. Collision probability
-        is negligible for realistic usage (<1M sources).
-        ...
-    def extract_display_name(url: str) -> str:
-        ...
->>>>>>> origin/main
+        Uses 16-char SHA256 prefix (64 bits) which provides
+        ~18 quintillion unique values. Collision probability
+        is negligible for realistic usage (<1M sources).
+        ...
+        try:
+            from urllib.parse import urlparse, urlunparse, parse_qsl, urlencode
+            ...
+            return hashlib.sha256(canonical.encode("utf-8")).hexdigest()[:16]
+        except Exception as e:
+            ...
+            return hashlib.sha256(fallback.encode("utf-8")).hexdigest()[:16]
+
+    @staticmethod
+    def extract_markdown_links(content: str, base_url: Optional[str] = None) -> List[str]:
+        ...
+
+    @staticmethod
+    def is_link_collection_file(url: str, content: Optional[str] = None) -> bool:
+        ...
+
+    @staticmethod
+    def extract_display_name(url: str) -> str:
+        ...

Note: Replace the ... with your current bodies (with the fixes from the next two comments).

Also applies to: 388-597

♻️ Duplicate comments (2)
python/src/server/services/crawling/helpers/url_handler.py (2)

366-381: Content-based detection: exclude “full” variants to prevent 492-URL crawl regression

Short-circuit content analysis when filename contains “full” (case-insensitive) to retain single-page behavior for llms-full.txt.

-            if content:
-                # Reuse extractor to avoid regex divergence and maintain consistency
-                extracted_links = URLHandler.extract_markdown_links(content)
+            if content:
+                # Preserve single-page behavior for *full* variants
+                if "full" in filename:
+                    logger.info(f"Skipping content-based link-collection detection for full-content file: {filename}")
+                    return False
+                # Reuse extractor to avoid regex divergence and maintain consistency
+                extracted_links = URLHandler.extract_markdown_links(content, url)

368-371: Pass base_url so relatives count toward density

Relative links won’t be resolved or counted without base_url; this under-detects MD link lists.

-                extracted_links = URLHandler.extract_markdown_links(content)
+                extracted_links = URLHandler.extract_markdown_links(content, url)
🧹 Nitpick comments (4)
python/src/server/services/crawling/helpers/url_handler.py (4)

255-261: Precompile the combined regex once to reduce per-call overhead

Move the pattern to a module-level constant; reuse inside extract_markdown_links.

-            combined_pattern = re.compile(
+            # at module scope:
+# COMBINED_URL_PATTERN = re.compile(
+#     r'\[(?P<text>[^\]]*)\]\((?P<md>[^)]+)\)'
+#     r'|<\s*(?P<auto>https?://[^>\s]+)\s*>'
+#     r'|(?P<bare>https?://[^\s<>()\[\]"]+)'
+#     r'|(?P<proto>//[^\s<>()\[\]"]+)'
+#     r'|(?P<www>www\.[^\s<>()\[\]"]+)'
+# )
+            combined_pattern = COMBINED_URL_PATTERN

315-316: Reduce log noise for large files

Downgrade to debug or include a cap/truncation to avoid high-cardinality info logs.

-            logger.info(f"Extracted {len(unique_urls)} unique links from content")
+            logger.debug(f"Extracted {len(unique_urls)} unique links from content")

171-177: Micro-optimization: use endswith(tuple) instead of looping set

Faster and simpler; keeps constant factors down in hot paths.

-            for ext in binary_extensions:
-                if path.endswith(ext):
-                    logger.debug(f"Skipping binary file: {url} (matched extension: {ext})")
-                    return True
+            if path.endswith(tuple(binary_extensions)):
+                logger.debug(f"Skipping binary file by extension match: {url}")
+                return True

195-205: Minor: accept both http and https for GitHub match (defensive)

GitHub serves https, but being lenient costs nothing.

-        github_file_pattern = r"https://github\.com/([^/]+)/([^/]+)/blob/([^/]+)/(.+)"
+        github_file_pattern = r"https?://github\.com/([^/]+)/([^/]+)/blob/([^/]+)/(.+)"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1c40d28 and f8303bd.

📒 Files selected for processing (2)
  • python/src/server/services/crawling/crawling_service.py (4 hunks)
  • python/src/server/services/crawling/helpers/url_handler.py (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/src/server/services/crawling/crawling_service.py
🧰 Additional context used
📓 Path-based instructions (6)
python/src/{server,mcp,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead

Files:

  • python/src/server/services/crawling/helpers/url_handler.py
python/src/{server/services,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data

Files:

  • python/src/server/services/crawling/helpers/url_handler.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients

Files:

  • python/src/server/services/crawling/helpers/url_handler.py
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit

Files:

  • python/src/server/services/crawling/helpers/url_handler.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • python/src/server/services/crawling/helpers/url_handler.py
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)

Files:

  • python/src/server/services/crawling/helpers/url_handler.py
🔇 Additional comments (1)
python/src/server/services/crawling/helpers/url_handler.py (1)

9-10: Type hints/imports look good

New imports are accurate and used correctly.

Comment thread python/src/server/services/crawling/helpers/url_handler.py Outdated
Comment thread python/src/server/services/crawling/helpers/url_handler.py
@Chillbruhhh
Copy link
Copy Markdown
Author

Chillbruhhh commented Aug 31, 2025

@coleam00

ready for merge! fixed the full-llms.txt issue. fixed the ui web crawling progress display! rebased it!

Screenshot 2025-08-30 213750

also made a change in the UI progress when storing batches, noticed sometimes, not all the time but sometimes the batch progress bar would break
Screenshot 2025-08-30 215858

Prevented this and fixed the other issues! checked with code rabbit in my ide aswell and it said good to go , no nitpick nothing. curious to see what it will say here in the PR.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
python/src/server/services/crawling/crawling_service.py (1)

608-625: Use mapped percentages in sitemap/recursive branches (consistency with ProgressMapper)

These branches still emit raw 10/15% values. Map them via ProgressMapper to keep UI consistent.

Apply:

             if self.progress_id:
-                self.progress_state.update({
+                overall = self.progress_mapper.map_progress("crawling", 10)
+                self.progress_state.update({
                     "status": "crawling",
-                    "percentage": 10,
+                    "percentage": overall,
                     "log": "Detected sitemap, parsing URLs...",
                 })
                 await update_crawl_progress(self.progress_id, self.progress_state)
 ...
             if sitemap_urls:
                 # Emit progress before starting batch crawl
                 if self.progress_id:
-                    self.progress_state.update({
+                    overall = self.progress_mapper.map_progress("crawling", 15)
+                    self.progress_state.update({
                         "status": "crawling",
-                        "percentage": 15,
+                        "percentage": overall,
                         "log": f"Starting batch crawl of {len(sitemap_urls)} URLs...",
                     })
 ...
             if self.progress_id:
-                self.progress_state.update({
+                overall = self.progress_mapper.map_progress("crawling", 10)
+                self.progress_state.update({
                     "status": "crawling",
-                    "percentage": 10,
+                    "percentage": overall,
                     "log": f"Starting recursive crawl with max depth {request.get('max_depth', 1)}...",
                 })

Also applies to: 637-643

🧹 Nitpick comments (5)
python/src/server/services/storage/code_storage_service.py (1)

967-976: Optional: include generic batch fields in the final event for BC

Final completion omits the generic batch_number/total_batches that you send per-batch. Consider including them here too to avoid any downstream regressions relying on those fields.

Apply:

         await progress_callback({
             "status": "code_storage",
             "percentage": 100,
             "log": f"Code storage completed. Stored {total_items} code examples.",
             "total_items": total_items,
             # Keep final batch info for code storage completion
             "code_total_batches": (total_items + batch_size - 1) // batch_size,
             "code_current_batch": (total_items + batch_size - 1) // batch_size,
+            # Backward-compat generic fields
+            "batch_number": (total_items + batch_size - 1) // batch_size,
+            "total_batches": (total_items + batch_size - 1) // batch_size,
         })
archon-ui-main/src/services/crawlProgressService.ts (1)

77-83: Type surface aligns with backend

Adding document_* and code_* optional fields matches the server payloads and CrawlingProgressCard usage.

Given the prior learning to deprecate this service in favor of useCrawlProgressPolling, consider moving these type updates to the hook and routing consumers there to reduce duplication.

archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx (1)

300-317: Guard for undefined vs falsy when showing code batch message

code_current_batch can be 0/1; using a truthy check can hide valid values. Match the doc logic by checking for undefined.

Apply:

-                if (progressData.code_current_batch && progressData.code_total_batches) {
+                if (progressData.code_current_batch !== undefined && progressData.code_total_batches) {
                   step.message = `Batch ${progressData.code_current_batch}/${progressData.code_total_batches} - Extracting code blocks...`;
                 } else {
                   step.message = 'Extracting code blocks...';
                 }
python/src/server/services/crawling/crawling_service.py (2)

495-528: Self-link normalization looks solid; consider “index.html/README” canonicalization

The scheme/host/default-port normalization is correct. Optionally treat trailing “/index.html” or “/README.md” as equivalent to “/” to catch common homepage aliases.

Example:

             def _core(u: str) -> str:
                 p = urlparse(u)
                 ...
-                path = p.path.rstrip("/")
+                path = p.path.rstrip("/")
+                if path.endswith("/index.html"):
+                    path = path[:-11]  # drop '/index.html'
+                if path.lower().endswith("/readme.md"):
+                    path = path[:-10]
                 return f"{scheme}://{host}{port_part}{path}"

559-604: Link-collection path looks correct; “full” files now excluded via helper

This should prevent the llms-full.txt over-crawl. Optional: add a configurable hard cap (e.g., RAG setting) on extracted links to avoid runaway batches on massive lists.

-                    if extracted_links:
+                    if extracted_links:
+                        # Optional safety cap (configurable)
+                        try:
+                            from ..credential_service import credential_service
+                            cap = int((await credential_service.get_credential("MAX_EXTRACTED_LINKS", "200", decrypt=True)) or "200")
+                        except Exception:
+                            cap = 200
+                        if len(extracted_links) > cap:
+                            logger.info(f"Capping extracted links from {len(extracted_links)} to {cap}")
+                            extracted_links = extracted_links[:cap]
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a43b1df and 5df242e.

📒 Files selected for processing (6)
  • archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx (4 hunks)
  • archon-ui-main/src/services/crawlProgressService.ts (1 hunks)
  • python/src/server/services/crawling/crawling_service.py (3 hunks)
  • python/src/server/services/crawling/helpers/url_handler.py (5 hunks)
  • python/src/server/services/storage/code_storage_service.py (2 hunks)
  • python/src/server/services/storage/document_storage_service.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/src/server/services/crawling/helpers/url_handler.py
🧰 Additional context used
📓 Path-based instructions (8)
python/src/{server,mcp,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead

Files:

  • python/src/server/services/storage/document_storage_service.py
  • python/src/server/services/storage/code_storage_service.py
  • python/src/server/services/crawling/crawling_service.py
python/src/{server/services,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data

Files:

  • python/src/server/services/storage/document_storage_service.py
  • python/src/server/services/storage/code_storage_service.py
  • python/src/server/services/crawling/crawling_service.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients

Files:

  • python/src/server/services/storage/document_storage_service.py
  • python/src/server/services/storage/code_storage_service.py
  • python/src/server/services/crawling/crawling_service.py
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit

Files:

  • python/src/server/services/storage/document_storage_service.py
  • python/src/server/services/storage/code_storage_service.py
  • python/src/server/services/crawling/crawling_service.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • python/src/server/services/storage/document_storage_service.py
  • archon-ui-main/src/services/crawlProgressService.ts
  • python/src/server/services/storage/code_storage_service.py
  • python/src/server/services/crawling/crawling_service.py
  • archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)

Files:

  • python/src/server/services/storage/document_storage_service.py
  • python/src/server/services/storage/code_storage_service.py
  • python/src/server/services/crawling/crawling_service.py
archon-ui-main/src/services/**

📄 CodeRabbit inference engine (CLAUDE.md)

Place API communication and business logic in archon-ui-main/src/services/

Files:

  • archon-ui-main/src/services/crawlProgressService.ts
archon-ui-main/src/components/**

📄 CodeRabbit inference engine (CLAUDE.md)

Place reusable UI components in archon-ui-main/src/components/

Files:

  • archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx
🧠 Learnings (1)
📚 Learning: 2025-08-28T13:07:24.766Z
Learnt from: Wirasm
PR: coleam00/Archon#514
File: archon-ui-main/src/services/crawlProgressService.ts:35-39
Timestamp: 2025-08-28T13:07:24.766Z
Learning: The crawlProgressService.ts in the Archon codebase should be deprecated in favor of the existing useCrawlProgressPolling hook from usePolling.ts, which already includes ETag support, 304 handling, tab visibility detection, and proper React lifecycle integration. This consolidation reduces code duplication and improves performance.

Applied to files:

  • archon-ui-main/src/services/crawlProgressService.ts
  • archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx
🧬 Code graph analysis (1)
python/src/server/services/crawling/crawling_service.py (3)
python/src/server/config/logfire_config.py (1)
  • safe_logfire_info (223-235)
python/src/server/services/crawling/helpers/url_handler.py (5)
  • is_txt (58-74)
  • is_markdown (38-55)
  • is_link_collection_file (368-435)
  • extract_markdown_links (279-365)
  • is_binary_file (77-174)
python/src/server/services/crawling/strategies/batch.py (1)
  • crawl_batch_with_progress (30-198)
🪛 Biome (2.1.2)
archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx

[error] 424-425: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Safe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)


[error] 425-426: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Safe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)


[error] 436-437: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Safe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)


[error] 437-438: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Safe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

🔇 Additional comments (4)
python/src/server/services/storage/code_storage_service.py (1)

958-964: Good: stage-scoped progress keys for code storage

Adding code_current_batch/code_total_batches cleanly isolates code-stage progress and keeps generic fields for BC. Looks consistent with the UI consumption.

python/src/server/services/storage/document_storage_service.py (1)

335-345: Good: stage-specific doc counters + BC fields

document_completed_batches/document_total_batches/document_current_batch are emitted alongside the legacy fields. This should fix “always 100%” UI symptoms while staying compatible.

Also applies to: 390-401

archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx (1)

739-775: Nice: stage-aware doc batch bar with fallbacks

Good use of stage-specific counters and clamped width; avoids cross-stage contamination and div/0.

Also applies to: 746-751, 757-766, 770-775

python/src/server/services/crawling/crawling_service.py (1)

541-556: Good: mapped progress used for text/markdown detection + crawl window

Using ProgressMapper here avoids percentage jumps and aligns with the new stage windows (5–10%). 👍

Comment thread archon-ui-main/src/components/knowledge-base/CrawlingProgressCard.tsx Outdated
@Chillbruhhh
Copy link
Copy Markdown
Author

Chillbruhhh commented Aug 31, 2025

i can also confirm the ui batch progress bar issue is actually a pre existing issue, i confirmed this by running the main branch of archon (https://modelcontextprotocol.io/llms-full.txt) and had it occur there (see photo), so this pr would fix that issue aswell! I know this is a separate issue from the original PR, i dont mind splitting the issue up and addressing it in another pr if need be, but i can confirm this pr will add the ability to crawl llms.txt links recursively and allows the llms-full.txt to be crawled fully while also fixing the batching progress bar, so its a double whammy.

Screenshot 2025-08-31 122835

heres what it looks like now using this pr:
Screenshot 2025-08-31 135700

@Chillbruhhh
Copy link
Copy Markdown
Author

@coleam00 I've updated it to the current main as of 9/3 noticed @Wirasm changed some things around but it fully works with main!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
python/src/server/services/crawling/crawling_service.py (2)

529-562: Harden self-link canonicalization (index pages + safer fallback).

Treat “/index.html|/index.htm|/README.md” as the base path and lowercase in the fallback to avoid false negatives.

 def _is_self_link(self, link: str, base_url: str) -> bool:
@@
-            def _core(u: str) -> str:
+            def _core(u: str) -> str:
                 p = urlparse(u)
                 scheme = (p.scheme or "http").lower()
                 host = (p.hostname or "").lower()
                 port = p.port
                 if (scheme == "http" and port in (None, 80)) or (scheme == "https" and port in (None, 443)):
                     port_part = ""
                 else:
                     port_part = f":{port}" if port else ""
-                path = p.path.rstrip("/")
+                path = p.path.rstrip("/")
+                # Canonicalize default index/README pages to the directory root
+                lp = path.lower()
+                if lp.endswith(("/index.html", "/index.htm", "/readme.md")):
+                    path = path[: path.rfind("/")] if "/" in path else ""
                 return f"{scheme}://{host}{port_part}{path}"
@@
-        except Exception as e:
+        except Exception as e:
             logger.warning(f"Error checking if link is self-referential: {e}", exc_info=True)
             # Fallback to simple string comparison
-            return link.rstrip('/') == base_url.rstrip('/')
+            return link.rstrip('/').lower() == base_url.rstrip('/').lower()

573-589: Use ProgressMapper for emitted percent; keep crawl window inside mapper range.

Avoid raw “10%” emissions and align the text/markdown window with the crawling range (comment below says 3–8).

-            if self.progress_tracker:
-                await self.progress_tracker.update(
-                    status="crawling",
-                    progress=10,
-                    log="Detected text file, fetching content...",
-                    crawl_type=crawl_type,
-                    current_url=url
-                )
+            if self.progress_tracker:
+                overall = self.progress_mapper.map_progress("crawling", 5)
+                await self.progress_tracker.update(
+                    status="crawling",
+                    progress=overall,
+                    log="Detected text/markdown file, fetching content...",
+                    crawl_type=crawl_type,
+                    current_url=url
+                )
@@
-                start_progress=5,
-                end_progress=10,
+                # Keep within ProgressMapper's crawling window
+                start_progress=3,
+                end_progress=4,

Optional (outside this hunk): map inside _create_crawl_progress_callback so every crawling callback is auto-mapped.

# inside callback(...) in _create_crawl_progress_callback
mapped = self.progress_mapper.map_progress(base_status, progress)
await self.progress_tracker.update(status=base_status, progress=mapped, log=message, **kwargs)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5df242e and 1b4d88f.

📒 Files selected for processing (3)
  • python/src/server/services/crawling/crawling_service.py (3 hunks)
  • python/src/server/services/crawling/helpers/url_handler.py (5 hunks)
  • python/src/server/services/storage/document_storage_service.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • python/src/server/services/storage/document_storage_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
🧰 Additional context used
📓 Path-based instructions (3)
python/src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/**/*.py: Fail fast on service startup failures (crash with clear error if credentials, database, or any service cannot initialize)
Fail fast on missing configuration or invalid environment settings
Fail fast on database connection failures; do not hide connection issues
Fail fast on authentication/authorization failures; halt the operation and surface the error
Fail fast on data corruption or validation errors; let Pydantic raise
Fail fast when critical dependencies are unavailable (required service down)
Never store invalid data that would corrupt state (e.g., zero embeddings, null foreign keys, malformed JSON); fail instead
For batch processing, complete what you can and log detailed failures per item
Background tasks should finish queues but log failures clearly
Do not crash on a single WebSocket/event failure; log and continue serving other clients
If optional features are disabled, log and skip rather than crashing
External API calls should retry with exponential backoff; then fail with a clear, specific error
When continuing after a failure, skip the failed item entirely; never persist partial or corrupted results
Include context about the attempted operation in error messages
Preserve full stack traces with exc_info=True in Python logging
Use specific exception types; avoid catching generic Exception
Never return None to indicate failure; raise an exception with details
For batch operations, report both success counts and detailed failure lists
Target Python 3.12 and keep line length at 120 characters
Use Ruff for linting (errors, warnings, unused imports, style) and keep code Ruff-clean
Use Mypy for static type checking and keep code type-safe
Enable auto-formatting on save in IDEs to maintain consistent Python style

Files:

  • python/src/server/services/crawling/crawling_service.py
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep the main FastAPI application under python/src/server/

Files:

  • python/src/server/services/crawling/crawling_service.py
python/src/server/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Put backend business logic services under python/src/server/services/

Files:

  • python/src/server/services/crawling/crawling_service.py
🧬 Code graph analysis (1)
python/src/server/services/crawling/crawling_service.py (2)
python/src/server/services/crawling/helpers/url_handler.py (4)
  • is_txt (58-74)
  • is_markdown (38-55)
  • extract_markdown_links (279-365)
  • is_binary_file (77-174)
python/src/server/services/crawling/strategies/batch.py (1)
  • crawl_batch_with_progress (32-236)
🔇 Additional comments (1)
python/src/server/services/crawling/crawling_service.py (1)

683-690: LGTM: recursive crawl window matches ProgressMapper comments (3–8).

Consistent with mapper; keeps UI progress stable.

Comment thread python/src/server/services/crawling/crawling_service.py
@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented Sep 4, 2025

This looks really neat, will test it tomorrow, i think we can merge this in pretty quickly @coleam00

@GioPetro
Copy link
Copy Markdown

GioPetro commented Sep 6, 2025

We need this feature !!! Please proceed

@coleam00
Copy link
Copy Markdown
Owner

coleam00 commented Sep 6, 2025

@Chillbruhhh This PR is working really nicely for llms.txt - awesome work!

However, for llms-full.txt, it still isn't ideal. I tried with:

https://ai.pydantic.dev/llms-full.txt

And it says it's crawling 492 pages when it's really just 1, and on main right now it does as I'd expect - it quickly crawls the llms-full.txt as a single page and within seconds moves on to the embedding and storage step.

@coleam00
Copy link
Copy Markdown
Owner

coleam00 commented Sep 6, 2025

Wait hold on.... I believe GitHub glitched when switching to your PR and didn't pull some changes from when I last tested. Checking now...

@Chillbruhhh
Copy link
Copy Markdown
Author

Chillbruhhh commented Sep 6, 2025

@coleam00 im also confirming now by downloading this exact branch to test that, this should be the correct version that fixed that issue, i tested it multiple times.

@Chillbruhhh
Copy link
Copy Markdown
Author

Chillbruhhh commented Sep 6, 2025

Wait hold on.... I believe GitHub glitched when switching to your PR and didn't pull some changes from when I last tested. Checking now...

i cloned this exact branch and removed all my other containers version, built these and this is what i see :

2025-09-06 19:19:41 | src.server.services.crawling.strategies.single_page | INFO | Crawling markdown file: [https://ai.pydantic.dev/llms-full.txt⁠](https://ai.pydantic.dev/llms-full.txt)

19:19:41.349 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:19:41.352 Progress retrieved | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca | status=crawling | progress=10.0

19:19:41.386 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

[COMPLETE] ● Database backup created at: /root/.crawl4ai/crawl4ai.db.backup_20250906_191941

[INIT].... → Starting database migration...

[COMPLETE] ● Migration completed. 0 records processed.

19:19:42.490 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:19:43.076 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

[FETCH]... ↓ [https://ai.pydantic.dev/llms-full.txt⁠](https://ai.pydantic.dev/llms-full.txt)                                                                | ✓ | ⏱: 1.64s

[SCRAPE].. ◆ [https://ai.pydantic.dev/llms-full.txt⁠](https://ai.pydantic.dev/llms-full.txt)                                                                | ✓ | ⏱: 0.10s

[COMPLETE] ● [https://ai.pydantic.dev/llms-full.txt⁠](https://ai.pydantic.dev/llms-full.txt)                                                                | ✓ | ⏱: 2.77s

2025-09-06 19:19:44 | src.server.services.crawling.strategies.single_page | INFO | Successfully crawled markdown file: [https://ai.pydantic.dev/llms-full.txt⁠](https://ai.pydantic.dev/llms-full.txt)

2025-09-06 19:19:44 | src.server.services.crawling.helpers.url_handler | INFO | Skipping content-based link-collection detection for full-content file: llms-full.txt

2025-09-06 19:19:44 | src.server.services.storage.base_storage_service | INFO | Successfully chunked text: original_length=1948866, chunks_created=458

2025-09-06 19:19:44 | src.server.services.llm_provider_service | INFO | Creating LLM client for provider: openai

2025-09-06 19:19:44 | src.server.services.llm_provider_service | INFO | OpenAI client created successfully

2025-09-06 19:19:44 | search | INFO | Generating summary for c0e629a894699314 using model: gpt-4.1-nano

19:19:45.640 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:19:45.640 Progress retrieved | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca | status=processing | progress=15.0

19:19:45.654 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

2025-09-06 19:19:46 | search | INFO | Updating source c0e629a894699314 with knowledge_type=technical

2025-09-06 19:19:46 | search | INFO | Creating new source c0e629a894699314 with knowledge_type=technical

2025-09-06 19:19:46 | search | INFO | Created/updated source c0e629a894699314 with title: Pydantic Documentation - Llms-Full.Txt

2025-09-06 19:19:47 | search | WARNING | Failed to load storage settings: cannot access local variable 'credential_service' where it is not associated with a value, using defaults

2025-09-06 19:19:47 | search | INFO | Deleted existing records for 1 URLs in batches

2025-09-06 19:19:47 | src.server.services.llm_provider_service | INFO | Creating LLM client for provider: openai

2025-09-06 19:19:47 | src.server.services.llm_provider_service | INFO | OpenAI client created successfully

19:19:47.514 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:19:47.515 Progress retrieved | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca | status=document_storage | progress=15.0

19:19:47.528 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:19:48.879 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:19:48.886 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:19:48.893 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:19:49.988 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:19:51.081 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:19:52.186 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:19:53.283 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:19:54.383 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:19:55.486 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:19:56.586 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

2025-09-06 19:19:56 | search | INFO | Batch 1: Generated 1/25 contextual embeddings using batch API (sub-batch size: 50)

2025-09-06 19:19:56 | src.server.services.llm_provider_service | INFO | Creating LLM client for provider: openai

2025-09-06 19:19:56 | src.server.services.llm_provider_service | INFO | OpenAI client created successfully

19:19:57.686 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:20:01.888 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:20:01.888 Progress retrieved | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca | status=document_storage | progress=15.0

19:20:01.900 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:20:01.900 Progress retrieved | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca | status=document_storage | progress=15.0

19:20:01.903 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:20:01.904 Progress retrieved | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca | status=document_storage | progress=15.0

19:20:01.908 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:20:01.916 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

19:20:01.923 Getting progress for operation | operation_id=25f4684b-cc70-42a2-acbc-936db9399bca

2025-09-06 19:20:01 | src.server.services.llm_provider_service | INFO | Creating LLM client for provider: openai

@coleam00
Copy link
Copy Markdown
Owner

coleam00 commented Sep 6, 2025

@Chillbruhhh Yeah that was my bad (or GitHub, idk) - it's looking good now! Just doing some last testing here.

@coleam00 coleam00 merged commit 8172067 into coleam00:main Sep 6, 2025
1 check passed
@coleam00
Copy link
Copy Markdown
Owner

coleam00 commented Sep 6, 2025

Merged this now - nice work @Chillbruhhh!! I appreciate it a lot.

@coderabbitai coderabbitai Bot mentioned this pull request Sep 25, 2025
13 tasks
@coderabbitai coderabbitai Bot mentioned this pull request Oct 7, 2025
20 tasks
leonj1 pushed a commit to leonj1/Archon that referenced this pull request Oct 13, 2025
* fixed the llms.txt/fulls-llm.txt/llms.md etc. to be crawleed finally. intelligently determines if theres links in the llms.txt and crawls them as it should. tested fully everything works!

* updated coderabbits suggestion - resolved

* refined to code rabbits suggestions take 2, should be final take. didnt add the max link paramter suggestion though.

* 3rd times the charm, added nit picky thing from code rabbit. code rabbit makes me crave nicotine

* Fixed progress bar accuracy and OpenAI API compatibility issues

  Changes Made:

  1. Progress Bar Fix: Fixed llms.txt crawling progress jumping to 90% then regressing to 45% by adjusting batch crawling progress ranges (20-30% instead of 40-90%) and using consistent ProgressMapper ranges
  2. OpenAI API Compatibility: Added robust fallback logic in contextual embedding service to handle newer models (GPT-5) that require max_completion_tokens instead of max_tokens and don't support custom temperature values

  Files Modified:

  - src/server/services/crawling/crawling_service.py - Fixed progress ranges
  - src/server/services/crawling/progress_mapper.py - Restored original stage ranges
  - src/server/services/embeddings/contextual_embedding_service.py - Added fallback API logic

  Result:

  -  Progress bar now smoothly progresses 030% (crawling)  35-80% (storage)  100%
  -  Automatic compatibility with both old (GPT-4.1-nano) and new (GPT-5-nano) OpenAI models
  -  Eliminates max_tokens not supported and temperature not supported errors

* removed gpt-5-handlings since thats a seprate issue and doesnt pertain to here, definitley recommend looking at that though since gpt-5-nano is considered a reasoning model and doesnt use max_tokens, requires a diffrent output. also removed my upsert fix from documentstorage since thats not apart of this exact issue and i have another PR open for it. checked in code rabbit in my ide no issues, no nitpicks. should be good? might flag me for the UPSERT logic not being in here. owell has nothing to do with this was pr, was submitted in the last revision by mistake. everythings tested and good to go!

* fixed the llms-full.txt crawling issue. now crawls just that page when crawling llms-full.txt. fixed the 100% crawl url when multiple urls are present and hasnt finished crawling. also fixed a styling issue in CrawlingProgressCard.tsx , when batching code examples the batching progress bar would sometimes glitch out of the ui fixed it to where it wont do that now.

* fixed a few things so it will work with the current branch!

* added some enhancemments to ui rendering aswell and other little misc. fixes from code rabbit

---------

Co-authored-by: Chillbruhhh <joshchesser97@gmail.com>
Co-authored-by: Claude Code <claude@anthropic.com>
coleam00 pushed a commit that referenced this pull request Apr 7, 2026
…lts (#437)

Workflow executor now infers the provider from the model when no explicit
provider is set. Previously, setting `model: sonnet` on a workflow while
`defaultAssistant: codex` in config would throw a compatibility error
because the provider was blindly inherited from the config default.

Resolution priority is now:
1. Explicit workflow `provider` field
2. Inferred from workflow `model` (claude aliases → claude, else → codex)
3. Config `defaultAssistant`

Also removes `model: sonnet` from all 8 default workflows and dead
`model: haiku` step-level fields from 2 workflows. Model selection is
now fully driven by config, not hardcoded in workflow YAMLs.
Tyone88 pushed a commit to Tyone88/Archon that referenced this pull request Apr 16, 2026
…lts (coleam00#437)

Workflow executor now infers the provider from the model when no explicit
provider is set. Previously, setting `model: sonnet` on a workflow while
`defaultAssistant: codex` in config would throw a compatibility error
because the provider was blindly inherited from the config default.

Resolution priority is now:
1. Explicit workflow `provider` field
2. Inferred from workflow `model` (claude aliases → claude, else → codex)
3. Config `defaultAssistant`

Also removes `model: sonnet` from all 8 default workflows and dead
`model: haiku` step-level fields from 2 workflows. Model selection is
now fully driven by config, not hardcoded in workflow YAMLs.
joaobmonteiro pushed a commit to joaobmonteiro/Archon that referenced this pull request Apr 26, 2026
…lts (coleam00#437)

Workflow executor now infers the provider from the model when no explicit
provider is set. Previously, setting `model: sonnet` on a workflow while
`defaultAssistant: codex` in config would throw a compatibility error
because the provider was blindly inherited from the config default.

Resolution priority is now:
1. Explicit workflow `provider` field
2. Inferred from workflow `model` (claude aliases → claude, else → codex)
3. Config `defaultAssistant`

Also removes `model: sonnet` from all 8 default workflows and dead
`model: haiku` step-level fields from 2 workflows. Model selection is
now fully driven by config, not hardcoded in workflow YAMLs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants