Add crawl checkpoint/resume infrastructure by Zebastjan · Pull Request #936 · coleam00/Archon

Zebastjan · 2026-02-22T08:19:45Z

Summary

Adds database migration for archon_crawl_url_state table to track per-URL crawl status
Implements CrawlUrlStateService with methods to track URL states (pending/fetched/embedded/failed)
Integrates URL state tracking into crawling service and document storage operations
Adds comprehensive unit tests for CrawlUrlStateService (24 tests)

Changes

New Files

migration/0.1.0/012_add_crawl_url_state.sql - Database table for URL state
python/src/server/services/crawling/crawl_url_state_service.py - Service class
python/tests/test_crawl_url_state_service.py - Unit tests

Modified Files

python/src/server/services/crawling/crawling_service.py - Added state check at crawl start
python/src/server/services/crawling/document_storage_operations.py - Added URL state initialization and marking

Testing

24 unit tests pass covering all CrawlUrlStateService methods
Tests follow existing patterns in the codebase

Notes

This is Phase 1 of the crawl checkpoint/resume feature. Next steps:

Implement actual resume logic (filter already-embedded URLs)
Add API endpoint to expose crawl state to users

Summary by CodeRabbit

New Features
- Crawl checkpoint/resume with per-URL state, restartable ingestion pipeline, revectorize/resummarize operations, ingestion health checks, and a “Processing Details” provenance panel in the UI; code-summarization settings and model/instance controls.
Bug Fixes
- Fixed crawl progress status validation (adds discovery status for polling).
Performance
- Optimized code-summary generation for small models (faster, lower cost).
Documentation
- Added end-to-end guides for code extraction, pipeline, and prompt optimization.
Tests
- Added prompt regression and integration tests for code-summary and crawl workflows.
Chores
- Database migrations to support pipeline, URL state, and provenance tracking.

Refactored code summarization prompt from verbose 24-line format to concise 8-line structured format optimized for 1.2B+ parameter models. Changes: - Reduced prompt from ~350 tokens to ~100 tokens (70% reduction) - Replaced extensive examples with direct structured guidance - Maintains same JSON output schema for backward compatibility - Added PURPOSE/PARAMETERS/USE WHEN structure to summary template Testing & Documentation: - Added permanent regression test (tests/prompts/test_code_summary_prompt.py) - Created test infrastructure documentation (tests/prompts/README.md) - Added comprehensive prompt documentation (PRPs/ai_docs/CODE_SUMMARY_PROMPT.md) - Documented data flow (CODE_EXTRACTION_FLOW.md) - Updated CLAUDE.md with testing guidelines Performance Impact: - 3-5x faster inference with small models (tested: Liquid 1.2B Instruct) - 70% cost reduction for code summarization - Only affects code blocks (~5% of content), prose chunks unchanged - Existing markdown fence handling in parser confirmed working The new prompt produces consistent JSON output with structured format while being dramatically more efficient for small language models. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Added two integration test suites to validate the optimized code summary prompt: 1. test_code_summary_prompt_quick.py (Quick validation) - Directly tests the code summarization function - Validates prompt generates valid JSON with required fields - Tests Python, TypeScript, and Rust code samples - ✅ All 3 tests passed with valid summaries 2. test_crawl_validation.py (Full crawl validation) - Tests complete crawl-to-summary pipeline via API - Validates contribution guideline URLs - Currently blocked by backend validation bug: - Progress status 'discovery' not in allowed enum - Prevents polling from completing - Ready for use once backend bug is fixed Test Results: - Quick validation: ✅ 3/3 passed - Summaries generated correctly for all languages - JSON structure validated - Demonstrates prompt works with configured LLM Files added: - tests/integration/__init__.py - tests/integration/test_code_summary_prompt_quick.py - tests/integration/test_crawl_validation.py - tests/integration/.gitignore (excludes *_results.json) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Added: - PROMPT_TEST_DETAILS.md: Complete test validation report - Updated PR_DESCRIPTION.md with test results section Testing Status: - ✅ Quick validation: 3/3 passed - ⏸️ Full crawl validation: Ready but blocked by backend bug Quick test validates: - Valid JSON generation across Python, TypeScript, Rust - Meaningful summaries produced - Prompt works with configured LLM Full crawl test infrastructure ready for use once backend progress status enum bug is fixed ('discovery' status). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…sponse Resolves Pydantic validation error that prevented programmatic crawl progress polling. Changes: - Added 'discovery' to allowed status values in CrawlProgressResponse - Updated documentation to reflect bug fix - Full crawl validation test now functional The 'discovery' status is emitted by the backend during the page discovery phase of crawling. This fix enables automated testing and progress monitoring. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Add migration to create archon_crawl_url_state table - Track per-URL status: pending, fetched, embedded, failed - Support retry logic with configurable max_retries - Add CrawlUrlStateService with methods for state management This is the infrastructure for Phase 1 of ADR-001, enabling resumable crawls after interruption.

- Import and initialize CrawlUrlStateService in crawling orchestration - Check for existing crawl state at start of crawl - Log resume info if pending/failed URLs exist - Initialize URL state in document storage before processing - Mark URLs as embedded after successful storage Note: Full resume functionality (skipping already-embedded URLs) requires additional work to filter URLs before crawling.

coderabbitai · 2026-02-22T08:20:01Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a restartable, checkpointed ingestion pipeline and per-URL crawl state, provenance tracking, ingestion workers/orchestrator/health checks, resume-aware crawling, code-summary prompt optimization and tests, new API endpoints for revectorize/resummarize, migrations for pipeline/provenance, frontend provenance UI/types, and related docs/tests.

Changes

Cohort / File(s)	Summary
Architecture & ADRs `docs/ADRs/001-restartable-rag-pipeline.md`, `docs/ADR-002: Crawl Reliability, Ingestion Quality Control & DB Validation.md`, `docs/ADR-002-IMPLEMENTATION-STATUS.md`, `docs/ADR-001: Crawl & Ingestion Pipeline Improvements.md`	Adds ADR-002 (multi-phase plan), implementation status, and marks ADR-001 merged into ADR-002; documents pipeline design, roadmap, and migration plan.
Database migrations `migration/0.1.0/012_add_crawl_url_state.sql`, `migration/0.1.0/013_add_provenance_tracking.sql`, `migration/0.1.0/014_add_pipeline_tables.sql`, `migration/0.1.0/*`	Adds `archon_crawl_url_state`, provenance columns on `archon_sources`, full pipeline tables (document blobs, chunks, embedding_sets, embeddings, summaries), and appends migration-record inserts across many migration files.
Crawl URL state service `python/src/server/services/crawling/crawl_url_state_service.py`	New CrawlUrlStateService: initialize/upsert URLs, mark_fetched/embedded/failed with retry logic, getters, get_crawl_state, clear_state, and singleton factory.
Crawler orchestration & strategies `python/src/server/services/crawling/crawling_service.py`, `python/src/server/services/crawling/strategies/recursive.py`	Integrates URL-state awareness and resume filtering across discovery, recursive and batch crawls; adds `_filter_already_processed_urls` and threads `source_id` / `url_state_service` through crawl flows.
Document storage & source management `python/src/server/services/crawling/document_storage_operations.py`, `python/src/server/services/source_management_service.py`	Adds new pipeline branch (`use_new_pipeline`), initializes URL state, marks embedded URLs post-storage, and persists provenance fields (embedding provider/model/dimensions, vectorizer_settings, summarization_model) and last_crawled/last_vectorized timestamps.
Ingestion pipeline & state service `python/src/server/services/ingestion/__init__.py`, `python/src/server/services/ingestion/ingestion_state_service.py`, `python/src/server/services/ingestion/pipeline_orchestrator.py`, `python/src/server/services/ingestion/embedding_worker.py`, `python/src/server/services/ingestion/summary_worker.py`, `python/src/server/services/ingestion/health_check.py`	New ingestion state service, pipeline orchestrator, embedding and summary workers, health-check, domain models and enums — implements restartable staged pipeline: blobs → chunks → embedding_sets → embeddings → summaries.
Code extraction & summarization `python/src/server/services/storage/code_storage_service.py`, `python/src/server/services/crawling/code_extraction_service.py`, `PRPs/ai_docs/CODE_SUMMARY_PROMPT.md`, `CODE_EXTRACTION_FLOW.md`	Adds dedicated code-summarization provider/model selection, shared-client batching, concise prompts optimized for small LMs, extraction/validation rules, de-duplication, and documentation of extraction/summarization flow.
Progress models & API `python/src/server/models/progress_models.py`, `python/src/server/api_routes/knowledge_api.py`	Adds `"discovery"` to CrawlProgressResponse.status literal; new endpoints `POST /knowledge-items/{source_id}/revectorize` and `POST /knowledge-items/{source_id}/resummarize` with background workers, semaphores, and progress tracking.
Code storage & pipeline integration `python/src/server/services/storage/code_storage_service.py`, `python/src/server/services/crawling/document_storage_operations.py`	Wires code extraction and embedding metadata into storage flows, supports provenance/vectorizer settings in source upserts, and marks URLs as embedded to prevent reprocessing.
Tests & test harness `python/tests/`, `python/tests/prompts/`, `python/tests/integration/*`	Adds unit tests for CrawlUrlStateService, integration crawl validation, code-summary prompt regression tests and README; several new integration/unit test scripts and some duplicated blocks noted.
Frontend provenance UI & types `archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx`, `archon-ui-main/src/features/knowledge/types/knowledge.ts`, `archon-ui-main/src/components/settings/RAGSettings.tsx`	Adds provenance fields to TS types, Processing Details UI, needs-revectorization indicator, revectorize/resummarize actions, and settings for code summarization provider/model.
Frontend hooks & services `archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts`, `archon-ui-main/src/features/knowledge/services/knowledgeService.ts`, `archon-ui-main/src/features/knowledge/components/KnowledgeCardActions.tsx`	Adds `useRevectorizeKnowledgeItem` and `useResummarizeKnowledgeItem` hooks (duplicate insertion present), knowledgeService endpoints for revectorize/resummarize, and UI actions wired into KnowledgeCardActions.
API & app wiring `python/src/server/api_routes/ingestion_api.py`, `python/src/server/main.py`	Adds ingestion API routes (process embeddings/summaries, health, retry failed), includes ingestion router in main app; minor duplicate imports noted.
Docs & misc `CLAUDE.md`, `PROMPT_TEST_DETAILS.md`, `PR_DESCRIPTION.md`, `docs/README.md`, `test_new_pipeline.md`	Adds/updates documentation for prompt tests, code-extraction flow, prompt optimization rationale, PR description, manual pipeline testing guide; some duplicated doc blocks noted.
Minor refactors & formatting `python/src/server/services/knowledge/knowledge_item_service.py`, various SQL migrations	Code reflows, inline formatting changes, logging normalization, small signature tweaks, and migration bookkeeping insert additions across SQL files.

Sequence Diagram(s)

sequenceDiagram
    participant Orchestrator as Pipeline Orchestrator
    participant URLState as Crawl URL State Service
    participant DB as Supabase DB
    participant Fetcher as Fetcher/Extractor
    participant DocStore as Document Storage
    participant CodeStore as Code Storage / Summarizer

    Orchestrator->>URLState: initialize_urls(source_id, discovered_urls)
    URLState->>DB: upsert archon_crawl_url_state (status='pending')
    DB-->>URLState: upsert result
    URLState-->>Orchestrator: {inserted, skipped}

    loop process pending URLs
        Orchestrator->>URLState: get_pending_urls(source_id)
        URLState->>DB: select pending
        DB-->>URLState: url list
        Orchestrator->>Fetcher: fetch(url)
        Fetcher-->>Orchestrator: crawl_result (content, code blocks)
        Orchestrator->>DocStore: store_documents(crawl_result, provenance)
        DocStore->>CodeStore: extract_code_examples (if enabled)
        CodeStore-->>DocStore: code_summaries
        DocStore->>URLState: mark_embedded(source_id, url)
        URLState->>DB: update status='embedded'
    end

    Orchestrator->>URLState: get_crawl_state(source_id)
    URLState->>DB: aggregate counts
    DB-->>URLState: state
    Orchestrator->>URLState: clear_state(source_id) (if finished)
    URLState->>DB: delete records

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Fixes: crawl code storage issue with <think> tags for ollama models. #775 — Modifies python/src/server/services/storage/code_storage_service.py; strongly related to code summarization/provider changes in this PR.
feat: Ollama Integration with Separate LLM/Embedding Model Support #643 — Overlaps crawl/ingestion pipeline, provenance fields, and migrations; closely related to pipeline/state additions.
feat: Implement priority-based automatic discovery of llms.txt and sitemap.xml files #622 — Overlaps discovery-stage and crawl orchestration/progress handling including the added "discovery" status and resume logic.

Suggested labels

enhancement

Suggested reviewers

coleam00
Wirasm

Poem

🐰 I hop through URLs with care and cheer,

I stash checkpoints where the pathway's clear,
I nibble code into tidy lines,
then whisper summaries in tiny signs,
and leave a trail so others may reappear.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.68% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Add crawl checkpoint/resume infrastructure' accurately and concisely describes the main change: introducing database and service infrastructure for tracking crawl state.
Description check	✅ Passed	The PR description includes a clear summary, lists new and modified files with their purposes, specifies testing evidence (24 unit tests), and provides context for next steps as Phase 1.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 15

🧹 Nitpick comments (18)

python/tests/integration/test_code_summary_prompt_quick.py (2)
132-133: Move import traceback to module level.

Importing inside an except block works but is non-idiomatic. traceback is a stdlib module with no cost to importing at module level.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/tests/integration/test_code_summary_prompt_quick.py` around lines 132
- 133, Move the local import of the standard library module traceback out of the
except block and place it at module level near the other imports in
python/tests/integration/test_code_summary_prompt_quick.py; locate the except
block that currently does "import traceback" and remove that line, then add
"import traceback" with the other top-level imports so functions/tests (e.g. the
failing test function containing the except) can use traceback without importing
inside the exception handler.
130-142: Exception handler swallows failures — pytest can never detect a failing test run.

All LLM/API exceptions are caught and stored as "success": False in results, but nothing is raised. Pytest sees test_prompt_directly as always passing. Per coding guidelines, batch-style tests should report detailed failures; here at minimum a final assertion should gate the test outcome.
🔧 Proposed fix — add a final assertion gate
     if success_count == len(results):
         print("\n🎉 All tests passed!")
     else:
         print(f"\n⚠️  {len(results) - success_count} test(s) failed")

+    # Let pytest see the result
+    assert success_count == len(results), (
+        f"{len(results) - success_count}/{len(results)} code summary test(s) failed"
+    )
     return results
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/tests/integration/test_code_summary_prompt_quick.py` around lines 130
- 142, The current except block appends failures to the results list but never
fails the test; update the end of test_prompt_directly to assert that there are
no failed runs by inspecting results (e.g., build failures = [r for r in results
if not r["success"]]) and raise an AssertionError or use pytest.fail with a
concise summary of failures (including name, language, and error) if any exist;
keep the existing try/except aggregation but add this final assertion gate to
ensure pytest detects failures.
python/src/server/services/storage/code_storage_service.py (2)
636-636: Redundant conditional — simplify to code[:1500].

Python slicing never raises when len(code) <= 1500; the conditional is unnecessary.
♻️ Proposed simplification
-{code[:1500] if len(code) > 1500 else code}
+{code[:1500]}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/storage/code_storage_service.py` at line 636,
Replace the redundant conditional that uses "{code[:1500] if len(code) > 1500
else code}" with the simpler slice "code[:1500]"; update the occurrence that
references the variable "code" (in code_storage_service.py) so it always uses
the safe slicing expression "code[:1500]" since Python slicing handles shorter
strings without error.
631-643: context_before / context_after are silently dropped from the prompt.

Both parameters are still declared in the function signature and documented, but are no longer included in the prompt. If callers rely on context to generate accurate summaries (especially for ambiguous code blocks), this is a silent quality regression for larger models. Consider documenting the intentional omission in the function docstring, or adding a parameter to re-enable context for non-small-model providers.
python/src/server/services/crawling/document_storage_operations.py (1)
69-78: Misleading comment — no "enabled" flag is checked.

The comment # Initialize URL state tracking if enabled implies a feature flag, but tracking is always attempted. Per coding guidelines, comments should document functionality, not aspirational state.
♻️ Proposed fix
-        # Initialize URL state tracking if enabled
+        # Initialize per-URL crawl state to "pending"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/crawling/document_storage_operations.py` around
lines 69 - 78, The comment "Initialize URL state tracking if enabled" is
misleading because no flag is checked; update the code to either actually check
a feature flag or change the comment to accurately reflect behavior.
Specifically, either add a conditional around get_crawl_url_state_service/
url_state_service.initialize_urls that checks the existing feature flag (or
config) used elsewhere, or replace the comment with a factual one such as
"Initialize URL state tracking" so it matches the unconditional calls to
get_crawl_url_state_service and url_state_service.initialize_urls; ensure
references to url_state_service, get_crawl_url_state_service, and
initialize_urls remain correct.
python/tests/prompts/README.md (1)
25-25: Hard-coded line reference will drift.

(lines 631-643) in the code_storage_service.py reference will become stale as the file evolves. Linking to the function name is more stable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/tests/prompts/README.md` at line 25, Replace the hard-coded "(lines
631-643)" reference with a stable symbol reference to the function or method in
src/server/services/storage/code_storage_service.py (e.g., use the class/method
name such as CodeStorageService.<method_name> or the exact def name) so readers
can find the code even as line numbers change; update the README.md text to use
the function/method name in backticks and remove the numeric line-range.
migration/0.1.0/012_add_crawl_url_state.sql (1)
27-29: Redundant single-column index on source_id.

idx_crawl_url_state_source on (source_id) is redundant because idx_crawl_url_state_source_status on (source_id, status) can serve all queries that filter only by source_id (leftmost prefix). Keeping it adds write overhead without query benefit.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@migration/0.1.0/012_add_crawl_url_state.sql` around lines 27 - 29, The
migration creates a redundant single-column index idx_crawl_url_state_source on
archon_crawl_url_state(source_id); remove that CREATE INDEX statement and rely
on the composite idx_crawl_url_state_source_status (source_id, status) instead;
if the single-column index may already exist in deployed DBs, add a DROP INDEX
IF EXISTS idx_crawl_url_state_source or an equivalent rollback step to the
migration so only the composite index remains and write overhead is avoided.
python/tests/integration/test_crawl_validation.py (3)
232-238: Move traceback import to the top of the file.

import traceback is a stdlib module — importing it inside the except block (and again at line 263) is unconventional and duplicated.
Suggested fix

Add at the top of the file with other imports:
import traceback
Then remove the inline imports at lines 236 and 263.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/tests/integration/test_crawl_validation.py` around lines 232 - 238,
Move the inline "import traceback" out of the except blocks and add a single
module-level "import traceback" with the other top-of-file imports; then remove
the two inline imports (the one inside the except that sets result["status"] =
"error" and the duplicate later) so traceback is imported only once at the top
and used from there.
80-82: Consider checking the status field instead of (or in addition to) error.

The progress response may have status: "error" without a separate error field, or the error key might be present but empty. Checking both provides more robust detection. Also, raising a bare Exception loses the full progress context.
Suggested fix
         # Check if errored
-        if progress.get("error"):
-            raise Exception(f"Crawl failed: {progress.get('error')}")
+        status = progress.get("status", "")
+        if progress.get("error") or status in ("error", "failed"):
+            raise RuntimeError(f"Crawl failed: {progress.get('error', status)}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/tests/integration/test_crawl_validation.py` around lines 80 - 82,
Update the failure check to consider the progress "status" field as well as the
"error" field and raise with the full progress context; specifically, in the
block using progress (the variable named progress in test_crawl_validation.py),
detect failure when progress.get("error") is truthy OR progress.get("status") ==
"error", and then raise an exception that includes the entire progress object
(e.g., include progress in the message) instead of raising a bare Exception with
only progress.get('error').
22-23: Hardcoded API base URL — consider using an environment variable.

API_BASE = "http://localhost:8181" will break if the server runs on a different host/port (e.g., in CI or Docker-to-Docker networking). A simple os.environ.get("API_BASE", "http://localhost:8181") keeps the default while allowing overrides.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/tests/integration/test_crawl_validation.py` around lines 22 - 23, The
test hardcodes API_BASE = "http://localhost:8181" which breaks in other
environments; change API_BASE to read from an environment variable with a
default (use os.environ.get("API_BASE", "http://localhost:8181")) and add an
import for os at the top of python/tests/integration/test_crawl_validation.py so
tests can override API_BASE in CI or Docker without changing code.
python/tests/test_crawl_url_state_service.py (2)
328-373: State transition tests are redundant with the individual method tests.

test_pending_to_fetched_transition and test_fetched_to_embedded_transition (Lines 331–347) duplicate exactly what test_marks_url_as_fetched and test_marks_url_as_embedded already cover. Consider either removing these or making them genuine multi-step transitions that initialize → fetch → embed the same URL with assertions at each stage.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/tests/test_crawl_url_state_service.py` around lines 328 - 373, The
tests test_pending_to_fetched_transition and test_fetched_to_embedded_transition
in TestStateTransitionLogic are redundant with existing tests
(test_marks_url_as_fetched and test_marks_url_as_embedded); either remove these
two methods or convert them into a true multi-step integration test that uses
the same source_id/url and calls url_state_service.mark_fetched(...) then
asserts the intermediate state, followed by url_state_service.mark_embedded(...)
and asserts the final state (ensure you reference the same URL and use the same
service instance: url_state_service.mark_fetched and
url_state_service.mark_embedded); update or delete the redundant test methods
accordingly to avoid duplicate coverage.
7-7: Unused import: patch.

patch is imported but never used in any test.
-from unittest.mock import MagicMock, patch
+from unittest.mock import MagicMock
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/tests/test_crawl_url_state_service.py` at line 7, Remove the unused
import "patch" from the test module import line (keep MagicMock if still used)
or replace it by using patch where intended; specifically update the import
statement that currently reads "from unittest.mock import MagicMock, patch" to
drop "patch" (or use "patch" in tests if that was intended) so there are no
unused imports.
CODE_EXTRACTION_FLOW.md (2)
17-25: Add language specifiers to pseudocode fenced blocks.

Lines 17, 33, 73, and 137 use fenced code blocks without a language identifier (flagged by markdownlint MD040). Using text or plaintext would satisfy the linter while keeping the content readable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CODE_EXTRACTION_FLOW.md` around lines 17 - 25, Several fenced code blocks in
CODE_EXTRACTION_FLOW.md (e.g., the block showing the numbered list starting "1.
Crawl returns markdown documents" and the other pseudocode blocks) are missing
language specifiers and trigger markdownlint MD040; update each triple-backtick
fence to include a language token such as ```text or ```plaintext so the blocks
are recognized as plain text (apply this to the blocks near the "Crawl returns
markdown documents" snippet and the other pseudocode fences referenced in the
review).
1-8: Documentation uses change-specific language that won't age well.

Phrases like "YOUR PROMPT HERE", "Your prompt change", and "Your 1.2B prompt optimization" are conversational and tied to this specific PR context. For long-term project documentation, consider replacing these with neutral descriptions (e.g., "the code summary prompt", "the prompt optimization"). Based on learnings, documentation should focus on functionality and reasoning rather than referencing specific changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CODE_EXTRACTION_FLOW.md` around lines 1 - 8, The documentation in
CODE_EXTRACTION_FLOW.md uses PR-specific, conversational phrases ("YOUR PROMPT
HERE", "Your prompt change", "Your 1.2B prompt optimization"); update the text
to neutral, long-lived terminology by replacing those phrases with stable
descriptors such as "the code summary prompt", "the prompt optimization", or
"prompt adjustments", and rephrase any remaining change-specific wording so the
document describes behavior (e.g., that the code summary runs on extracted code
blocks only) rather than referencing this PR or an author-facing instruction;
ensure the three explicit strings mentioned are replaced consistently throughout
the file and that the executive summary remains factual and version-agnostic.
PROMPT_TEST_DETAILS.md (1)
99-105: Fenced code block missing language specifier.

Per markdownlint MD040, adding a language identifier (e.g., text) would satisfy the linter.
-```
+```text
 pydantic_core._pydantic_core.ValidationError: 1 validation error for CrawlProgressResponse
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@PROMPT_TEST_DETAILS.md` around lines 99 - 105, The fenced code block showing
the pydantic ValidationError in PROMPT_TEST_DETAILS.md is missing a language
specifier; update the opening fence to include a language identifier (for
example change ``` to ```text) so the block containing
"pydantic_core._pydantic_core.ValidationError: 1 validation error for
CrawlProgressResponse" is annotated, ensuring the fenced code block starts with
```text and ends with the matching ``` fence.
python/src/server/services/crawling/crawl_url_state_service.py (3)
323-340: Singleton accessor is not thread-safe and silently ignores a different supabase_client after first initialization.

If get_crawl_url_state_service(client_A) is called during startup and later get_crawl_url_state_service(client_B) is called (e.g., in tests or a different context), client_B is silently discarded. Additionally, concurrent calls can race on the None check and create two instances.

Consider either:

Adding a reset parameter for testing, and documenting that the client arg is only honored on first call.

Using a lock for thread safety.
Minimal fix — document behavior and add reset capability for tests
-def get_crawl_url_state_service(supabase_client=None) -> CrawlUrlStateService:
+def get_crawl_url_state_service(supabase_client=None, *, _reset: bool = False) -> CrawlUrlStateService:
     """
     Get the singleton crawl URL state service instance.
 
     Args:
-        supabase_client: Optional Supabase client
+        supabase_client: Optional Supabase client (only used on first call or after reset)
+        _reset: If True, discard existing instance and create a new one (testing only)
 
     Returns:
         CrawlUrlStateService instance
     """
     global crawl_url_state_service
-    if crawl_url_state_service is None:
+    if crawl_url_state_service is None or _reset:
         crawl_url_state_service = CrawlUrlStateService(supabase_client)
     return crawl_url_state_service
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/crawling/crawl_url_state_service.py` around lines
323 - 340, The singleton accessor get_crawl_url_state_service currently drops
subsequent supabase_client args and is not thread-safe; update it to accept an
optional reset: bool = False (for tests) and document that supabase_client is
honored only on first init unless reset=True, and add a module-level
threading.Lock (e.g., crawl_url_state_service_lock) and wrap the None-check and
instantiation of crawl_url_state_service (and honoring/reset logic) inside
lock.acquire()/release or a context manager to prevent races; ensure you still
use the global crawl_url_state_service and instantiate
CrawlUrlStateService(supabase_client) only when appropriate.
20-28: Missing type annotation on supabase_client parameter.

The constructor accepts an optional Supabase client but has no type hint, making it harder for MyPy and IDE tooling to catch misuse.
Proposed fix
-    def __init__(self, supabase_client=None):
+    def __init__(self, supabase_client: "Client | None" = None):
(Import Client from the supabase library or use Any if the exact type varies.)
As per coding guidelines: "Use Mypy for type checking in Python to ensure type safety."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/crawling/crawl_url_state_service.py` around lines
20 - 28, The __init__ of CrawlUrlStateService currently lacks a type annotation
for the supabase_client parameter; update the signature of __init__ to annotate
supabase_client as either the supabase Client type (import Client from the
supabase library) or as typing.Optional[Client] (or typing.Any if the exact
client class is uncertain), and add a corresponding attribute annotation for
self.supabase_client (e.g., self.supabase_client: Optional[Client]) so MyPy/IDE
tooling can validate uses of get_supabase_client() and methods on
supabase_client.
74-76: Error logging lacks context (source_id, url) and stack traces, per coding guidelines.

Most except blocks log a generic message (e.g., "Failed to update URL status: {e}") without including the source_id or url that failed, and none preserve the stack trace with exc_info=True. This makes production debugging significantly harder.

Example — _update_status (line 178):
Proposed fix pattern (apply similarly to all catch blocks)
         except Exception as e:
-            safe_logfire_error(f"Failed to update URL status: {e}")
+            logger.error(
+                f"Failed to update URL status to '{status}' | source_id={source_id} | url={url}",
+                exc_info=True,
+            )
             return False
Where safe_logfire_error alone is insufficient for debugging, pair it with logger.error(..., exc_info=True) so that structured tracing and full stack traces are both available.
As per coding guidelines: "Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs."

Also applies to: 150-152, 177-179, 249-251, 276-278, 299-301, 318-320
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/crawling/crawl_url_state_service.py` around lines
74 - 76, The except blocks (e.g., the one around safe_logfire_error in
crawl_url_state_service.py and specifically in the _update_status method) must
log contextual identifiers and full stack traces: replace the bare
safe_logfire_error(f"...{e}") pattern with two actions — call safe_logfire_error
with a concise message and then call the module logger (e.g., logger.error) with
a descriptive message that includes source_id and url (or other relevant IDs)
and pass exc_info=True so the full exception stack is preserved; also narrow the
except to specific exception types where feasible and re-raise with "raise" (or
"raise from e" if chaining is desired). Ensure these changes are applied to the
other catch sites mentioned (around lines 150-152, 177-179, 249-251, 276-278,
299-301, 318-320) and reference the same symbols (safe_logfire_error and
_update_status) when editing.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CLAUDE.md`:
- Around line 153-173: Remove the stray closing code-fence that follows the
"Prompt Regression Tests" / Quick Workflows bash block so there's only one
closing ``` for that block; specifically, delete the orphaned ``` that currently
precedes the "## Architecture Overview" heading so the bash block closes
correctly and the rest of the document renders as normal.

In `@migration/0.1.0/012_add_crawl_url_state.sql`:
- Around line 43-45: The current RLS policy "Service role full access to
crawl_url_state" on archon_crawl_url_state uses USING (true) WITH CHECK (true)
and thus grants access to any role; update the policy to restrict it to the
service role by either replacing the predicate with USING (auth.role() =
'service_role') WITH CHECK (auth.role() = 'service_role') or by adding TO
service_role so only the service role can read/write; modify the policy
definition for "Service role full access to crawl_url_state" accordingly.
- Around line 21-22: Add the missing UPDATE trigger for updated_at on the
archon_crawl_url_state table: after the table definition that contains the
updated_at TIMESTAMPTZ DEFAULT now() column, create a trigger named
set_updated_at_crawl_url_state that fires BEFORE UPDATE ON
archon_crawl_url_state FOR EACH ROW and EXECUTE FUNCTION
update_updated_at_column(); this mirrors the pattern used for archon_settings,
archon_projects, etc., and ensures updated_at is auto-updated even if the
application layer omits it.

In `@PRPs/ai_docs/CODE_SUMMARY_PROMPT.md`:
- Around line 132-135: The run command in the markdown uses the wrong test path
("uv run python test_code_summary_prompt.py ollama"); update this to point at
the tests/prompts test file used elsewhere by replacing that string with "uv run
python tests/prompts/test_code_summary_prompt.py ollama" so the command runs
from the python/ directory and matches the other references to
tests/prompts/test_code_summary_prompt.py.

In `@python/src/server/services/crawling/crawl_url_state_service.py`:
- Around line 104-152: mark_failed currently does a read-then-update on
retry_count (via
self.supabase_client.table(self.table_name).select(...).match(...)) which causes
a TOCTOU race; replace that two-step logic with an atomic DB-side update (e.g.,
a single SQL UPDATE that sets retry_count = retry_count + 1 and uses RETURNING
to get the new value, or call a Supabase RPC/stored procedure) so concurrent
failures cannot lose increments and so you can decide in the same statement
whether to set status to "failed" or keep "pending"; update mark_failed to call
that atomic updater and then call _update_status only if the returned
retry_count >= max_retries (or incorporate status change into the same RPC), or
if you cannot change DB now, add a short comment in mark_failed documenting this
known limitation.
- Around line 253-278: The exception handler in get_crawl_state can reference
counts before it's defined if the Supabase query raises, causing an
UnboundLocalError; to fix, move or create the counts = {"pending": 0, "fetched":
0, "embedded": 0, "failed": 0, "total": 0} initialization before the try block
(or ensure a default counts is set in the except) so that the except path and
return always have a valid counts; update the get_crawl_state function (the
supabase_client.table(...).select("status").match({"source_id":
source_id}).execute() call and the except block) to use this pre-initialized
counts.

In `@python/src/server/services/crawling/crawling_service.py`:
- Around line 361-362: In _async_orchestrate_crawl, synchronous Supabase I/O
methods on url_state_service (has_existing_state, get_crawl_state, clear_state)
must be run off the event loop; replace direct calls like
url_state_service.has_existing_state(original_source_id) with awaiting
asyncio.to_thread(url_state_service.has_existing_state, original_source_id) (and
do the same for get_crawl_state and clear_state) so these blocking DB calls do
not block the async orchestration (mirror the existing pattern used for
discover_files).
- Around line 360-381: The clear-state branch currently wipes all crawl state
when pending==0 and failed==0, which drops URLs in the "fetched" state; update
the logic in the block that calls
url_state_service.clear_state(original_source_id) (within the
get_crawl_url_state_service / crawl_state handling) to also inspect
crawl_state.get("fetched", 0) (or fetched_count) and only clear state when
fetched_count == 0 (otherwise log resume info and let the crawl continue so
fetched-but-not-yet-embedded URLs are preserved); ensure you reference
url_state_service.clear_state, url_state_service.get_crawl_state, and the
crawl_state keys pending/failed/fetched/total in your change.

In `@python/src/server/services/crawling/document_storage_operations.py`:
- Around line 74-78: The except blocks around url_state_service.initialize_urls
(where safe_logfire_error is called) and the other handler at the location
referencing safe_logfire_error must also call the module logger with
exc_info=True to preserve stack traces; update the except handlers that
currently call only safe_logfire_error(...) to additionally invoke
logger.error("Failed to initialize URL state" or the existing error message,
exc_info=True) (i.e., mirror the established pattern used elsewhere in this file
where safe_logfire_error(...) is paired with logger.error(..., exc_info=True)).
- Around line 291-298: The call to url_state_service.mark_embedded currently
uses unique_doc_urls (which includes URLs for documents that were skipped and
never embedded), corrupting resume state; update the post-storage marking to
iterate over the deduplicated list of actually-embedded URLs by replacing
unique_doc_urls with list(set(all_urls)) so only URLs collected in the
processing loop (all_urls) are marked embedded; look for the block in
document_storage_operations.py that calls url_state_service.mark_embedded and
the variables unique_doc_urls and all_urls within the same function to apply
this change.

In `@python/tests/integration/.gitignore`:
- Around line 1-2: The tests are writing code_summary_test_results.json into
python/tests/prompts/ but only python/tests/integration/ is ignored; add a
.gitignore to prevent committing that output by either creating
python/tests/prompts/.gitignore containing the same pattern (*_results.json or
specifically code_summary_test_results.json) or consolidate by adding
*_results.json to python/tests/.gitignore; update the repository by committing
the new/updated .gitignore so python/tests/prompts/test_code_summary_prompt.py
output is ignored.

In `@python/tests/integration/test_code_summary_prompt_quick.py`:
- Around line 70-77: The test currently swallows ImportError and does a plain
return, masking failures; update the import error handling in
python/tests/integration/test_code_summary_prompt_quick.py where
_generate_code_example_summary_async is imported: replace the bare return with
either pytest.skip("dependency missing: ...") (after importing pytest) when the
import/environment is optional, or re-raise the caught ImportError to fail the
test and surface the real error; ensure the ImportError handler uses the chosen
approach and includes the original exception message for context.

In `@python/tests/prompts/test_code_summary_prompt.py`:
- Around line 118-160: The helper function test_single_summary is being
collected by pytest because it starts with test_ and its parameter sample has no
fixture, and its try/except swallows assertion failures; rename the function
(e.g., test_single_summary -> run_single_summary) so pytest won't collect it,
remove or lift the broad try/except in run_single_summary so assertion errors
propagate to pytest, and update all call sites that referenced
test_single_summary to the new name; keep the original signature
(code/context/language/provider) but ensure no pytest-only fixtures are
expected.

In `@python/tests/test_crawl_url_state_service.py`:
- Around line 145-169: Add assertions in the two tests to verify the payload
passed to the DB update call: for test_marks_url_as_failed_after_max_retries
assert that the update payload includes "status": "failed"; for
test_increments_retry_count_below_max assert the payload includes "status":
"pending" and "retry_count": 2. Locate the mocked update by referencing
mock_client.table.return_value.update.return_value.match.return_value (or the
update(...) call on mock_client.table()) and assert the call arguments (e.g.,
via assert_called_with or inspecting call_args) to ensure the correct dict was
written.

---

Nitpick comments:
In `@CODE_EXTRACTION_FLOW.md`:
- Around line 17-25: Several fenced code blocks in CODE_EXTRACTION_FLOW.md
(e.g., the block showing the numbered list starting "1. Crawl returns markdown
documents" and the other pseudocode blocks) are missing language specifiers and
trigger markdownlint MD040; update each triple-backtick fence to include a
language token such as ```text or ```plaintext so the blocks are recognized as
plain text (apply this to the blocks near the "Crawl returns markdown documents"
snippet and the other pseudocode fences referenced in the review).
- Around line 1-8: The documentation in CODE_EXTRACTION_FLOW.md uses
PR-specific, conversational phrases ("YOUR PROMPT HERE", "Your prompt change",
"Your 1.2B prompt optimization"); update the text to neutral, long-lived
terminology by replacing those phrases with stable descriptors such as "the code
summary prompt", "the prompt optimization", or "prompt adjustments", and
rephrase any remaining change-specific wording so the document describes
behavior (e.g., that the code summary runs on extracted code blocks only) rather
than referencing this PR or an author-facing instruction; ensure the three
explicit strings mentioned are replaced consistently throughout the file and
that the executive summary remains factual and version-agnostic.

In `@migration/0.1.0/012_add_crawl_url_state.sql`:
- Around line 27-29: The migration creates a redundant single-column index
idx_crawl_url_state_source on archon_crawl_url_state(source_id); remove that
CREATE INDEX statement and rely on the composite
idx_crawl_url_state_source_status (source_id, status) instead; if the
single-column index may already exist in deployed DBs, add a DROP INDEX IF
EXISTS idx_crawl_url_state_source or an equivalent rollback step to the
migration so only the composite index remains and write overhead is avoided.

In `@PROMPT_TEST_DETAILS.md`:
- Around line 99-105: The fenced code block showing the pydantic ValidationError
in PROMPT_TEST_DETAILS.md is missing a language specifier; update the opening
fence to include a language identifier (for example change ``` to ```text) so
the block containing "pydantic_core._pydantic_core.ValidationError: 1 validation
error for CrawlProgressResponse" is annotated, ensuring the fenced code block
starts with ```text and ends with the matching ``` fence.

In `@python/src/server/services/crawling/crawl_url_state_service.py`:
- Around line 323-340: The singleton accessor get_crawl_url_state_service
currently drops subsequent supabase_client args and is not thread-safe; update
it to accept an optional reset: bool = False (for tests) and document that
supabase_client is honored only on first init unless reset=True, and add a
module-level threading.Lock (e.g., crawl_url_state_service_lock) and wrap the
None-check and instantiation of crawl_url_state_service (and honoring/reset
logic) inside lock.acquire()/release or a context manager to prevent races;
ensure you still use the global crawl_url_state_service and instantiate
CrawlUrlStateService(supabase_client) only when appropriate.
- Around line 20-28: The __init__ of CrawlUrlStateService currently lacks a type
annotation for the supabase_client parameter; update the signature of __init__
to annotate supabase_client as either the supabase Client type (import Client
from the supabase library) or as typing.Optional[Client] (or typing.Any if the
exact client class is uncertain), and add a corresponding attribute annotation
for self.supabase_client (e.g., self.supabase_client: Optional[Client]) so
MyPy/IDE tooling can validate uses of get_supabase_client() and methods on
supabase_client.
- Around line 74-76: The except blocks (e.g., the one around safe_logfire_error
in crawl_url_state_service.py and specifically in the _update_status method)
must log contextual identifiers and full stack traces: replace the bare
safe_logfire_error(f"...{e}") pattern with two actions — call safe_logfire_error
with a concise message and then call the module logger (e.g., logger.error) with
a descriptive message that includes source_id and url (or other relevant IDs)
and pass exc_info=True so the full exception stack is preserved; also narrow the
except to specific exception types where feasible and re-raise with "raise" (or
"raise from e" if chaining is desired). Ensure these changes are applied to the
other catch sites mentioned (around lines 150-152, 177-179, 249-251, 276-278,
299-301, 318-320) and reference the same symbols (safe_logfire_error and
_update_status) when editing.

In `@python/src/server/services/crawling/document_storage_operations.py`:
- Around line 69-78: The comment "Initialize URL state tracking if enabled" is
misleading because no flag is checked; update the code to either actually check
a feature flag or change the comment to accurately reflect behavior.
Specifically, either add a conditional around get_crawl_url_state_service/
url_state_service.initialize_urls that checks the existing feature flag (or
config) used elsewhere, or replace the comment with a factual one such as
"Initialize URL state tracking" so it matches the unconditional calls to
get_crawl_url_state_service and url_state_service.initialize_urls; ensure
references to url_state_service, get_crawl_url_state_service, and
initialize_urls remain correct.

In `@python/src/server/services/storage/code_storage_service.py`:
- Line 636: Replace the redundant conditional that uses "{code[:1500] if
len(code) > 1500 else code}" with the simpler slice "code[:1500]"; update the
occurrence that references the variable "code" (in code_storage_service.py) so
it always uses the safe slicing expression "code[:1500]" since Python slicing
handles shorter strings without error.

In `@python/tests/integration/test_code_summary_prompt_quick.py`:
- Around line 132-133: Move the local import of the standard library module
traceback out of the except block and place it at module level near the other
imports in python/tests/integration/test_code_summary_prompt_quick.py; locate
the except block that currently does "import traceback" and remove that line,
then add "import traceback" with the other top-level imports so functions/tests
(e.g. the failing test function containing the except) can use traceback without
importing inside the exception handler.
- Around line 130-142: The current except block appends failures to the results
list but never fails the test; update the end of test_prompt_directly to assert
that there are no failed runs by inspecting results (e.g., build failures = [r
for r in results if not r["success"]]) and raise an AssertionError or use
pytest.fail with a concise summary of failures (including name, language, and
error) if any exist; keep the existing try/except aggregation but add this final
assertion gate to ensure pytest detects failures.

In `@python/tests/integration/test_crawl_validation.py`:
- Around line 232-238: Move the inline "import traceback" out of the except
blocks and add a single module-level "import traceback" with the other
top-of-file imports; then remove the two inline imports (the one inside the
except that sets result["status"] = "error" and the duplicate later) so
traceback is imported only once at the top and used from there.
- Around line 80-82: Update the failure check to consider the progress "status"
field as well as the "error" field and raise with the full progress context;
specifically, in the block using progress (the variable named progress in
test_crawl_validation.py), detect failure when progress.get("error") is truthy
OR progress.get("status") == "error", and then raise an exception that includes
the entire progress object (e.g., include progress in the message) instead of
raising a bare Exception with only progress.get('error').
- Around line 22-23: The test hardcodes API_BASE = "http://localhost:8181" which
breaks in other environments; change API_BASE to read from an environment
variable with a default (use os.environ.get("API_BASE",
"http://localhost:8181")) and add an import for os at the top of
python/tests/integration/test_crawl_validation.py so tests can override API_BASE
in CI or Docker without changing code.

In `@python/tests/prompts/README.md`:
- Line 25: Replace the hard-coded "(lines 631-643)" reference with a stable
symbol reference to the function or method in
src/server/services/storage/code_storage_service.py (e.g., use the class/method
name such as CodeStorageService.<method_name> or the exact def name) so readers
can find the code even as line numbers change; update the README.md text to use
the function/method name in backticks and remove the numeric line-range.

In `@python/tests/test_crawl_url_state_service.py`:
- Around line 328-373: The tests test_pending_to_fetched_transition and
test_fetched_to_embedded_transition in TestStateTransitionLogic are redundant
with existing tests (test_marks_url_as_fetched and test_marks_url_as_embedded);
either remove these two methods or convert them into a true multi-step
integration test that uses the same source_id/url and calls
url_state_service.mark_fetched(...) then asserts the intermediate state,
followed by url_state_service.mark_embedded(...) and asserts the final state
(ensure you reference the same URL and use the same service instance:
url_state_service.mark_fetched and url_state_service.mark_embedded); update or
delete the redundant test methods accordingly to avoid duplicate coverage.
- Line 7: Remove the unused import "patch" from the test module import line
(keep MagicMock if still used) or replace it by using patch where intended;
specifically update the import statement that currently reads "from
unittest.mock import MagicMock, patch" to drop "patch" (or use "patch" in tests
if that was intended) so there are no unused imports.

coderabbitai · 2026-02-22T08:30:25Z

+
+# Prompt regression tests
+uv run python tests/prompts/test_code_summary_prompt.py  # Test code summary prompt
+uv run pytest tests/prompts/ -v                           # Run all prompt tests with pytest
+```
+
+### Prompt Regression Tests
+
+**Location**: `python/tests/prompts/`
+**Documentation**: `@PRPs/ai_docs/CODE_SUMMARY_PROMPT.md`
+
+Regression tests for AI prompts used in production. These ensure prompt changes don't break output structure or quality.
+
+**When to run**:
+- Before merging prompt changes
+- When updating LLM providers or models
+- As part of CI/CD pipeline
+- When debugging summary/output quality issues
+
+See `python/tests/prompts/README.md` for details on adding new prompt tests.
 ```


⚠️ Potential issue | 🟡 Minor

Orphaned closing fence at Line 173 breaks all downstream markdown rendering.

The Quick Workflows ```bash block originally closed at Line 173 (pre-existing ```). The new changes insert a closing fence at Line 157, correctly ending the bash block. However, the pre-existing ``` at Line 173 remains and is now orphaned — it opens a new unlanguaged code block, causing the entire ## Architecture Overview section and everything below it to render as raw code.

Fix: remove the pre-existing ``` at Line 173, since Line 157 already properly closes the bash block.

🔧 Proposed fix

See `python/tests/prompts/README.md` for details on adding new prompt tests. -``` ## Architecture Overview

🧰 Tools

🪛 markdownlint-cli2 (0.21.0)

[warning] 173-173: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@CLAUDE.md` around lines 153 - 173, Remove the stray closing code-fence that follows the "Prompt Regression Tests" / Quick Workflows bash block so there's only one closing ``` for that block; specifically, delete the orphaned ``` that currently precedes the "## Architecture Overview" heading so the bash block closes correctly and the rest of the document renders as normal.

coderabbitai · 2026-02-22T08:30:25Z

+    created_at TIMESTAMPTZ DEFAULT now(),
+    updated_at TIMESTAMPTZ DEFAULT now(),


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Search for updated_at related triggers and patterns find . -name "*.sql" -type f | head -20

Repository: coleam00/Archon

Length of output: 847

🏁 Script executed:

# Check for update triggers in migration files rg "update_updated_at|BEFORE UPDATE" --type sql

Repository: coleam00/Archon

Length of output: 1694

🏁 Script executed:

# Search for how Python handles updated_at in service code find . -path "*/python/src/server/*" -name "*.py" -type f | xargs grep -l "updated_at" 2>/dev/null | head -10

Repository: coleam00/Archon

Length of output: 637

🏁 Script executed:

# Check the migration file structure around other tables find . -name "*.sql" -path "*/migration/*" -type f | sort

Repository: coleam00/Archon

Length of output: 847

🏁 Script executed:

# Read the migration file to check if it includes the trigger cat ./migration/0.1.0/012_add_crawl_url_state.sql

Repository: coleam00/Archon

Length of output: 2249

🏁 Script executed:

# Check the crawl_url_state_service.py to see how it handles updates cat ./python/src/server/services/crawling/crawl_url_state_service.py

Repository: coleam00/Archon

Length of output: 10346

Add updated_at trigger to match project pattern and prevent accidental omission.

The column defaults to now() at insert time but lacks the auto-update trigger used consistently on other tables (archon_settings, archon_projects, archon_tasks, archon_prompts, archon_configured_repositories, archon_agent_work_orders). While the Python service currently sets updated_at explicitly on every update operation, the trigger provides a defensive layer against future code paths that might forget to do so.

Add the following after the table creation:

Suggested trigger

CREATE TRIGGER set_updated_at_crawl_url_state BEFORE UPDATE ON archon_crawl_url_state FOR EACH ROW EXECUTE FUNCTION update_updated_at_column();

(The update_updated_at_column() function is already defined in migration/complete_setup.sql)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@migration/0.1.0/012_add_crawl_url_state.sql` around lines 21 - 22, Add the missing UPDATE trigger for updated_at on the archon_crawl_url_state table: after the table definition that contains the updated_at TIMESTAMPTZ DEFAULT now() column, create a trigger named set_updated_at_crawl_url_state that fires BEFORE UPDATE ON archon_crawl_url_state FOR EACH ROW and EXECUTE FUNCTION update_updated_at_column(); this mirrors the pattern used for archon_settings, archon_projects, etc., and ensures updated_at is auto-updated even if the application layer omits it.

coderabbitai · 2026-02-22T08:30:25Z

+-- RLS Policy: Service role has full access
+CREATE POLICY "Service role full access to crawl_url_state" ON archon_crawl_url_state
+    FOR ALL USING (true) WITH CHECK (true);


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, view the specific migration file lines in question cat -n migration/0.1.0/012_add_crawl_url_state.sql | sed -n '40,50p'

Repository: coleam00/Archon

Length of output: 374

🏁 Script executed:

#!/bin/bash # Check all RLS policies in SQL files for consistency and pattern rg -n "CREATE POLICY" --type sql -A 1 | head -100

Repository: coleam00/Archon

Length of output: 5445

🏁 Script executed:

#!/bin/bash # Look for Supabase/PostgreSQL RLS policies in the codebase find . -name "*.sql" -type f -exec grep -l "CREATE POLICY" {} \; | head -20

Repository: coleam00/Archon

Length of output: 271

RLS policy grants access to all roles, not just service_role.

USING (true) WITH CHECK (true) without role restriction allows any authenticated role (including anon) to read/write this table. Other policies in the codebase restrict service role access using either USING (auth.role() = 'service_role') or TO service_role. Restrict this policy consistently:

-CREATE POLICY "Service role full access to crawl_url_state" ON archon_crawl_url_state - FOR ALL USING (true) WITH CHECK (true); +CREATE POLICY "Service role full access to crawl_url_state" ON archon_crawl_url_state + FOR ALL TO service_role USING (true) WITH CHECK (true);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

-- RLS Policy: Service role has full access

CREATE POLICY "Service role full access to crawl_url_state" ON archon_crawl_url_state

FOR ALL USING (true) WITH CHECK (true);

-- RLS Policy: Service role has full access

CREATE POLICY "Service role full access to crawl_url_state" ON archon_crawl_url_state

FOR ALL TO service_role USING (true) WITH CHECK (true);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@migration/0.1.0/012_add_crawl_url_state.sql` around lines 43 - 45, The current RLS policy "Service role full access to crawl_url_state" on archon_crawl_url_state uses USING (true) WITH CHECK (true) and thus grants access to any role; update the policy to restrict it to the service role by either replacing the predicate with USING (auth.role() = 'service_role') WITH CHECK (auth.role() = 'service_role') or by adding TO service_role so only the service role can read/write; modify the policy definition for "Service role full access to crawl_url_state" accordingly.

coderabbitai · 2026-02-22T08:30:25Z

+3. **Run the test**:
+   ```bash
+   uv run python test_code_summary_prompt.py ollama
+   ```


⚠️ Potential issue | 🟡 Minor

Inconsistent test file path.

Line 134 uses test_code_summary_prompt.py without a directory path, while lines 51–58 correctly reference tests/prompts/test_code_summary_prompt.py. This will fail if run from the python/ directory as documented.

📝 Suggested fix

3. **Run the test**: ```bash - uv run python test_code_summary_prompt.py ollama + uv run python tests/prompts/test_code_summary_prompt.py ollama ```

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@PRPs/ai_docs/CODE_SUMMARY_PROMPT.md` around lines 132 - 135, The run command in the markdown uses the wrong test path ("uv run python test_code_summary_prompt.py ollama"); update this to point at the tests/prompts test file used elsewhere by replacing that string with "uv run python tests/prompts/test_code_summary_prompt.py ollama" so the command runs from the python/ directory and matches the other references to tests/prompts/test_code_summary_prompt.py.

coderabbitai · 2026-02-22T08:30:25Z

+Tracks per-URL crawl progress to enable checkpoint/resume functionality.
+"""
+
+from datetime import UTC


⚠️ Potential issue | 🔴 Critical

UTC is a timezone, not a timestamp — every created_at/updated_at field will store a datetime.timezone.utc object instead of the current time.

from datetime import UTC imports datetime.timezone.utc, which is a timezone constant, not a datetime value. Assigning now = UTC (line 45) and using "updated_at": UTC (lines 143, 168) writes the timezone sentinel into Supabase rather than the current timestamp. This will either corrupt rows or raise a serialization error at runtime.

Proposed fix

-from datetime import UTC +from datetime import UTC, datetime

Then everywhere UTC is used as a value, replace with datetime.now(UTC):

- now = UTC + now = datetime.now(UTC).isoformat()

- "updated_at": UTC, + "updated_at": datetime.now(UTC).isoformat(),

- update_data = {"status": status, "updated_at": UTC} + update_data = {"status": status, "updated_at": datetime.now(UTC).isoformat()}

As per coding guidelines: "Never accept corrupted data - skip failed items entirely rather than storing corrupted or incomplete data."

Also applies to: 45-45

coderabbitai · 2026-02-22T08:30:25Z

+        # Mark URLs as embedded after successful storage
+        if unique_doc_urls:
+            try:
+                for doc_url in unique_doc_urls:
+                    url_state_service.mark_embedded(original_source_id, doc_url)
+                safe_logfire_info(f"Marked {len(unique_doc_urls)} URLs as embedded")
+            except Exception as e:
+                safe_logfire_error(f"Failed to mark URLs as embedded: {e}")


⚠️ Potential issue | 🟠 Major

mark_embedded incorrectly marks URLs that were never actually embedded — corrupts Phase 2 resume state.

unique_doc_urls (Line 71–72) is built from ALL crawl_results entries with a non-empty URL, including documents whose markdown was empty. Those documents are skipped during the processing loop (Lines 107–110: continue) and are never chunked, never added to all_urls, and never passed to add_documents_to_supabase. Calling mark_embedded for them nonetheless marks them as embedded in archon_crawl_url_state.

When Phase 2 resume logic checks this state to skip already-embedded URLs, it will incorrectly skip these never-embedded URLs, causing silent data loss.

Fix: replace unique_doc_urls at Line 292 with list(set(all_urls)) — all_urls is only populated for documents that passed the content check (Line 140) and were actually chunked.

🐛 Proposed fix

- # Mark URLs as embedded after successful storage - if unique_doc_urls: + # Mark only successfully chunked URLs as embedded + embedded_urls = list(set(all_urls)) + if embedded_urls: try: - for doc_url in unique_doc_urls: + for doc_url in embedded_urls: url_state_service.mark_embedded(original_source_id, doc_url) - safe_logfire_info(f"Marked {len(unique_doc_urls)} URLs as embedded") + safe_logfire_info(f"Marked {len(embedded_urls)} URLs as embedded") except Exception as e: safe_logfire_error(f"Failed to mark URLs as embedded: {e}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Mark URLs as embedded after successful storage

if unique_doc_urls:

try:

for doc_url in unique_doc_urls:

url_state_service.mark_embedded(original_source_id, doc_url)

safe_logfire_info(f"Marked {len(unique_doc_urls)} URLs as embedded")

except Exception as e:

safe_logfire_error(f"Failed to mark URLs as embedded: {e}")

# Mark only successfully chunked URLs as embedded

embedded_urls = list(set(all_urls))

if embedded_urls:

try:

for doc_url in embedded_urls:

url_state_service.mark_embedded(original_source_id, doc_url)

safe_logfire_info(f"Marked {len(embedded_urls)} URLs as embedded")

except Exception as e:

safe_logfire_error(f"Failed to mark URLs as embedded: {e}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/crawling/document_storage_operations.py` around lines 291 - 298, The call to url_state_service.mark_embedded currently uses unique_doc_urls (which includes URLs for documents that were skipped and never embedded), corrupting resume state; update the post-storage marking to iterate over the deduplicated list of actually-embedded URLs by replacing unique_doc_urls with list(set(all_urls)) so only URLs collected in the processing loop (all_urls) are marked embedded; look for the block in document_storage_operations.py that calls url_state_service.mark_embedded and the variables unique_doc_urls and all_urls within the same function to apply this change.

coderabbitai · 2026-02-22T08:30:25Z

+# Test results (generated)
+*_results.json


⚠️ Potential issue | 🟡 Minor

Missing .gitignore in python/tests/prompts/

python/tests/prompts/test_code_summary_prompt.py (Line 213) writes code_summary_test_results.json to python/tests/prompts/. This .gitignore only covers python/tests/integration/, so that output file is unprotected and would be committed.

Add python/tests/prompts/.gitignore with the same pattern, or consolidate at python/tests/.gitignore.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/tests/integration/.gitignore` around lines 1 - 2, The tests are writing code_summary_test_results.json into python/tests/prompts/ but only python/tests/integration/ is ignored; add a .gitignore to prevent committing that output by either creating python/tests/prompts/.gitignore containing the same pattern (*_results.json or specifically code_summary_test_results.json) or consolidate by adding *_results.json to python/tests/.gitignore; update the repository by committing the new/updated .gitignore so python/tests/prompts/test_code_summary_prompt.py output is ignored.

coderabbitai · 2026-02-22T08:30:25Z

+    try:
+        from src.server.services.storage.code_storage_service import (
+            _generate_code_example_summary_async,
+        )
+    except ImportError as e:
+        print(f"\n❌ Failed to import code summary function: {e}")
+        print("\nPlease ensure you're running from the python/ directory")
+        return


⚠️ Potential issue | 🟡 Minor

Silent return on ImportError means pytest always reports PASSED even when the import fails.

return (implicitly returns None) after catching ImportError is invisible to pytest; the test appears green. Use pytest.skip for a missing environment, or re-raise to surface the real error.

🔧 Proposed fix

except ImportError as e: print(f"\n❌ Failed to import code summary function: {e}") - print("\nPlease ensure you're running from the python/ directory") - return + pytest.skip(f"Cannot import code summary function: {e}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

from src.server.services.storage.code_storage_service import (

_generate_code_example_summary_async,

)

except ImportError as e:

print(f"\n❌ Failed to import code summary function: {e}")

print("\nPlease ensure you're running from the python/ directory")

return

try:

from src.server.services.storage.code_storage_service import (

_generate_code_example_summary_async,

)

except ImportError as e:

print(f"\n❌ Failed to import code summary function: {e}")

pytest.skip(f"Cannot import code summary function: {e}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/tests/integration/test_code_summary_prompt_quick.py` around lines 70 - 77, The test currently swallows ImportError and does a plain return, masking failures; update the import error handling in python/tests/integration/test_code_summary_prompt_quick.py where _generate_code_example_summary_async is imported: replace the bare return with either pytest.skip("dependency missing: ...") (after importing pytest) when the import/environment is optional, or re-raise the caught ImportError to fail the test and surface the real error; ensure the ImportError handler uses the chosen approach and includes the original exception message for context.

coderabbitai · 2026-02-22T08:30:25Z

+async def test_single_summary(sample: dict, provider: str = None):
+    """Test summary generation for a single code sample."""
+    print(f"\n{'='*80}")
+    print(f"Testing: {sample['name']}")
+    print(f"Language: {sample['language']}")
+    print(f"{'='*80}")
+
+    print(f"\nCode snippet (first 200 chars):")
+    print(f"{sample['code'][:200]}...")
+
+    try:
+        result = await _generate_code_example_summary_async(
+            code=sample['code'],
+            context_before=sample['context_before'],
+            context_after=sample['context_after'],
+            language=sample['language'],
+            provider=provider
+        )
+
+        print(f"\n✅ SUCCESS - Generated summary:")
+        print(f"   Example Name: {result['example_name']}")
+        print(f"   Summary: {result['summary']}")
+
+        # Verify JSON structure
+        assert 'example_name' in result, "Missing 'example_name' field"
+        assert 'summary' in result, "Missing 'summary' field"
+        assert len(result['example_name']) > 0, "Empty 'example_name'"
+        assert len(result['summary']) > 0, "Empty 'summary'"
+
+        # Check if summary follows the structured format
+        has_purpose = 'PURPOSE:' in result['summary'].upper() or 'purpose' in result['summary'].lower()
+        has_params = 'PARAMETERS:' in result['summary'].upper() or 'parameter' in result['summary'].lower()
+        has_use = 'USE WHEN:' in result['summary'].upper() or 'use' in result['summary'].lower()
+
+        structure_score = sum([has_purpose, has_params, has_use])
+        print(f"   Structure indicators: {structure_score}/3 (PURPOSE/PARAMETERS/USE WHEN)")
+
+        return True, result
+
+    except Exception as e:
+        print(f"\n❌ FAILED with error:")
+        print(f"   {type(e).__name__}: {str(e)}")
+        return False, None


⚠️ Potential issue | 🟡 Minor

test_single_summary causes a pytest collection ERROR and silently swallows assertion failures.

Two issues:

Pytest collection error: test_single_summary(sample: dict, ...) starts with test_ so pytest tries to collect it. sample has no default and no matching pytest fixture, causing ERROR: fixture 'sample' not found — the uv run pytest tests/prompts/ -v command documented in both CLAUDE.md and README.md is broken.

Silent failures: Assertions at lines 142–145 are inside try/except Exception, so any assertion failure is caught and returned as (False, None), invisible to pytest.

Rename the helper to avoid collection, and let assertion errors propagate:

🔧 Proposed fix

-async def test_single_summary(sample: dict, provider: str = None): +async def run_single_summary(sample: dict, provider: str = None): """Test summary generation for a single code sample.""" ... try: result = await _generate_code_example_summary_async(...) - # Verify JSON structure - assert 'example_name' in result, "Missing 'example_name' field" - assert 'summary' in result, "Missing 'summary' field" - assert len(result['example_name']) > 0, "Empty 'example_name'" - assert len(result['summary']) > 0, "Empty 'summary'" + assert 'example_name' in result, "Missing 'example_name' field" + assert 'summary' in result, "Missing 'summary' field" + assert len(result['example_name']) > 0, "Empty 'example_name'" + assert len(result['summary']) > 0, "Empty 'summary'" ... return True, result except AssertionError: raise # Let pytest see assertion failures except Exception as e: ... return False, None

And update all call sites from test_single_summary → run_single_summary.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/tests/prompts/test_code_summary_prompt.py` around lines 118 - 160, The helper function test_single_summary is being collected by pytest because it starts with test_ and its parameter sample has no fixture, and its try/except swallows assertion failures; rename the function (e.g., test_single_summary -> run_single_summary) so pytest won't collect it, remove or lift the broad try/except in run_single_summary so assertion errors propagate to pytest, and update all call sites that referenced test_single_summary to the new name; keep the original signature (code/context/language/provider) but ensure no pytest-only fixtures are expected.

coderabbitai · 2026-02-22T08:30:25Z

+    def test_marks_url_as_failed_after_max_retries(self, url_state_service, mock_client):
+        """URL marked as failed after exceeding max retries."""
+        mock_select_result = MagicMock()
+        mock_select_result.data = [{"retry_count": 3, "max_retries": 3}]
+        mock_client.table.return_value.select.return_value.match.return_value.execute.return_value = mock_select_result
+
+        result = url_state_service.mark_failed("source-1", "https://example.com/page1", "Connection timeout")
+
+        assert result is True
+
+        update_call = mock_client.table.return_value.update.return_value.match.return_value
+        update_call.execute.assert_called()
+
+    def test_increments_retry_count_below_max(self, url_state_service, mock_client):
+        """Retry count incremented when under max retries."""
+        mock_select_result = MagicMock()
+        mock_select_result.data = [{"retry_count": 1, "max_retries": 3}]
+        mock_client.table.return_value.select.return_value.match.return_value.execute.return_value = mock_select_result
+
+        result = url_state_service.mark_failed("source-1", "https://example.com/page1", "Connection timeout")
+
+        assert result is True
+
+        update_call = mock_client.table.return_value.update.return_value.match.return_value
+        update_call.execute.assert_called()


🛠️ Refactor suggestion | 🟠 Major

mark_failed tests don't assert the status value written to the DB.

Both tests verify that execute was called but don't check what was passed to update(...). For test_marks_url_as_failed_after_max_retries (retry_count == max_retries), you should assert the update payload contains "status": "failed". For test_increments_retry_count_below_max (retry_count < max_retries), assert "status": "pending" and "retry_count": 2.

Without these assertions, the tests pass even if the service writes the wrong status.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/tests/test_crawl_url_state_service.py` around lines 145 - 169, Add assertions in the two tests to verify the payload passed to the DB update call: for test_marks_url_as_failed_after_max_retries assert that the update payload includes "status": "failed"; for test_increments_retry_count_below_max assert the payload includes "status": "pending" and "retry_count": 2. Locate the mocked update by referencing mock_client.table.return_value.update.return_value.match.return_value (or the update(...) call on mock_client.table()) and assert the call arguments (e.g., via assert_called_with or inspecting call_args) to ensure the correct dict was written.

- Merge ADR-001 into ADR-002, mark ADR-001 as superseded - Add provenance fields to backend API responses (embedding_model, embedding_provider, vectorizer_settings, summarization_model, etc.) - Add VectorizerSettings type and provenance fields to frontend types - Add collapsible 'Processing Details' section to KnowledgeCard showing embedding provider/model, dimensions, summarization model, and timestamps - Add implementation status tracking in ADR-002-IMPLEMENTATION-STATUS.md

- Add checkpoint/resume filtering to crawling service (skip already-embedded URLs) - Add provenance tracking to source management (embedding model, provider, dimensions, vectorizer settings, summarization model, timestamps) - Add migration 013_add_provenance_tracking.sql to add provenance columns - Update document storage operations to capture and store provenance - Update recursive crawling strategy to support resume All changes pass existing tests.

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

python/src/server/services/crawling/strategies/recursive.py (1)

164-174: ⚠️ Potential issue | 🔴 Critical

Embedded URLs are not normalized before being added to visited, causing resume to miss matches.

URLs from get_embedded_urls are added to visited as-is (line 168), but all subsequent membership checks use normalize_url (fragment-stripping via urldefrag) — see line 200. If any stored URL contains a fragment (e.g., #section), the normalized form won't match the raw form in visited, and the URL will be re-crawled despite being already embedded.

Proposed fix — normalize embedded URLs

         # If resume filtering is enabled, pre-populate visited with already-embedded URLs
         if url_state_service and source_id:
             embedded_urls = url_state_service.get_embedded_urls(source_id)
             if embedded_urls:
-                visited.update(embedded_urls)
+                visited.update(normalize_url(u) for u in embedded_urls)
                 logger.info(f"Resume filtering: pre-loaded {len(embedded_urls)} already-embedded URLs")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/crawling/strategies/recursive.py` around lines 164
- 174, Normalize embedded URLs before adding them to the visited set: when
url_state_service.get_embedded_urls(source_id) returns embedded_urls, map each
through the existing normalize_url (the urldefrag wrapper) before calling
visited.update(...). Update the block that handles resume filtering (the code
that calls get_embedded_urls and logger.info) so it uses the normalized forms to
ensure membership checks later in the crawl (where normalize_url is used) will
correctly detect already-embedded URLs.

python/src/server/services/knowledge/knowledge_item_service.py (1)

166-182: ⚠️ Potential issue | 🟠 Major

**source_metadata spread at the end silently overrides computed values.

In the list_items method, **source_metadata on line 181 is placed after the explicit keys (knowledge_type, tags, source_type, status, description, etc.). If source_metadata contains any of those keys, the raw DB values will silently overwrite the computed/defaulted values above.

Compare with _transform_source_to_item (line 376) where the spread is correctly placed first with a comment explaining the intent.

Proposed fix — move the spread to the beginning

                     "metadata": {
+                        **source_metadata,
                         "knowledge_type": source_metadata.get("knowledge_type", "technical"),
                         "tags": source_metadata.get("tags", []),
                         "source_type": source_type,
                         "status": "active",
                         "description": source_metadata.get("description", source.get("summary", "")),
                         "chunks_count": chunks_count,
                         "word_count": source.get("total_word_count", 0),
                         "estimated_pages": round(source.get("total_word_count", 0) / 250, 1),
                         "pages_tooltip": f"{round(source.get('total_word_count', 0) / 250, 1)} pages (≈ {source.get('total_word_count', 0):,} words)",
                         "last_scraped": source.get("updated_at"),
                         "file_name": source_metadata.get("file_name"),
                         "file_type": source_metadata.get("file_type"),
                         "update_frequency": source_metadata.get("update_frequency", 7),
                         "code_examples_count": code_examples_count,
-                        **source_metadata,
                     },

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/knowledge/knowledge_item_service.py` around lines
166 - 182, In list_items in knowledge_item_service.py the final
"**source_metadata" spread overrides the computed metadata keys; move the spread
so source_metadata is merged first (i.e., put **source_metadata at the start of
the metadata dict) so the explicit computed keys like "knowledge_type", "tags",
"source_type", "status", "description", "chunks_count", "word_count",
"estimated_pages", "pages_tooltip", "last_scraped", "file_name", "file_type",
"update_frequency", and "code_examples_count" take precedence; mirror the
ordering used in _transform_source_to_item and keep a short comment explaining
that computed values should override raw DB metadata.

🧹 Nitpick comments (5)

python/src/server/services/knowledge/knowledge_item_service.py (1)

450-462: _get_chunks_count fetches full rows when only the count is needed.

Using .select("*", count="exact") without head=True retrieves all matching row data just to read the count. Add head=True to issue a HEAD-style request, consistent with how the count query is done in list_items (line 62).

Proposed fix

             result = (
                 self.supabase.table("archon_crawled_pages")
-                .select("*", count="exact")
+                .select("*", count="exact", head=True)
                 .eq("source_id", source_id)
                 .execute()
             )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/knowledge/knowledge_item_service.py` around lines
450 - 462, The _get_chunks_count function issues a full SELECT
(self.supabase.table("archon_crawled_pages").select("*", count="exact")) which
retrieves rows unnecessarily; change the query to use a HEAD-style request by
adding head=True (consistent with list_items) so only the count is returned, and
return result.count or 0 as before. Ensure you update the call inside
_get_chunks_count and keep the same result handling logic.

python/src/server/services/source_management_service.py (1)

297-313: Duplicate provenance block — extract a helper and move datetime import to module level.

The provenance-field block (lines 297–313) is duplicated nearly verbatim at lines 378–394. Also, from datetime import datetime is imported inline in both branches despite UTC already being imported at the module top (line 8).

Move from datetime import datetime to the top-level imports alongside UTC.
Extract the provenance fields + timestamp logic into a small helper to avoid the duplication.

Proposed refactor

Add to top-level imports:

-from datetime import UTC
+from datetime import UTC, datetime

Extract a helper (e.g., near line 215):

def _add_provenance_fields(
    upsert_data: dict,
    *,
    embedding_model: str | None,
    embedding_dimensions: int | None,
    embedding_provider: str | None,
    vectorizer_settings: dict | None,
    summarization_model: str | None,
) -> None:
    """Add provenance tracking fields to an upsert payload."""
    if embedding_model:
        upsert_data["embedding_model"] = embedding_model
    if embedding_dimensions is not None:
        upsert_data["embedding_dimensions"] = embedding_dimensions
    if embedding_provider:
        upsert_data["embedding_provider"] = embedding_provider
    if vectorizer_settings is not None:
        upsert_data["vectorizer_settings"] = vectorizer_settings
    if summarization_model:
        upsert_data["summarization_model"] = summarization_model
    now = datetime.now(UTC).isoformat()
    upsert_data["last_crawled_at"] = now
    upsert_data["last_vectorized_at"] = now

Then replace both inline blocks with a single call to _add_provenance_fields(upsert_data, ...).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/source_management_service.py` around lines 297 -
313, Extract the duplicated provenance/timestamp logic into a helper named
_add_provenance_fields that accepts upsert_data and the provenance params
(embedding_model, embedding_dimensions, embedding_provider, vectorizer_settings,
summarization_model) and sets the fields plus both last_crawled_at and
last_vectorized_at to the same datetime string; replace both inline blocks with
calls to _add_provenance_fields(upsert_data, ...). Also move from datetime
import datetime to the module-level imports (next to UTC) so the helper and
callers use the top-level datetime import rather than importing inline.

python/src/server/services/crawling/strategies/recursive.py (1)

45-46: url_state_service typed as Any — consider using a narrower type.

Using Any loses type safety. A Protocol or the concrete CrawlUrlStateService type (with a lazy/string-forward-ref import to avoid circular deps) would catch misuse at type-check time.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/crawling/strategies/recursive.py` around lines 45
- 46, The parameter url_state_service is currently annotated as Any which loses
type safety; replace it with a narrower type such as CrawlUrlStateService | None
(or a Protocol describing the methods you use) and use a forward reference or
TYPE_CHECKING import to avoid circular imports. Concretely, update the signature
for the function/class in recursive.py to use CrawlUrlStateService | None (or a
custom Protocol name) instead of Any, add a conditional import under
typing.TYPE_CHECKING (or use a string annotation) so the type checker sees
CrawlUrlStateService without creating runtime import cycles, and ensure the code
uses only the methods declared on that type/Protocol.

python/src/server/services/crawling/crawling_service.py (2)

897-899: Unparameterized tuple return type annotation.

-> tuple doesn't convey the shape of the return value. Consider -> tuple[list[dict[str, Any]], str | None] so callers and type checkers know what to expect.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/crawling/crawling_service.py` around lines 897 -
899, The return type annotation for _crawl_by_url_type is uninformative; change
its signature from -> tuple to a precise type such as -> tuple[list[dict[str,
Any]], str | None] (or the appropriate concrete element types) so callers and
type checkers know the return shape; update any imports (typing.Any already
used) if needed and ensure callers that unpack the returned tuple still match
the new annotated shape.

269-270: Use a concrete type instead of Any for url_state_service.

url_state_service: Any | None = None loses all type safety. Consider importing the actual type (e.g., CrawlUrlStateService) or using a Protocol for the expected interface.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/crawling/crawling_service.py` around lines 269 -
270, The parameter url_state_service on the function/signature in
crawling_service.py is typed as Any which removes type safety; replace it with
the concrete service type (e.g., CrawlUrlStateService) or a Protocol that
defines the methods you use, import that type at top, and update the parameter
annotation from url_state_service: Any | None = None to url_state_service:
CrawlUrlStateService | None = None (or the Protocol name) so callers and static
checkers can validate usage; ensure any references to url_state_service methods
inside the class match the chosen interface.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ADR-002`: Crawl Reliability, Ingestion Quality Control & DB Validation.md:
- Around line 53-66: The ADR example's crawl_url_state schema references
knowledge_sources(id) and uses UUID for source_id which doesn't match the
codebase (archon_sources with source_id TEXT); update the example SQL for
crawl_url_state to match the real migration by changing the reference to
archon_sources and the column type to TEXT (e.g., source_id TEXT NOT NULL
REFERENCES archon_sources(source_id)) or remove the FK and keep the same TEXT
column as in the codebase so the ADR mirrors the actual schema used by functions
that rely on crawl_url_state and archon_sources.

In `@archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx`:
- Around line 330-338: The Vectorizer row currently uses truthy checks which
render "0" for chunk_size===0 and can produce an empty <span>; update the
KnowledgeCard rendering logic for item.vectorizer_settings to build a list of
display parts: include chunk when chunk_size is a number (e.g., typeof
item.vectorizer_settings.chunk_size === "number"), include "contextual" when
item.vectorizer_settings.use_contextual is true, and include "hybrid" when
item.vectorizer_settings.use_hybrid is true, then only render the entire row
(the surrounding <div> and <span>) if the parts array has length > 0; refer to
the item.vectorizer_settings, chunk_size, use_contextual, and use_hybrid
identifiers to locate the code.
- Around line 340-351: The date rendering for item.last_crawled_at and
item.last_vectorized_at in KnowledgeCard.tsx uses format(new Date(...))
unguarded and can throw on invalid inputs; wrap each rendering in a safe check
(e.g., parse the string and verify with date-fns isValid or wrap format(new
Date(...)) in a try/catch) so you only render a formatted string when valid,
otherwise render a fallback (e.g., "Unknown" or omit); alternatively, extract a
shared safeDateFormat utility (safeDateFormat(dateStr): string | null) and call
it from the KnowledgeCard footer and these provenance lines to DRY the logic and
ensure consistent fallback behavior for invalid dates.

In `@python/src/server/services/crawling/crawling_service.py`:
- Around line 364-384: Wrap the resume-state logic that calls
get_crawl_url_state_service(...), has_existing_state(original_source_id),
get_crawl_state(original_source_id), and clear_state(original_source_id) in a
try/except so any DB error does not abort the crawl; on exception log a warning
via safe_logfire_info and continue. Before calling clear_state, verify the
crawl_state includes a "fetched" count (or at minimum ensure pending==0 and
failed==0 and fetched>=total) to avoid clearing prematurely. If these URL state
methods are synchronous and this module is async, call them off the event loop
(e.g., asyncio.to_thread) or use their async equivalents to avoid blocking.
Ensure all caught exceptions are swallowed after logging so resume-state
failures degrade gracefully.
- Around line 863-895: _filter_already_processed_urls is declared async but
calls the synchronous get_embedded_urls (via
get_crawl_url_state_service(self.supabase_client)) which can block the event
loop; either make the method synchronous or run the blocking call in a thread.
Fix by either changing _filter_already_processed_urls to a regular def (and
adjust callers) or wrap the blocking call with asyncio.to_thread, e.g. await
asyncio.to_thread(url_state_service.get_embedded_urls, source_id), then proceed
with building embedded_set, filtering urls, and calling
safe_logfire_info(progress_id=self.progress_id) as before.

In `@python/src/server/services/crawling/document_storage_operations.py`:
- Around line 376-394: The provenance retrieval currently can raise and abort
the primary update_source_info path; wrap the calls to
credential_service.get_credentials_by_category("embedding") and ("rag_strategy")
and the int(...) conversion for embedding_dimensions in a small try/except block
so failures are caught, logged, and the code continues with safe fallback values
(e.g., embedding_provider=None, embedding_model=None, embedding_dimensions=None
or a validated integer default, summarization_model=None) instead of raising;
ensure this block lives just before building
vectorizer_settings/summarization_model (around the
embedding_config/embedding_dimensions and rag_settings retrieval) and that
update_source_info uses those fallback None/validated values so provenance
fields are omitted but the primary path is not aborted.

In `@python/src/server/services/source_management_service.py`:
- Around line 298-301: The guard `if embedding_dimensions:` drops valid zero
values; change it to `if embedding_dimensions is not None:` when setting
upsert_data["embedding_dimensions"] so 0 is preserved—update the occurrence near
the block that sets upsert_data (the same area that checks embedding_model) and
also apply the identical change at the later occurrence around line ~381 where
embedding_dimensions is checked again (both places in
source_management_service.py that populate upsert_data/vectorizer settings).

---

Outside diff comments:
In `@python/src/server/services/crawling/strategies/recursive.py`:
- Around line 164-174: Normalize embedded URLs before adding them to the visited
set: when url_state_service.get_embedded_urls(source_id) returns embedded_urls,
map each through the existing normalize_url (the urldefrag wrapper) before
calling visited.update(...). Update the block that handles resume filtering (the
code that calls get_embedded_urls and logger.info) so it uses the normalized
forms to ensure membership checks later in the crawl (where normalize_url is
used) will correctly detect already-embedded URLs.

In `@python/src/server/services/knowledge/knowledge_item_service.py`:
- Around line 166-182: In list_items in knowledge_item_service.py the final
"**source_metadata" spread overrides the computed metadata keys; move the spread
so source_metadata is merged first (i.e., put **source_metadata at the start of
the metadata dict) so the explicit computed keys like "knowledge_type", "tags",
"source_type", "status", "description", "chunks_count", "word_count",
"estimated_pages", "pages_tooltip", "last_scraped", "file_name", "file_type",
"update_frequency", and "code_examples_count" take precedence; mirror the
ordering used in _transform_source_to_item and keep a short comment explaining
that computed values should override raw DB metadata.

---

Duplicate comments:
In `@python/src/server/services/crawling/document_storage_operations.py`:
- Around line 74-78: The except block around url_state_service.initialize_urls
should log the full exception stack trace: in the except Exception as e handler
for initialize_urls (the block that currently calls safe_logfire_error), add a
call to the module logger.error with exc_info=True (in addition to
safe_logfire_error) so the failure of url_state_service.initialize_urls is
recorded with the full traceback; keep the existing safe_logfire_error message
and call logger.error(f"Failed to initialize URL state: {e}", exc_info=True).
- Around line 291-298: The code marks entries via
url_state_service.mark_embedded using unique_doc_urls which includes any
non-empty URL from crawl_results (including docs that were skipped because their
markdown was empty), causing corrupt resume state; change the marking loop to
iterate over all_urls (the list populated only for actually-processed documents)
instead of unique_doc_urls, update the safe_logfire_info count to len(all_urls),
and keep the same exception handling (safe_logfire_error) and
original_source_id/mark_embedded calls so only truly-processed URLs are marked
embedded.

---

Nitpick comments:
In `@python/src/server/services/crawling/crawling_service.py`:
- Around line 897-899: The return type annotation for _crawl_by_url_type is
uninformative; change its signature from -> tuple to a precise type such as ->
tuple[list[dict[str, Any]], str | None] (or the appropriate concrete element
types) so callers and type checkers know the return shape; update any imports
(typing.Any already used) if needed and ensure callers that unpack the returned
tuple still match the new annotated shape.
- Around line 269-270: The parameter url_state_service on the function/signature
in crawling_service.py is typed as Any which removes type safety; replace it
with the concrete service type (e.g., CrawlUrlStateService) or a Protocol that
defines the methods you use, import that type at top, and update the parameter
annotation from url_state_service: Any | None = None to url_state_service:
CrawlUrlStateService | None = None (or the Protocol name) so callers and static
checkers can validate usage; ensure any references to url_state_service methods
inside the class match the chosen interface.

In `@python/src/server/services/crawling/strategies/recursive.py`:
- Around line 45-46: The parameter url_state_service is currently annotated as
Any which loses type safety; replace it with a narrower type such as
CrawlUrlStateService | None (or a Protocol describing the methods you use) and
use a forward reference or TYPE_CHECKING import to avoid circular imports.
Concretely, update the signature for the function/class in recursive.py to use
CrawlUrlStateService | None (or a custom Protocol name) instead of Any, add a
conditional import under typing.TYPE_CHECKING (or use a string annotation) so
the type checker sees CrawlUrlStateService without creating runtime import
cycles, and ensure the code uses only the methods declared on that
type/Protocol.

In `@python/src/server/services/knowledge/knowledge_item_service.py`:
- Around line 450-462: The _get_chunks_count function issues a full SELECT
(self.supabase.table("archon_crawled_pages").select("*", count="exact")) which
retrieves rows unnecessarily; change the query to use a HEAD-style request by
adding head=True (consistent with list_items) so only the count is returned, and
return result.count or 0 as before. Ensure you update the call inside
_get_chunks_count and keep the same result handling logic.

In `@python/src/server/services/source_management_service.py`:
- Around line 297-313: Extract the duplicated provenance/timestamp logic into a
helper named _add_provenance_fields that accepts upsert_data and the provenance
params (embedding_model, embedding_dimensions, embedding_provider,
vectorizer_settings, summarization_model) and sets the fields plus both
last_crawled_at and last_vectorized_at to the same datetime string; replace both
inline blocks with calls to _add_provenance_fields(upsert_data, ...). Also move
from datetime import datetime to the module-level imports (next to UTC) so the
helper and callers use the top-level datetime import rather than importing
inline.

coderabbitai · 2026-02-22T09:03:12Z

+```sql
+CREATE TABLE crawl_url_state (
+    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
+    source_id       UUID NOT NULL REFERENCES knowledge_sources(id),
+    url             TEXT NOT NULL,
+    status          TEXT NOT NULL,  -- pending | fetched | embedded | failed
+    chunk_count     INTEGER,
+    content_hash    TEXT,           -- for duplicate detection
+    error_message   TEXT,
+    created_at      TIMESTAMPTZ DEFAULT now(),
+    updated_at      TIMESTAMPTZ DEFAULT now(),
+    UNIQUE(source_id, url)
+);
+```


⚠️ Potential issue | 🟡 Minor

ADR SQL example references knowledge_sources(id) — actual table is archon_sources.

The example schema shows source_id UUID NOT NULL REFERENCES knowledge_sources(id), but the codebase uses archon_sources with source_id TEXT. This could mislead future implementers reading the ADR. Consider updating the example to match the actual migration.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ADR-002`: Crawl Reliability, Ingestion Quality Control & DB Validation.md around lines 53 - 66, The ADR example's crawl_url_state schema references knowledge_sources(id) and uses UUID for source_id which doesn't match the codebase (archon_sources with source_id TEXT); update the example SQL for crawl_url_state to match the real migration by changing the reference to archon_sources and the column type to TEXT (e.g., source_id TEXT NOT NULL REFERENCES archon_sources(source_id)) or remove the FK and keep the same TEXT column as in the codebase so the ADR mirrors the actual schema used by functions that rely on crawl_url_state and archon_sources.

coderabbitai · 2026-02-22T09:03:12Z

+                  {item.vectorizer_settings && (
+                    <div className="flex items-center gap-2">
+                      <span className="text-gray-400">Vectorizer:</span>
+                      <span>
+                        {item.vectorizer_settings.chunk_size && `chunk=${item.vectorizer_settings.chunk_size}`}
+                        {item.vectorizer_settings.use_contextual && " contextual"}
+                        {item.vectorizer_settings.use_hybrid && " hybrid"}
+                      </span>
+                    </div>


⚠️ Potential issue | 🟡 Minor

chunk_size && ... will render literal 0 when chunk_size is 0.

Using && with a number in JSX is a well-known React gotcha — 0 && <element> renders "0". Additionally, if none of chunk_size, use_contextual, or use_hybrid are truthy, the "Vectorizer:" row renders with an empty <span>.

Proposed fix

{item.vectorizer_settings && ( <div className="flex items-center gap-2"> <span className="text-gray-400">Vectorizer:</span> <span> - {item.vectorizer_settings.chunk_size && `chunk=${item.vectorizer_settings.chunk_size}`} + {item.vectorizer_settings.chunk_size != null && `chunk=${item.vectorizer_settings.chunk_size}`} {item.vectorizer_settings.use_contextual && " contextual"} {item.vectorizer_settings.use_hybrid && " hybrid"} </span> </div> )}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

{item.vectorizer_settings && (

<div className="flex items-center gap-2">

<span className="text-gray-400">Vectorizer:</span>

<span>

{item.vectorizer_settings.chunk_size && `chunk=${item.vectorizer_settings.chunk_size}`}

{item.vectorizer_settings.use_contextual && " contextual"}

{item.vectorizer_settings.use_hybrid && " hybrid"}

</span>

</div>

{item.vectorizer_settings && (

<div className="flex items-center gap-2">

<span className="text-gray-400">Vectorizer:</span>

<span>

{item.vectorizer_settings.chunk_size != null && `chunk=${item.vectorizer_settings.chunk_size}`}

{item.vectorizer_settings.use_contextual && " contextual"}

{item.vectorizer_settings.use_hybrid && " hybrid"}

</span>

</div>

)}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx` around lines 330 - 338, The Vectorizer row currently uses truthy checks which render "0" for chunk_size===0 and can produce an empty <span>; update the KnowledgeCard rendering logic for item.vectorizer_settings to build a list of display parts: include chunk when chunk_size is a number (e.g., typeof item.vectorizer_settings.chunk_size === "number"), include "contextual" when item.vectorizer_settings.use_contextual is true, and include "hybrid" when item.vectorizer_settings.use_hybrid is true, then only render the entire row (the surrounding <div> and <span>) if the parts array has length > 0; refer to the item.vectorizer_settings, chunk_size, use_contextual, and use_hybrid identifiers to locate the code.

coderabbitai · 2026-02-22T09:03:12Z

+                  {item.last_crawled_at && (
+                    <div className="flex items-center gap-2">
+                      <span className="text-gray-400">Last crawled:</span>
+                      <span>{format(new Date(item.last_crawled_at), "M/d/yyyy h:mm a")}</span>
+                    </div>
+                  )}
+                  {item.last_vectorized_at && (
+                    <div className="flex items-center gap-2">
+                      <span className="text-gray-400">Last vectorized:</span>
+                      <span>{format(new Date(item.last_vectorized_at), "M/d/yyyy h:mm a")}</span>
+                    </div>
+                  )}


⚠️ Potential issue | 🟠 Major

Unguarded format(new Date(...)) will throw on invalid date strings.

The existing card footer (lines 242–246) wraps its format() call in a try/catch, but the provenance dates here do not. If last_crawled_at or last_vectorized_at contains an unexpected value, this will crash the component.

Proposed fix — add safe formatting

{item.last_crawled_at && ( <div className="flex items-center gap-2"> <span className="text-gray-400">Last crawled:</span> - <span>{format(new Date(item.last_crawled_at), "M/d/yyyy h:mm a")}</span> + <span> + {(() => { + try { + return format(new Date(item.last_crawled_at), "M/d/yyyy h:mm a"); + } catch { + return new Date(item.last_crawled_at).toLocaleString(); + } + })()} + </span> </div> )} {item.last_vectorized_at && ( <div className="flex items-center gap-2"> <span className="text-gray-400">Last vectorized:</span> - <span>{format(new Date(item.last_vectorized_at), "M/d/yyyy h:mm a")}</span> + <span> + {(() => { + try { + return format(new Date(item.last_vectorized_at), "M/d/yyyy h:mm a"); + } catch { + return new Date(item.last_vectorized_at).toLocaleString(); + } + })()} + </span> </div> )}

Alternatively, extract a shared safeDateFormat utility to DRY up both this block and the footer.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

{item.last_crawled_at && (

<div className="flex items-center gap-2">

<span className="text-gray-400">Last crawled:</span>

<span>{format(new Date(item.last_crawled_at), "M/d/yyyy h:mm a")}</span>

</div>

)}

{item.last_vectorized_at && (

<div className="flex items-center gap-2">

<span className="text-gray-400">Last vectorized:</span>

<span>{format(new Date(item.last_vectorized_at), "M/d/yyyy h:mm a")}</span>

</div>

)}

{item.last_crawled_at && (

<div className="flex items-center gap-2">

<span className="text-gray-400">Last crawled:</span>

<span>

{(() => {

try {

return format(new Date(item.last_crawled_at), "M/d/yyyy h:mm a");

} catch {

return new Date(item.last_crawled_at).toLocaleString();

}

})()}

</span>

</div>

)}

{item.last_vectorized_at && (

<div className="flex items-center gap-2">

<span className="text-gray-400">Last vectorized:</span>

<span>

{(() => {

try {

return format(new Date(item.last_vectorized_at), "M/d/yyyy h:mm a");

} catch {

return new Date(item.last_vectorized_at).toLocaleString();

}

})()}

</span>

</div>

)}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx` around lines 340 - 351, The date rendering for item.last_crawled_at and item.last_vectorized_at in KnowledgeCard.tsx uses format(new Date(...)) unguarded and can throw on invalid inputs; wrap each rendering in a safe check (e.g., parse the string and verify with date-fns isValid or wrap format(new Date(...)) in a try/catch) so you only render a formatted string when valid, otherwise render a fallback (e.g., "Unknown" or omit); alternatively, extract a shared safeDateFormat utility (safeDateFormat(dateStr): string | null) and call it from the KnowledgeCard footer and these provenance lines to DRY the logic and ensure consistent fallback behavior for invalid dates.

coderabbitai · 2026-02-22T09:03:12Z

+            # Check for existing crawl state and determine if we're resuming
+            url_state_service = get_crawl_url_state_service(self.supabase_client)
+            has_existing_state = url_state_service.has_existing_state(original_source_id)
+
+            if has_existing_state:
+                crawl_state = url_state_service.get_crawl_state(original_source_id)
+                pending_count = crawl_state.get("pending", 0)
+                embedded_count = crawl_state.get("embedded", 0)
+                failed_count = crawl_state.get("failed", 0)
+                total_count = crawl_state.get("total", 0)
+
+                # If there are pending or failed URLs, log resume info
+                if pending_count > 0 or failed_count > 0:
+                    safe_logfire_info(
+                        f"Resuming crawl | source_id={original_source_id} | "
+                        f"embedded={embedded_count} | pending={pending_count} | failed={failed_count} | total={total_count}"
+                    )
+                else:
+                    # All URLs processed - clear old state for fresh crawl
+                    url_state_service.clear_state(original_source_id)
+                    safe_logfire_info(f"Cleared completed crawl state for fresh crawl | source_id={original_source_id}")


⚠️ Potential issue | 🟠 Major

Resume state check is unguarded — a DB failure here kills the entire crawl.

If has_existing_state(), get_crawl_state(), or clear_state() throws (e.g., table doesn't exist yet, transient DB error), the exception propagates uncaught and the crawl fails entirely at the top-level except Exception handler (Line 749). Resume state checking is an optimization, not a prerequisite — it should degrade gracefully.

Additionally, the missing fetched-state check before clear_state and the synchronous DB calls in async context were noted in prior reviews and still apply.

🛡️ Proposed fix — wrap in try/except so crawl proceeds even if state check fails

# Check for existing crawl state and determine if we're resuming url_state_service = get_crawl_url_state_service(self.supabase_client) - has_existing_state = url_state_service.has_existing_state(original_source_id) - - if has_existing_state: - crawl_state = url_state_service.get_crawl_state(original_source_id) - pending_count = crawl_state.get("pending", 0) - embedded_count = crawl_state.get("embedded", 0) - failed_count = crawl_state.get("failed", 0) - total_count = crawl_state.get("total", 0) - - # If there are pending or failed URLs, log resume info - if pending_count > 0 or failed_count > 0: - safe_logfire_info( - f"Resuming crawl | source_id={original_source_id} | " - f"embedded={embedded_count} | pending={pending_count} | failed={failed_count} | total={total_count}" - ) - else: - # All URLs processed - clear old state for fresh crawl - url_state_service.clear_state(original_source_id) - safe_logfire_info(f"Cleared completed crawl state for fresh crawl | source_id={original_source_id}") + has_existing_state = False + try: + has_existing_state = await asyncio.to_thread( + url_state_service.has_existing_state, original_source_id + ) + + if has_existing_state: + crawl_state = await asyncio.to_thread( + url_state_service.get_crawl_state, original_source_id + ) + pending_count = crawl_state.get("pending", 0) + embedded_count = crawl_state.get("embedded", 0) + failed_count = crawl_state.get("failed", 0) + total_count = crawl_state.get("total", 0) + + if pending_count > 0 or failed_count > 0: + safe_logfire_info( + f"Resuming crawl | source_id={original_source_id} | " + f"embedded={embedded_count} | pending={pending_count} | " + f"failed={failed_count} | total={total_count}" + ) + else: + await asyncio.to_thread( + url_state_service.clear_state, original_source_id + ) + safe_logfire_info( + f"Cleared completed crawl state for fresh crawl | source_id={original_source_id}" + ) + except Exception as e: + logger.warning( + f"Failed to check crawl resume state for '{original_source_id}': {e}. " + "Proceeding with fresh crawl.", + exc_info=True, + ) + has_existing_state = False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/crawling/crawling_service.py` around lines 364 - 384, Wrap the resume-state logic that calls get_crawl_url_state_service(...), has_existing_state(original_source_id), get_crawl_state(original_source_id), and clear_state(original_source_id) in a try/except so any DB error does not abort the crawl; on exception log a warning via safe_logfire_info and continue. Before calling clear_state, verify the crawl_state includes a "fetched" count (or at minimum ensure pending==0 and failed==0 and fetched>=total) to avoid clearing prematurely. If these URL state methods are synchronous and this module is async, call them off the event loop (e.g., asyncio.to_thread) or use their async equivalents to avoid blocking. Ensure all caught exceptions are swallowed after logging so resume-state failures degrade gracefully.

coderabbitai · 2026-02-22T09:03:12Z

+    async def _filter_already_processed_urls(self, source_id: str, urls: list[str]) -> list[str]:
+        """
+        Filter out URLs that are already embedded.

-    async def _crawl_by_url_type(self, url: str, request: dict[str, Any]) -> tuple:
+        Args:
+            source_id: The source ID
+            urls: List of URLs to filter
+
+        Returns:
+            List of URLs that have not been embedded yet
+        """
+        if not urls:
+            return []
+
+        url_state_service = get_crawl_url_state_service(self.supabase_client)
+
+        # Get embedded URLs
+        embedded_urls = url_state_service.get_embedded_urls(source_id)
+        embedded_set = set(embedded_urls)
+
+        # Filter
+        filtered = [url for url in urls if url not in embedded_set]
+
+        # Log resume info
+        if len(filtered) < len(urls):
+            skipped = len(urls) - len(filtered)
+            safe_logfire_info(
+                f"Resume filtering | skipped={skipped} already-embedded URLs | "
+                f"remaining={len(filtered)} | source_id={source_id}",
+                progress_id=self.progress_id,
+            )
+
+        return filtered


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find safe_logfire_info function definition rg "def safe_logfire_info" -A 5 python/

Repository: coleam00/Archon

Length of output: 462

🏁 Script executed:

# Find get_embedded_urls function definition rg "def get_embedded_urls" -A 5 python/

Repository: coleam00/Archon

Length of output: 616

🏁 Script executed:

# Check imports in the crawling_service file to understand the context rg "from.*safe_logfire_info|from.*get_crawl_url_state_service" python/src/server/services/crawling/crawling_service.py

Repository: coleam00/Archon

Length of output: 210

🏁 Script executed:

# Check how _filter_already_processed_urls is called rg "_filter_already_processed_urls" python/src/server/services/crawling/ -B 2 -A 2

Repository: coleam00/Archon

Length of output: 1772

🏁 Script executed:

# Check if there are other patterns in the codebase where sync DB calls are made in async methods rg "get_embedded_urls|url_state_service\." python/src/server/services/crawling/crawling_service.py | head -20

Repository: coleam00/Archon

Length of output: 373

🏁 Script executed:

# Look at the full implementation of safe_logfire_info to understand kwargs usage cat python/src/server/config/logfire_config.py | grep -A 20 "def safe_logfire_info"

Repository: coleam00/Archon

Length of output: 537

🏁 Script executed:

# Check how many other async methods make sync DB calls via url_state_service rg "async def " python/src/server/services/crawling/crawling_service.py -A 30 | grep -E "async def|url_state_service\.|get_crawl_url_state_service"

Repository: coleam00/Archon

Length of output: 1622

🏁 Script executed:

# Check if any await calls are used with get_crawl_url_state_service calls rg "await.*get_crawl_url_state_service|await.*url_state_service" python/src/server/services/crawling/crawling_service.py

Repository: coleam00/Archon

Length of output: 41

_filter_already_processed_urls is async but contains no await calls — consider whether the method should be synchronous.

The method calls get_embedded_urls(), a synchronous operation, without awaiting it. While safe_logfire_info correctly accepts the progress_id keyword argument via **kwargs, the synchronous database call in an async context could block the event loop. If this pattern is intentional across the service, consider making the method synchronous instead, or use asyncio.to_thread() to run the blocking call on a thread pool.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/crawling/crawling_service.py` around lines 863 - 895, _filter_already_processed_urls is declared async but calls the synchronous get_embedded_urls (via get_crawl_url_state_service(self.supabase_client)) which can block the event loop; either make the method synchronous or run the blocking call in a thread. Fix by either changing _filter_already_processed_urls to a regular def (and adjust callers) or wrap the blocking call with asyncio.to_thread, e.g. await asyncio.to_thread(url_state_service.get_embedded_urls, source_id), then proceed with building embedded_set, filtering urls, and calling safe_logfire_info(progress_id=self.progress_id) as before.

coderabbitai · 2026-02-22T09:03:12Z

            try:
+                # Get current embedding configuration for provenance tracking
+                from ..credential_service import credential_service
+
+                embedding_config = await credential_service.get_credentials_by_category("embedding")
+                embedding_provider = embedding_config.get("EMBEDDING_PROVIDER", "openai")
+                embedding_model = embedding_config.get("EMBEDDING_MODEL", "text-embedding-3-small")
+                embedding_dimensions = int(embedding_config.get("EMBEDDING_DIMENSIONS", "1536"))
+
+                # Get vectorizer settings (placeholder for now, can be enhanced later)
+                vectorizer_settings = {
+                    "use_contextual": False,
+                    "use_hybrid": False,
+                    "chunk_size": 5000,
+                }
+
+                # Get summarization model from RAG strategy
+                rag_settings = await credential_service.get_credentials_by_category("rag_strategy")
+                summarization_model = rag_settings.get("MODEL_CHOICE", "gpt-4o-mini")


⚠️ Potential issue | 🟠 Major

Provenance retrieval failures silently degrade source records to fallback without provenance.

If credential_service.get_credentials_by_category("embedding") throws (e.g., transient network error), the entire primary update_source_info call is skipped and the fallback path (Lines 420–449) creates a bare source record without any provenance fields. This means a single credential-service hiccup loses all provenance for the crawl.

Also, int(embedding_config.get("EMBEDDING_DIMENSIONS", "1536")) on Line 383 will raise ValueError if the config contains a non-numeric string, triggering the same fallback.

Consider isolating provenance retrieval so failures default to None rather than aborting the primary path:

🛡️ Proposed fix — isolate provenance retrieval

try: - # Get current embedding configuration for provenance tracking - from ..credential_service import credential_service - - embedding_config = await credential_service.get_credentials_by_category("embedding") - embedding_provider = embedding_config.get("EMBEDDING_PROVIDER", "openai") - embedding_model = embedding_config.get("EMBEDDING_MODEL", "text-embedding-3-small") - embedding_dimensions = int(embedding_config.get("EMBEDDING_DIMENSIONS", "1536")) - - # Get vectorizer settings (placeholder for now, can be enhanced later) - vectorizer_settings = { - "use_contextual": False, - "use_hybrid": False, - "chunk_size": 5000, - } - - # Get summarization model from RAG strategy - rag_settings = await credential_service.get_credentials_by_category("rag_strategy") - summarization_model = rag_settings.get("MODEL_CHOICE", "gpt-4o-mini") + # Get current embedding configuration for provenance tracking + from ..credential_service import credential_service + + embedding_provider = None + embedding_model = None + embedding_dimensions = None + vectorizer_settings = None + summarization_model = None + + try: + embedding_config = await credential_service.get_credentials_by_category("embedding") + embedding_provider = embedding_config.get("EMBEDDING_PROVIDER", "openai") + embedding_model = embedding_config.get("EMBEDDING_MODEL", "text-embedding-3-small") + embedding_dimensions = int(embedding_config.get("EMBEDDING_DIMENSIONS", "1536")) + vectorizer_settings = { + "use_contextual": False, + "use_hybrid": False, + "chunk_size": 5000, + } + rag_settings = await credential_service.get_credentials_by_category("rag_strategy") + summarization_model = rag_settings.get("MODEL_CHOICE", "gpt-4o-mini") + except Exception as prov_err: + logger.warning( + f"Failed to retrieve provenance config for '{source_id}': {prov_err}, " + "source record will be created without provenance", + exc_info=True, + ) # Call async update_source_info directly await update_source_info(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/crawling/document_storage_operations.py` around lines 376 - 394, The provenance retrieval currently can raise and abort the primary update_source_info path; wrap the calls to credential_service.get_credentials_by_category("embedding") and ("rag_strategy") and the int(...) conversion for embedding_dimensions in a small try/except block so failures are caught, logged, and the code continues with safe fallback values (e.g., embedding_provider=None, embedding_model=None, embedding_dimensions=None or a validated integer default, summarization_model=None) instead of raising; ensure this block lives just before building vectorizer_settings/summarization_model (around the embedding_config/embedding_dimensions and rag_settings retrieval) and that update_source_info uses those fallback None/validated values so provenance fields are omitted but the primary path is not aborted.

coderabbitai · 2026-02-22T09:03:12Z

+            if embedding_model:
+                upsert_data["embedding_model"] = embedding_model
+            if embedding_dimensions:
+                upsert_data["embedding_dimensions"] = embedding_dimensions


⚠️ Potential issue | 🟡 Minor

embedding_dimensions=0 would be silently dropped.

if embedding_dimensions: is falsy for 0. While unlikely in practice, the guard is inconsistent with line 304's vectorizer_settings is not None pattern. Use is not None for the integer check.

Proposed fix

- if embedding_dimensions: + if embedding_dimensions is not None:

Apply the same change at line 381.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/source_management_service.py` around lines 298 - 301, The guard `if embedding_dimensions:` drops valid zero values; change it to `if embedding_dimensions is not None:` when setting upsert_data["embedding_dimensions"] so 0 is preserved—update the occurrence near the block that sets upsert_data (the same area that checks embedding_model) and also apply the identical change at the later occurrence around line ~381 where embedding_dimensions is checked again (both places in source_management_service.py that populate upsert_data/vectorizer settings).

… settings

…mmarize - Add semaphore-based concurrency limiting (2 concurrent ops max) - Add ProgressTracker integration for real-time progress updates - Return progressId from endpoints for frontend polling - Update frontend to capture progressId and start polling - Add Code Summarization Agent settings section to RAGSettings UI - Provenance display already exists on KnowledgeCard

- Add Code Summary as third tab alongside Chat and Embeddings - Add LuCode icon with orange color for Code Summary tab - Add inline Ollama URL input for Code Summary when Ollama selected - Remove redundant collapsible section that was added incorrectly - Update all context-aware logic to handle code_summarization selection

- Move Code Summary to second row to prevent overflow - Shorten label to 'Summary' - Add code_summarization to Ollama Config panel with model selection - Update Configuration Summary to handle code summarization

- Add Summary as third tab (row 2) with orange color and LuCode icon - Works alongside Chat (row 1) and Embeddings (row 1)

coderabbitai

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

python/src/server/services/storage/code_storage_service.py (2)

1084-1107: ⚠️ Potential issue | 🟠 Major

Resolve provider before creating the shared LLM client.

generate_code_summaries_batch builds a shared client with provider=None, but _generate_code_example_summary_async can resolve a different provider from credentials. That mismatch can break requests.

🛠️ Suggested fix

+    if provider is None:
+        provider = await _get_code_summarization_provider()
+
     # Create a shared LLM client for all summaries (performance optimization)
     async with get_llm_client(provider=provider) as shared_client:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/storage/code_storage_service.py` around lines 1084
- 1107, The batch routine creates a shared LLM client with provider possibly
None which can mismatch what _generate_code_example_summary_async resolves
per-item; before calling get_llm_client in generate_code_summaries_batch,
resolve the concrete provider to use (e.g., derive from credentials/config) and
set the local provider variable, then create the shared_client with that
resolved provider and pass that same provider into each
_generate_code_example_summary_async call so the client and provider are
consistent.

108-137: ⚠️ Potential issue | 🟠 Major

Default model selection should follow the code-summarization provider.

If CODE_SUMMARIZATION_PROVIDER differs from the active LLM provider and no dedicated model is set, _get_model_choice still chooses defaults for the chat provider, which can produce a provider/model mismatch.

🛠️ Suggested adjustment

-        provider_config = await credential_service.get_active_provider("llm")
-        active_provider = provider_config.get("provider", "openai")
-        model = provider_config.get("chat_model")
+        provider_config = await credential_service.get_active_provider("llm")
+        active_provider = provider_config.get("provider", "openai")
+        model = provider_config.get("chat_model")
+        code_provider = await _get_code_summarization_provider()
+        provider_for_defaults = code_provider or active_provider
...
-        if not model or model.strip() == "":
+        if not model or model.strip() == "" or provider_for_defaults != active_provider:
             provider_defaults = {
                 ...
             }
-            model = provider_defaults.get(active_provider, "gpt-4o-mini")
+            model = provider_defaults.get(provider_for_defaults, "gpt-4o-mini")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/storage/code_storage_service.py` around lines 108
- 137, The _get_model_choice function currently uses the active LLM provider to
pick provider-specific defaults even when a separate CODE_SUMMARIZATION_PROVIDER
is configured; update _get_model_choice to first read
CODE_SUMMARIZATION_PROVIDER (via
credential_service.get_credential("CODE_SUMMARIZATION_PROVIDER")) and, when no
CODE_SUMMARIZATION_MODEL is set, use that provider value (if present and
non-empty) as the provider to look up defaults (falling back to
provider_config.get("provider", "openai") only if CODE_SUMMARIZATION_PROVIDER is
absent/empty); keep the existing provider_defaults map and return/assign model
based on the chosen summarization provider so the selected default model matches
the code-summarization provider.

🧹 Nitpick comments (2)

archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx (2)
328-387: Consider using Radix Collapsible primitive for the provenance section.

The expand/collapse logic is implemented manually with useState + conditional rendering (Lines 330–343, 345–385). The coding guidelines recommend using Radix UI primitives from src/features/ui/primitives/, and the ADR-002 implementation status doc itself suggests "Radix Collapsible" for this section. Using Collapsible.Root / Collapsible.Trigger / Collapsible.Content would provide built-in accessibility (aria attributes, keyboard handling) and animation support.

As per coding guidelines: "Prefer using Radix UI primitives from src/features/ui/primitives/ when creating UI components."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx` around
lines 328 - 387, The provenance expand/collapse uses manual state
(showProvenance) and a button click handler in KnowledgeCard (the
ChevronRight/Settings2 button and conditional rendering block) which should be
replaced with the Radix Collapsible primitive from our UI primitives; import
Collapsible.Root, Collapsible.Trigger and Collapsible.Content from
src/features/ui/primitives/, wrap the current button markup as
Collapsible.Trigger (preserving the ChevronRight rotation via the open state
class), move the details panel into Collapsible.Content (remove
useState/showProvenance and the onClick handler), and keep all existing item
fields (embedding_provider, summarization_model, vectorizer_settings,
last_crawled_at, last_vectorized_at) and formatting logic unchanged so
accessibility, keyboard handling, and animations are provided by Radix.
190-195: onRevectorize and onResummarize are passed unconditionally.

Both handlers are always passed to KnowledgeCardActions, but the child component gates rendering on hasDocuments && onRevectorize and hasCodeExamples && onResummarize. This means the "Re-vectorize" menu item will appear whenever documentCount > 0 regardless of whether re-vectorization makes sense (e.g., the item is currently processing). Not a bug since it works, but consider gating these at the source for clarity — e.g., only pass onRevectorize when the item isn't already processing.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx` around
lines 190 - 195, Only pass the handlers to KnowledgeCardActions when they make
sense: conditionally pass onRevectorize and onResummarize instead of
unconditionally; e.g., replace the current props so onRevectorize is passed only
when documentCount > 0 AND the item is not currently processing (e.g.,
!isProcessing or !isVectorizing) and similarly pass onResummarize only when
codeExamplesCount > 0 AND the item is not processing; use the existing
handleRevectorize and handleResummarize functions as the values when those
conditions are true.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ADR-002-IMPLEMENTATION-STATUS.md`:
- Around line 96-141: Update the ADR status and checklists to reflect the
frontend and reprocessing work already included in this PR: change "Part 2:
Frontend Implementation" from "⏳ PENDING" to completed and check the items
referencing archon-ui-main/src/features/knowledge/types/knowledge.ts and
archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx (the
KnowledgeSource type and the Processing Details UI are present), and update the
reprocessing section (Part 4) from "❌ Not Started" to completed and check the
items implemented by useRevectorizeKnowledgeItem, useResummarizeKnowledgeItem
hooks, revectorizeKnowledgeItem and resummarizeKnowledgeItem service methods,
the "Re-vectorize"/"Re-summarize" actions in KnowledgeCardActions, and the
"Needs re-vectorization" badge in KnowledgeCard so the document accurately
reflects shipped code.

In `@ADR-002`: Crawl Reliability, Ingestion Quality Control & DB Validation.md:
- Around line 96-103: The ADR phase status is incorrect: Phase 4 is marked "Not
Started" while this PR already adds reprocessing hooks and UI; update ADR-002
and ADR-002-IMPLEMENTATION-STATUS.md to reflect partial implementation by
changing Phase 4 status from "❌ Not Started" to something like "⚠️ Partially
Implemented" and add a short note listing implemented artifacts
(useRevectorizeKnowledgeItem, useResummarizeKnowledgeItem, the corresponding
service methods and UI actions) and any remaining work; ensure the ADR text and
the IMPLEMENTATION-STATUS doc mention which sub-tasks are completed and which
are pending so the docs match the code.

In `@archon-ui-main/src/components/settings/RAGSettings.tsx`:
- Around line 1343-1351: The Summary tab button currently builds Tailwind
classes via a template string using activeSelection === 'code_summarization';
replace this dynamic construction with static class constants and pick the
appropriate constant based on state (e.g. create two predefined className
strings like SUMMARY_ACTIVE_CLASS and SUMMARY_INACTIVE_CLASS and choose one
where the component renders the button), or use a utility like clsx/twMerge to
select between those static constants; update the button rendering in
RAGSettings (the element using activeSelection and the 'code_summarization' key)
to use the selected static className instead of composing classes inline.
- Around line 221-229: The component reads
ragSettings.CODE_SUMMARIZATION_PROVIDER and
ragSettings.CODE_SUMMARIZATION_BASE_URL but RAGSettingsProps (ragSettings)
doesn't declare these fields; update the RAGSettingsProps interface to include
CODE_SUMMARIZATION_PROVIDER: ProviderKey (or optional ProviderKey) and
CODE_SUMMARIZATION_BASE_URL: string (or optional string) so the usages in
showCodeSummarySettings, codeSummaryProvider, and codeSummaryInstanceConfig are
type-safe under strict TS; add them as optional properties if they may be
missing and update any related defaulting logic accordingly.
- Around line 475-486: The component lacks a sync from ragSettings → local code
summary provider state, so add a useEffect that watches
ragSettings.CODE_SUMMARIZATION_PROVIDER and updates the local
codeSummaryProvider (or calls setRagSettings appropriately) to keep the UI from
showing stale defaults; implement it similarly to the existing effect but
inverted: use updateCodeSummaryRagSettingsRef to avoid loops, check if
ragSettings.CODE_SUMMARIZATION_PROVIDER is defined and different from the
current codeSummaryProvider, then call the setter to apply
ragSettings.CODE_SUMMARIZATION_PROVIDER so UI and persisted settings stay in
sync.

In `@archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts`:
- Around line 716-761: The hooks useRevectorizeKnowledgeItem and
useResummarizeKnowledgeItem access properties (documents_updated,
examples_updated, model_used) on the mutation result that are not declared in
the service return types; to fix, either update the service methods
knowledgeService.revectorizeKnowledgeItem and
knowledgeService.resummarizeKnowledgeItem to include those optional fields in
their Promise return types (e.g. add documents_updated?: number,
examples_updated?: number, model_used?: string) or change the onSuccess handlers
in useRevectorizeKnowledgeItem and useResummarizeKnowledgeItem to use
data.message (or guard with data.documents_updated ?? undefined) so TypeScript
strict mode is satisfied and no undeclared property is accessed.

In `@archon-ui-main/src/services/credentialsService.ts`:
- Around line 27-30: RagSettings is missing the CODE_SUMMARIZATION_INSTANCE_NAME
key so values set in RAGSettings.tsx aren't persisted or read back; add
CODE_SUMMARIZATION_INSTANCE_NAME?: string to the RagSettings interface, add a
default entry for CODE_SUMMARIZATION_INSTANCE_NAME in the RagSettings default
values object, and include "CODE_SUMMARIZATION_INSTANCE_NAME" in the
string-fields mapping returned by getRagSettings() so updateRagSettings() and
the loader will persist/return the value (refer to the RagSettings interface,
the defaults object, and getRagSettings() function to locate where to add
these).

In `@migration/0.1.0/003_ollama_add_columns.sql`:
- Around line 33-40: The tracking INSERT into the archon_migrations table
(VALUES ('0.1.0','003_ollama_add_columns') ON CONFLICT DO NOTHING) is executed
after the COMMIT and must be moved inside the same BEGIN…COMMIT transaction that
contains the ALTER TABLE statements so the migration and its tracking record are
atomic; relocate the INSERT so it runs before the COMMIT (within the transaction
block) and keep the existing ON CONFLICT clause and status SELECT unchanged.

In `@migration/0.1.0/004_ollama_migrate_data.sql`:
- Around line 68-75: The tracking INSERT into archon_migrations currently runs
after the transaction COMMIT, leaving DROP COLUMN changes outside the same
atomic transaction; move the INSERT INTO archon_migrations (the statement that
inserts ('0.1.0', '004_ollama_migrate_data') ON CONFLICT DO NOTHING) so it
executes before the COMMIT and inside the same BEGIN…COMMIT block (i.e., place
the INSERT into archon_migrations just prior to the COMMIT in this migration) to
ensure the migration recording is atomic with the schema changes.

In `@migration/0.1.0/005_ollama_create_functions.sql`:
- Around line 170-177: The tracking INSERT into archon_migrations is executed
after the COMMIT, breaking transactional atomicity; move the INSERT ('0.1.0',
'005_ollama_create_functions' ON CONFLICT DO NOTHING) so it runs inside the same
transaction that creates the Ollama functions—i.e., ensure the BEGIN…COMMIT
block that wraps the function creation also contains the INSERT (or wrap both
function creation and the INSERT in a single explicit transaction) so that
either both the functions and the migration record are committed or neither are.

In `@python/src/server/api_routes/knowledge_api.py`:
- Around line 684-723: The route handler revectorize_knowledge_item currently
performs orchestration and Supabase reads/updates; refactor by moving all
orchestration (provider API-key validation via
credential_service.get_active_provider and _validate_provider_api_key, fetching
the item via KnowledgeItemService.get_item, ProgressTracker
initialization/start, and launching the background job
_perform_revectorize_with_progress) into a new service method (e.g.,
KnowledgeItemService.revectorize_with_progress or
KnowledgeOrchestrationService.start_revectorize) that returns the progress_id
and status; then simplify revectorize_knowledge_item to call that service method
and return {"success": True, "progressId": progress_id, "message": ...}. Ensure
the service method uses the same symbols (_perform_revectorize_with_progress,
ProgressTracker, KnowledgeItemService.get_item) and keeps error handling (404
HTTPException) so tests and database access are encapsulated in the service
layer.
- Around line 878-882: The code is validating the active "llm" provider but
resummarize should validate the configured code-summarization provider; replace
the hardcoded "llm" usage with the CODE_SUMMARIZATION_PROVIDER constant so you
fetch the correct provider_config (change the call
credential_service.get_active_provider("llm") to
credential_service.get_active_provider(CODE_SUMMARIZATION_PROVIDER)), then
derive provider from provider_config (provider = provider_config.get("provider",
"<appropriate default>")) and call _validate_provider_api_key(provider) as
before to ensure the code summarization provider's API key is validated.
- Around line 788-848: The code uses embedding_dim (set inside the per-batch
loop when result.embeddings exist) later in the
supabase.table("archon_sources").update call which can raise UnboundLocalError
if no embeddings were returned; to fix, introduce a variable (e.g.,
final_embedding_dim or any name) initialized before the batching loop and update
it when you successfully determine embedding_dim inside the loop (in the block
handling result.embeddings), then use that final variable (or a safe default
like 0 or None) in the archon_sources.update call and only set
embedding_model/provider/vectorizer_settings/last_vectorized_at if embeddings
were produced (or explicitly handle the “no embeddings” case) so archon_sources
update won’t reference an undefined embedding_dim; refer to
create_embeddings_batch, result.embeddings, embedding_dim, and the
supabase.table("archon_sources").update call when making the change.

In `@python/src/server/services/knowledge/knowledge_item_service.py`:
- Around line 384-386: _transform_source_to_item currently reads total words
from source.get("total_words", 0) which misses the DB column "total_word_count";
change the lookup to defensively prefer "total_word_count" with a fallback to
"total_words" (e.g. total_words = source.get("total_word_count",
source.get("total_words", 0))) and then use that total_words variable when
setting "word_count", "estimated_pages", and "pages_tooltip" so single-item
responses reflect the real count.
- Around line 427-462: In _check_nevectorization (async def
_check_needs_revectorization) change the default types passed into
credential_service.get_credential to match stored string formats (e.g., use
"false" for USE_CONTEXTUAL_EMBEDDINGS and "512" for CHUNK_SIZE) and compare
against stored_vectorizer_settings values also read as strings to avoid
type-coercion false positives; also replace the bare except block that silently
returns False with logging the exception (use safe_logfire_error or the project
logger, include source.get("source_id") and exc_info=True) before returning
False so import/credential errors are visible.

---

Outside diff comments:
In `@python/src/server/services/storage/code_storage_service.py`:
- Around line 1084-1107: The batch routine creates a shared LLM client with
provider possibly None which can mismatch what
_generate_code_example_summary_async resolves per-item; before calling
get_llm_client in generate_code_summaries_batch, resolve the concrete provider
to use (e.g., derive from credentials/config) and set the local provider
variable, then create the shared_client with that resolved provider and pass
that same provider into each _generate_code_example_summary_async call so the
client and provider are consistent.
- Around line 108-137: The _get_model_choice function currently uses the active
LLM provider to pick provider-specific defaults even when a separate
CODE_SUMMARIZATION_PROVIDER is configured; update _get_model_choice to first
read CODE_SUMMARIZATION_PROVIDER (via
credential_service.get_credential("CODE_SUMMARIZATION_PROVIDER")) and, when no
CODE_SUMMARIZATION_MODEL is set, use that provider value (if present and
non-empty) as the provider to look up defaults (falling back to
provider_config.get("provider", "openai") only if CODE_SUMMARIZATION_PROVIDER is
absent/empty); keep the existing provider_defaults map and return/assign model
based on the chosen summarization provider so the selected default model matches
the code-summarization provider.

---

Duplicate comments:
In `@ADR-002`: Crawl Reliability, Ingestion Quality Control & DB Validation.md:
- Around line 53-66: The ADR's crawl_url_state table definition is inconsistent
with the codebase: it references knowledge_sources(id) and uses UUID for
source_id, while the actual schema uses archon_sources and source_id is TEXT;
update the CREATE TABLE statement for crawl_url_state to reference
archon_sources (replace REFERENCES knowledge_sources(id)) and change the
source_id column type to TEXT (ensure UNIQUE(source_id, url) still applies) so
the example matches the real schema.

In `@archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx`:
- Around line 362-370: In KnowledgeCard, the expression
item.vectorizer_settings.chunk_size &&
`chunk=${item.vectorizer_settings.chunk_size}` uses && which treats 0 as false
and hides chunk_size=0; replace that conditional with an explicit null/undefined
check (e.g., check chunk_size !== undefined && chunk_size !== null or
Number.isFinite) and render `chunk=${item.vectorizer_settings.chunk_size}` when
the check passes so a value of 0 is shown correctly.
- Around line 372-383: The code renders item.last_crawled_at and
item.last_vectorized_at using format(new Date(...)) which can throw for invalid
date strings; update KnowledgeCard.tsx to guard these calls (same approach used
in the footer) by validating/parsing the date before formatting (e.g., use
date-fns isValid(new Date(...)) or wrap format(new Date(...)) in a try/catch)
and render a safe fallback (empty string or "Invalid date") when the value is
invalid; apply this change to both item.last_crawled_at and
item.last_vectorized_at so format is only called on guaranteed-valid Date
objects.

In `@migration/0.1.0/012_add_crawl_url_state.sql`:
- Around line 43-45: The RLS policy "Service role full access to
crawl_url_state" on archon_crawl_url_state is too permissive; replace USING
(true) WITH CHECK (true) with a service-role-only condition such as USING
(auth.role() = 'service_role') WITH CHECK (auth.role() = 'service_role') so only
the service_role can bypass row restrictions; update the policy definition that
mentions "Service role full access to crawl_url_state" to use this condition.
- Around line 21-22: The migration sets updated_at with a DEFAULT but lacks the
trigger to refresh it on updates; add a BEFORE UPDATE trigger and helper
function that sets NEW.updated_at = now() so updated_at is maintained
automatically. In the same migration (012_add_crawl_url_state.sql) create a
trigger function (e.g., trigger_set_timestamp or set_updated_at) and attach a
BEFORE UPDATE trigger on the table that assigns now() to NEW.updated_at and
returns NEW.

---

Nitpick comments:
In `@archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx`:
- Around line 328-387: The provenance expand/collapse uses manual state
(showProvenance) and a button click handler in KnowledgeCard (the
ChevronRight/Settings2 button and conditional rendering block) which should be
replaced with the Radix Collapsible primitive from our UI primitives; import
Collapsible.Root, Collapsible.Trigger and Collapsible.Content from
src/features/ui/primitives/, wrap the current button markup as
Collapsible.Trigger (preserving the ChevronRight rotation via the open state
class), move the details panel into Collapsible.Content (remove
useState/showProvenance and the onClick handler), and keep all existing item
fields (embedding_provider, summarization_model, vectorizer_settings,
last_crawled_at, last_vectorized_at) and formatting logic unchanged so
accessibility, keyboard handling, and animations are provided by Radix.
- Around line 190-195: Only pass the handlers to KnowledgeCardActions when they
make sense: conditionally pass onRevectorize and onResummarize instead of
unconditionally; e.g., replace the current props so onRevectorize is passed only
when documentCount > 0 AND the item is not currently processing (e.g.,
!isProcessing or !isVectorizing) and similarly pass onResummarize only when
codeExamplesCount > 0 AND the item is not processing; use the existing
handleRevectorize and handleResummarize functions as the values when those
conditions are true.

coderabbitai · 2026-02-22T18:56:25Z

+### Frontend Implementation
+
+**Status:** ⏳ PENDING
+
+**Files to Modify:**
+1. ⏳ `archon-ui-main/src/features/knowledge/types/knowledge.ts`
+   ```typescript
+   export interface KnowledgeSource {
+     source_id: string;
+     // ... existing fields ...
+     embedding_model?: string;
+     embedding_dimensions?: number;
+     embedding_provider?: string;
+     vectorizer_settings?: {
+       use_contextual?: boolean;
+       use_hybrid?: boolean;
+       chunk_size?: number;
+     };
+     summarization_model?: string;
+     last_crawled_at?: string;
+     last_vectorized_at?: string;
+   }
+   ```
+
+2. ⏳ `archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx`
+   - Add expandable "Processing Details" section using Radix Collapsible
+   - Display embedding_provider/embedding_model (embedding_dimensions D)
+   - Display summarization_model
+   - Display formatted last_crawled_at timestamp
+   - Use Tron-inspired glassmorphism styling
+
+**UI Design:**
+```tsx
+<Collapsible.Root>
+  <Collapsible.Trigger className="flex items-center gap-2 text-sm text-gray-400 hover:text-cyan-400">
+    <ChevronRight className="transition-transform" />
+    Processing Details
+  </Collapsible.Trigger>
+  <Collapsible.Content className="mt-2 text-xs text-gray-400 space-y-1 pl-6">
+    <div>Embeddings: {embedding_provider}/{embedding_model} ({embedding_dimensions}D)</div>
+    <div>Summarization: {summarization_model}</div>
+    <div>Last crawled: {formatDate(last_crawled_at)}</div>
+  </Collapsible.Content>
+</Collapsible.Root>
+```
+


⚠️ Potential issue | 🟡 Minor

Status markers are out of date — frontend implementation is already present in this PR.

The "Part 2: Frontend Implementation" is marked "⏳ PENDING" and the items listed (types in knowledge.ts, provenance display in KnowledgeCard.tsx) are already implemented in this PR. Similarly, the checklist at Lines 274–275 should be checked.

Additionally, Part 4 (Lines 211–256) marks all reprocessing items as "❌ Not Started," but this PR includes:

useRevectorizeKnowledgeItem / useResummarizeKnowledgeItem hooks

revectorizeKnowledgeItem / resummarizeKnowledgeItem service methods

"Re-vectorize" / "Re-summarize" dropdown actions in KnowledgeCardActions

"Needs re-vectorization" badge in KnowledgeCard

Update the status markers and checklists to reflect the code that ships with this PR.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ADR-002-IMPLEMENTATION-STATUS.md` around lines 96 - 141, Update the ADR status and checklists to reflect the frontend and reprocessing work already included in this PR: change "Part 2: Frontend Implementation" from "⏳ PENDING" to completed and check the items referencing archon-ui-main/src/features/knowledge/types/knowledge.ts and archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx (the KnowledgeSource type and the Processing Details UI are present), and update the reprocessing section (Part 4) from "❌ Not Started" to completed and check the items implemented by useRevectorizeKnowledgeItem, useResummarizeKnowledgeItem hooks, revectorizeKnowledgeItem and resummarizeKnowledgeItem service methods, the "Re-vectorize"/"Re-summarize" actions in KnowledgeCardActions, and the "Needs re-vectorization" badge in KnowledgeCard so the document accurately reflects shipped code.

coderabbitai · 2026-02-22T18:56:25Z

+**Implementation:**
+- Add validation API endpoint
+- Add database validation functions
+- Add MCP validation tool
+- Add frontend validation UI
+
+**Status:** ❌ Not Started (see ADR-002-IMPLEMENTATION-STATUS.md)
+


⚠️ Potential issue | 🟡 Minor

Phase 3/4 status markers conflict with implemented code in this PR.

The ADR marks Phase 4 (Reprocessing Tools) as "❌ Not Started," but this PR already includes frontend hooks (useRevectorizeKnowledgeItem, useResummarizeKnowledgeItem), service methods, and UI actions for re-vectorize/re-summarize. The implementation status doc (ADR-002-IMPLEMENTATION-STATUS.md) also marks Phase 4 as not started. Consider updating the status to reflect the partial implementation already present.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ADR-002`: Crawl Reliability, Ingestion Quality Control & DB Validation.md around lines 96 - 103, The ADR phase status is incorrect: Phase 4 is marked "Not Started" while this PR already adds reprocessing hooks and UI; update ADR-002 and ADR-002-IMPLEMENTATION-STATUS.md to reflect partial implementation by changing Phase 4 status from "❌ Not Started" to something like "⚠️ Partially Implemented" and add a short note listing implemented artifacts (useRevectorizeKnowledgeItem, useResummarizeKnowledgeItem, the corresponding service methods and UI actions) and any remaining work; ensure the ADR text and the IMPLEMENTATION-STATUS doc mention which sub-tasks are completed and which are pending so the docs match the code.

coderabbitai · 2026-02-22T18:56:25Z

+  // Code Summarization state
+  const [showCodeSummarySettings, setShowCodeSummarySettings] = useState(false);
+  const [codeSummaryProvider, setCodeSummaryProvider] = useState<ProviderKey>(() =>
+    (ragSettings.CODE_SUMMARIZATION_PROVIDER as ProviderKey) || 'openai'
+  );
+  const [codeSummaryInstanceConfig, setCodeSummaryInstanceConfig] = useState({
+    name: '',
+    url: ragSettings.CODE_SUMMARIZATION_BASE_URL || 'http://host.docker.internal:11434/v1'
+  });


⚠️ Potential issue | 🟠 Major

Extend ragSettings type to include CODE_SUMMARIZATION_* fields.

These new property accesses aren’t declared in RAGSettingsProps, which will fail under strict TypeScript and makes the shape inconsistent.

💡 Suggested interface update

ragSettings: { ... CODE_EXTRACTION_BATCH_SIZE?: number; CODE_SUMMARY_MAX_WORKERS?: number; + CODE_SUMMARIZATION_BASE_URL?: string; + CODE_SUMMARIZATION_INSTANCE_NAME?: string; + CODE_SUMMARIZATION_PROVIDER?: ProviderKey; + CODE_SUMMARIZATION_MODEL?: string; };

As per coding guidelines, "archon-ui-main/src/**/*.{ts,tsx}: Use TypeScript in strict mode with no implicit any in frontend code".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@archon-ui-main/src/components/settings/RAGSettings.tsx` around lines 221 - 229, The component reads ragSettings.CODE_SUMMARIZATION_PROVIDER and ragSettings.CODE_SUMMARIZATION_BASE_URL but RAGSettingsProps (ragSettings) doesn't declare these fields; update the RAGSettingsProps interface to include CODE_SUMMARIZATION_PROVIDER: ProviderKey (or optional ProviderKey) and CODE_SUMMARIZATION_BASE_URL: string (or optional string) so the usages in showCodeSummarySettings, codeSummaryProvider, and codeSummaryInstanceConfig are type-safe under strict TS; add them as optional properties if they may be missing and update any related defaulting logic accordingly.

coderabbitai · 2026-02-22T18:56:25Z

+  // Update ragSettings when codeSummaryProvider changes
+  const updateCodeSummaryRagSettingsRef = useRef(true);
+
+  useEffect(() => {
+    if (updateCodeSummaryRagSettingsRef.current && codeSummaryProvider && codeSummaryProvider !== ragSettings.CODE_SUMMARIZATION_PROVIDER) {
+      setRagSettings(prev => ({
+        ...prev,
+        CODE_SUMMARIZATION_PROVIDER: codeSummaryProvider
+      }));
+    }
+    updateCodeSummaryRagSettingsRef.current = true;
+  }, [codeSummaryProvider]);


⚠️ Potential issue | 🟠 Major

Sync codeSummaryProvider from ragSettings to avoid stale state.

Unlike chat/embedding, there’s no ragSettings → local state sync for code summary. This can leave the UI showing the default provider and overwrite persisted settings on save.

🛠️ Suggested sync effect

+ useEffect(() => { + if ( + ragSettings.CODE_SUMMARIZATION_PROVIDER && + ragSettings.CODE_SUMMARIZATION_PROVIDER !== codeSummaryProvider + ) { + updateCodeSummaryRagSettingsRef.current = false; + setCodeSummaryProvider(ragSettings.CODE_SUMMARIZATION_PROVIDER as ProviderKey); + } + }, [ragSettings.CODE_SUMMARIZATION_PROVIDER]);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@archon-ui-main/src/components/settings/RAGSettings.tsx` around lines 475 - 486, The component lacks a sync from ragSettings → local code summary provider state, so add a useEffect that watches ragSettings.CODE_SUMMARIZATION_PROVIDER and updates the local codeSummaryProvider (or calls setRagSettings appropriately) to keep the UI from showing stale defaults; implement it similarly to the existing effect but inverted: use updateCodeSummaryRagSettingsRef to avoid loops, check if ragSettings.CODE_SUMMARIZATION_PROVIDER is defined and different from the current codeSummaryProvider, then call the setter to apply ragSettings.CODE_SUMMARIZATION_PROVIDER so UI and persisted settings stay in sync.

coderabbitai · 2026-02-22T18:56:25Z

+            className={`min-w-[180px] px-5 py-3 font-semibold text-white dark:text-white
+              border border-orange-400/70 dark:border-orange-400/40
+              bg-black/40 backdrop-blur-md
+              shadow-[inset_0_0_16px_rgba(234,88,12,0.38)]
+              hover:bg-orange-500/12 dark:hover:bg-orange-500/20
+              hover:border-orange-300/80 hover:shadow-[0_0_24px_rgba(251,146,60,0.52)]
+              ${(activeSelection === 'code_summarization')
+                ? 'shadow-[0_0_26px_rgba(251,146,60,0.55)] ring-2 ring-orange-400/60'
+                : 'shadow-[0_0_15px_rgba(251,146,60,0.25)]'}


⚠️ Potential issue | 🟡 Minor

Avoid dynamic Tailwind class construction for the Summary tab button.

The inline template string builds Tailwind classes dynamically. Prefer static class constants selected by state.

As per coding guidelines, "Do not use dynamic Tailwind class construction in frontend components".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@archon-ui-main/src/components/settings/RAGSettings.tsx` around lines 1343 - 1351, The Summary tab button currently builds Tailwind classes via a template string using activeSelection === 'code_summarization'; replace this dynamic construction with static class constants and pick the appropriate constant based on state (e.g. create two predefined className strings like SUMMARY_ACTIVE_CLASS and SUMMARY_INACTIVE_CLASS and choose one where the component renders the button), or use a utility like clsx/twMerge to select between those static constants; update the button rendering in RAGSettings (the element using activeSelection and the 'code_summarization' key) to use the selected static className instead of composing classes inline.

coderabbitai · 2026-02-22T18:56:26Z

+@router.post("/knowledge-items/{source_id}/revectorize")
+async def revectorize_knowledge_item(source_id: str):
+    """Re-generate embeddings for all documents in a knowledge item without re-crawling."""
+    from ..utils.progress.progress_tracker import ProgressTracker
+
+    logger.info(f"🔍 Starting re-vectorize for source_id={source_id}")
+
+    # Generate unique progress ID
+    progress_id = str(uuid.uuid4())
+
+    # Initialize progress tracker
+    tracker = ProgressTracker(progress_id, operation_type="revectorize")
+
+    try:
+        # Validate API key
+        provider_config = await credential_service.get_active_provider("embedding")
+        provider = provider_config.get("provider", "openai")
+        await _validate_provider_api_key(provider)
+
+        # Get the existing knowledge item
+        service = KnowledgeItemService(get_supabase_client())
+        existing_item = await service.get_item(source_id)
+
+        if not existing_item:
+            raise HTTPException(status_code=404, detail={"error": f"Knowledge item {source_id} not found"})
+
+        await tracker.start(
+            {
+                "status": "starting",
+                "progress": 0,
+                "log": f"Starting re-vectorization for {existing_item.get('title', source_id)}",
+                "documents_total": 0,
+                "documents_processed": 0,
+            }
        )
+
+        # Start background task with semaphore
+        asyncio.create_task(_perform_revectorize_with_progress(progress_id, source_id, provider, tracker))
+
+        return {"success": True, "progressId": progress_id, "message": "Re-vectorization started"}


🛠️ Refactor suggestion | 🟠 Major

Route should delegate revectorize/resummarize work to a service layer.

These new endpoints execute Supabase reads/updates directly in api_routes. Please move orchestration into a service class and call it from the route for consistency and testability.

As per coding guidelines, "Follow the API route → Service → Database pattern for new API endpoints".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/api_routes/knowledge_api.py` around lines 684 - 723, The route handler revectorize_knowledge_item currently performs orchestration and Supabase reads/updates; refactor by moving all orchestration (provider API-key validation via credential_service.get_active_provider and _validate_provider_api_key, fetching the item via KnowledgeItemService.get_item, ProgressTracker initialization/start, and launching the background job _perform_revectorize_with_progress) into a new service method (e.g., KnowledgeItemService.revectorize_with_progress or KnowledgeOrchestrationService.start_revectorize) that returns the progress_id and status; then simplify revectorize_knowledge_item to call that service method and return {"success": True, "progressId": progress_id, "message": ...}. Ensure the service method uses the same symbols (_perform_revectorize_with_progress, ProgressTracker, KnowledgeItemService.get_item) and keeps error handling (404 HTTPException) so tests and database access are encapsulated in the service layer.

coderabbitai · 2026-02-22T18:56:26Z

+                # Create embeddings
+                result = await create_embeddings_batch(contents, provider=provider)
+
+                if result.embeddings:
+                    # Update documents with new embeddings
+                    for j, (doc, embedding) in enumerate(zip(batch, result.embeddings, strict=False)):
+                        doc_id = doc.get("id")
+                        if not doc_id:
+                            continue
+
+                        # Determine embedding column based on dimension
+                        embedding_dim = len(embedding) if isinstance(embedding, list) else 0
+                        embedding_column = None
+                        if embedding_dim == 768:
+                            embedding_column = "embedding_768"
+                        elif embedding_dim == 1024:
+                            embedding_column = "embedding_1024"
+                        elif embedding_dim == 1536:
+                            embedding_column = "embedding_1536"
+                        elif embedding_dim == 3072:
+                            embedding_column = "embedding_3072"
+                        else:
+                            errors.append(f"Unsupported dimension {embedding_dim} for doc {doc_id}")
+                            continue
+
+                        try:
+                            supabase.table("archon_crawled_pages").update(
+                                {
+                                    embedding_column: embedding,
+                                    "embedding_model": embedding_model,
+                                    "embedding_dimension": embedding_dim,
+                                }
+                            ).eq("id", doc_id).execute()
+                            total_updated += 1
+                        except Exception as e:
+                            errors.append(f"Failed to update doc {doc_id}: {str(e)}")
+
+                # Update progress
+                progress = 10 + int((i + len(batch)) / total_docs * 85)
+                await tracker.update(
+                    {
+                        "status": "processing",
+                        "progress": progress,
+                        "log": f"Processed {min(i + len(batch), total_docs)}/{total_docs} documents",
+                        "documents_total": total_docs,
+                        "documents_processed": min(i + len(batch), total_docs),
+                    }
+                )
+
+            # Update source provenance
+            supabase.table("archon_sources").update(
+                {
+                    "embedding_model": embedding_model,
+                    "embedding_dimensions": embedding_dim,
+                    "embedding_provider": provider,
+                    "vectorizer_settings": vectorizer_settings,
+                    "last_vectorized_at": datetime.utcnow().isoformat(),
+                    "needs_revectorization": False,
+                }
+            ).eq("id", source_id).execute()
+


⚠️ Potential issue | 🟠 Major

Guard against empty embedding batches before updating provenance.

embedding_dim is set inside the loop and later used to update archon_sources. If no embeddings are returned, this raises UnboundLocalError and leaves progress in a failed state.

🛠️ Suggested guard

- # Process documents in batches + embedding_dim: int | None = None + # Process documents in batches for i in range(0, len(documents), batch_size): ... if result.embeddings: for j, (doc, embedding) in enumerate(zip(batch, result.embeddings, strict=False)): ... - embedding_dim = len(embedding) if isinstance(embedding, list) else 0 + embedding_dim = len(embedding) if isinstance(embedding, list) else len(embedding) ... + + if embedding_dim is None: + await tracker.error("No embeddings were generated; aborting re-vectorization") + return

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Create embeddings

result = await create_embeddings_batch(contents, provider=provider)

if result.embeddings:

# Update documents with new embeddings

for j, (doc, embedding) in enumerate(zip(batch, result.embeddings, strict=False)):

doc_id = doc.get("id")

if not doc_id:

continue

# Determine embedding column based on dimension

embedding_dim = len(embedding) if isinstance(embedding, list) else 0

embedding_column = None

if embedding_dim == 768:

embedding_column = "embedding_768"

elif embedding_dim == 1024:

embedding_column = "embedding_1024"

elif embedding_dim == 1536:

embedding_column = "embedding_1536"

elif embedding_dim == 3072:

embedding_column = "embedding_3072"

else:

errors.append(f"Unsupported dimension {embedding_dim} for doc {doc_id}")

continue

try:

supabase.table("archon_crawled_pages").update(

{

embedding_column: embedding,

"embedding_model": embedding_model,

"embedding_dimension": embedding_dim,

}

).eq("id", doc_id).execute()

total_updated += 1

except Exception as e:

errors.append(f"Failed to update doc {doc_id}: {str(e)}")

# Update progress

progress = 10 + int((i + len(batch)) / total_docs * 85)

await tracker.update(

{

"status": "processing",

"progress": progress,

"log": f"Processed {min(i + len(batch), total_docs)}/{total_docs} documents",

"documents_total": total_docs,

"documents_processed": min(i + len(batch), total_docs),

}

)

# Update source provenance

supabase.table("archon_sources").update(

{

"embedding_model": embedding_model,

"embedding_dimensions": embedding_dim,

"embedding_provider": provider,

"vectorizer_settings": vectorizer_settings,

"last_vectorized_at": datetime.utcnow().isoformat(),

"needs_revectorization": False,

}

).eq("id", source_id).execute()

# Initialize before loop to prevent UnboundLocalError

embedding_dim: int | None = None

# Process documents in batches

for i in range(0, len(documents), batch_size):

...

# Create embeddings

result = await create_embeddings_batch(contents, provider=provider)

if result.embeddings:

# Update documents with new embeddings

for j, (doc, embedding) in enumerate(zip(batch, result.embeddings, strict=False)):

doc_id = doc.get("id")

if not doc_id:

continue

# Determine embedding column based on dimension

embedding_dim = len(embedding) if isinstance(embedding, list) else 0

embedding_column = None

if embedding_dim == 768:

embedding_column = "embedding_768"

elif embedding_dim == 1024:

embedding_column = "embedding_1024"

elif embedding_dim == 1536:

embedding_column = "embedding_1536"

elif embedding_dim == 3072:

embedding_column = "embedding_3072"

else:

errors.append(f"Unsupported dimension {embedding_dim} for doc {doc_id}")

continue

try:

supabase.table("archon_crawled_pages").update(

{

embedding_column: embedding,

"embedding_model": embedding_model,

"embedding_dimension": embedding_dim,

}

).eq("id", doc_id).execute()

total_updated += 1

except Exception as e:

errors.append(f"Failed to update doc {doc_id}: {str(e)}")

# Update progress

progress = 10 + int((i + len(batch)) / total_docs * 85)

await tracker.update(

{

"status": "processing",

"progress": progress,

"log": f"Processed {min(i + len(batch), total_docs)}/{total_docs} documents",

"documents_total": total_docs,

"documents_processed": min(i + len(batch), total_docs),

}

)

# Guard against empty embedding batches before updating provenance

if embedding_dim is None:

await tracker.error("No embeddings were generated; aborting re-vectorization")

return

# Update source provenance

supabase.table("archon_sources").update(

{

"embedding_model": embedding_model,

"embedding_dimensions": embedding_dim,

"embedding_provider": provider,

"vectorizer_settings": vectorizer_settings,

"last_vectorized_at": datetime.utcnow().isoformat(),

"needs_revectorization": False,

}

).eq("id", source_id).execute()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/api_routes/knowledge_api.py` around lines 788 - 848, The code uses embedding_dim (set inside the per-batch loop when result.embeddings exist) later in the supabase.table("archon_sources").update call which can raise UnboundLocalError if no embeddings were returned; to fix, introduce a variable (e.g., final_embedding_dim or any name) initialized before the batching loop and update it when you successfully determine embedding_dim inside the loop (in the block handling result.embeddings), then use that final variable (or a safe default like 0 or None) in the archon_sources.update call and only set embedding_model/provider/vectorizer_settings/last_vectorized_at if embeddings were produced (or explicitly handle the “no embeddings” case) so archon_sources update won’t reference an undefined embedding_dim; refer to create_embeddings_batch, result.embeddings, embedding_dim, and the supabase.table("archon_sources").update call when making the change.

coderabbitai · 2026-02-22T18:56:26Z

+        # Validate API key (uses LLM provider for summarization)
+        provider_config = await credential_service.get_active_provider("llm")
+        provider = provider_config.get("provider", "openai")
+        await _validate_provider_api_key(provider)
+


⚠️ Potential issue | 🟠 Major

Validate the code-summarization provider instead of the chat provider.

resummarize validates the active LLM provider, but summarization may use CODE_SUMMARIZATION_PROVIDER. If they differ, validation passes and the background task fails later.

🛠️ Suggested fix

- provider_config = await credential_service.get_active_provider("llm") - provider = provider_config.get("provider", "openai") + from ..services.storage.code_storage_service import _get_code_summarization_provider + provider = await _get_code_summarization_provider() await _validate_provider_api_key(provider)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/api_routes/knowledge_api.py` around lines 878 - 882, The code is validating the active "llm" provider but resummarize should validate the configured code-summarization provider; replace the hardcoded "llm" usage with the CODE_SUMMARIZATION_PROVIDER constant so you fetch the correct provider_config (change the call credential_service.get_active_provider("llm") to credential_service.get_active_provider(CODE_SUMMARIZATION_PROVIDER)), then derive provider from provider_config (provider = provider_config.get("provider", "<appropriate default>")) and call _validate_provider_api_key(provider) as before to ensure the code summarization provider's API key is validated.

coderabbitai · 2026-02-22T18:56:26Z

                "word_count": source.get("total_words", 0),
-                "estimated_pages": round(
-                    source.get("total_words", 0) / 250, 1
-                ),  # Average book page = 250 words
+                "estimated_pages": round(source.get("total_words", 0) / 250, 1),  # Average book page = 250 words
                "pages_tooltip": f"{round(source.get('total_words', 0) / 250, 1)} pages (≈ {source.get('total_words', 0):,} words)",


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for total_word occurrences in Python files echo "=== Python files with total_word references ===" rg -n "total_word" --type py -C 2 echo -e "\n=== SQL files with total_word references ===" rg -n "total_word" --type sql -C 2

Repository: coleam00/Archon

Length of output: 13365

🏁 Script executed:

#!/bin/bash # Find the knowledge_item_service.py file and examine the specific lines fd "knowledge_item_service.py" --type f -x wc -l {} echo "=== Context around lines 170-180 (list_items area) ===" fd "knowledge_item_service.py" --type f -x sed -n '170,180p' {} echo -e "\n=== Context around lines 310-320 (get_available_sources area) ===" fd "knowledge_item_service.py" --type f -x sed -n '310,320p' {} echo -e "\n=== Context around lines 382-390 (_transform_source_to_item area) ===" fd "knowledge_item_service.py" --type f -x sed -n '382,390p' {}

Repository: coleam00/Archon

Length of output: 2862

total_words field lookup will always return 0 if the source contains the actual database column total_word_count

_transform_source_to_item uses source.get("total_words", 0) (lines 384–386), but all data producers create the "total_word_count" key per the database schema. The unchanged list_items path (line 173) correctly uses source.get("total_word_count", 0), and get_available_sources (line 312) defensively falls back to total_word_count. This method does not—meaning word_count and estimated_pages will silently be 0 in single-item responses.

🐛 Proposed fix to align with the defensive pattern in `get_available_sources`

- "word_count": source.get("total_words", 0), - "estimated_pages": round(source.get("total_words", 0) / 250, 1), # Average book page = 250 words - "pages_tooltip": f"{round(source.get('total_words', 0) / 250, 1)} pages (≈ {source.get('total_words', 0):,} words)", + "word_count": source.get("total_words", source.get("total_word_count", 0)), + "estimated_pages": round(source.get("total_words", source.get("total_word_count", 0)) / 250, 1), + "pages_tooltip": f"{round(source.get('total_words', source.get('total_word_count', 0)) / 250, 1)} pages (≈ {source.get('total_words', source.get('total_word_count', 0)):,} words)",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"word_count": source.get("total_words", 0),

"estimated_pages": round(

source.get("total_words", 0) / 250, 1

), # Average book page = 250 words

"estimated_pages": round(source.get("total_words", 0) / 250, 1), # Average book page = 250 words

"pages_tooltip": f"{round(source.get('total_words', 0) / 250, 1)} pages (≈ {source.get('total_words', 0):,} words)",

"word_count": source.get("total_words", source.get("total_word_count", 0)),

"estimated_pages": round(source.get("total_words", source.get("total_word_count", 0)) / 250, 1),

"pages_tooltip": f"{round(source.get('total_words', source.get('total_word_count', 0)) / 250, 1)} pages (≈ {source.get('total_words', source.get('total_word_count', 0)):,} words)",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/knowledge/knowledge_item_service.py` around lines 384 - 386, _transform_source_to_item currently reads total words from source.get("total_words", 0) which misses the DB column "total_word_count"; change the lookup to defensively prefer "total_word_count" with a fallback to "total_words" (e.g. total_words = source.get("total_word_count", source.get("total_words", 0))) and then use that total_words variable when setting "word_count", "estimated_pages", and "pages_tooltip" so single-item responses reflect the real count.

coderabbitai · 2026-02-22T18:56:26Z

+    async def _check_needs_revectorization(self, source: dict[str, Any]) -> bool:
+        """Check if re-vectorization is needed by comparing current settings with stored provenance."""
+        try:
+            from ..credential_service import credential_service
+
+            stored_embedding_model = source.get("embedding_model")
+            stored_embedding_provider = source.get("embedding_provider")
+            stored_vectorizer_settings = source.get("vectorizer_settings") or {}
+
+            if not stored_embedding_model:
+                return False
+
+            current_embedding_model = await credential_service.get_credential("EMBEDDING_MODEL")
+            current_embedding_provider_config = await credential_service.get_active_provider("embedding")
+            current_embedding_provider = current_embedding_provider_config.get("provider", "openai")
+
+            if current_embedding_model and stored_embedding_model != current_embedding_model:
+                return True
+
+            if stored_embedding_provider and stored_embedding_provider != current_embedding_provider:
+                return True
+
+            current_use_contextual = await credential_service.get_credential("USE_CONTEXTUAL_EMBEDDINGS", False)
+            stored_use_contextual = stored_vectorizer_settings.get("use_contextual", False)
+            if current_use_contextual != stored_use_contextual:
+                return True
+
+            current_chunk_size = await credential_service.get_credential("CHUNK_SIZE", 512)
+            stored_chunk_size = stored_vectorizer_settings.get("chunk_size", 512)
+            if current_chunk_size != stored_chunk_size:
+                return True
+
+            return False
+
+        except Exception:
+            return False


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd credential_service -t f

Repository: coleam00/Archon

Length of output: 153

🏁 Script executed:

find . -name "*credential*service*" -type f

Repository: coleam00/Archon

Length of output: 157

🏁 Script executed:

rg "def get_credential" -A 10 -B 2

Repository: coleam00/Archon

Length of output: 6762

🏁 Script executed:

rg "async def get_credential" -A 15 -B 2

Repository: coleam00/Archon

Length of output: 9540

🏁 Script executed:

cat -n python/src/server/services/credential_service.py | head -100

Repository: coleam00/Archon

Length of output: 4162

🏁 Script executed:

rg "load_all_credentials" -A 30 python/src/server/services/credential_service.py

Repository: coleam00/Archon

Length of output: 5482

🏁 Script executed:

rg "USE_CONTEXTUAL_EMBEDDINGS|CHUNK_SIZE" -B 2 -A 2

Repository: coleam00/Archon

Length of output: 12416

🏁 Script executed:

head -50 python/src/server/services/knowledge/knowledge_item_service.py

Repository: coleam00/Archon

Length of output: 1517

🏁 Script executed:

rg "logfire|logger" python/src/server/services/knowledge/knowledge_item_service.py | head -20

Repository: coleam00/Archon

Length of output: 1251

_check_needs_revectorization: Type coercion false positives and silent exception handler

Two correctness issues:

1. Type mismatch causes false positives (lines 449–457)

Credentials are stored in the database as strings ('false', '512'), but the comparison defaults are native Python types. When a credential is missing from the database, get_credential() returns the default (False bool or 512 int), which will never equal the string value stored in the database or vectorizer_settings:

# If CHUNK_SIZE is in DB: '512' (str) != 512 (int default) → True (incorrectly flagged) # If CHUNK_SIZE is missing: 512 (int default) != '512' (str from vectorizer_settings) → True (incorrectly flagged)

Pass string defaults to match the storage format:

current_use_contextual = await credential_service.get_credential("USE_CONTEXTUAL_EMBEDDINGS", "false") stored_use_contextual = stored_vectorizer_settings.get("use_contextual", "false") if current_use_contextual != stored_use_contextual: return True current_chunk_size = await credential_service.get_credential("CHUNK_SIZE", "512") stored_chunk_size = stored_vectorizer_settings.get("chunk_size", "512") if current_chunk_size != stored_chunk_size: return True

2. Silent except Exception: return False violates logging guidelines (lines 461–462)

Swallowing all exceptions (including ImportError from the deferred import and credential service failures) means errors are invisible. Log the error before returning:

except Exception as e: safe_logfire_error( f"Failed to check revectorization needs | source_id={source.get('source_id')} | error={str(e)}", exc_info=True, ) return False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/knowledge/knowledge_item_service.py` around lines 427 - 462, In _check_nevectorization (async def _check_needs_revectorization) change the default types passed into credential_service.get_credential to match stored string formats (e.g., use "false" for USE_CONTEXTUAL_EMBEDDINGS and "512" for CHUNK_SIZE) and compare against stored_vectorizer_settings values also read as strings to avoid type-coercion false positives; also replace the bare except block that silently returns False with logging the exception (use safe_logfire_error or the project logger, include source.get("source_id") and exc_info=True) before returning False so import/credential errors are visible.

BREAKING CHANGE / HEAVILY IN DEVELOPMENT - NOT FOR PRODUCTION USE This commit adds a new restartable RAG ingestion pipeline with the following features: - New database tables for state tracking: archon_document_blobs, archon_chunks, archon_embedding_sets, archon_embeddings, archon_summaries - Each pipeline stage (download, chunk, embed, summarize) is now separable - Full metadata tracking: embedder_id, embedder_version, embedder_config for embeddings; summarizer_model_id, prompt_hash, style for summaries - Checkpointing: each stage has explicit states (pending/in_progress/done/failed) - Support for multiple embedders and summarization styles per source - Health check utilities for validating pipeline state New services: - services/ingestion/ingestion_state_service.py - State management - services/ingestion/pipeline_orchestrator.py - Main orchestration - services/ingestion/embedding_worker.py - Async embedding processor - services/ingestion/summary_worker.py - Async summarization processor - services/ingestion/health_check.py - Health validation Migration: migration/0.1.0/014_add_pipeline_tables.sql Documentation: docs/ADRs/001-restartable-rag-pipeline.md The old pipeline remains unchanged for backwards compatibility.

coderabbitai

Actionable comments posted: 14

🧹 Nitpick comments (10)

python/src/server/services/ingestion/summary_worker.py (2)
112-119: Only first 3 content parts used — most source content silently dropped for multi-blob sources.

content_parts[:3] limits summarization input to chunks from (at most) 3 blobs. For sources with many documents, the summary will be based on a small, possibly unrepresentative subset. This is a significant design decision that should at least be documented or logged.

Consider logging a warning when content is truncated so operators can tune the limit.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/ingestion/summary_worker.py` around lines 112 -
119, The current loop collects chunk contents into content_parts but then
restricts input to content_parts[:3], silently dropping most content; update the
logic in summary_worker.py to use all gathered parts (remove the hard slice) or
implement a deterministic selection (e.g., sample/first N across blobs) instead
of content_parts[:3], and add a warning log when you truncate combined_content
to 25000 characters so operators know data was dropped; refer to content_parts,
content_parts[:3], combined_content, and the async get_chunks_by_blob call on
state_service to locate and change the code and add the logging.
100-100: Missing type annotation for summary parameter.

Similar to the EmbeddingWorker issue, summary should be typed as Summary.
-    async def _process_summary(self, summary) -> bool:
+    async def _process_summary(self, summary: Summary) -> bool:
As per coding guidelines, python/src/**/*.py: "Use Mypy for type checking in Python to ensure type safety".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/ingestion/summary_worker.py` at line 100, The
_process_summary method signature lacks a type annotation for the summary
parameter; update the signature of async def _process_summary(self, summary) ->
bool to annotate summary as Summary (e.g., async def _process_summary(self,
summary: Summary) -> bool), and add or ensure the appropriate import for Summary
is present at the top of the file (matching the Summary class/type used
elsewhere) so Mypy can validate it.
python/src/server/services/ingestion/embedding_worker.py (3)
117-128: retry_failed_embeddings bypasses state_service for the query, inconsistent with the rest of the class.

All other methods delegate DB operations to self.state_service, but this method directly queries self.supabase. Consider adding a method to IngestionStateService (e.g., get_failed_embedding_sets) and using it here for consistency. Also, if update_embedding_set_status fails mid-loop, there's no error handling — partial resets could go unnoticed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/ingestion/embedding_worker.py` around lines 117 -
128, The method retry_failed_embeddings directly queries self.supabase instead
of using the IngestionStateService like the rest of the class; add a method on
IngestionStateService (e.g., get_failed_embedding_sets or similar) that returns
failed embedding set rows and use it inside retry_failed_embeddings, then
iterate those rows and call
self.state_service.update_embedding_set_status(uuid.UUID(row["id"]),
EmbeddingStatus.PENDING) for each; also wrap the per-row update in try/except so
failures are logged/collected and do not silently leave partial resets (return
counts of successfully reset items and any failures).
67-67: Missing type annotation for embedding_set parameter.

Should be typed as EmbeddingSet for clarity and MyPy compliance.
-    async def _process_embedding_set(self, embedding_set, provider: str | None = None) -> bool:
+    async def _process_embedding_set(self, embedding_set: EmbeddingSet, provider: str | None = None) -> bool:
This requires importing EmbeddingSet from ingestion_state_service.

As per coding guidelines, python/src/**/*.py: "Use Mypy for type checking in Python to ensure type safety".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/ingestion/embedding_worker.py` at line 67, Update
the method signature of _process_embedding_set to type the first parameter as
EmbeddingSet (e.g., async def _process_embedding_set(self, embedding_set:
EmbeddingSet, provider: str | None = None) -> bool) and add an import for
EmbeddingSet from ingestion_state_service at the top of the file; this aligns
with MyPy and the project's typing conventions and keeps the existing provider
and return types unchanged.
56-63: Error logging missing stack traces and context per coding guidelines.

The except block logs the error string but doesn't preserve the full stack trace. Per coding guidelines, errors should include exc_info=True and context about what was being attempted.
Proposed fix
             except Exception as e:
-                safe_logfire_error(f"Error processing embedding set {embedding_set.id}: {e}")
+                logger.error(
+                    f"Error processing embedding set {embedding_set.id}: {e}",
+                    exc_info=True,
+                )
                 await self.state_service.update_embedding_set_status(
The same applies to line 109.

As per coding guidelines, python/src/**/*.py: "Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/ingestion/embedding_worker.py` around lines 56 -
63, The except block in embedding_worker.py is only logging str(e) and missing
stack traces and operation context; update the safe_logfire_error call(s) (the
ones around processing an embedding set and the other exception later in this
file) to include exc_info=True and a clearer message describing the attempted
operation (e.g., "Error processing embedding set {embedding_set.id} while
[operation]"). Also ensure the error_info passed to
state_service.update_embedding_set_status (and related calls) includes the same
contextual fields (operation/stage and full error string) so full traceability
is preserved; reference safe_logfire_error, update_embedding_set_status, and
EmbeddingStatus when making the changes.
python/src/server/services/ingestion/ingestion_state_service.py (2)
381-394: Large unbounded batch insert for embeddings may hit payload size limits.

store_embeddings inserts all chunk embeddings in a single request. For large documents, this could mean thousands of rows each containing high-dimensional float vectors (e.g., 1536 floats). PostgREST and Supabase have request payload size limits that this could exceed.

Consider batching the inserts (e.g., in groups of 100–500 records).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/ingestion/ingestion_state_service.py` around lines
381 - 394, The store_embeddings method currently inserts all records at once
which can exceed Supabase/PostgREST payload limits; change store_embeddings to
split chunk_embeddings into batches (e.g., batch_size = 100–500), build records
per batch (same shape as current records), and call
self.supabase.table("archon_embeddings").insert(batch_records).execute() for
each batch (awaiting each call), summing the number of inserted rows from each
response.data to return the total; ensure you handle empty/None response.data
and surface/raise errors if any batch insert fails.
217-217: Token count estimation is a rough heuristic.

len(content.split()) * 4 // 3 approximates token count from word count. This can be significantly off for code, URLs, or non-English text. Consider documenting this is an estimate, or using a proper tokenizer if accuracy matters for downstream decisions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/ingestion/ingestion_state_service.py` at line 217,
The current token estimation in ingestion_state_service.py uses a rough
heuristic ("token_count": len(content.split()) * 4 // 3) which is unreliable for
code, URLs, or non‑English text; update the implementation in the
function/method where this appears (search for token_count assignment in
IngestionStateService or related functions) to either (A) replace the heuristic
with a proper tokenizer (e.g., use tiktoken or the project's canonical tokenizer
to compute token count from content) or (B) if you must keep a fast heuristic,
add a clear inline comment and change the field name or metadata (e.g.,
token_count_estimate) and document in the service/class docstring that this is
an estimate and may be inaccurate for certain content types so downstream logic
can treat it accordingly.
python/src/server/services/ingestion/health_check.py (1)
69-69: Magic number heuristic for expected chunk count is fragile and confusing.

sum(1 for _ in blobs) * 10 is equivalent to len(blobs) * 10. The hardcoded 10 assumes each blob produces exactly 10 chunks, which is unlikely to hold. This makes the "low chunk count" warning unreliable — it could be a false positive or miss real issues depending on document sizes.

Consider either removing this heuristic, making the multiplier configurable, or computing expected counts from actual blob content lengths and chunk sizes.
-        total_expected_chunks = sum(1 for _ in blobs) * 10
+        total_expected_chunks = len(blobs) * 10  # TODO: derive from actual content length / chunk_size
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/ingestion/health_check.py` at line 69, The line
computing total_expected_chunks using a magic multiplier is fragile; replace
sum(1 for _ in blobs) * 10 with a robust approach in health_check.py: either (A)
compute expected chunks per blob from actual blob content lengths (e.g., use
each blob's byte/character length and the chunk size constant to calculate
chunks as ceil(length / CHUNK_SIZE) and sum those) or (B) make the multiplier
configurable by introducing an EXPECTED_CHUNKS_PER_BLOB constant/setting and use
len(blobs) * EXPECTED_CHUNKS_PER_BLOB; update references to
total_expected_chunks accordingly and ensure you use the blob attribute that
holds length (e.g., blob.size or blob.length) and the existing CHUNK_SIZE
constant if present.
python/src/server/services/ingestion/pipeline_orchestrator.py (1)
105-110: Progress callback value min(50, total_chunks) doesn't behave like a percentage.

min(50, total_chunks) clamps the progress value at 50, which appears intended as a percentage (chunking = first 50%). But once total_chunks > 50, progress stays stuck at 50 for every subsequent document. Consider computing actual progress as a proportion of total documents processed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/ingestion/pipeline_orchestrator.py` around lines
105 - 110, The progress value is clamped with min(50, total_chunks) which isn't
a percentage; compute a proportional percentage for the "chunking" phase
instead. Replace the hard clamp with something like chunk_percent = int(50 *
processed_chunks / max(1, total_chunks)) (or if you track processed_documents
use processed_documents in place of processed_chunks), then call
progress_callback("chunking", chunk_percent, f"Processed {total_blobs}
documents, {total_chunks} chunks") so the progress moves from 0→50 as chunking
progresses; reference progress_callback, total_chunks and total_blobs (and
processed_chunks/processed_documents variable you already maintain) to locate
and apply the change.
migration/0.1.0/014_add_pipeline_tables.sql (1)
23-24: updated_at columns have no auto-update trigger

archon_document_blobs.updated_at and archon_embedding_sets.updated_at are initialized to NOW() at INSERT but will silently stale on UPDATE unless the application always passes the column explicitly. Consider adding a trigger:
CREATE OR REPLACE FUNCTION set_updated_at()
RETURNS TRIGGER LANGUAGE plpgsql AS $$
BEGIN
  NEW.updated_at = NOW();
  RETURN NEW;
END;
$$;

CREATE TRIGGER trg_archon_document_blobs_updated_at
BEFORE UPDATE ON archon_document_blobs
FOR EACH ROW EXECUTE FUNCTION set_updated_at();

-- repeat for archon_embedding_sets, archon_summaries
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@migration/0.1.0/014_add_pipeline_tables.sql` around lines 23 - 24, The
migration creates updated_at columns that default to NOW() but never auto-update
on UPDATE; add a plpgsql trigger function (e.g. set_updated_at) that sets
NEW.updated_at = NOW() and attach BEFORE UPDATE triggers (e.g.
trg_archon_document_blobs_updated_at, trg_archon_embedding_sets_updated_at,
trg_archon_summaries_updated_at) to the tables archon_document_blobs,
archon_embedding_sets, and archon_summaries so updated_at is refreshed
automatically on row updates.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/ADRs/001-restartable-rag-pipeline.md`:
- Line 3: Update the ADR status from "Proposed" to "Accepted" (or "Superseded")
and add a short note in the ADR body stating that the migration
014_add_pipeline_tables.sql and accompanying services are being shipped in this
PR and therefore the decision is implemented; also add a line that ADR-002
supersedes this ADR (or mark this ADR as superseded by ADR-002) so the document
accurately reflects current implementation and provenance.

In `@migration/0.1.0/014_add_pipeline_tables.sql`:
- Around line 77-88: The archon_embeddings table currently hardcodes vector
dimensionality with "vector VECTOR(1536)" which breaks multi-embedder support
and contradicts archon_embedding_sets.embedding_dimension; update the schema so
the embedding vector column is dimension-agnostic (use a dimensionless VECTOR
column) or implement one of the suggested alternatives (per-dimension tables or
restrict system to 1536 and remove embedding_dimension) — specifically modify
the archon_embeddings table definition (the vector column) and ensure any
related index/constraints referencing archon_embedding_sets.embedding_dimension
are adjusted to match the chosen approach so different embedding dimensions can
be stored without insert failures.
- Line 67: The UNIQUE constraints (UNIQUE(source_id, embedder_id,
embedder_version) and UNIQUE(source_id, summarizer_model_id, prompt_hash,
style)) are ineffective when embedder_version or prompt_hash can be NULL; change
them to enforce uniqueness by either making those columns NOT NULL with a
sentinel default (e.g., set embedder_version and prompt_hash to NOT NULL DEFAULT
'' and adjust any inserts) or replace the UNIQUE constraints with explicit
unique indexes using COALESCE (for example create unique index on (source_id,
embedder_id, COALESCE(embedder_version, '')) and similarly on (source_id,
summarizer_model_id, COALESCE(prompt_hash, ''), style)) so NULLs are treated as
the sentinel and true uniqueness is enforced.
- Around line 100-107: The archon_summaries table schema is incorrect for
pipeline usage: change summary_content to allow inserts before content is
generated (make summary_content nullable or add DEFAULT '') and add DEFAULT
NOW() to updated_at to match other tables; update the CREATE TABLE definition
for archon_summaries so the column summary_content is not NOT NULL (or has
DEFAULT '') and updated_at TIMESTAMPTZ has DEFAULT NOW() to avoid NOT NULL
violations when inserting queued/pending summaries.
- Around line 47-48: The index creation for idx_archon_chunks_source_id is
invalid because archon_chunks has no source_id column and PostgreSQL forbids
listing a column both as key and in INCLUDE; remove or replace this statement:
drop the CREATE INDEX IF NOT EXISTS idx_archon_chunks_source_id ON
archon_chunks(blob_id, source_id) INCLUDE (source_id) and either delete it
entirely or create/ensure only idx_archon_chunks_blob_id exists to support
queries that must follow blob_id → archon_document_blobs.source_id (join through
archon_document_blobs to access source_id) so migrations run without error.

In `@python/src/server/services/ingestion/embedding_worker.py`:
- Around line 89-95: The loop is using zip(..., strict=False) which can silently
misalign chunk_ids and result.embeddings and is coupled with an unused
enumerate; change the logic in embedding_worker.py to first verify lengths match
and skip the entire batch on mismatch (log via safe_logfire_error with details)
to avoid storing corrupted data, then iterate with a plain zip(chunk_ids,
result.embeddings) (no strict flag) and append only non-empty embeddings to
successful_embeddings; remove the unused enumerate/_i and ensure you return or
continue when lengths differ so no partial/misaligned pairs are persisted.

In `@python/src/server/services/ingestion/health_check.py`:
- Around line 177-196: The check_all_sources method currently calls
check_source_health for each source without per-source error handling, so a
raised exception aborts the whole batch; update check_all_sources to wrap the
await self.check_source_health(source["source_id"]) call in a try/except, catch
Exception as e, and for failures append a result dict for that source (include
source_id, healthy: False, and an error field with the exception message or
traceback) so processing continues for remaining sources and the final returned
structure (total_sources, healthy_sources, unhealthy_sources, results) reflects
both successes and detailed failures; keep existing variables (sources_response,
results, healthy_count) and ensure unhealthy_sources calculation uses the
appended failure entries.

In `@python/src/server/services/ingestion/ingestion_state_service.py`:
- Line 164: Replace the generic raises in create_document_blob,
create_embedding_set, and create_summary with a more specific exception that
includes context and relevant IDs (e.g., raise IngestionError(f"Failed to create
document blob for source_id={source_id} blob_id={blob_id}: {e}") from e); if
IngestionError doesn't exist add a simple custom subclass of Exception
(IngestionError) near other service errors and import it, and ensure you chain
the original exception (using "from e") or include str(e) so the full stack
trace and underlying error are preserved in the logs and error flow.
- Around line 494-508: In update_source_pipeline_status, avoid the
redundant/unwanted overwrite of pipeline_error and use the same None-check
pattern as elsewhere: change the initial check to if error_info is not None:
update_data["pipeline_error"] = error_info, and remove the unconditional
assignment inside the status == "error" branch so pipeline_error is only written
when a non-None error_info is provided; keep the pipeline_completed_at
assignment for status == "complete" and then call the supabase update with
update_data as before.

In `@python/src/server/services/ingestion/pipeline_orchestrator.py`:
- Around line 96-100: The current start_offset logic in create_chunks (used
after smart_chunk_text_async) incorrectly assumes chunks are contiguous;
instead, compute accurate offsets by finding each chunk's substring position in
the original content: iterate chunks and call original_content.find(chunk,
search_start), set start_offset to the found index, set search_start =
found_index + len(chunk) to handle repeated content, and if find returns -1
(trimmed/modified chunk or overlap) fall back to a sentinel (e.g., None or -1)
or skip offset assignment; update create_chunks (and any callers that rely on
start_offsets) to use these real positions or remove/mark offsets as approximate
if you prefer not to compute them.
- Around line 144-178: The _queue_embedding function currently logs failures and
returns None (violating the "never return None to indicate failure" rule);
change it to raise a descriptive exception instead of returning None so failures
propagate to the pipeline error handler; capture the original exception from the
try block, call safe_logfire_error with context, then re-raise (or raise a new
exception that wraps the original) so callers get a raised error; apply the same
change to _queue_summary; reference functions: _queue_embedding, _queue_summary,
and state_service.create_embedding_set (and the equivalent create_summary_set)
when making the change.

In `@python/src/server/services/ingestion/summary_worker.py`:
- Around line 149-178: The _generate_summary function currently calls
get_llm_client() with no provider which can route requests to the wrong backend
for the given model_id; update _generate_summary to resolve the provider (e.g.,
via a helper like resolve_provider_from_model(model_id) or by reading the
source's/summary record config) and pass that provider into
get_llm_client(provider=resolved_provider) before creating the chat completion;
keep the rest of the logic (prompt selection via SUMMARY_PROMPTS/SummaryStyle
and response handling with extract_message_text) the same.
- Line 156: The current conversion SummaryStyle(style) can raise ValueError for
unknown strings causing the fallback in SUMMARY_PROMPTS.get to never run; change
the logic around prompt_template to safely map style to a SummaryStyle by
attempting conversion in a try/except (catch ValueError) or checking membership
(e.g., SummaryStyle.__members__) and default to SummaryStyle.OVERVIEW on
failure, then use that safe enum value to index SUMMARY_PROMPTS (i.e., ensure
you compute a safe_style variable from style before doing
SUMMARY_PROMPTS[safe_style] and assign it to prompt_template).
- Around line 81-96: The code calls
self.state_service.update_summary(summary.id, ...) without guarding for
summary.id possibly being None (Summary.id can be None); before calling
update_summary in the exception handler (and other places around lines handling
failures), add a guard like in EmbeddingWorker: if summary.id is None then
safe_logfire_error a clear message including summary (or summary.external_id),
increment results["failed"], and skip calling update_summary/update DB (do not
pass None to update_summary); otherwise call
self.state_service.update_summary(str(summary.id), SummaryStatus.FAILED,
error_info=...) as before. Ensure the same guard is applied wherever
update_summary is invoked for summary.id (including after _process_summary
failures).

---

Nitpick comments:
In `@migration/0.1.0/014_add_pipeline_tables.sql`:
- Around line 23-24: The migration creates updated_at columns that default to
NOW() but never auto-update on UPDATE; add a plpgsql trigger function (e.g.
set_updated_at) that sets NEW.updated_at = NOW() and attach BEFORE UPDATE
triggers (e.g. trg_archon_document_blobs_updated_at,
trg_archon_embedding_sets_updated_at, trg_archon_summaries_updated_at) to the
tables archon_document_blobs, archon_embedding_sets, and archon_summaries so
updated_at is refreshed automatically on row updates.

In `@python/src/server/services/ingestion/embedding_worker.py`:
- Around line 117-128: The method retry_failed_embeddings directly queries
self.supabase instead of using the IngestionStateService like the rest of the
class; add a method on IngestionStateService (e.g., get_failed_embedding_sets or
similar) that returns failed embedding set rows and use it inside
retry_failed_embeddings, then iterate those rows and call
self.state_service.update_embedding_set_status(uuid.UUID(row["id"]),
EmbeddingStatus.PENDING) for each; also wrap the per-row update in try/except so
failures are logged/collected and do not silently leave partial resets (return
counts of successfully reset items and any failures).
- Line 67: Update the method signature of _process_embedding_set to type the
first parameter as EmbeddingSet (e.g., async def _process_embedding_set(self,
embedding_set: EmbeddingSet, provider: str | None = None) -> bool) and add an
import for EmbeddingSet from ingestion_state_service at the top of the file;
this aligns with MyPy and the project's typing conventions and keeps the
existing provider and return types unchanged.
- Around line 56-63: The except block in embedding_worker.py is only logging
str(e) and missing stack traces and operation context; update the
safe_logfire_error call(s) (the ones around processing an embedding set and the
other exception later in this file) to include exc_info=True and a clearer
message describing the attempted operation (e.g., "Error processing embedding
set {embedding_set.id} while [operation]"). Also ensure the error_info passed to
state_service.update_embedding_set_status (and related calls) includes the same
contextual fields (operation/stage and full error string) so full traceability
is preserved; reference safe_logfire_error, update_embedding_set_status, and
EmbeddingStatus when making the changes.

In `@python/src/server/services/ingestion/health_check.py`:
- Line 69: The line computing total_expected_chunks using a magic multiplier is
fragile; replace sum(1 for _ in blobs) * 10 with a robust approach in
health_check.py: either (A) compute expected chunks per blob from actual blob
content lengths (e.g., use each blob's byte/character length and the chunk size
constant to calculate chunks as ceil(length / CHUNK_SIZE) and sum those) or (B)
make the multiplier configurable by introducing an EXPECTED_CHUNKS_PER_BLOB
constant/setting and use len(blobs) * EXPECTED_CHUNKS_PER_BLOB; update
references to total_expected_chunks accordingly and ensure you use the blob
attribute that holds length (e.g., blob.size or blob.length) and the existing
CHUNK_SIZE constant if present.

In `@python/src/server/services/ingestion/ingestion_state_service.py`:
- Around line 381-394: The store_embeddings method currently inserts all records
at once which can exceed Supabase/PostgREST payload limits; change
store_embeddings to split chunk_embeddings into batches (e.g., batch_size =
100–500), build records per batch (same shape as current records), and call
self.supabase.table("archon_embeddings").insert(batch_records).execute() for
each batch (awaiting each call), summing the number of inserted rows from each
response.data to return the total; ensure you handle empty/None response.data
and surface/raise errors if any batch insert fails.
- Line 217: The current token estimation in ingestion_state_service.py uses a
rough heuristic ("token_count": len(content.split()) * 4 // 3) which is
unreliable for code, URLs, or non‑English text; update the implementation in the
function/method where this appears (search for token_count assignment in
IngestionStateService or related functions) to either (A) replace the heuristic
with a proper tokenizer (e.g., use tiktoken or the project's canonical tokenizer
to compute token count from content) or (B) if you must keep a fast heuristic,
add a clear inline comment and change the field name or metadata (e.g.,
token_count_estimate) and document in the service/class docstring that this is
an estimate and may be inaccurate for certain content types so downstream logic
can treat it accordingly.

In `@python/src/server/services/ingestion/pipeline_orchestrator.py`:
- Around line 105-110: The progress value is clamped with min(50, total_chunks)
which isn't a percentage; compute a proportional percentage for the "chunking"
phase instead. Replace the hard clamp with something like chunk_percent = int(50
* processed_chunks / max(1, total_chunks)) (or if you track processed_documents
use processed_documents in place of processed_chunks), then call
progress_callback("chunking", chunk_percent, f"Processed {total_blobs}
documents, {total_chunks} chunks") so the progress moves from 0→50 as chunking
progresses; reference progress_callback, total_chunks and total_blobs (and
processed_chunks/processed_documents variable you already maintain) to locate
and apply the change.

In `@python/src/server/services/ingestion/summary_worker.py`:
- Around line 112-119: The current loop collects chunk contents into
content_parts but then restricts input to content_parts[:3], silently dropping
most content; update the logic in summary_worker.py to use all gathered parts
(remove the hard slice) or implement a deterministic selection (e.g.,
sample/first N across blobs) instead of content_parts[:3], and add a warning log
when you truncate combined_content to 25000 characters so operators know data
was dropped; refer to content_parts, content_parts[:3], combined_content, and
the async get_chunks_by_blob call on state_service to locate and change the code
and add the logging.
- Line 100: The _process_summary method signature lacks a type annotation for
the summary parameter; update the signature of async def _process_summary(self,
summary) -> bool to annotate summary as Summary (e.g., async def
_process_summary(self, summary: Summary) -> bool), and add or ensure the
appropriate import for Summary is present at the top of the file (matching the
Summary class/type used elsewhere) so Mypy can validate it.

coderabbitai · 2026-02-22T21:49:36Z

@@ -0,0 +1,75 @@
+# ADR-001: Restartable RAG Ingestion Pipeline
+
+## Status: Proposed


⚠️ Potential issue | 🟡 Minor

ADR status should reflect that implementation is in progress

The status Proposed is inconsistent — the corresponding migration (014_add_pipeline_tables.sql) and services are being shipped in this very PR. Standard ADR conventions expect the status to be updated to Accepted (or Superseded) before the decision is implemented. Additionally, per the PR context, ADR-002 supersedes this document, which should be noted here.

📝 Suggested update

-## Status: Proposed +## Status: Accepted (Superseded by ADR-002)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

## Status: Proposed

## Status: Accepted (Superseded by ADR-002)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/ADRs/001-restartable-rag-pipeline.md` at line 3, Update the ADR status from "Proposed" to "Accepted" (or "Superseded") and add a short note in the ADR body stating that the migration 014_add_pipeline_tables.sql and accompanying services are being shipped in this PR and therefore the decision is implemented; also add a line that ADR-002 supersedes this ADR (or mark this ADR as superseded by ADR-002) so the document accurately reflects current implementation and provenance.

coderabbitai · 2026-02-22T21:49:36Z

+CREATE INDEX IF NOT EXISTS idx_archon_chunks_source_id ON archon_chunks(blob_id, source_id) 
+    INCLUDE (source_id);


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

fd 'migration' --type d

Repository: coleam00/Archon

Length of output: 118

🏁 Script executed:

fd '014_add_pipeline_tables.sql' -type f

Repository: coleam00/Archon

Length of output: 229

🏁 Script executed:

find . -path "*/migration/*" -name "*.sql" | head -20

Repository: coleam00/Archon

Length of output: 943

🏁 Script executed:

cat -n migration/0.1.0/014_add_pipeline_tables.sql | head -60

Repository: coleam00/Archon

Length of output: 3203

🏁 Script executed:

# Check if there are any other references to source_id in the migration file cat -n migration/0.1.0/014_add_pipeline_tables.sql | grep -i source_id

Repository: coleam00/Archon

Length of output: 904

🏁 Script executed:

# Verify the complete archon_chunks table definition to ensure no source_id exists sed -n '34,44p' migration/0.1.0/014_add_pipeline_tables.sql

Repository: coleam00/Archon

Length of output: 457

🏁 Script executed:

# Verify PostgreSQL behavior: check if referencing a non-existent column in an index causes an error # Also check if having a column in both key and INCLUDE is valid grep -A5 "INCLUDE" migration/0.1.0/014_add_pipeline_tables.sql

Repository: coleam00/Archon

Length of output: 296

🏁 Script executed:

# Check if this migration has been run or if there are any other migrations that might add source_id to archon_chunks grep -r "source_id" migration/0.1.0/ | grep -i "archon_chunks"

Repository: coleam00/Archon

Length of output: 195

🏁 Script executed:

# Verify PostgreSQL index rules by checking documentation or testing the syntax # Let's also check complete_setup.sql to see how the base schema might define things head -250 migration/complete_setup.sql | tail -100 | grep -A10 -B10 "archon_chunks"

Repository: coleam00/Archon

Length of output: 41

🏁 Script executed:

# Check if there are any ALTER TABLE statements that might add source_id to archon_chunks grep -r "ALTER TABLE archon_chunks" migration/

Repository: coleam00/Archon

Length of output: 41

Migration will fail: source_id does not exist in archon_chunks

The archon_chunks table (lines 34–44) contains only id, blob_id, chunk_index, start_offset, end_offset, content, token_count, created_at, and a UNIQUE(blob_id, chunk_index) constraint. The index definition on lines 47–48 references source_id, which does not exist in this table. PostgreSQL will raise an error during migration execution: ERROR: column "source_id" does not exist.

Additionally, if source_id existed, PostgreSQL would reject having it appear in both the key list and the INCLUDE clause (columns in INCLUDE cannot also be key columns).

To reach source_id, queries must traverse through blob_id → archon_document_blobs.source_id. Drop this index or replace it with the already-defined idx_archon_chunks_blob_id if the access pattern can be satisfied by that index alone.

Proposed fix

-CREATE INDEX IF NOT EXISTS idx_archon_chunks_source_id ON archon_chunks(blob_id, source_id) - INCLUDE (source_id); +-- blob_id-only index (idx_archon_chunks_blob_id above) already covers this access pattern. +-- No additional index needed here.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@migration/0.1.0/014_add_pipeline_tables.sql` around lines 47 - 48, The index creation for idx_archon_chunks_source_id is invalid because archon_chunks has no source_id column and PostgreSQL forbids listing a column both as key and in INCLUDE; remove or replace this statement: drop the CREATE INDEX IF NOT EXISTS idx_archon_chunks_source_id ON archon_chunks(blob_id, source_id) INCLUDE (source_id) and either delete it entirely or create/ensure only idx_archon_chunks_blob_id exists to support queries that must follow blob_id → archon_document_blobs.source_id (join through archon_document_blobs to access source_id) so migrations run without error.

coderabbitai · 2026-02-22T21:49:37Z

+    total_chunk_count INTEGER DEFAULT 0,
+    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
+    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
+    UNIQUE(source_id, embedder_id, embedder_version)


⚠️ Potential issue | 🟡 Minor

UNIQUE constraints with nullable columns won't prevent duplicates when the nullable part is NULL

In PostgreSQL, NULL values are considered distinct from each other in B-tree unique indexes. This means:

UNIQUE(source_id, embedder_id, embedder_version) (line 67): two rows with the same source_id/embedder_id but embedder_version = NULL will not violate the constraint.

UNIQUE(source_id, summarizer_model_id, prompt_hash, style) (line 106): same issue with nullable prompt_hash.

If these columns being NULL represents a meaningful "no version" or "no prompt hash" state that should still be unique, use COALESCE in a partial unique index or replace nullability with a sentinel value (e.g., '').

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@migration/0.1.0/014_add_pipeline_tables.sql` at line 67, The UNIQUE constraints (UNIQUE(source_id, embedder_id, embedder_version) and UNIQUE(source_id, summarizer_model_id, prompt_hash, style)) are ineffective when embedder_version or prompt_hash can be NULL; change them to enforce uniqueness by either making those columns NOT NULL with a sentinel default (e.g., set embedder_version and prompt_hash to NOT NULL DEFAULT '' and adjust any inserts) or replace the UNIQUE constraints with explicit unique indexes using COALESCE (for example create unique index on (source_id, embedder_id, COALESCE(embedder_version, '')) and similarly on (source_id, summarizer_model_id, COALESCE(prompt_hash, ''), style)) so NULLs are treated as the sentinel and true uniqueness is enforced.

coderabbitai · 2026-02-22T21:49:37Z

+CREATE TABLE IF NOT EXISTS archon_embeddings (
+    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
+    chunk_id UUID NOT NULL REFERENCES archon_chunks(id) ON DELETE CASCADE,
+    embedding_set_id UUID NOT NULL REFERENCES archon_embedding_sets(id) ON DELETE CASCADE,
+    vector VECTOR(1536),
+    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
+    UNIQUE(chunk_id, embedding_set_id)
+);
+
+CREATE INDEX IF NOT EXISTS idx_archon_embeddings_chunk_id ON archon_embeddings(chunk_id);
+CREATE INDEX IF NOT EXISTS idx_archon_embeddings_set_id ON archon_embeddings(embedding_set_id);
+


⚠️ Potential issue | 🟠 Major

Hardcoded VECTOR(1536) prevents multi-embedder support

The stated design goal is "multiple embedders can coexist for the same source", and archon_embedding_sets.embedding_dimension exists to track the actual dimensionality per embedder. However, archon_embeddings.vector VECTOR(1536) enforces a fixed size — any embedder that produces a different dimension (e.g., nomic-embed-text at 768, text-embedding-3-large at 3072) will fail on INSERT.

Options to consider:

Drop the dimension constraint entirely: vector VECTOR (pgvector supports dimensionless columns, though this sacrifices type safety and requires different IVFFlat/HNSW index strategies).

Use a separate archon_embeddings_{dim} table per dimension family, referenced from archon_embedding_sets.

Constrain the system to 1536-dim only for now and remove embedding_dimension from archon_embedding_sets to avoid the false impression of multi-dimension support.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@migration/0.1.0/014_add_pipeline_tables.sql` around lines 77 - 88, The archon_embeddings table currently hardcodes vector dimensionality with "vector VECTOR(1536)" which breaks multi-embedder support and contradicts archon_embedding_sets.embedding_dimension; update the schema so the embedding vector column is dimension-agnostic (use a dimensionless VECTOR column) or implement one of the suggested alternatives (per-dimension tables or restrict system to 1536 and remove embedding_dimension) — specifically modify the archon_embeddings table definition (the vector column) and ensure any related index/constraints referencing archon_embedding_sets.embedding_dimension are adjusted to match the chosen approach so different embedding dimensions can be stored without insert failures.

coderabbitai · 2026-02-22T21:49:37Z

+    status TEXT NOT NULL DEFAULT 'pending'
+        CHECK (status IN ('pending', 'in_progress', 'done', 'failed')),
+    error_info JSONB,
+    summary_content TEXT NOT NULL,
+    updated_at TIMESTAMPTZ NOT NULL,
+    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
+    UNIQUE(source_id, summarizer_model_id, prompt_hash, style)
+);


⚠️ Potential issue | 🟠 Major

Two schema bugs in archon_summaries prevent correct pipeline usage

1. summary_content TEXT NOT NULL — The pipeline creates summary records at pending status before content is available. Without a default or nullable summary_content, an INSERT for a queued-but-not-yet-generated summary will fail. Either make the column nullable or add DEFAULT ''.

2. updated_at TIMESTAMPTZ NOT NULL missing DEFAULT NOW() — This is inconsistent with archon_document_blobs (line 24) and archon_embedding_sets (line 66), which both have DEFAULT NOW(). Any INSERT that omits updated_at will raise a NOT NULL violation.

🐛 Proposed fix

- summary_content TEXT NOT NULL, - updated_at TIMESTAMPTZ NOT NULL, + summary_content TEXT, + updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

status TEXT NOT NULL DEFAULT 'pending'

CHECK (status IN ('pending', 'in_progress', 'done', 'failed')),

error_info JSONB,

summary_content TEXT NOT NULL,

updated_at TIMESTAMPTZ NOT NULL,

created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),

UNIQUE(source_id, summarizer_model_id, prompt_hash, style)

);

status TEXT NOT NULL DEFAULT 'pending'

CHECK (status IN ('pending', 'in_progress', 'done', 'failed')),

error_info JSONB,

summary_content TEXT,

updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),

created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),

UNIQUE(source_id, summarizer_model_id, prompt_hash, style)

);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@migration/0.1.0/014_add_pipeline_tables.sql` around lines 100 - 107, The archon_summaries table schema is incorrect for pipeline usage: change summary_content to allow inserts before content is generated (make summary_content nullable or add DEFAULT '') and add DEFAULT NOW() to updated_at to match other tables; update the CREATE TABLE definition for archon_summaries so the column summary_content is not NOT NULL (or has DEFAULT '') and updated_at TIMESTAMPTZ has DEFAULT NOW() to avoid NOT NULL violations when inserting queued/pending summaries.

coderabbitai · 2026-02-22T21:49:37Z

+                start_offsets = []
+                current_offset = 0
+                for chunk in chunks:
+                    start_offsets.append(current_offset)
+                    current_offset += len(chunk)


⚠️ Potential issue | 🟡 Minor

Start offsets assume chunks are contiguous substrings of the original content, which may not hold.

The offset calculation sums chunk lengths sequentially, but smart_chunk_text_async may produce overlapping chunks, chunks with trimmed whitespace, or chunks that don't partition the original text exactly. This would make start_offset and end_offset (computed in create_chunks) incorrect and misleading for consumers.

If offsets need to be accurate, compute them by searching for each chunk's position in the original content. Otherwise, consider documenting that offsets are approximate, or not computing them at all.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/ingestion/pipeline_orchestrator.py` around lines 96 - 100, The current start_offset logic in create_chunks (used after smart_chunk_text_async) incorrectly assumes chunks are contiguous; instead, compute accurate offsets by finding each chunk's substring position in the original content: iterate chunks and call original_content.find(chunk, search_start), set start_offset to the found index, set search_start = found_index + len(chunk) to handle repeated content, and if find returns -1 (trimmed/modified chunk or overlap) fall back to a sentinel (e.g., None or -1) or skip offset assignment; update create_chunks (and any callers that rely on start_offsets) to use these real positions or remove/mark offsets as approximate if you prefer not to compute them.

coderabbitai · 2026-02-22T21:49:37Z

+    async def _queue_embedding(
+        self,
+        source_id: str,
+        total_chunks: int,
+        embedder_id: str | None,
+    ):
+        try:
+            rag_settings = await credential_service.get_credentials_by_category("rag_strategy")
+            embedding_provider = rag_settings.get("EMBEDDING_PROVIDER", "openai")
+
+            if not embedder_id:
+                embedder_id = await get_embedding_model(provider=embedding_provider)
+
+            embedding_dimensions = int(rag_settings.get("EMBEDDING_DIMENSIONS", "1536"))
+
+            embedding_config = {
+                "provider": embedding_provider,
+                "dimensions": embedding_dimensions,
+            }
+
+            embedding_set = await self.state_service.create_embedding_set(
+                source_id=source_id,
+                embedder_id=embedder_id,
+                embedder_version=None,
+                embedder_config=embedding_config,
+                total_chunk_count=total_chunks,
+                embedding_dimension=embedding_dimensions,
+            )
+
+            safe_logfire_info(f"Created embedding set {embedding_set.id} for source {source_id}")
+            return embedding_set
+
+        except Exception as e:
+            safe_logfire_error(f"Failed to queue embedding: {e}")
+            return None


⚠️ Potential issue | 🟠 Major

_queue_embedding returns None on failure, violating the "never return None to indicate failure" guideline.

When _queue_embedding fails, it logs and returns None. The caller at line 131 silently accepts this (str(embedding_set.id) if embedding_set else None), meaning the pipeline reports success with embedding_set_id: None — the user has no indication that embedding queueing failed entirely.

Consider either raising the exception (to trigger the pipeline error handler at line 136) or returning an explicit error result that is surfaced in the response.

The same applies to _queue_summary (lines 180–206).

As per coding guidelines, python/src/**/*.py: "never return None to indicate failure".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/ingestion/pipeline_orchestrator.py` around lines 144 - 178, The _queue_embedding function currently logs failures and returns None (violating the "never return None to indicate failure" rule); change it to raise a descriptive exception instead of returning None so failures propagate to the pipeline error handler; capture the original exception from the try block, call safe_logfire_error with context, then re-raise (or raise a new exception that wraps the original) so callers get a raised error; apply the same change to _queue_summary; reference functions: _queue_embedding, _queue_summary, and state_service.create_embedding_set (and the equivalent create_summary_set) when making the change.

coderabbitai · 2026-02-22T21:49:37Z

+        for summary in pending[:max_batch_size]:
+            try:
+                success = await self._process_summary(summary)
+                if success:
+                    results["processed"] += 1
+                    results["summaries_processed"].append(str(summary.id))
+                else:
+                    results["failed"] += 1
+            except Exception as e:
+                safe_logfire_error(f"Error processing summary {summary.id}: {e}")
+                await self.state_service.update_summary(
+                    summary.id,
+                    SummaryStatus.FAILED,
+                    error_info={"error": str(e), "stage": "summary_processing"},
+                )
+                results["failed"] += 1


⚠️ Potential issue | 🟠 Major

summary.id may be None — no guard before passing to update_summary.

The Summary dataclass has id: uuid.UUID | None = None. If summary.id is None, it's passed directly to self.state_service.update_summary(summary.id, ...) at line 91, which calls .eq("id", str(summary_id)) — str(None) becomes "None", a silent no-op update. The same issue exists at line 101 and 129.

Add a guard similar to what EmbeddingWorker does at line 46:

for summary in pending[:max_batch_size]: + if not summary.id: + results["failed"] += 1 + continue try:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/ingestion/summary_worker.py` around lines 81 - 96, The code calls self.state_service.update_summary(summary.id, ...) without guarding for summary.id possibly being None (Summary.id can be None); before calling update_summary in the exception handler (and other places around lines handling failures), add a guard like in EmbeddingWorker: if summary.id is None then safe_logfire_error a clear message including summary (or summary.external_id), increment results["failed"], and skip calling update_summary/update DB (do not pass None to update_summary); otherwise call self.state_service.update_summary(str(summary.id), SummaryStatus.FAILED, error_info=...) as before. Ensure the same guard is applied wherever update_summary is invoked for summary.id (including after _process_summary failures).

coderabbitai · 2026-02-22T21:49:37Z

+    async def _generate_summary(
+        self,
+        source_id: str,
+        content: str,
+        model_id: str,
+        style: str,
+    ) -> str:
+        prompt_template = SUMMARY_PROMPTS.get(SummaryStyle(style), SUMMARY_PROMPTS[SummaryStyle.OVERVIEW])
+        prompt = prompt_template.format(content=content, source_id=source_id)
+
+        async with get_llm_client() as client:
+            response = await client.chat.completions.create(
+                model=model_id,
+                messages=[
+                    {
+                        "role": "system",
+                        "content": "You are a helpful assistant that provides concise library/tool/framework summaries.",
+                    },
+                    {"role": "user", "content": prompt},
+                ],
+            )
+
+            if not response or not response.choices:
+                raise Exception("Empty response from LLM")
+
+            summary_text, _, _ = extract_message_text(response.choices[0])
+            if not summary_text:
+                raise Exception("LLM returned empty content")
+
+            return summary_text.strip()


⚠️ Potential issue | 🟠 Major

get_llm_client() called without specifying a provider — may use wrong backend for the given model_id.

_generate_summary receives a model_id that may belong to a specific provider (e.g., an Anthropic model), but get_llm_client() is called with no provider argument and defaults to whatever is globally configured. This could send the request to the wrong provider, causing failures or unexpected behavior.

Consider resolving the provider from model_id or from the summary record's associated configuration.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/ingestion/summary_worker.py` around lines 149 - 178, The _generate_summary function currently calls get_llm_client() with no provider which can route requests to the wrong backend for the given model_id; update _generate_summary to resolve the provider (e.g., via a helper like resolve_provider_from_model(model_id) or by reading the source's/summary record config) and pass that provider into get_llm_client(provider=resolved_provider) before creating the chat completion; keep the rest of the logic (prompt selection via SUMMARY_PROMPTS/SummaryStyle and response handling with extract_message_text) the same.

coderabbitai · 2026-02-22T21:49:37Z

+        model_id: str,
+        style: str,
+    ) -> str:
+        prompt_template = SUMMARY_PROMPTS.get(SummaryStyle(style), SUMMARY_PROMPTS[SummaryStyle.OVERVIEW])


⚠️ Potential issue | 🟡 Minor

SummaryStyle(style) will raise ValueError for unrecognized style strings.

If style contains a value not in the SummaryStyle enum, this line raises an unhandled ValueError. The fallback to SUMMARY_PROMPTS[SummaryStyle.OVERVIEW] in the .get() never triggers because the exception occurs in the dict key construction.

Proposed fix

- prompt_template = SUMMARY_PROMPTS.get(SummaryStyle(style), SUMMARY_PROMPTS[SummaryStyle.OVERVIEW]) + try: + style_enum = SummaryStyle(style) + except ValueError: + logger.warning(f"Unknown summary style '{style}', falling back to OVERVIEW") + style_enum = SummaryStyle.OVERVIEW + prompt_template = SUMMARY_PROMPTS[style_enum]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/ingestion/summary_worker.py` at line 156, The current conversion SummaryStyle(style) can raise ValueError for unknown strings causing the fallback in SUMMARY_PROMPTS.get to never run; change the logic around prompt_template to safely map style to a SummaryStyle by attempting conversion in a try/except (catch ValueError) or checking membership (e.g., SummaryStyle.__members__) and default to SummaryStyle.OVERVIEW on failure, then use that safe enum value to index SUMMARY_PROMPTS (i.e., ensure you compute a safe_style variable from style before doing SUMMARY_PROMPTS[safe_style] and assign it to prompt_template).

Completes the restartable RAG pipeline by adding worker invocation mechanisms and integrating with the existing crawl service. New Components: - ingestion_api.py: 6 API endpoints for worker control and health checks - POST /api/ingestion/process-embeddings - POST /api/ingestion/process-summaries - GET /api/ingestion/health/{source_id} - GET /api/ingestion/health - POST /api/ingestion/retry-failed-embeddings - POST /api/ingestion/retry-failed-summaries Crawl Integration: - Added use_new_pipeline parameter to KnowledgeItemRequest - Integrated pipeline orchestrator in document_storage_operations.py - New pipeline creates blobs, chunks, and queues async jobs Bug Fixes: - Fixed PostgREST query syntax in get_chunks_by_source() - Added batching for large blob queries (URI length limit) - Added url_to_full_document for backward compatibility Testing: - Manual testing completed (see TEST_RESULTS.md) - Pipeline successfully creates 275 blobs, chunks, and embedding sets - Worker infrastructure validated (picks up jobs, processes, handles errors) - All core components proven working Status: Architecture validated, ready for production testing. Config issue: embedding dimension mismatch (easily resolved). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 18

🧹 Nitpick comments (5)

test_new_pipeline.md (1)

337-337: Minor wording nit: "fix" → "resolve" for formal documentation tone.

-**Solution:** Check error_info in database, fix issue, use retry endpoint.
+**Solution:** Check error_info in database, resolve the issue, then use the retry endpoint.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test_new_pipeline.md` at line 337, Update the wording in the sentence that
reads "Check error_info in database, fix issue, use retry endpoint." to use a
more formal tone by replacing "fix" with "resolve" so it reads "Check error_info
in database, resolve issue, use retry endpoint."; locate and edit the exact
fragment in the document (the line containing "Check error_info in database, fix
issue, use retry endpoint.") to make this single-word substitution.

python/src/server/services/crawling/document_storage_operations.py (2)

635-635: Line exceeds 120-character limit.

This line is well over 120 characters. Consider breaking it up for readability and guideline compliance.

♻️ Suggested formatting

-            response = self.supabase_client.table("archon_sources").select("source_id").eq("source_id", source_id).execute()
+            response = (
+                self.supabase_client.table("archon_sources")
+                .select("source_id")
+                .eq("source_id", source_id)
+                .execute()
+            )

As per coding guidelines: "Use Python 3.12 with 120 character line length in backend code."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/crawling/document_storage_operations.py` at line
635, The line calling
self.supabase_client.table("archon_sources").select("source_id").eq("source_id",
source_id).execute() exceeds 120 chars; refactor it by splitting the chained
call across multiple lines (e.g., assign the table call or the query to an
intermediate variable, or use parentheses to break after .table(...).select(...)
and .eq(...)) so the resulting assignment to response in the
DocumentStorageOperations method remains under 120 characters while preserving
the same call chain and behavior.

561-562: safe_logfire_error for "no valid documents" is misleading — this is a normal edge case, not an error.

When all crawl results have empty content, this is an expected condition (e.g., failed fetches), not a system error. Using error-level logging here will generate noise in error monitoring.

♻️ Suggested change

         if not documents:
-            safe_logfire_error(f"No valid documents to process | source_id={original_source_id}")
+            safe_logfire_info(f"No valid documents to process — skipping pipeline | source_id={original_source_id}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/crawling/document_storage_operations.py` around
lines 561 - 562, Replace the error-level logging call for the empty-documents
edge case with a non-error level and clearer message: in the block checking "if
not documents:" swap safe_logfire_error(...) for a safe_logfire_info(...) or
safe_logfire_warning(...) call and update the message to indicate this is an
expected/no-content result (include original_source_id for context). This change
should be made where the variable original_source_id is available so the log
still records the source but no longer floods error monitoring.

python/src/server/api_routes/ingestion_api.py (1)

18-42: POST trigger endpoints expose control parameters as URL query strings rather than a request body.

In FastAPI, bare scalar parameters on a POST handler resolve to query parameters (e.g. POST /process-embeddings?max_batch_size=20&embedder_id=foo). Request bodies are better suited for sending larger amounts of structured data to your API. Using Pydantic request models would give you schema-validated JSON bodies, keep parameters out of server logs/URLs, and produce cleaner OpenAPI docs.

♻️ Example refactor for process-embeddings

+from pydantic import BaseModel
+
+class ProcessEmbeddingsRequest(BaseModel):
+    max_batch_size: int = Field(default=10, ge=1)
+    embedder_id: str | None = None
+    provider: str | None = None
+
 `@router.post`("/process-embeddings")
 async def process_pending_embeddings(
-    max_batch_size: int = 10,
-    embedder_id: str | None = None,
-    provider: str | None = None,
+    request: ProcessEmbeddingsRequest,
     supabase: Client = Depends(get_supabase_client),
 ):
     worker = get_embedding_worker(supabase)
     result = await worker.process_pending_embeddings(
-        embedder_id=embedder_id,
-        max_batch_size=max_batch_size,
-        provider=provider,
+        embedder_id=request.embedder_id,
+        max_batch_size=request.max_batch_size,
+        provider=request.provider,
     )
     return result

Also applies to: 45-69, 101-117, 120-141

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/src/server/api_routes/ingestion_api.py` around lines 18 - 42, The
endpoint process_pending_embeddings currently declares bare scalar parameters
which FastAPI exposes as query params; replace them with a Pydantic request
model (e.g., ProcessEmbeddingsRequest with fields max_batch_size, embedder_id,
provider) and accept that model as the single body parameter in the
process_pending_embeddings handler, then pass model fields into
get_embedding_worker(...).process_pending_embeddings; update any
validation/defaults on the model rather than function args. Apply the same
refactor to the other POST handlers in this module (the other router.post
endpoints) so they all accept validated Pydantic body models and use those
fields when invoking their respective worker methods.

python/src/server/services/ingestion/ingestion_state_service.py (1)

64-119: Consider using enum types in dataclass fields instead of str.

DocumentBlob.download_status, EmbeddingSet.status, Summary.status, and Summary.style are typed as bare str, losing the type-safety provided by the enums defined just above. MyPy won't catch invalid status strings at assignment time.

♻️ Proposed change (excerpt)

 `@dataclass`
 class DocumentBlob:
     ...
-    download_status: str = "pending"
+    download_status: DownloadStatus = DownloadStatus.PENDING

 `@dataclass`
 class EmbeddingSet:
     ...
-    status: str = "pending"
+    status: EmbeddingStatus = EmbeddingStatus.PENDING

 `@dataclass`
 class Summary:
     ...
-    style: str = "overview"
+    style: SummaryStyle = SummaryStyle.OVERVIEW
-    status: str = "pending"
+    status: SummaryStatus = SummaryStatus.PENDING

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/src/server/services/ingestion/ingestion_state_service.py` around lines
64 - 119, Replace the bare str types for status/style fields with the
appropriate enum types and default enum members: change
DocumentBlob.download_status to use the DownloadStatus enum (default
DownloadStatus.PENDING), EmbeddingSet.status to the EmbeddingSetStatus enum
(default e.g., EmbeddingSetStatus.PENDING), Summary.status to SummaryStatus
(default SummaryStatus.PENDING) and Summary.style to SummaryStyle (default
SummaryStyle.OVERVIEW); update the dataclass annotations and default values to
the enum members and ensure any code that serializes/deserializes these fields
uses .value or casts to/from the enum (e.g., when reading from DB or JSON) so
runtime and MyPy type-safety are preserved.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/src/server/api_routes/ingestion_api.py`:
- Around line 18-42: The POST endpoints (e.g., process_pending_embeddings)
currently have no auth and must require a service/auth key; update their route
signatures to add the existing auth dependency (for example require_service_key)
alongside get_supabase_client so only authorized callers can invoke
state-mutating operations, and apply the same change to the other operational
endpoints mentioned (retry_failed_embeddings, retry_failed_summaries,
process_summaries) to check the service-role token before performing any work;
ensure you import the auth dependency (e.g., from ..utils import
require_service_key) and use it as a FastAPI Depends parameter in each affected
function signature.
- Line 20: The max_batch_size parameter currently lacks lower-bound validation
in both endpoints; update the parameter declaration to use FastAPI's Query with
ge=1 (e.g. change max_batch_size: int = 10 to max_batch_size: int = Query(10,
ge=1)) so invalid values like 0 or negatives are rejected before reaching the
worker layer, and add the Query import from fastapi if not already present;
apply the same change to both occurrences of max_batch_size in this module.

In `@python/src/server/api_routes/knowledge_api.py`:
- Around line 757-759: The code currently treats an empty docs_response.data as
an error by calling await tracker.error(...); instead, treat an empty-but-valid
state as completion: replace await tracker.error("No documents found for
source") with await tracker.complete("No documents found for source") (or a
clearer message) in the function handling the normal summarize flow (the block
checking if not docs_response.data), and make the identical replacement in the
resummarize path (the similar check around lines 932–934). Ensure you only
change the tracker method call and message text so the operation is marked
complete rather than failed.
- Around line 722-723: The background tasks created with asyncio.create_task for
_perform_revectorize_with_progress (and similarly for
_perform_resummarize_with_progress) are being discarded and can be GC'd and
cannot be cancelled; change the creation to retain and manage the Task just like
upload tasks by adding the created Task to the existing active_crawl_tasks set
(or an equivalent managed set), ensure you remove the Task from that set when it
completes, and use that stored Task for cancellation in the stop endpoint so
revectorize/resummarize tasks have proper lifecycle and cancellation support.
- Around line 784-863: The revectorize routine collects per-document failures in
the local errors list but never surfaces them; update the code that calls
tracker.complete (and the outer exception handler) to include the errors list in
the completion payload (e.g., include "errors": errors and "documents_failed":
len(errors)) and log it via logger.error/info as appropriate; also change the
outer except block to call logger.error with exc_info=True in addition to
safe_logfire_error so the stack trace is preserved. Apply the same changes
inside _perform_resummarize_with_progress: ensure its local errors list is
returned/embedded in tracker.complete and logged, and use logger.error(...,
exc_info=True) in its exception handler.
- Around line 1003-1009: The progress payload sent to tracker.complete uses the
wrong key name: change the payload in the tracker.complete call inside the
re-summarization completion block (where total_updated is used) to use
"total_examples": total_updated instead of "examples_total": total_updated so it
matches the progress updates emitted during processing and mirrors the
revectorize path; update the entry in the dictionary passed to tracker.complete
(the call site referencing tracker.complete and the keys
"examples_total"/"examples_processed") accordingly.
- Around line 846-857: Replace the deprecated datetime.utcnow() with
datetime.now(timezone.utc) and add "from datetime import timezone" to imports,
and adjust the final tracker.complete() payload to use
documents_total=total_docs and documents_processed=total_updated (keep
documents_total consistent with mid-loop updates) so the log and counts reflect
total_docs as the total and total_updated as processed; locate the call that
sets "last_vectorized_at" (currently using datetime.utcnow()) and the
tracker.complete(...) invocation that currently passes
documents_total=total_updated to make these changes.
- Line 751: Remove the dead initializer embedding_dimensions = 1536; it is never
used and causes confusion because the code later derives the actual dimension
into embedding_dim (from the returned vector length in the logic around
embedding_dim). Delete the unused assignment and run a quick search for
embedding_dimensions to ensure no other code relies on it; if any remaining
references exist, replace them to use embedding_dim instead.

In `@python/src/server/services/crawling/document_storage_operations.py`:
- Around line 563-569: The returned dicts currently use a Python set for
"urls_stored" which is not JSON-serializable and is inconsistent with an older
code path that omits the key; update all return sites in
document_storage_operations.py (the early-return that returns {"source_id":
original_source_id, "chunk_count": 0, "chunks_stored": 0, "urls_stored": set(),
"url_to_page_id": {}} and the normal return around the other block) to use an
empty list ([]) for "urls_stored" instead of set(), and ensure the same
"urls_stored" key (as a list) exists in all return paths to maintain a
consistent contract (keep "url_to_page_id" as a dict).
- Around line 659-661: The except block in document_storage_operations.py that
catches exceptions during the create/update of archon_sources must log the full
traceback and include the relevant source identifier before re-raising; locate
the try/except around the archon_sources create/update logic (look for
functions/methods that reference "archon_sources",
"create_or_update_archon_source", or similar) and replace or augment the
safe_logfire_error call with a logger.error call that includes a descriptive
message, the source_id (or URL) context, and exc_info=True so the stack trace is
preserved, then re-raise the exception as before.
- Around line 525-535: The _process_with_new_pipeline method accepts a
cancellation_check but never uses it, leaving the new pipeline uncancellable;
either (A) update orchestrator.run_pipeline to accept a cancellation_check (or a
cancellable token) and thread that through to any async tasks/workers invoked by
run_pipeline, then pass the cancellation_check from _process_with_new_pipeline
into orchestrator.run_pipeline, and update other calls noted around lines
587–595 to pass the same check; or (B) remove the cancellation_check parameter
from _process_with_new_pipeline (and its callers) and add a clear docstring
comment in _process_with_new_pipeline and public API docs stating the new
pipeline does not support per-request cancellation. Use the symbols
_process_with_new_pipeline and orchestrator.run_pipeline to locate changes and
ensure unit tests/typing are updated accordingly.
- Around line 653-656: The update is currently passing the literal string
"now()" which will be stored verbatim; replace it by generating an ISO timestamp
in Python and pass that value to updated_at. In the call to
self.supabase_client.table("archon_sources").update(...).eq("source_id",
source_id).execute() create a timestamp with
datetime.datetime.utcnow().isoformat() (or datetime.utcnow().isoformat() + "Z"
if you want explicit UTC) and set "updated_at": timestamp instead of "now()";
ensure you import datetime at top if not already present.

In `@python/src/server/services/ingestion/ingestion_state_service.py`:
- Around line 224-238: The create_chunks implementation in
ingestion_state_service.py silently returns an empty list when response.data is
empty; make it consistent with create_document_blob and create_embedding_set by
checking response.data after the Supabase insert and raising an error if it's
empty (include the response.error or status information in the exception message
for debugging). Locate the create_chunks method and the response variable used
for the Supabase insert, and replace the silent empty-return behavior with an
explicit raise (e.g., RuntimeError or a specific service error) that includes
context (operation name "create_chunks", response.status_code/response.error) so
downstream workers fail fast on data-corruption cases.
- Around line 122-530: The async methods in IngestionStateService (e.g.,
create_document_blob, get_document_blob, get_blobs_by_source, create_chunks,
get_chunks_by_blob, get_chunks_by_source, create_embedding_set,
get_embedding_set, get_pending_embedding_sets, update_embedding_set_status,
store_embeddings, get_embeddings_by_set, create_summary, get_pending_summaries,
update_summary, update_source_pipeline_status) call the synchronous Supabase
Client .execute() directly and block the event loop; either (preferred) replace
the sync Client with Supabase AsyncClient/create_async_client and await each
.execute() call, or (interim) wrap every .execute() invocation in an offloading
call such as await asyncio.to_thread(...) (or loop.run_in_executor) so DB calls
run in a thread pool; update imports, client construction, and all .execute()
usages in this class to match the chosen approach and ensure all returned
responses are awaited appropriately.
- Around line 277-298: The batched queries in get_chunks_by_source currently
append results from multiple .in_() calls without guaranteeing order; after
collecting all_chunks, sort the rows by blob_id and chunk_index (e.g., sort key
(row["blob_id"], row["chunk_index"])) before constructing Chunk objects so
callers receive chunks in deterministic per-blob, ascending chunk_index order;
update the code around all_chunks collection and the Chunk(...) list
comprehension in get_chunks_by_source to perform this sort.

In `@test_new_pipeline.md`:
- Line 15: The prerequisite in test_new_pipeline.md incorrectly references
"migration 014" while this PR adds 012_add_crawl_url_state.sql; update the
prerequisite text so it references the correct migration number(s) — either
change "migration 014" to "migration 012" or explicitly list which migrations
this PR ships (e.g., 012_add_crawl_url_state.sql) versus earlier ones, so
testers know exactly which migrations to verify.
- Line 9: The step that cd's into the hardcoded path
"/home/zebastjan/dev/archon" should be removed; instead update the instruction
to either tell contributors to run the steps from the repository root or replace
the hardcoded path with a repository-root detection approach (e.g., use git to
find the top-level repo) so the command works for everyone and does not
reference the personal workstation path.
- Around line 44-60: The docs assign SOURCE_ID to a literal placeholder and
never show how to extract it from the /api/progress/<progress_id> response,
causing subsequent health checks to target an empty URL; fix by showing an
"Expected Response" for the progress endpoint that includes the source_id field,
add a concrete extraction step that captures the progress JSON (e.g., assign
PROGRESS from the curl response and set SOURCE_ID by parsing PROGRESS for
.source_id), and add a short reminder to re-assign SOURCE_ID at the start of
each test (Tests 2, 3, 4, 6) that depends on it so the health check against
/api/ingestion/health/$SOURCE_ID uses a real value.

---

Duplicate comments:
In `@python/src/server/api_routes/knowledge_api.py`:
- Around line 879-883: The code currently validates the chat LLM provider by
calling credential_service.get_active_provider("llm") before calling
_validate_provider_api_key, but resummarization uses the code summarization
provider key; change the validation to fetch and validate the actual code
summarization provider (use the CODE_SUMMARIZATION_PROVIDER constant or config
key instead of the literal "llm") so that
credential_service.get_active_provider(CODE_SUMMARIZATION_PROVIDER) (or
equivalent settings lookup) is used and then pass that provider value into
_validate_provider_api_key to ensure the provider used for resummarization is
the one validated.
- Around line 686-731: The route revectorize_knowledge_item and its helper
_perform_revectorize_with_progress (and likewise
resummarize_knowledge_item/_perform_resummarize_with_progress) currently perform
Supabase reads/updates directly; refactor so all DB interactions move into
KnowledgeItemService methods (e.g., add service methods like revectorize_item,
get_item_for_revectorize, update_item_embeddings, resummarize_item,
update_item_summary) and have the route call the service to start the operation
and the background helper call service methods for all reads/updates; update
revectorize_knowledge_item to only validate input, create the ProgressTracker
and call asyncio.create_task(...) passing the service instance, and change
_perform_revectorize_with_progress to use the new KnowledgeItemService methods
(and mirror these changes for the resummarize helpers) so the API route →
service → database pattern is followed.
- Around line 793-849: The update to archon_sources uses embedding_dim which may
be unbound if no batch produced embeddings; ensure embedding_dim is initialized
before the batch loop (e.g., embedding_dim = 0 or None) and update it whenever
you successfully determine an embedding dimension inside the inner loop (the
block that sets embedding_column based on len(embedding)); then use that
initialized/last-known embedding_dim when calling
supabase.table("archon_sources").update(...). Also consider using a clearly
named variable like last_embedding_dim to avoid accidental reuse of loop-scoped
variables and keep the tracker.update and supabase update logic unchanged.

In `@python/src/server/services/crawling/document_storage_operations.py`:
- Around line 87-91: In the except block around
url_state_service.initialize_urls (the try/except handling URL state
initialization), preserve the exception stack trace by logging the error with
the logger (e.g., logger.error or module logger) using exc_info=True in addition
to the existing safe_logfire_error call; update the except Exception as e
handler to call logger.error("Failed to initialize URL state", exc_info=True)
(or equivalent logger variable) so the traceback is recorded while keeping
safe_logfire_error for the user-facing message.
- Around line 304-311: The code is marking URLs from unique_doc_urls as embedded
even for documents that were skipped; change the logic so mark_embedded is
called only for URLs that were actually embedded (use all_urls, which is
populated for documents that passed content validation and were chunked/stored)
instead of unique_doc_urls; update the loop to iterate over all_urls and adjust
the success log to report len(all_urls) and keep the same exception handling
around url_state_service.mark_embedded(original_source_id, url) so only truly
embedded URLs are marked.
- Around line 389-407: In update_source_info, fetching provenance
(credential_service.get_credentials_by_category) and parsing
EMBEDDING_DIMENSIONS can raise transient errors and currently abort the primary
path; wrap the credential lookups for "embedding" and "rag_strategy" in their
own try/except blocks (or use safe_get helper) and on exception or missing keys
fall back to sensible defaults (embedding_provider="openai",
embedding_model="text-embedding-3-small", embedding_dimensions=1536,
vectorizer_settings defaults, summarization_model="gpt-4o-mini") while logging
the error, and ensure embedding_dimensions is validated with a safe int
conversion that falls back to 1536 instead of letting ValueError propagate so
provenance failures do not abort update_source_info.

In `@python/src/server/services/ingestion/ingestion_state_service.py`:
- Line 164: Replace the three generic bare raises that use Exception("Failed to
create document blob"), Exception("Failed to create embedding set"), and
Exception("Failed to create summary record") with a more specific error that
preserves context and the original error (for example a custom IngestionError or
ValueError) and chain the original exception using "raise ... from e"; include
relevant contextual information (IDs, request data, or underlying exception
message) in the new exception text so callers can debug (locate these by
searching for the exact strings "Failed to create document blob", "Failed to
create embedding set", and "Failed to create summary record" in
ingestion_state_service.py).
- Around line 516-530: In update_source_pipeline_status, avoid the falsy check
and the unconditional write of pipeline_error when status == "error": change the
error handling to use explicit "if error_info is not None" (matching
update_embedding_set_status and update_summary), only add "pipeline_error" to
update_data when error_info is not None, and remove the redundant elif branch
that overwrites pipeline_error; keep the pipeline_completed_at assignment when
status == "complete" and then call
self.supabase.table("archon_sources").update(update_data).eq("source_id",
source_id).execute() as before.

---

Nitpick comments:
In `@python/src/server/api_routes/ingestion_api.py`:
- Around line 18-42: The endpoint process_pending_embeddings currently declares
bare scalar parameters which FastAPI exposes as query params; replace them with
a Pydantic request model (e.g., ProcessEmbeddingsRequest with fields
max_batch_size, embedder_id, provider) and accept that model as the single body
parameter in the process_pending_embeddings handler, then pass model fields into
get_embedding_worker(...).process_pending_embeddings; update any
validation/defaults on the model rather than function args. Apply the same
refactor to the other POST handlers in this module (the other router.post
endpoints) so they all accept validated Pydantic body models and use those
fields when invoking their respective worker methods.

In `@python/src/server/services/crawling/document_storage_operations.py`:
- Line 635: The line calling
self.supabase_client.table("archon_sources").select("source_id").eq("source_id",
source_id).execute() exceeds 120 chars; refactor it by splitting the chained
call across multiple lines (e.g., assign the table call or the query to an
intermediate variable, or use parentheses to break after .table(...).select(...)
and .eq(...)) so the resulting assignment to response in the
DocumentStorageOperations method remains under 120 characters while preserving
the same call chain and behavior.
- Around line 561-562: Replace the error-level logging call for the
empty-documents edge case with a non-error level and clearer message: in the
block checking "if not documents:" swap safe_logfire_error(...) for a
safe_logfire_info(...) or safe_logfire_warning(...) call and update the message
to indicate this is an expected/no-content result (include original_source_id
for context). This change should be made where the variable original_source_id
is available so the log still records the source but no longer floods error
monitoring.

In `@python/src/server/services/ingestion/ingestion_state_service.py`:
- Around line 64-119: Replace the bare str types for status/style fields with
the appropriate enum types and default enum members: change
DocumentBlob.download_status to use the DownloadStatus enum (default
DownloadStatus.PENDING), EmbeddingSet.status to the EmbeddingSetStatus enum
(default e.g., EmbeddingSetStatus.PENDING), Summary.status to SummaryStatus
(default SummaryStatus.PENDING) and Summary.style to SummaryStyle (default
SummaryStyle.OVERVIEW); update the dataclass annotations and default values to
the enum members and ensure any code that serializes/deserializes these fields
uses .value or casts to/from the enum (e.g., when reading from DB or JSON) so
runtime and MyPy type-safety are preserved.

In `@test_new_pipeline.md`:
- Line 337: Update the wording in the sentence that reads "Check error_info in
database, fix issue, use retry endpoint." to use a more formal tone by replacing
"fix" with "resolve" so it reads "Check error_info in database, resolve issue,
use retry endpoint."; locate and edit the exact fragment in the document (the
line containing "Check error_info in database, fix issue, use retry endpoint.")
to make this single-word substitution.

coderabbitai · 2026-02-23T00:46:09Z

+@router.post("/process-embeddings")
+async def process_pending_embeddings(
+    max_batch_size: int = 10,
+    embedder_id: str | None = None,
+    provider: str | None = None,
+    supabase: Client = Depends(get_supabase_client),
+):
+    """
+    Manually trigger processing of pending embedding sets.
+
+    Args:
+        max_batch_size: Maximum number of embedding sets to process
+        embedder_id: Optional filter by specific embedder
+        provider: Optional embedding provider override
+
+    Returns:
+        Processing results with counts
+    """
+    worker = get_embedding_worker(supabase)
+    result = await worker.process_pending_embeddings(
+        embedder_id=embedder_id,
+        max_batch_size=max_batch_size,
+        provider=provider,
+    )
+    return result


⚠️ Potential issue | 🟠 Major

No authentication on state-mutating operational endpoints — major security gap.

All six endpoints (four POST, two GET) are publicly accessible with zero authentication. The POST endpoints are particularly sensitive:

/retry-failed-embeddings and /retry-failed-summaries silently reset persisted failed state back to pending, potentially triggering re-processing of previously intentionally skipped items.

/process-embeddings and /process-summaries can be hammered to exhaust worker capacity, especially since max_batch_size is unbounded.

The app's CORS policy is allow_origins=["*"], so any browser can reach these endpoints cross-origin. Even for an internal tool, these trigger operations should require at minimum an API key or the same service-role token check used elsewhere. Consider adding an auth dependency — e.g.:

from ..utils import require_service_key # or your existing auth dependency

Also applies to: 45-69, 101-117, 120-141

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/api_routes/ingestion_api.py` around lines 18 - 42, The POST endpoints (e.g., process_pending_embeddings) currently have no auth and must require a service/auth key; update their route signatures to add the existing auth dependency (for example require_service_key) alongside get_supabase_client so only authorized callers can invoke state-mutating operations, and apply the same change to the other operational endpoints mentioned (retry_failed_embeddings, retry_failed_summaries, process_summaries) to check the service-role token before performing any work; ensure you import the auth dependency (e.g., from ..utils import require_service_key) and use it as a FastAPI Depends parameter in each affected function signature.

coderabbitai · 2026-02-23T00:46:09Z

+
+@router.post("/process-embeddings")
+async def process_pending_embeddings(
+    max_batch_size: int = 10,


⚠️ Potential issue | 🟡 Minor

max_batch_size has no lower-bound validation — pass 0 or -1 to silently no-op or misbehave.

Add a Query(ge=1) constraint on both endpoints to guard against nonsensical values before they reach the worker layer:

🛡️ Proposed fix

-from fastapi import APIRouter, Depends +from fastapi import APIRouter, Depends, Query from supabase import Client ... `@router.post`("/process-embeddings") async def process_pending_embeddings( - max_batch_size: int = 10, + max_batch_size: int = Query(default=10, ge=1), ... ): ... `@router.post`("/process-summaries") async def process_pending_summaries( - max_batch_size: int = 10, + max_batch_size: int = Query(default=10, ge=1), ... ):

Also applies to: 47-47

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/api_routes/ingestion_api.py` at line 20, The max_batch_size parameter currently lacks lower-bound validation in both endpoints; update the parameter declaration to use FastAPI's Query with ge=1 (e.g. change max_batch_size: int = 10 to max_batch_size: int = Query(10, ge=1)) so invalid values like 0 or negatives are rejected before reaching the worker layer, and add the Query import from fastapi if not already present; apply the same change to both occurrences of max_batch_size in this module.

coderabbitai · 2026-02-23T00:46:09Z

+        # Start background task with semaphore
+        asyncio.create_task(_perform_revectorize_with_progress(progress_id, source_id, provider, tracker))


⚠️ Potential issue | 🟠 Major

Background task reference dropped — GC risk and no cancellation support.

The asyncio.create_task() result is discarded, meaning the task has no strong reference. "Save a reference to the result of this function, to avoid a task disappearing mid-execution. The event loop only keeps weak references to tasks. A task that isn't referenced elsewhere may get garbage collected at any time, even before it's done."

The same issue is at line 903 for _perform_resummarize_with_progress. By contrast, upload tasks (line 1233) are correctly stored in active_crawl_tasks. Store revectorize/resummarize tasks the same way (or at minimum use the set-based fire-and-forget pattern), so they also get cancellation support via the stop endpoint.

🛡️ Suggested fix

- asyncio.create_task(_perform_revectorize_with_progress(progress_id, source_id, provider, tracker)) + task = asyncio.create_task(_perform_revectorize_with_progress(progress_id, source_id, provider, tracker)) + active_crawl_tasks[progress_id] = task

Apply the same pattern at line 903 for resummarize.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/api_routes/knowledge_api.py` around lines 722 - 723, The background tasks created with asyncio.create_task for _perform_revectorize_with_progress (and similarly for _perform_resummarize_with_progress) are being discarded and can be GC'd and cannot be cancelled; change the creation to retain and manage the Task just like upload tasks by adding the created Task to the existing active_crawl_tasks set (or an equivalent managed set), ensure you remove the Task from that set when it completes, and use that stored Task for cancellation in the stop endpoint so revectorize/resummarize tasks have proper lifecycle and cancellation support.

coderabbitai · 2026-02-23T00:46:09Z

+
+            # Get current embedding settings for provenance
+            embedding_model = await get_embedding_model(provider=provider)
+            embedding_dimensions = 1536


⚠️ Potential issue | 🟡 Minor

embedding_dimensions = 1536 is dead code.

This variable is assigned but never read; all subsequent logic uses embedding_dim (derived from the actual returned vector length at line 801). Remove the stale initialiser to avoid confusion about what the actual dimension will be.

🗑️ Proposed fix

- embedding_dimensions = 1536 - # Fetch all documents for this source

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/api_routes/knowledge_api.py` at line 751, Remove the dead initializer embedding_dimensions = 1536; it is never used and causes confusion because the code later derives the actual dimension into embedding_dim (from the returned vector length in the logic around embedding_dim). Delete the unused assignment and run a quick search for embedding_dimensions to ensure no other code relies on it; if any remaining references exist, replace them to use embedding_dim instead.

coderabbitai · 2026-02-23T00:46:09Z

+            if not docs_response.data:
+                await tracker.error("No documents found for source")
+                return


⚠️ Potential issue | 🟠 Major

tracker.error() misused for an empty-but-valid state.

If a source genuinely has no crawled pages yet, the operation is marked as failed, misleading the user. This is a valid state (e.g., a freshly created source), not an error. Use tracker.complete() with a descriptive message instead.

🛠️ Proposed fix

if not docs_response.data: - await tracker.error("No documents found for source") + await tracker.complete({"log": "No documents found for source — nothing to re-vectorize"}) return

Apply the same change at line 932–934 for the resummarize path.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/api_routes/knowledge_api.py` around lines 757 - 759, The code currently treats an empty docs_response.data as an error by calling await tracker.error(...); instead, treat an empty-but-valid state as completion: replace await tracker.error("No documents found for source") with await tracker.complete("No documents found for source") (or a clearer message) in the function handling the normal summarize flow (the block checking if not docs_response.data), and make the identical replacement in the resummarize path (the similar check around lines 932–934). Ensure you only change the tracker method call and message text so the operation is marked complete rather than failed.

coderabbitai · 2026-02-23T00:46:10Z

+        response = self.supabase.table("archon_chunks").insert(chunk_records).execute()
+
+        return [
+            Chunk(
+                id=uuid.UUID(row["id"]),
+                blob_id=uuid.UUID(row["blob_id"]),
+                chunk_index=row["chunk_index"],
+                start_offset=row.get("start_offset"),
+                end_offset=row.get("end_offset"),
+                content=row["content"],
+                token_count=row.get("token_count"),
+                created_at=row.get("created_at"),
+            )
+            for row in response.data
+        ]


⚠️ Potential issue | 🟡 Minor

create_chunks silently returns [] on empty DB response, inconsistent with other create methods.

create_document_blob and create_embedding_set both raise when response.data is empty; create_chunks does not. A silent empty return means downstream embedding/summary workers see zero chunks with no error signal, violating the fail-fast guideline for data corruption.

🐛 Proposed fix

response = self.supabase.table("archon_chunks").insert(chunk_records).execute() + if not response.data: + raise RuntimeError( + f"Failed to create {len(chunk_records)} chunks for blob_id={blob_id}: no data returned" + ) + return [

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/ingestion/ingestion_state_service.py` around lines 224 - 238, The create_chunks implementation in ingestion_state_service.py silently returns an empty list when response.data is empty; make it consistent with create_document_blob and create_embedding_set by checking response.data after the Supabase insert and raising an error if it's empty (include the response.error or status information in the exception message for debugging). Locate the create_chunks method and the response variable used for the Supabase insert, and replace the silent empty-return behavior with an explicit raise (e.g., RuntimeError or a specific service error) that includes context (operation name "create_chunks", response.status_code/response.error) so downstream workers fail fast on data-corruption cases.

coderabbitai · 2026-02-23T00:46:10Z

+        for i in range(0, len(blob_ids), batch_size):
+            batch = blob_ids[i : i + batch_size]
+            response = (
+                self.supabase.table("archon_chunks")
+                .select("*")
+                .in_("blob_id", batch)
+                .execute()
+            )
+            all_chunks.extend(response.data)
+
+        return [
+            Chunk(
+                id=uuid.UUID(row["id"]),
+                blob_id=uuid.UUID(row["blob_id"]),
+                chunk_index=row["chunk_index"],
+                start_offset=row.get("start_offset"),
+                end_offset=row.get("end_offset"),
+                content=row["content"],
+                token_count=row.get("token_count"),
+                created_at=row.get("created_at"),
+            )
+            for row in all_chunks


⚠️ Potential issue | 🟡 Minor

Batched queries in get_chunks_by_source return unordered chunks.

get_chunks_by_blob sorts by chunk_index, but the batched variant here does not. Across multiple blobs the results from different .in_() queries are appended in arrival order, leaving the final list unsorted. Callers expecting ordered chunks (e.g. embedding workers, context assembly) will silently process them out of sequence.

🐛 Proposed fix

response = ( self.supabase.table("archon_chunks") .select("*") .in_("blob_id", batch) + .order("chunk_index") .execute() ) all_chunks.extend(response.data) + + # Sort across batches in case blobs interleave + all_chunks.sort(key=lambda r: (r["blob_id"], r["chunk_index"]))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/ingestion/ingestion_state_service.py` around lines 277 - 298, The batched queries in get_chunks_by_source currently append results from multiple .in_() calls without guaranteeing order; after collecting all_chunks, sort the rows by blob_id and chunk_index (e.g., sort key (row["blob_id"], row["chunk_index"])) before constructing Chunk objects so callers receive chunks in deterministic per-blob, ascending chunk_index order; update the code around all_chunks collection and the Chunk(...) list comprehension in get_chunks_by_source to perform this sort.

coderabbitai · 2026-02-23T00:46:10Z

+
+1. Start the backend service:
+```bash
+cd /home/zebastjan/dev/archon


⚠️ Potential issue | 🟡 Minor

Remove the hardcoded developer path.

/home/zebastjan/dev/archon is a personal workstation path. Contributors running these steps elsewhere will get a confusing No such file or directory error before anything else fails.

✏️ Suggested fix

-cd /home/zebastjan/dev/archon +cd archon # or: cd "$(git rev-parse --show-toplevel)"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

cd /home/zebastjan/dev/archon

cd archon # or: cd "$(git rev-parse --show-toplevel)"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test_new_pipeline.md` at line 9, The step that cd's into the hardcoded path "/home/zebastjan/dev/archon" should be removed; instead update the instruction to either tell contributors to run the steps from the repository root or replace the hardcoded path with a repository-root detection approach (e.g., use git to find the top-level repo) so the command works for everyone and does not reference the personal workstation path.

coderabbitai · 2026-02-23T00:46:10Z

+# cd python && uv run python -m src.server.main
+```
+
+2. Ensure Supabase is running and migration 014 has been applied (pipeline tables exist)


⚠️ Potential issue | 🟡 Minor

Migration number in prerequisites doesn't match the PR.

The PR adds 012_add_crawl_url_state.sql, but this line tells testers to verify migration 014. If the numbering is genuinely ahead of what this PR ships, the prerequisite should explicitly call out which migrations belong to this PR and which were applied earlier — otherwise testers will be looking for a migration that doesn't exist yet (or they'll silently skip a required one).

✏️ Suggested fix (adjust numbers to match actual shipped migrations)

-2. Ensure Supabase is running and migration 014 has been applied (pipeline tables exist) +2. Ensure Supabase is running and migrations 012–014 have been applied (crawl URL state + pipeline tables)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

2. Ensure Supabase is running and migration 014 has been applied (pipeline tables exist)

2. Ensure Supabase is running and migrations 012–014 have been applied (crawl URL state + pipeline tables)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test_new_pipeline.md` at line 15, The prerequisite in test_new_pipeline.md incorrectly references "migration 014" while this PR adds 012_add_crawl_url_state.sql; update the prerequisite text so it references the correct migration number(s) — either change "migration 014" to "migration 012" or explicitly list which migrations this PR ships (e.g., 012_add_crawl_url_state.sql) versus earlier ones, so testers know exactly which migrations to verify.

coderabbitai · 2026-02-23T00:46:10Z

+# Replace <progress_id> with the ID from step 1
+curl http://localhost:8181/api/progress/<progress_id>
+```
+
+**Expected:** Status should progress through stages (discovery → downloading → chunking)
+
+### Step 3: Verify pipeline state
+
+Once crawling completes, check that blobs and chunks were created:
+
+```bash
+# Get source_id from progress response
+SOURCE_ID="<source_id_from_progress>"
+
+# Check health of the source
+curl http://localhost:8181/api/ingestion/health/$SOURCE_ID
+```


⚠️ Potential issue | 🟡 Minor

$SOURCE_ID is never populated — every subsequent health check silently targets an empty URL.

The block at lines 56-57 assigns a placeholder string literal to SOURCE_ID rather than extracting the real value from the API. Because the progress response format is never shown (Step 2 has no "Expected Response" block), a tester has no way to know which field to read. Additionally, $SOURCE_ID is reused across Tests 2, 3, 4, and 6, all of which assume it was set in a prior terminal session.

Suggested improvements:

Show the expected progress response so testers can identify the source_id field.

Provide a concrete extraction command (e.g. jq) instead of the placeholder.

Add a re-assignment reminder at the start of each test that needs the variable.

✏️ Suggested fix for Step 2 + Step 3

### Step 2: Check crawl progress ```bash # Replace <progress_id> with the ID from step 1 -curl http://localhost:8181/api/progress/<progress_id> +PROGRESS=$(curl -s http://localhost:8181/api/progress/<progress_id>) +echo "$PROGRESS" | jq . +SOURCE_ID=$(echo "$PROGRESS" | jq -r '.source_id')

+Expected: Response includes a source_id field alongside the current stage.
+

Step 3: Verify pipeline state

Once crawling completes, check that blobs and chunks were created:

-# Get source_id from progress response -SOURCE_ID="<source_id_from_progress>" - # Check health of the source curl http://localhost:8181/api/ingestion/health/$SOURCE_ID

</details> <details> <summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @test_new_pipeline.md around lines 44 - 60, The docs assign SOURCE_ID to a
literal placeholder and never show how to extract it from the
/api/progress/<progress_id> response, causing subsequent health checks to target
an empty URL; fix by showing an "Expected Response" for the progress endpoint
that includes the source_id field, add a concrete extraction step that captures
the progress JSON (e.g., assign PROGRESS from the curl response and set
SOURCE_ID by parsing PROGRESS for .source_id), and add a short reminder to
re-assign SOURCE_ID at the start of each test (Tests 2, 3, 4, 6) that depends on
it so the health check against /api/ingestion/health/$SOURCE_ID uses a real
value.

</details>  

- Current focus: robust ingestion pipeline - Near-term: batch processing + agent skills - Mid-term: git + IPFS integration - Long-term: database abstraction + knowledge graph

- Add operation_progress table for restartable pipelines - ProgressTracker now persists to database (survives container restarts) - Add pause/resume API endpoints and frontend buttons - Auto-apply migrations on server startup - Fix Summary Ollama detection on page load - Fix settings reload after save to verify persistence - Fix broken test fixtures (renamed helper functions)

Implements foundational pause/resume functionality for web crawl operations with automatic recovery on Docker restart. Download manager-style controls with proper state tracking and UI feedback. ## Changes ### Backend - Pause/Resume State Machine - Add cancellation reason tracking (PAUSED vs STOPPED vs NONE) - Modify CrawlingService.cancel() to accept reason parameter - Add CrawlingService.pause() method for explicit pause operations - Update CancelledError handler to preserve "paused" status - Pause endpoint now calls orchestration.pause() instead of cancel() ### Backend - Auto-Resume on Restart - Add ProgressTracker.auto_resume_paused_operations() method - Server startup now auto-resumes all paused operations - Operations marked as "paused" after Docker restart automatically continue - Set source_id early in crawl initialization for pause/resume support ### Backend - Failed Operations Retryable - Resume endpoint now accepts "failed" status for retry functionality - Failed operations preserve progress data for recovery ### Frontend - UI Improvements - Add "Retry" button for failed operations (blue, RotateCw icon) - Change Resume button icon from StopCircle to Play icon - Update button visibility logic for paused/failed states - Visual distinction between Pause (⏹️), Resume (▶️), and Retry (🔄) ## Known Issues ### Critical - **Cancel button non-functional**: UI does not respond to cancel/stop button - **Resume doesn't restart crawl**: UI responds to resume button but actual crawl operation does not continue (no backend activity detected) - **Auto-resume may not trigger**: Docker restart shows "Restored" message but not always "Auto-resumed" message ### Investigation Needed - Source_id may still be NULL in some cases despite early initialization - Cancellation checks (_check_cancellation) may not be frequent enough during long-running operations like discovery or initial page crawling - Active orchestration registry may not properly track paused operations ## Testing Status - ✅ Pause button changes status to "paused" in database - ✅ UI shows correct badges (Paused/Cancelled/Failed) - ✅ Resume button appears for paused operations - ❌ Resume does not actually restart crawl - ❌ Cancel button does not stop operations - ⚠️ Auto-resume inconsistent on Docker restart ## Next Steps 1. Debug why resume endpoint doesn't restart crawl service 2. Fix cancel button to properly stop operations 3. Ensure _check_cancellation() called frequently in all crawl stages 4. Add more detailed logging for pause/resume/cancel operations 5. Verify source_id persistence across pause/resume cycles Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add robust testing for pause/resume/cancel functionality with critical bug prevention and fix for source creation reliability. Testing Infrastructure: - API endpoint tests (8/9 passing) for pause/resume/cancel endpoints - Integration tests (6/6 passing) for full pause/resume lifecycle - Comprehensive test documentation with examples and patterns - Quick reference guide for running tests Critical Bug Tests: - Resume with missing source_id returns 400 (prevents early pause bug) - Resume with missing source record returns 404 (prevents DB inconsistency) - Resume validates BEFORE updating DB status (prevents corruption) - Pause before source creation fails gracefully with clear error - Progress never goes backwards after resume Source Creation Fix: - Make source record creation REQUIRED (was best-effort) - Add retry logic: 3 attempts with exponential backoff (1s, 2s) - Fail crawl with clear error if source creation fails after retries - Prevents silent breakage where pause works but resume fails Impact: - 14/15 tests passing (93% success rate) - All critical bugs have regression tests - Transient DB errors handled automatically via retry - Permanent DB failures fail fast with clear error messages - No more silent pause/resume breakage Files: - python/tests/test_pause_resume_cancel_api.py (369 lines) - python/tests/progress_tracking/integration/test_pause_resume_flow.py (338 lines) - python/tests/progress_tracking/README.md (437 lines - test docs) - python/tests/RUN_PAUSE_RESUME_TESTS.md (quick reference) - python/src/server/services/crawling/crawling_service.py (retry logic) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

⚠️ WARNING: Code is in non-working state Crawls start but fail/disappear shortly after. Cannot test pause/resume functionality until basic crawls work. See KNOWN_ISSUES.md for: - Current symptoms and impact - What we know/don't know - Debugging steps for next session - TODO list This is WIP debugging code. DO NOT MERGE until crawls work. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The database constraint only allows: idle, downloading, chunking, embedding, summarizing, complete, error. Using 'initializing' caused constraint violations (error code 23514) and prevented source creation, making all crawls fail immediately. Changed to 'idle' which is the database default and semantically correct for a newly created source that hasn't started processing yet. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Wirasm · 2026-04-06T07:24:02Z

🔄 This repository is being replaced by a new version of Archon.

The original Python/MCP codebase is being archived to the archive/v1-python-mcp branch. The new Archon (TypeScript workflow engine for AI coding agents) is replacing it.

This PR is being closed as part of the migration. Thank you for your contribution!

… nodes (#936) Approval nodes can now capture the reviewer's comment as $nodeId.output (via capture_response: true) and optionally retry on rejection instead of cancelling (via on_reject: { prompt, max_attempts }). This enables iterative human-AI review cycles without needing interactive loop nodes. Changes: - Add approvalOnRejectSchema and extend approval node schema with capture_response and on_reject fields - Extend ApprovalContext with captureResponse, onRejectPrompt, onRejectMaxAttempts (stored at pause time for reject handlers) - Add $REJECTION_REASON variable to substituteWorkflowVariables - Extract executeApprovalNode function with rejection resume logic - Update all 4 approve handlers (CLI, command-handler, orchestrator, server API) to use captureResponse and clear rejection state - Update all 3 reject handlers (CLI, command-handler, server API) to check onRejectPrompt and retry instead of cancel when configured - Add 5 tests for approval node behavior (fresh pause, capture_response, on_reject resume, max_attempts exhaustion, max_attempts=1) Fixes #936

… nodes (coleam00#936) Approval nodes can now capture the reviewer's comment as $nodeId.output (via capture_response: true) and optionally retry on rejection instead of cancelling (via on_reject: { prompt, max_attempts }). This enables iterative human-AI review cycles without needing interactive loop nodes. Changes: - Add approvalOnRejectSchema and extend approval node schema with capture_response and on_reject fields - Extend ApprovalContext with captureResponse, onRejectPrompt, onRejectMaxAttempts (stored at pause time for reject handlers) - Add $REJECTION_REASON variable to substituteWorkflowVariables - Extract executeApprovalNode function with rejection resume logic - Update all 4 approve handlers (CLI, command-handler, orchestrator, server API) to use captureResponse and clear rejection state - Update all 3 reject handlers (CLI, command-handler, server API) to check onRejectPrompt and retry instead of cancel when configured - Add 5 tests for approval node behavior (fresh pause, capture_response, on_reject resume, max_attempts exhaustion, max_attempts=1) Fixes coleam00#936

Zebastjan and others added 8 commits February 21, 2026 23:18

Update ADR-001 with accurate implementation state and roadmap

0aaee50

Add unit tests for CrawlUrlStateService (checkpoint/resume)

0f83aa6

coderabbitai Bot reviewed Feb 22, 2026

View reviewed changes

Zebastjan added 2 commits February 22, 2026 03:54

coderabbitai Bot reviewed Feb 22, 2026

View reviewed changes

Zebastjan added 7 commits February 22, 2026 04:36

Add archon_migrations tracking to migration files 001-007, 012-013

e93d42e

Add re-vectorize, re-summarize endpoints and code summarization agent…

f4c868d

… settings

Fix Code Summary tab layout and Config button

a7306d2

- Move Code Summary to second row to prevent overflow - Shorten label to 'Summary' - Add code_summarization to Ollama Config panel with model selection - Update Configuration Summary to handle code summarization

Add Summary tab to LLM Provider Settings

ca3d358

- Add Summary as third tab (row 2) with orange color and LuCode icon - Works alongside Chat (row 1) and Embeddings (row 1)

Fix ternary syntax and add full model selection for Summary tab

e3fef61

coderabbitai Bot reviewed Feb 22, 2026

View reviewed changes

coderabbitai Bot reviewed Feb 23, 2026

View reviewed changes

Zebastjan and others added 5 commits February 22, 2026 19:49

Add roadmap document outlining project direction

a60fe0d

- Current focus: robust ingestion pipeline - Near-term: batch processing + agent skills - Mid-term: git + IPFS integration - Long-term: database abstraction + knowledge graph

Wirasm closed this Apr 6, 2026

This was referenced Apr 8, 2026

feat(web): rename Reject button to Cancel/Revise based on approval node config #969

Open

feat: archon-self-improve workflow — review workflow runs and suggest prompt/workflow improvements #976

Open

133Felix mentioned this pull request Apr 9, 2026

fix(workflows): interactive-prd approval gates missing capture_response #1001

Closed

		created_at TIMESTAMPTZ DEFAULT now(),
		updated_at TIMESTAMPTZ DEFAULT now(),

		@@ -0,0 +1,75 @@
		# ADR-001: Restartable RAG Ingestion Pipeline

		## Status: Proposed

	## Status: Proposed
	## Status: Accepted (Superseded by ADR-002)

		CREATE INDEX IF NOT EXISTS idx_archon_chunks_source_id ON archon_chunks(blob_id, source_id)
		INCLUDE (source_id);

		# Start background task with semaphore
		asyncio.create_task(_perform_revectorize_with_progress(progress_id, source_id, provider, tracker))

	cd /home/zebastjan/dev/archon
	cd archon # or: cd "$(git rev-parse --show-toplevel)"

	2. Ensure Supabase is running and migration 014 has been applied (pipeline tables exist)
	2. Ensure Supabase is running and migrations 012–014 have been applied (crawl URL state + pipeline tables)

Conversation

Zebastjan commented Feb 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New Files

Modified Files

Testing

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 22, 2026

Choose a reason for hiding this comment

Zebastjan commented Feb 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 22, 2026 •

edited

Loading