Add crawl checkpoint/resume infrastructure#936
Add crawl checkpoint/resume infrastructure#936Zebastjan wants to merge 25 commits intocoleam00:mainfrom
Conversation
Refactored code summarization prompt from verbose 24-line format to concise 8-line structured format optimized for 1.2B+ parameter models. Changes: - Reduced prompt from ~350 tokens to ~100 tokens (70% reduction) - Replaced extensive examples with direct structured guidance - Maintains same JSON output schema for backward compatibility - Added PURPOSE/PARAMETERS/USE WHEN structure to summary template Testing & Documentation: - Added permanent regression test (tests/prompts/test_code_summary_prompt.py) - Created test infrastructure documentation (tests/prompts/README.md) - Added comprehensive prompt documentation (PRPs/ai_docs/CODE_SUMMARY_PROMPT.md) - Documented data flow (CODE_EXTRACTION_FLOW.md) - Updated CLAUDE.md with testing guidelines Performance Impact: - 3-5x faster inference with small models (tested: Liquid 1.2B Instruct) - 70% cost reduction for code summarization - Only affects code blocks (~5% of content), prose chunks unchanged - Existing markdown fence handling in parser confirmed working The new prompt produces consistent JSON output with structured format while being dramatically more efficient for small language models. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added two integration test suites to validate the optimized code summary prompt:
1. test_code_summary_prompt_quick.py (Quick validation)
- Directly tests the code summarization function
- Validates prompt generates valid JSON with required fields
- Tests Python, TypeScript, and Rust code samples
- ✅ All 3 tests passed with valid summaries
2. test_crawl_validation.py (Full crawl validation)
- Tests complete crawl-to-summary pipeline via API
- Validates contribution guideline URLs
- Currently blocked by backend validation bug:
- Progress status 'discovery' not in allowed enum
- Prevents polling from completing
- Ready for use once backend bug is fixed
Test Results:
- Quick validation: ✅ 3/3 passed
- Summaries generated correctly for all languages
- JSON structure validated
- Demonstrates prompt works with configured LLM
Files added:
- tests/integration/__init__.py
- tests/integration/test_code_summary_prompt_quick.py
- tests/integration/test_crawl_validation.py
- tests/integration/.gitignore (excludes *_results.json)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added:
- PROMPT_TEST_DETAILS.md: Complete test validation report
- Updated PR_DESCRIPTION.md with test results section
Testing Status:
- ✅ Quick validation: 3/3 passed
- ⏸️ Full crawl validation: Ready but blocked by backend bug
Quick test validates:
- Valid JSON generation across Python, TypeScript, Rust
- Meaningful summaries produced
- Prompt works with configured LLM
Full crawl test infrastructure ready for use once backend
progress status enum bug is fixed ('discovery' status).
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…sponse Resolves Pydantic validation error that prevented programmatic crawl progress polling. Changes: - Added 'discovery' to allowed status values in CrawlProgressResponse - Updated documentation to reflect bug fix - Full crawl validation test now functional The 'discovery' status is emitted by the backend during the page discovery phase of crawling. This fix enables automated testing and progress monitoring. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add migration to create archon_crawl_url_state table - Track per-URL status: pending, fetched, embedded, failed - Support retry logic with configurable max_retries - Add CrawlUrlStateService with methods for state management This is the infrastructure for Phase 1 of ADR-001, enabling resumable crawls after interruption.
- Import and initialize CrawlUrlStateService in crawling orchestration - Check for existing crawl state at start of crawl - Log resume info if pending/failed URLs exist - Initialize URL state in document storage before processing - Mark URLs as embedded after successful storage Note: Full resume functionality (skipping already-embedded URLs) requires additional work to filter URLs before crawling.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a restartable, checkpointed ingestion pipeline and per-URL crawl state, provenance tracking, ingestion workers/orchestrator/health checks, resume-aware crawling, code-summary prompt optimization and tests, new API endpoints for revectorize/resummarize, migrations for pipeline/provenance, frontend provenance UI/types, and related docs/tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Orchestrator as Pipeline Orchestrator
participant URLState as Crawl URL State Service
participant DB as Supabase DB
participant Fetcher as Fetcher/Extractor
participant DocStore as Document Storage
participant CodeStore as Code Storage / Summarizer
Orchestrator->>URLState: initialize_urls(source_id, discovered_urls)
URLState->>DB: upsert archon_crawl_url_state (status='pending')
DB-->>URLState: upsert result
URLState-->>Orchestrator: {inserted, skipped}
loop process pending URLs
Orchestrator->>URLState: get_pending_urls(source_id)
URLState->>DB: select pending
DB-->>URLState: url list
Orchestrator->>Fetcher: fetch(url)
Fetcher-->>Orchestrator: crawl_result (content, code blocks)
Orchestrator->>DocStore: store_documents(crawl_result, provenance)
DocStore->>CodeStore: extract_code_examples (if enabled)
CodeStore-->>DocStore: code_summaries
DocStore->>URLState: mark_embedded(source_id, url)
URLState->>DB: update status='embedded'
end
Orchestrator->>URLState: get_crawl_state(source_id)
URLState->>DB: aggregate counts
DB-->>URLState: state
Orchestrator->>URLState: clear_state(source_id) (if finished)
URLState->>DB: delete records
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 15
🧹 Nitpick comments (18)
python/tests/integration/test_code_summary_prompt_quick.py (2)
132-133: Moveimport tracebackto module level.Importing inside an
exceptblock works but is non-idiomatic.tracebackis a stdlib module with no cost to importing at module level.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/tests/integration/test_code_summary_prompt_quick.py` around lines 132 - 133, Move the local import of the standard library module traceback out of the except block and place it at module level near the other imports in python/tests/integration/test_code_summary_prompt_quick.py; locate the except block that currently does "import traceback" and remove that line, then add "import traceback" with the other top-level imports so functions/tests (e.g. the failing test function containing the except) can use traceback without importing inside the exception handler.
130-142: Exception handler swallows failures — pytest can never detect a failing test run.All LLM/API exceptions are caught and stored as
"success": Falseinresults, but nothing is raised. Pytest seestest_prompt_directlyas always passing. Per coding guidelines, batch-style tests should report detailed failures; here at minimum a final assertion should gate the test outcome.🔧 Proposed fix — add a final assertion gate
if success_count == len(results): print("\n🎉 All tests passed!") else: print(f"\n⚠️ {len(results) - success_count} test(s) failed") + # Let pytest see the result + assert success_count == len(results), ( + f"{len(results) - success_count}/{len(results)} code summary test(s) failed" + ) return results🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/tests/integration/test_code_summary_prompt_quick.py` around lines 130 - 142, The current except block appends failures to the results list but never fails the test; update the end of test_prompt_directly to assert that there are no failed runs by inspecting results (e.g., build failures = [r for r in results if not r["success"]]) and raise an AssertionError or use pytest.fail with a concise summary of failures (including name, language, and error) if any exist; keep the existing try/except aggregation but add this final assertion gate to ensure pytest detects failures.python/src/server/services/storage/code_storage_service.py (2)
636-636: Redundant conditional — simplify tocode[:1500].Python slicing never raises when
len(code) <= 1500; the conditional is unnecessary.♻️ Proposed simplification
-{code[:1500] if len(code) > 1500 else code} +{code[:1500]}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/storage/code_storage_service.py` at line 636, Replace the redundant conditional that uses "{code[:1500] if len(code) > 1500 else code}" with the simpler slice "code[:1500]"; update the occurrence that references the variable "code" (in code_storage_service.py) so it always uses the safe slicing expression "code[:1500]" since Python slicing handles shorter strings without error.
631-643:context_before/context_afterare silently dropped from the prompt.Both parameters are still declared in the function signature and documented, but are no longer included in the prompt. If callers rely on context to generate accurate summaries (especially for ambiguous code blocks), this is a silent quality regression for larger models. Consider documenting the intentional omission in the function docstring, or adding a parameter to re-enable context for non-small-model providers.
python/src/server/services/crawling/document_storage_operations.py (1)
69-78: Misleading comment — no "enabled" flag is checked.The comment
# Initialize URL state tracking if enabledimplies a feature flag, but tracking is always attempted. Per coding guidelines, comments should document functionality, not aspirational state.♻️ Proposed fix
- # Initialize URL state tracking if enabled + # Initialize per-URL crawl state to "pending"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/crawling/document_storage_operations.py` around lines 69 - 78, The comment "Initialize URL state tracking if enabled" is misleading because no flag is checked; update the code to either actually check a feature flag or change the comment to accurately reflect behavior. Specifically, either add a conditional around get_crawl_url_state_service/ url_state_service.initialize_urls that checks the existing feature flag (or config) used elsewhere, or replace the comment with a factual one such as "Initialize URL state tracking" so it matches the unconditional calls to get_crawl_url_state_service and url_state_service.initialize_urls; ensure references to url_state_service, get_crawl_url_state_service, and initialize_urls remain correct.python/tests/prompts/README.md (1)
25-25: Hard-coded line reference will drift.
(lines 631-643)in thecode_storage_service.pyreference will become stale as the file evolves. Linking to the function name is more stable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/tests/prompts/README.md` at line 25, Replace the hard-coded "(lines 631-643)" reference with a stable symbol reference to the function or method in src/server/services/storage/code_storage_service.py (e.g., use the class/method name such as CodeStorageService.<method_name> or the exact def name) so readers can find the code even as line numbers change; update the README.md text to use the function/method name in backticks and remove the numeric line-range.migration/0.1.0/012_add_crawl_url_state.sql (1)
27-29: Redundant single-column index onsource_id.
idx_crawl_url_state_sourceon(source_id)is redundant becauseidx_crawl_url_state_source_statuson(source_id, status)can serve all queries that filter only bysource_id(leftmost prefix). Keeping it adds write overhead without query benefit.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@migration/0.1.0/012_add_crawl_url_state.sql` around lines 27 - 29, The migration creates a redundant single-column index idx_crawl_url_state_source on archon_crawl_url_state(source_id); remove that CREATE INDEX statement and rely on the composite idx_crawl_url_state_source_status (source_id, status) instead; if the single-column index may already exist in deployed DBs, add a DROP INDEX IF EXISTS idx_crawl_url_state_source or an equivalent rollback step to the migration so only the composite index remains and write overhead is avoided.python/tests/integration/test_crawl_validation.py (3)
232-238: Movetracebackimport to the top of the file.
import tracebackis a stdlib module — importing it inside the except block (and again at line 263) is unconventional and duplicated.Suggested fix
Add at the top of the file with other imports:
import tracebackThen remove the inline imports at lines 236 and 263.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/tests/integration/test_crawl_validation.py` around lines 232 - 238, Move the inline "import traceback" out of the except blocks and add a single module-level "import traceback" with the other top-of-file imports; then remove the two inline imports (the one inside the except that sets result["status"] = "error" and the duplicate later) so traceback is imported only once at the top and used from there.
80-82: Consider checking thestatusfield instead of (or in addition to)error.The progress response may have
status: "error"without a separateerrorfield, or theerrorkey might be present but empty. Checking both provides more robust detection. Also, raising a bareExceptionloses the full progress context.Suggested fix
# Check if errored - if progress.get("error"): - raise Exception(f"Crawl failed: {progress.get('error')}") + status = progress.get("status", "") + if progress.get("error") or status in ("error", "failed"): + raise RuntimeError(f"Crawl failed: {progress.get('error', status)}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/tests/integration/test_crawl_validation.py` around lines 80 - 82, Update the failure check to consider the progress "status" field as well as the "error" field and raise with the full progress context; specifically, in the block using progress (the variable named progress in test_crawl_validation.py), detect failure when progress.get("error") is truthy OR progress.get("status") == "error", and then raise an exception that includes the entire progress object (e.g., include progress in the message) instead of raising a bare Exception with only progress.get('error').
22-23: Hardcoded API base URL — consider using an environment variable.
API_BASE = "http://localhost:8181"will break if the server runs on a different host/port (e.g., in CI or Docker-to-Docker networking). A simpleos.environ.get("API_BASE", "http://localhost:8181")keeps the default while allowing overrides.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/tests/integration/test_crawl_validation.py` around lines 22 - 23, The test hardcodes API_BASE = "http://localhost:8181" which breaks in other environments; change API_BASE to read from an environment variable with a default (use os.environ.get("API_BASE", "http://localhost:8181")) and add an import for os at the top of python/tests/integration/test_crawl_validation.py so tests can override API_BASE in CI or Docker without changing code.python/tests/test_crawl_url_state_service.py (2)
328-373: State transition tests are redundant with the individual method tests.
test_pending_to_fetched_transitionandtest_fetched_to_embedded_transition(Lines 331–347) duplicate exactly whattest_marks_url_as_fetchedandtest_marks_url_as_embeddedalready cover. Consider either removing these or making them genuine multi-step transitions that initialize → fetch → embed the same URL with assertions at each stage.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/tests/test_crawl_url_state_service.py` around lines 328 - 373, The tests test_pending_to_fetched_transition and test_fetched_to_embedded_transition in TestStateTransitionLogic are redundant with existing tests (test_marks_url_as_fetched and test_marks_url_as_embedded); either remove these two methods or convert them into a true multi-step integration test that uses the same source_id/url and calls url_state_service.mark_fetched(...) then asserts the intermediate state, followed by url_state_service.mark_embedded(...) and asserts the final state (ensure you reference the same URL and use the same service instance: url_state_service.mark_fetched and url_state_service.mark_embedded); update or delete the redundant test methods accordingly to avoid duplicate coverage.
7-7: Unused import:patch.
patchis imported but never used in any test.-from unittest.mock import MagicMock, patch +from unittest.mock import MagicMock🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/tests/test_crawl_url_state_service.py` at line 7, Remove the unused import "patch" from the test module import line (keep MagicMock if still used) or replace it by using patch where intended; specifically update the import statement that currently reads "from unittest.mock import MagicMock, patch" to drop "patch" (or use "patch" in tests if that was intended) so there are no unused imports.CODE_EXTRACTION_FLOW.md (2)
17-25: Add language specifiers to pseudocode fenced blocks.Lines 17, 33, 73, and 137 use fenced code blocks without a language identifier (flagged by markdownlint MD040). Using
textorplaintextwould satisfy the linter while keeping the content readable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@CODE_EXTRACTION_FLOW.md` around lines 17 - 25, Several fenced code blocks in CODE_EXTRACTION_FLOW.md (e.g., the block showing the numbered list starting "1. Crawl returns markdown documents" and the other pseudocode blocks) are missing language specifiers and trigger markdownlint MD040; update each triple-backtick fence to include a language token such as ```text or ```plaintext so the blocks are recognized as plain text (apply this to the blocks near the "Crawl returns markdown documents" snippet and the other pseudocode fences referenced in the review).
1-8: Documentation uses change-specific language that won't age well.Phrases like "YOUR PROMPT HERE", "Your prompt change", and "Your 1.2B prompt optimization" are conversational and tied to this specific PR context. For long-term project documentation, consider replacing these with neutral descriptions (e.g., "the code summary prompt", "the prompt optimization"). Based on learnings, documentation should focus on functionality and reasoning rather than referencing specific changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@CODE_EXTRACTION_FLOW.md` around lines 1 - 8, The documentation in CODE_EXTRACTION_FLOW.md uses PR-specific, conversational phrases ("YOUR PROMPT HERE", "Your prompt change", "Your 1.2B prompt optimization"); update the text to neutral, long-lived terminology by replacing those phrases with stable descriptors such as "the code summary prompt", "the prompt optimization", or "prompt adjustments", and rephrase any remaining change-specific wording so the document describes behavior (e.g., that the code summary runs on extracted code blocks only) rather than referencing this PR or an author-facing instruction; ensure the three explicit strings mentioned are replaced consistently throughout the file and that the executive summary remains factual and version-agnostic.PROMPT_TEST_DETAILS.md (1)
99-105: Fenced code block missing language specifier.Per markdownlint MD040, adding a language identifier (e.g.,
text) would satisfy the linter.-``` +```text pydantic_core._pydantic_core.ValidationError: 1 validation error for CrawlProgressResponse🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@PROMPT_TEST_DETAILS.md` around lines 99 - 105, The fenced code block showing the pydantic ValidationError in PROMPT_TEST_DETAILS.md is missing a language specifier; update the opening fence to include a language identifier (for example change ``` to ```text) so the block containing "pydantic_core._pydantic_core.ValidationError: 1 validation error for CrawlProgressResponse" is annotated, ensuring the fenced code block starts with ```text and ends with the matching ``` fence.python/src/server/services/crawling/crawl_url_state_service.py (3)
323-340: Singleton accessor is not thread-safe and silently ignores a differentsupabase_clientafter first initialization.If
get_crawl_url_state_service(client_A)is called during startup and laterget_crawl_url_state_service(client_B)is called (e.g., in tests or a different context),client_Bis silently discarded. Additionally, concurrent calls can race on theNonecheck and create two instances.Consider either:
- Adding a
resetparameter for testing, and documenting that the client arg is only honored on first call.- Using a lock for thread safety.
Minimal fix — document behavior and add reset capability for tests
-def get_crawl_url_state_service(supabase_client=None) -> CrawlUrlStateService: +def get_crawl_url_state_service(supabase_client=None, *, _reset: bool = False) -> CrawlUrlStateService: """ Get the singleton crawl URL state service instance. Args: - supabase_client: Optional Supabase client + supabase_client: Optional Supabase client (only used on first call or after reset) + _reset: If True, discard existing instance and create a new one (testing only) Returns: CrawlUrlStateService instance """ global crawl_url_state_service - if crawl_url_state_service is None: + if crawl_url_state_service is None or _reset: crawl_url_state_service = CrawlUrlStateService(supabase_client) return crawl_url_state_service🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/crawling/crawl_url_state_service.py` around lines 323 - 340, The singleton accessor get_crawl_url_state_service currently drops subsequent supabase_client args and is not thread-safe; update it to accept an optional reset: bool = False (for tests) and document that supabase_client is honored only on first init unless reset=True, and add a module-level threading.Lock (e.g., crawl_url_state_service_lock) and wrap the None-check and instantiation of crawl_url_state_service (and honoring/reset logic) inside lock.acquire()/release or a context manager to prevent races; ensure you still use the global crawl_url_state_service and instantiate CrawlUrlStateService(supabase_client) only when appropriate.
20-28: Missing type annotation onsupabase_clientparameter.The constructor accepts an optional Supabase client but has no type hint, making it harder for MyPy and IDE tooling to catch misuse.
Proposed fix
- def __init__(self, supabase_client=None): + def __init__(self, supabase_client: "Client | None" = None):(Import
Clientfrom the supabase library or useAnyif the exact type varies.)As per coding guidelines: "Use Mypy for type checking in Python to ensure type safety."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/crawling/crawl_url_state_service.py` around lines 20 - 28, The __init__ of CrawlUrlStateService currently lacks a type annotation for the supabase_client parameter; update the signature of __init__ to annotate supabase_client as either the supabase Client type (import Client from the supabase library) or as typing.Optional[Client] (or typing.Any if the exact client class is uncertain), and add a corresponding attribute annotation for self.supabase_client (e.g., self.supabase_client: Optional[Client]) so MyPy/IDE tooling can validate uses of get_supabase_client() and methods on supabase_client.
74-76: Error logging lacks context (source_id, url) and stack traces, per coding guidelines.Most
exceptblocks log a generic message (e.g.,"Failed to update URL status: {e}") without including thesource_idorurlthat failed, and none preserve the stack trace withexc_info=True. This makes production debugging significantly harder.Example —
_update_status(line 178):Proposed fix pattern (apply similarly to all catch blocks)
except Exception as e: - safe_logfire_error(f"Failed to update URL status: {e}") + logger.error( + f"Failed to update URL status to '{status}' | source_id={source_id} | url={url}", + exc_info=True, + ) return FalseWhere
safe_logfire_erroralone is insufficient for debugging, pair it withlogger.error(..., exc_info=True)so that structured tracing and full stack traces are both available.As per coding guidelines: "Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs."
Also applies to: 150-152, 177-179, 249-251, 276-278, 299-301, 318-320
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/crawling/crawl_url_state_service.py` around lines 74 - 76, The except blocks (e.g., the one around safe_logfire_error in crawl_url_state_service.py and specifically in the _update_status method) must log contextual identifiers and full stack traces: replace the bare safe_logfire_error(f"...{e}") pattern with two actions — call safe_logfire_error with a concise message and then call the module logger (e.g., logger.error) with a descriptive message that includes source_id and url (or other relevant IDs) and pass exc_info=True so the full exception stack is preserved; also narrow the except to specific exception types where feasible and re-raise with "raise" (or "raise from e" if chaining is desired). Ensure these changes are applied to the other catch sites mentioned (around lines 150-152, 177-179, 249-251, 276-278, 299-301, 318-320) and reference the same symbols (safe_logfire_error and _update_status) when editing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CLAUDE.md`:
- Around line 153-173: Remove the stray closing code-fence that follows the
"Prompt Regression Tests" / Quick Workflows bash block so there's only one
closing ``` for that block; specifically, delete the orphaned ``` that currently
precedes the "## Architecture Overview" heading so the bash block closes
correctly and the rest of the document renders as normal.
In `@migration/0.1.0/012_add_crawl_url_state.sql`:
- Around line 43-45: The current RLS policy "Service role full access to
crawl_url_state" on archon_crawl_url_state uses USING (true) WITH CHECK (true)
and thus grants access to any role; update the policy to restrict it to the
service role by either replacing the predicate with USING (auth.role() =
'service_role') WITH CHECK (auth.role() = 'service_role') or by adding TO
service_role so only the service role can read/write; modify the policy
definition for "Service role full access to crawl_url_state" accordingly.
- Around line 21-22: Add the missing UPDATE trigger for updated_at on the
archon_crawl_url_state table: after the table definition that contains the
updated_at TIMESTAMPTZ DEFAULT now() column, create a trigger named
set_updated_at_crawl_url_state that fires BEFORE UPDATE ON
archon_crawl_url_state FOR EACH ROW and EXECUTE FUNCTION
update_updated_at_column(); this mirrors the pattern used for archon_settings,
archon_projects, etc., and ensures updated_at is auto-updated even if the
application layer omits it.
In `@PRPs/ai_docs/CODE_SUMMARY_PROMPT.md`:
- Around line 132-135: The run command in the markdown uses the wrong test path
("uv run python test_code_summary_prompt.py ollama"); update this to point at
the tests/prompts test file used elsewhere by replacing that string with "uv run
python tests/prompts/test_code_summary_prompt.py ollama" so the command runs
from the python/ directory and matches the other references to
tests/prompts/test_code_summary_prompt.py.
In `@python/src/server/services/crawling/crawl_url_state_service.py`:
- Around line 104-152: mark_failed currently does a read-then-update on
retry_count (via
self.supabase_client.table(self.table_name).select(...).match(...)) which causes
a TOCTOU race; replace that two-step logic with an atomic DB-side update (e.g.,
a single SQL UPDATE that sets retry_count = retry_count + 1 and uses RETURNING
to get the new value, or call a Supabase RPC/stored procedure) so concurrent
failures cannot lose increments and so you can decide in the same statement
whether to set status to "failed" or keep "pending"; update mark_failed to call
that atomic updater and then call _update_status only if the returned
retry_count >= max_retries (or incorporate status change into the same RPC), or
if you cannot change DB now, add a short comment in mark_failed documenting this
known limitation.
- Around line 253-278: The exception handler in get_crawl_state can reference
counts before it's defined if the Supabase query raises, causing an
UnboundLocalError; to fix, move or create the counts = {"pending": 0, "fetched":
0, "embedded": 0, "failed": 0, "total": 0} initialization before the try block
(or ensure a default counts is set in the except) so that the except path and
return always have a valid counts; update the get_crawl_state function (the
supabase_client.table(...).select("status").match({"source_id":
source_id}).execute() call and the except block) to use this pre-initialized
counts.
In `@python/src/server/services/crawling/crawling_service.py`:
- Around line 361-362: In _async_orchestrate_crawl, synchronous Supabase I/O
methods on url_state_service (has_existing_state, get_crawl_state, clear_state)
must be run off the event loop; replace direct calls like
url_state_service.has_existing_state(original_source_id) with awaiting
asyncio.to_thread(url_state_service.has_existing_state, original_source_id) (and
do the same for get_crawl_state and clear_state) so these blocking DB calls do
not block the async orchestration (mirror the existing pattern used for
discover_files).
- Around line 360-381: The clear-state branch currently wipes all crawl state
when pending==0 and failed==0, which drops URLs in the "fetched" state; update
the logic in the block that calls
url_state_service.clear_state(original_source_id) (within the
get_crawl_url_state_service / crawl_state handling) to also inspect
crawl_state.get("fetched", 0) (or fetched_count) and only clear state when
fetched_count == 0 (otherwise log resume info and let the crawl continue so
fetched-but-not-yet-embedded URLs are preserved); ensure you reference
url_state_service.clear_state, url_state_service.get_crawl_state, and the
crawl_state keys pending/failed/fetched/total in your change.
In `@python/src/server/services/crawling/document_storage_operations.py`:
- Around line 74-78: The except blocks around url_state_service.initialize_urls
(where safe_logfire_error is called) and the other handler at the location
referencing safe_logfire_error must also call the module logger with
exc_info=True to preserve stack traces; update the except handlers that
currently call only safe_logfire_error(...) to additionally invoke
logger.error("Failed to initialize URL state" or the existing error message,
exc_info=True) (i.e., mirror the established pattern used elsewhere in this file
where safe_logfire_error(...) is paired with logger.error(..., exc_info=True)).
- Around line 291-298: The call to url_state_service.mark_embedded currently
uses unique_doc_urls (which includes URLs for documents that were skipped and
never embedded), corrupting resume state; update the post-storage marking to
iterate over the deduplicated list of actually-embedded URLs by replacing
unique_doc_urls with list(set(all_urls)) so only URLs collected in the
processing loop (all_urls) are marked embedded; look for the block in
document_storage_operations.py that calls url_state_service.mark_embedded and
the variables unique_doc_urls and all_urls within the same function to apply
this change.
In `@python/tests/integration/.gitignore`:
- Around line 1-2: The tests are writing code_summary_test_results.json into
python/tests/prompts/ but only python/tests/integration/ is ignored; add a
.gitignore to prevent committing that output by either creating
python/tests/prompts/.gitignore containing the same pattern (*_results.json or
specifically code_summary_test_results.json) or consolidate by adding
*_results.json to python/tests/.gitignore; update the repository by committing
the new/updated .gitignore so python/tests/prompts/test_code_summary_prompt.py
output is ignored.
In `@python/tests/integration/test_code_summary_prompt_quick.py`:
- Around line 70-77: The test currently swallows ImportError and does a plain
return, masking failures; update the import error handling in
python/tests/integration/test_code_summary_prompt_quick.py where
_generate_code_example_summary_async is imported: replace the bare return with
either pytest.skip("dependency missing: ...") (after importing pytest) when the
import/environment is optional, or re-raise the caught ImportError to fail the
test and surface the real error; ensure the ImportError handler uses the chosen
approach and includes the original exception message for context.
In `@python/tests/prompts/test_code_summary_prompt.py`:
- Around line 118-160: The helper function test_single_summary is being
collected by pytest because it starts with test_ and its parameter sample has no
fixture, and its try/except swallows assertion failures; rename the function
(e.g., test_single_summary -> run_single_summary) so pytest won't collect it,
remove or lift the broad try/except in run_single_summary so assertion errors
propagate to pytest, and update all call sites that referenced
test_single_summary to the new name; keep the original signature
(code/context/language/provider) but ensure no pytest-only fixtures are
expected.
In `@python/tests/test_crawl_url_state_service.py`:
- Around line 145-169: Add assertions in the two tests to verify the payload
passed to the DB update call: for test_marks_url_as_failed_after_max_retries
assert that the update payload includes "status": "failed"; for
test_increments_retry_count_below_max assert the payload includes "status":
"pending" and "retry_count": 2. Locate the mocked update by referencing
mock_client.table.return_value.update.return_value.match.return_value (or the
update(...) call on mock_client.table()) and assert the call arguments (e.g.,
via assert_called_with or inspecting call_args) to ensure the correct dict was
written.
---
Nitpick comments:
In `@CODE_EXTRACTION_FLOW.md`:
- Around line 17-25: Several fenced code blocks in CODE_EXTRACTION_FLOW.md
(e.g., the block showing the numbered list starting "1. Crawl returns markdown
documents" and the other pseudocode blocks) are missing language specifiers and
trigger markdownlint MD040; update each triple-backtick fence to include a
language token such as ```text or ```plaintext so the blocks are recognized as
plain text (apply this to the blocks near the "Crawl returns markdown documents"
snippet and the other pseudocode fences referenced in the review).
- Around line 1-8: The documentation in CODE_EXTRACTION_FLOW.md uses
PR-specific, conversational phrases ("YOUR PROMPT HERE", "Your prompt change",
"Your 1.2B prompt optimization"); update the text to neutral, long-lived
terminology by replacing those phrases with stable descriptors such as "the code
summary prompt", "the prompt optimization", or "prompt adjustments", and
rephrase any remaining change-specific wording so the document describes
behavior (e.g., that the code summary runs on extracted code blocks only) rather
than referencing this PR or an author-facing instruction; ensure the three
explicit strings mentioned are replaced consistently throughout the file and
that the executive summary remains factual and version-agnostic.
In `@migration/0.1.0/012_add_crawl_url_state.sql`:
- Around line 27-29: The migration creates a redundant single-column index
idx_crawl_url_state_source on archon_crawl_url_state(source_id); remove that
CREATE INDEX statement and rely on the composite
idx_crawl_url_state_source_status (source_id, status) instead; if the
single-column index may already exist in deployed DBs, add a DROP INDEX IF
EXISTS idx_crawl_url_state_source or an equivalent rollback step to the
migration so only the composite index remains and write overhead is avoided.
In `@PROMPT_TEST_DETAILS.md`:
- Around line 99-105: The fenced code block showing the pydantic ValidationError
in PROMPT_TEST_DETAILS.md is missing a language specifier; update the opening
fence to include a language identifier (for example change ``` to ```text) so
the block containing "pydantic_core._pydantic_core.ValidationError: 1 validation
error for CrawlProgressResponse" is annotated, ensuring the fenced code block
starts with ```text and ends with the matching ``` fence.
In `@python/src/server/services/crawling/crawl_url_state_service.py`:
- Around line 323-340: The singleton accessor get_crawl_url_state_service
currently drops subsequent supabase_client args and is not thread-safe; update
it to accept an optional reset: bool = False (for tests) and document that
supabase_client is honored only on first init unless reset=True, and add a
module-level threading.Lock (e.g., crawl_url_state_service_lock) and wrap the
None-check and instantiation of crawl_url_state_service (and honoring/reset
logic) inside lock.acquire()/release or a context manager to prevent races;
ensure you still use the global crawl_url_state_service and instantiate
CrawlUrlStateService(supabase_client) only when appropriate.
- Around line 20-28: The __init__ of CrawlUrlStateService currently lacks a type
annotation for the supabase_client parameter; update the signature of __init__
to annotate supabase_client as either the supabase Client type (import Client
from the supabase library) or as typing.Optional[Client] (or typing.Any if the
exact client class is uncertain), and add a corresponding attribute annotation
for self.supabase_client (e.g., self.supabase_client: Optional[Client]) so
MyPy/IDE tooling can validate uses of get_supabase_client() and methods on
supabase_client.
- Around line 74-76: The except blocks (e.g., the one around safe_logfire_error
in crawl_url_state_service.py and specifically in the _update_status method)
must log contextual identifiers and full stack traces: replace the bare
safe_logfire_error(f"...{e}") pattern with two actions — call safe_logfire_error
with a concise message and then call the module logger (e.g., logger.error) with
a descriptive message that includes source_id and url (or other relevant IDs)
and pass exc_info=True so the full exception stack is preserved; also narrow the
except to specific exception types where feasible and re-raise with "raise" (or
"raise from e" if chaining is desired). Ensure these changes are applied to the
other catch sites mentioned (around lines 150-152, 177-179, 249-251, 276-278,
299-301, 318-320) and reference the same symbols (safe_logfire_error and
_update_status) when editing.
In `@python/src/server/services/crawling/document_storage_operations.py`:
- Around line 69-78: The comment "Initialize URL state tracking if enabled" is
misleading because no flag is checked; update the code to either actually check
a feature flag or change the comment to accurately reflect behavior.
Specifically, either add a conditional around get_crawl_url_state_service/
url_state_service.initialize_urls that checks the existing feature flag (or
config) used elsewhere, or replace the comment with a factual one such as
"Initialize URL state tracking" so it matches the unconditional calls to
get_crawl_url_state_service and url_state_service.initialize_urls; ensure
references to url_state_service, get_crawl_url_state_service, and
initialize_urls remain correct.
In `@python/src/server/services/storage/code_storage_service.py`:
- Line 636: Replace the redundant conditional that uses "{code[:1500] if
len(code) > 1500 else code}" with the simpler slice "code[:1500]"; update the
occurrence that references the variable "code" (in code_storage_service.py) so
it always uses the safe slicing expression "code[:1500]" since Python slicing
handles shorter strings without error.
In `@python/tests/integration/test_code_summary_prompt_quick.py`:
- Around line 132-133: Move the local import of the standard library module
traceback out of the except block and place it at module level near the other
imports in python/tests/integration/test_code_summary_prompt_quick.py; locate
the except block that currently does "import traceback" and remove that line,
then add "import traceback" with the other top-level imports so functions/tests
(e.g. the failing test function containing the except) can use traceback without
importing inside the exception handler.
- Around line 130-142: The current except block appends failures to the results
list but never fails the test; update the end of test_prompt_directly to assert
that there are no failed runs by inspecting results (e.g., build failures = [r
for r in results if not r["success"]]) and raise an AssertionError or use
pytest.fail with a concise summary of failures (including name, language, and
error) if any exist; keep the existing try/except aggregation but add this final
assertion gate to ensure pytest detects failures.
In `@python/tests/integration/test_crawl_validation.py`:
- Around line 232-238: Move the inline "import traceback" out of the except
blocks and add a single module-level "import traceback" with the other
top-of-file imports; then remove the two inline imports (the one inside the
except that sets result["status"] = "error" and the duplicate later) so
traceback is imported only once at the top and used from there.
- Around line 80-82: Update the failure check to consider the progress "status"
field as well as the "error" field and raise with the full progress context;
specifically, in the block using progress (the variable named progress in
test_crawl_validation.py), detect failure when progress.get("error") is truthy
OR progress.get("status") == "error", and then raise an exception that includes
the entire progress object (e.g., include progress in the message) instead of
raising a bare Exception with only progress.get('error').
- Around line 22-23: The test hardcodes API_BASE = "http://localhost:8181" which
breaks in other environments; change API_BASE to read from an environment
variable with a default (use os.environ.get("API_BASE",
"http://localhost:8181")) and add an import for os at the top of
python/tests/integration/test_crawl_validation.py so tests can override API_BASE
in CI or Docker without changing code.
In `@python/tests/prompts/README.md`:
- Line 25: Replace the hard-coded "(lines 631-643)" reference with a stable
symbol reference to the function or method in
src/server/services/storage/code_storage_service.py (e.g., use the class/method
name such as CodeStorageService.<method_name> or the exact def name) so readers
can find the code even as line numbers change; update the README.md text to use
the function/method name in backticks and remove the numeric line-range.
In `@python/tests/test_crawl_url_state_service.py`:
- Around line 328-373: The tests test_pending_to_fetched_transition and
test_fetched_to_embedded_transition in TestStateTransitionLogic are redundant
with existing tests (test_marks_url_as_fetched and test_marks_url_as_embedded);
either remove these two methods or convert them into a true multi-step
integration test that uses the same source_id/url and calls
url_state_service.mark_fetched(...) then asserts the intermediate state,
followed by url_state_service.mark_embedded(...) and asserts the final state
(ensure you reference the same URL and use the same service instance:
url_state_service.mark_fetched and url_state_service.mark_embedded); update or
delete the redundant test methods accordingly to avoid duplicate coverage.
- Line 7: Remove the unused import "patch" from the test module import line
(keep MagicMock if still used) or replace it by using patch where intended;
specifically update the import statement that currently reads "from
unittest.mock import MagicMock, patch" to drop "patch" (or use "patch" in tests
if that was intended) so there are no unused imports.
|
|
||
| # Prompt regression tests | ||
| uv run python tests/prompts/test_code_summary_prompt.py # Test code summary prompt | ||
| uv run pytest tests/prompts/ -v # Run all prompt tests with pytest | ||
| ``` | ||
|
|
||
| ### Prompt Regression Tests | ||
|
|
||
| **Location**: `python/tests/prompts/` | ||
| **Documentation**: `@PRPs/ai_docs/CODE_SUMMARY_PROMPT.md` | ||
|
|
||
| Regression tests for AI prompts used in production. These ensure prompt changes don't break output structure or quality. | ||
|
|
||
| **When to run**: | ||
| - Before merging prompt changes | ||
| - When updating LLM providers or models | ||
| - As part of CI/CD pipeline | ||
| - When debugging summary/output quality issues | ||
|
|
||
| See `python/tests/prompts/README.md` for details on adding new prompt tests. | ||
| ``` |
There was a problem hiding this comment.
Orphaned closing fence at Line 173 breaks all downstream markdown rendering.
The Quick Workflows ```bash block originally closed at Line 173 (pre-existing ```). The new changes insert a closing fence at Line 157, correctly ending the bash block. However, the pre-existing ``` at Line 173 remains and is now orphaned — it opens a new unlanguaged code block, causing the entire ## Architecture Overview section and everything below it to render as raw code.
Fix: remove the pre-existing ``` at Line 173, since Line 157 already properly closes the bash block.
🔧 Proposed fix
See `python/tests/prompts/README.md` for details on adding new prompt tests.
-```
## Architecture Overview🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 173-173: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@CLAUDE.md` around lines 153 - 173, Remove the stray closing code-fence that
follows the "Prompt Regression Tests" / Quick Workflows bash block so there's
only one closing ``` for that block; specifically, delete the orphaned ``` that
currently precedes the "## Architecture Overview" heading so the bash block
closes correctly and the rest of the document renders as normal.
| created_at TIMESTAMPTZ DEFAULT now(), | ||
| updated_at TIMESTAMPTZ DEFAULT now(), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Search for updated_at related triggers and patterns
find . -name "*.sql" -type f | head -20Repository: coleam00/Archon
Length of output: 847
🏁 Script executed:
# Check for update triggers in migration files
rg "update_updated_at|BEFORE UPDATE" --type sqlRepository: coleam00/Archon
Length of output: 1694
🏁 Script executed:
# Search for how Python handles updated_at in service code
find . -path "*/python/src/server/*" -name "*.py" -type f | xargs grep -l "updated_at" 2>/dev/null | head -10Repository: coleam00/Archon
Length of output: 637
🏁 Script executed:
# Check the migration file structure around other tables
find . -name "*.sql" -path "*/migration/*" -type f | sortRepository: coleam00/Archon
Length of output: 847
🏁 Script executed:
# Read the migration file to check if it includes the trigger
cat ./migration/0.1.0/012_add_crawl_url_state.sqlRepository: coleam00/Archon
Length of output: 2249
🏁 Script executed:
# Check the crawl_url_state_service.py to see how it handles updates
cat ./python/src/server/services/crawling/crawl_url_state_service.pyRepository: coleam00/Archon
Length of output: 10346
Add updated_at trigger to match project pattern and prevent accidental omission.
The column defaults to now() at insert time but lacks the auto-update trigger used consistently on other tables (archon_settings, archon_projects, archon_tasks, archon_prompts, archon_configured_repositories, archon_agent_work_orders). While the Python service currently sets updated_at explicitly on every update operation, the trigger provides a defensive layer against future code paths that might forget to do so.
Add the following after the table creation:
Suggested trigger
CREATE TRIGGER set_updated_at_crawl_url_state
BEFORE UPDATE ON archon_crawl_url_state
FOR EACH ROW
EXECUTE FUNCTION update_updated_at_column();(The update_updated_at_column() function is already defined in migration/complete_setup.sql)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@migration/0.1.0/012_add_crawl_url_state.sql` around lines 21 - 22, Add the
missing UPDATE trigger for updated_at on the archon_crawl_url_state table: after
the table definition that contains the updated_at TIMESTAMPTZ DEFAULT now()
column, create a trigger named set_updated_at_crawl_url_state that fires BEFORE
UPDATE ON archon_crawl_url_state FOR EACH ROW and EXECUTE FUNCTION
update_updated_at_column(); this mirrors the pattern used for archon_settings,
archon_projects, etc., and ensures updated_at is auto-updated even if the
application layer omits it.
| -- RLS Policy: Service role has full access | ||
| CREATE POLICY "Service role full access to crawl_url_state" ON archon_crawl_url_state | ||
| FOR ALL USING (true) WITH CHECK (true); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, view the specific migration file lines in question
cat -n migration/0.1.0/012_add_crawl_url_state.sql | sed -n '40,50p'Repository: coleam00/Archon
Length of output: 374
🏁 Script executed:
#!/bin/bash
# Check all RLS policies in SQL files for consistency and pattern
rg -n "CREATE POLICY" --type sql -A 1 | head -100Repository: coleam00/Archon
Length of output: 5445
🏁 Script executed:
#!/bin/bash
# Look for Supabase/PostgreSQL RLS policies in the codebase
find . -name "*.sql" -type f -exec grep -l "CREATE POLICY" {} \; | head -20Repository: coleam00/Archon
Length of output: 271
RLS policy grants access to all roles, not just service_role.
USING (true) WITH CHECK (true) without role restriction allows any authenticated role (including anon) to read/write this table. Other policies in the codebase restrict service role access using either USING (auth.role() = 'service_role') or TO service_role. Restrict this policy consistently:
-CREATE POLICY "Service role full access to crawl_url_state" ON archon_crawl_url_state
- FOR ALL USING (true) WITH CHECK (true);
+CREATE POLICY "Service role full access to crawl_url_state" ON archon_crawl_url_state
+ FOR ALL TO service_role USING (true) WITH CHECK (true);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| -- RLS Policy: Service role has full access | |
| CREATE POLICY "Service role full access to crawl_url_state" ON archon_crawl_url_state | |
| FOR ALL USING (true) WITH CHECK (true); | |
| -- RLS Policy: Service role has full access | |
| CREATE POLICY "Service role full access to crawl_url_state" ON archon_crawl_url_state | |
| FOR ALL TO service_role USING (true) WITH CHECK (true); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@migration/0.1.0/012_add_crawl_url_state.sql` around lines 43 - 45, The
current RLS policy "Service role full access to crawl_url_state" on
archon_crawl_url_state uses USING (true) WITH CHECK (true) and thus grants
access to any role; update the policy to restrict it to the service role by
either replacing the predicate with USING (auth.role() = 'service_role') WITH
CHECK (auth.role() = 'service_role') or by adding TO service_role so only the
service role can read/write; modify the policy definition for "Service role full
access to crawl_url_state" accordingly.
| 3. **Run the test**: | ||
| ```bash | ||
| uv run python test_code_summary_prompt.py ollama | ||
| ``` |
There was a problem hiding this comment.
Inconsistent test file path.
Line 134 uses test_code_summary_prompt.py without a directory path, while lines 51–58 correctly reference tests/prompts/test_code_summary_prompt.py. This will fail if run from the python/ directory as documented.
📝 Suggested fix
3. **Run the test**:
```bash
- uv run python test_code_summary_prompt.py ollama
+ uv run python tests/prompts/test_code_summary_prompt.py ollama
```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@PRPs/ai_docs/CODE_SUMMARY_PROMPT.md` around lines 132 - 135, The run command
in the markdown uses the wrong test path ("uv run python
test_code_summary_prompt.py ollama"); update this to point at the tests/prompts
test file used elsewhere by replacing that string with "uv run python
tests/prompts/test_code_summary_prompt.py ollama" so the command runs from the
python/ directory and matches the other references to
tests/prompts/test_code_summary_prompt.py.
| Tracks per-URL crawl progress to enable checkpoint/resume functionality. | ||
| """ | ||
|
|
||
| from datetime import UTC |
There was a problem hiding this comment.
UTC is a timezone, not a timestamp — every created_at/updated_at field will store a datetime.timezone.utc object instead of the current time.
from datetime import UTC imports datetime.timezone.utc, which is a timezone constant, not a datetime value. Assigning now = UTC (line 45) and using "updated_at": UTC (lines 143, 168) writes the timezone sentinel into Supabase rather than the current timestamp. This will either corrupt rows or raise a serialization error at runtime.
Proposed fix
-from datetime import UTC
+from datetime import UTC, datetimeThen everywhere UTC is used as a value, replace with datetime.now(UTC):
- now = UTC
+ now = datetime.now(UTC).isoformat()- "updated_at": UTC,
+ "updated_at": datetime.now(UTC).isoformat(),- update_data = {"status": status, "updated_at": UTC}
+ update_data = {"status": status, "updated_at": datetime.now(UTC).isoformat()}As per coding guidelines: "Never accept corrupted data - skip failed items entirely rather than storing corrupted or incomplete data."
Also applies to: 45-45
| # Mark URLs as embedded after successful storage | ||
| if unique_doc_urls: | ||
| try: | ||
| for doc_url in unique_doc_urls: | ||
| url_state_service.mark_embedded(original_source_id, doc_url) | ||
| safe_logfire_info(f"Marked {len(unique_doc_urls)} URLs as embedded") | ||
| except Exception as e: | ||
| safe_logfire_error(f"Failed to mark URLs as embedded: {e}") |
There was a problem hiding this comment.
mark_embedded incorrectly marks URLs that were never actually embedded — corrupts Phase 2 resume state.
unique_doc_urls (Line 71–72) is built from ALL crawl_results entries with a non-empty URL, including documents whose markdown was empty. Those documents are skipped during the processing loop (Lines 107–110: continue) and are never chunked, never added to all_urls, and never passed to add_documents_to_supabase. Calling mark_embedded for them nonetheless marks them as embedded in archon_crawl_url_state.
When Phase 2 resume logic checks this state to skip already-embedded URLs, it will incorrectly skip these never-embedded URLs, causing silent data loss.
Fix: replace unique_doc_urls at Line 292 with list(set(all_urls)) — all_urls is only populated for documents that passed the content check (Line 140) and were actually chunked.
🐛 Proposed fix
- # Mark URLs as embedded after successful storage
- if unique_doc_urls:
+ # Mark only successfully chunked URLs as embedded
+ embedded_urls = list(set(all_urls))
+ if embedded_urls:
try:
- for doc_url in unique_doc_urls:
+ for doc_url in embedded_urls:
url_state_service.mark_embedded(original_source_id, doc_url)
- safe_logfire_info(f"Marked {len(unique_doc_urls)} URLs as embedded")
+ safe_logfire_info(f"Marked {len(embedded_urls)} URLs as embedded")
except Exception as e:
safe_logfire_error(f"Failed to mark URLs as embedded: {e}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Mark URLs as embedded after successful storage | |
| if unique_doc_urls: | |
| try: | |
| for doc_url in unique_doc_urls: | |
| url_state_service.mark_embedded(original_source_id, doc_url) | |
| safe_logfire_info(f"Marked {len(unique_doc_urls)} URLs as embedded") | |
| except Exception as e: | |
| safe_logfire_error(f"Failed to mark URLs as embedded: {e}") | |
| # Mark only successfully chunked URLs as embedded | |
| embedded_urls = list(set(all_urls)) | |
| if embedded_urls: | |
| try: | |
| for doc_url in embedded_urls: | |
| url_state_service.mark_embedded(original_source_id, doc_url) | |
| safe_logfire_info(f"Marked {len(embedded_urls)} URLs as embedded") | |
| except Exception as e: | |
| safe_logfire_error(f"Failed to mark URLs as embedded: {e}") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/services/crawling/document_storage_operations.py` around
lines 291 - 298, The call to url_state_service.mark_embedded currently uses
unique_doc_urls (which includes URLs for documents that were skipped and never
embedded), corrupting resume state; update the post-storage marking to iterate
over the deduplicated list of actually-embedded URLs by replacing
unique_doc_urls with list(set(all_urls)) so only URLs collected in the
processing loop (all_urls) are marked embedded; look for the block in
document_storage_operations.py that calls url_state_service.mark_embedded and
the variables unique_doc_urls and all_urls within the same function to apply
this change.
| # Test results (generated) | ||
| *_results.json |
There was a problem hiding this comment.
Missing .gitignore in python/tests/prompts/
python/tests/prompts/test_code_summary_prompt.py (Line 213) writes code_summary_test_results.json to python/tests/prompts/. This .gitignore only covers python/tests/integration/, so that output file is unprotected and would be committed.
Add python/tests/prompts/.gitignore with the same pattern, or consolidate at python/tests/.gitignore.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/tests/integration/.gitignore` around lines 1 - 2, The tests are
writing code_summary_test_results.json into python/tests/prompts/ but only
python/tests/integration/ is ignored; add a .gitignore to prevent committing
that output by either creating python/tests/prompts/.gitignore containing the
same pattern (*_results.json or specifically code_summary_test_results.json) or
consolidate by adding *_results.json to python/tests/.gitignore; update the
repository by committing the new/updated .gitignore so
python/tests/prompts/test_code_summary_prompt.py output is ignored.
| try: | ||
| from src.server.services.storage.code_storage_service import ( | ||
| _generate_code_example_summary_async, | ||
| ) | ||
| except ImportError as e: | ||
| print(f"\n❌ Failed to import code summary function: {e}") | ||
| print("\nPlease ensure you're running from the python/ directory") | ||
| return |
There was a problem hiding this comment.
Silent return on ImportError means pytest always reports PASSED even when the import fails.
return (implicitly returns None) after catching ImportError is invisible to pytest; the test appears green. Use pytest.skip for a missing environment, or re-raise to surface the real error.
🔧 Proposed fix
except ImportError as e:
print(f"\n❌ Failed to import code summary function: {e}")
- print("\nPlease ensure you're running from the python/ directory")
- return
+ pytest.skip(f"Cannot import code summary function: {e}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| from src.server.services.storage.code_storage_service import ( | |
| _generate_code_example_summary_async, | |
| ) | |
| except ImportError as e: | |
| print(f"\n❌ Failed to import code summary function: {e}") | |
| print("\nPlease ensure you're running from the python/ directory") | |
| return | |
| try: | |
| from src.server.services.storage.code_storage_service import ( | |
| _generate_code_example_summary_async, | |
| ) | |
| except ImportError as e: | |
| print(f"\n❌ Failed to import code summary function: {e}") | |
| pytest.skip(f"Cannot import code summary function: {e}") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/tests/integration/test_code_summary_prompt_quick.py` around lines 70 -
77, The test currently swallows ImportError and does a plain return, masking
failures; update the import error handling in
python/tests/integration/test_code_summary_prompt_quick.py where
_generate_code_example_summary_async is imported: replace the bare return with
either pytest.skip("dependency missing: ...") (after importing pytest) when the
import/environment is optional, or re-raise the caught ImportError to fail the
test and surface the real error; ensure the ImportError handler uses the chosen
approach and includes the original exception message for context.
| async def test_single_summary(sample: dict, provider: str = None): | ||
| """Test summary generation for a single code sample.""" | ||
| print(f"\n{'='*80}") | ||
| print(f"Testing: {sample['name']}") | ||
| print(f"Language: {sample['language']}") | ||
| print(f"{'='*80}") | ||
|
|
||
| print(f"\nCode snippet (first 200 chars):") | ||
| print(f"{sample['code'][:200]}...") | ||
|
|
||
| try: | ||
| result = await _generate_code_example_summary_async( | ||
| code=sample['code'], | ||
| context_before=sample['context_before'], | ||
| context_after=sample['context_after'], | ||
| language=sample['language'], | ||
| provider=provider | ||
| ) | ||
|
|
||
| print(f"\n✅ SUCCESS - Generated summary:") | ||
| print(f" Example Name: {result['example_name']}") | ||
| print(f" Summary: {result['summary']}") | ||
|
|
||
| # Verify JSON structure | ||
| assert 'example_name' in result, "Missing 'example_name' field" | ||
| assert 'summary' in result, "Missing 'summary' field" | ||
| assert len(result['example_name']) > 0, "Empty 'example_name'" | ||
| assert len(result['summary']) > 0, "Empty 'summary'" | ||
|
|
||
| # Check if summary follows the structured format | ||
| has_purpose = 'PURPOSE:' in result['summary'].upper() or 'purpose' in result['summary'].lower() | ||
| has_params = 'PARAMETERS:' in result['summary'].upper() or 'parameter' in result['summary'].lower() | ||
| has_use = 'USE WHEN:' in result['summary'].upper() or 'use' in result['summary'].lower() | ||
|
|
||
| structure_score = sum([has_purpose, has_params, has_use]) | ||
| print(f" Structure indicators: {structure_score}/3 (PURPOSE/PARAMETERS/USE WHEN)") | ||
|
|
||
| return True, result | ||
|
|
||
| except Exception as e: | ||
| print(f"\n❌ FAILED with error:") | ||
| print(f" {type(e).__name__}: {str(e)}") | ||
| return False, None |
There was a problem hiding this comment.
test_single_summary causes a pytest collection ERROR and silently swallows assertion failures.
Two issues:
-
Pytest collection error:
test_single_summary(sample: dict, ...)starts withtest_so pytest tries to collect it.samplehas no default and no matching pytest fixture, causingERROR: fixture 'sample' not found— theuv run pytest tests/prompts/ -vcommand documented in bothCLAUDE.mdandREADME.mdis broken. -
Silent failures: Assertions at lines 142–145 are inside
try/except Exception, so any assertion failure is caught and returned as(False, None), invisible to pytest.
Rename the helper to avoid collection, and let assertion errors propagate:
🔧 Proposed fix
-async def test_single_summary(sample: dict, provider: str = None):
+async def run_single_summary(sample: dict, provider: str = None):
"""Test summary generation for a single code sample."""
...
try:
result = await _generate_code_example_summary_async(...)
- # Verify JSON structure
- assert 'example_name' in result, "Missing 'example_name' field"
- assert 'summary' in result, "Missing 'summary' field"
- assert len(result['example_name']) > 0, "Empty 'example_name'"
- assert len(result['summary']) > 0, "Empty 'summary'"
+ assert 'example_name' in result, "Missing 'example_name' field"
+ assert 'summary' in result, "Missing 'summary' field"
+ assert len(result['example_name']) > 0, "Empty 'example_name'"
+ assert len(result['summary']) > 0, "Empty 'summary'"
...
return True, result
except AssertionError:
raise # Let pytest see assertion failures
except Exception as e:
...
return False, NoneAnd update all call sites from test_single_summary → run_single_summary.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/tests/prompts/test_code_summary_prompt.py` around lines 118 - 160, The
helper function test_single_summary is being collected by pytest because it
starts with test_ and its parameter sample has no fixture, and its try/except
swallows assertion failures; rename the function (e.g., test_single_summary ->
run_single_summary) so pytest won't collect it, remove or lift the broad
try/except in run_single_summary so assertion errors propagate to pytest, and
update all call sites that referenced test_single_summary to the new name; keep
the original signature (code/context/language/provider) but ensure no
pytest-only fixtures are expected.
| def test_marks_url_as_failed_after_max_retries(self, url_state_service, mock_client): | ||
| """URL marked as failed after exceeding max retries.""" | ||
| mock_select_result = MagicMock() | ||
| mock_select_result.data = [{"retry_count": 3, "max_retries": 3}] | ||
| mock_client.table.return_value.select.return_value.match.return_value.execute.return_value = mock_select_result | ||
|
|
||
| result = url_state_service.mark_failed("source-1", "https://example.com/page1", "Connection timeout") | ||
|
|
||
| assert result is True | ||
|
|
||
| update_call = mock_client.table.return_value.update.return_value.match.return_value | ||
| update_call.execute.assert_called() | ||
|
|
||
| def test_increments_retry_count_below_max(self, url_state_service, mock_client): | ||
| """Retry count incremented when under max retries.""" | ||
| mock_select_result = MagicMock() | ||
| mock_select_result.data = [{"retry_count": 1, "max_retries": 3}] | ||
| mock_client.table.return_value.select.return_value.match.return_value.execute.return_value = mock_select_result | ||
|
|
||
| result = url_state_service.mark_failed("source-1", "https://example.com/page1", "Connection timeout") | ||
|
|
||
| assert result is True | ||
|
|
||
| update_call = mock_client.table.return_value.update.return_value.match.return_value | ||
| update_call.execute.assert_called() |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
mark_failed tests don't assert the status value written to the DB.
Both tests verify that execute was called but don't check what was passed to update(...). For test_marks_url_as_failed_after_max_retries (retry_count == max_retries), you should assert the update payload contains "status": "failed". For test_increments_retry_count_below_max (retry_count < max_retries), assert "status": "pending" and "retry_count": 2.
Without these assertions, the tests pass even if the service writes the wrong status.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/tests/test_crawl_url_state_service.py` around lines 145 - 169, Add
assertions in the two tests to verify the payload passed to the DB update call:
for test_marks_url_as_failed_after_max_retries assert that the update payload
includes "status": "failed"; for test_increments_retry_count_below_max assert
the payload includes "status": "pending" and "retry_count": 2. Locate the mocked
update by referencing
mock_client.table.return_value.update.return_value.match.return_value (or the
update(...) call on mock_client.table()) and assert the call arguments (e.g.,
via assert_called_with or inspecting call_args) to ensure the correct dict was
written.
- Merge ADR-001 into ADR-002, mark ADR-001 as superseded - Add provenance fields to backend API responses (embedding_model, embedding_provider, vectorizer_settings, summarization_model, etc.) - Add VectorizerSettings type and provenance fields to frontend types - Add collapsible 'Processing Details' section to KnowledgeCard showing embedding provider/model, dimensions, summarization model, and timestamps - Add implementation status tracking in ADR-002-IMPLEMENTATION-STATUS.md
- Add checkpoint/resume filtering to crawling service (skip already-embedded URLs) - Add provenance tracking to source management (embedding model, provider, dimensions, vectorizer settings, summarization model, timestamps) - Add migration 013_add_provenance_tracking.sql to add provenance columns - Update document storage operations to capture and store provenance - Update recursive crawling strategy to support resume All changes pass existing tests.
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
python/src/server/services/crawling/strategies/recursive.py (1)
164-174:⚠️ Potential issue | 🔴 CriticalEmbedded URLs are not normalized before being added to
visited, causing resume to miss matches.URLs from
get_embedded_urlsare added tovisitedas-is (line 168), but all subsequent membership checks usenormalize_url(fragment-stripping viaurldefrag) — see line 200. If any stored URL contains a fragment (e.g.,#section), the normalized form won't match the raw form invisited, and the URL will be re-crawled despite being already embedded.Proposed fix — normalize embedded URLs
# If resume filtering is enabled, pre-populate visited with already-embedded URLs if url_state_service and source_id: embedded_urls = url_state_service.get_embedded_urls(source_id) if embedded_urls: - visited.update(embedded_urls) + visited.update(normalize_url(u) for u in embedded_urls) logger.info(f"Resume filtering: pre-loaded {len(embedded_urls)} already-embedded URLs")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/crawling/strategies/recursive.py` around lines 164 - 174, Normalize embedded URLs before adding them to the visited set: when url_state_service.get_embedded_urls(source_id) returns embedded_urls, map each through the existing normalize_url (the urldefrag wrapper) before calling visited.update(...). Update the block that handles resume filtering (the code that calls get_embedded_urls and logger.info) so it uses the normalized forms to ensure membership checks later in the crawl (where normalize_url is used) will correctly detect already-embedded URLs.python/src/server/services/knowledge/knowledge_item_service.py (1)
166-182:⚠️ Potential issue | 🟠 Major
**source_metadataspread at the end silently overrides computed values.In the
list_itemsmethod,**source_metadataon line 181 is placed after the explicit keys (knowledge_type,tags,source_type,status,description, etc.). Ifsource_metadatacontains any of those keys, the raw DB values will silently overwrite the computed/defaulted values above.Compare with
_transform_source_to_item(line 376) where the spread is correctly placed first with a comment explaining the intent.Proposed fix — move the spread to the beginning
"metadata": { + **source_metadata, "knowledge_type": source_metadata.get("knowledge_type", "technical"), "tags": source_metadata.get("tags", []), "source_type": source_type, "status": "active", "description": source_metadata.get("description", source.get("summary", "")), "chunks_count": chunks_count, "word_count": source.get("total_word_count", 0), "estimated_pages": round(source.get("total_word_count", 0) / 250, 1), "pages_tooltip": f"{round(source.get('total_word_count', 0) / 250, 1)} pages (≈ {source.get('total_word_count', 0):,} words)", "last_scraped": source.get("updated_at"), "file_name": source_metadata.get("file_name"), "file_type": source_metadata.get("file_type"), "update_frequency": source_metadata.get("update_frequency", 7), "code_examples_count": code_examples_count, - **source_metadata, },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/knowledge/knowledge_item_service.py` around lines 166 - 182, In list_items in knowledge_item_service.py the final "**source_metadata" spread overrides the computed metadata keys; move the spread so source_metadata is merged first (i.e., put **source_metadata at the start of the metadata dict) so the explicit computed keys like "knowledge_type", "tags", "source_type", "status", "description", "chunks_count", "word_count", "estimated_pages", "pages_tooltip", "last_scraped", "file_name", "file_type", "update_frequency", and "code_examples_count" take precedence; mirror the ordering used in _transform_source_to_item and keep a short comment explaining that computed values should override raw DB metadata.
🧹 Nitpick comments (5)
python/src/server/services/knowledge/knowledge_item_service.py (1)
450-462:_get_chunks_countfetches full rows when only the count is needed.Using
.select("*", count="exact")withouthead=Trueretrieves all matching row data just to read the count. Addhead=Trueto issue a HEAD-style request, consistent with how the count query is done inlist_items(line 62).Proposed fix
result = ( self.supabase.table("archon_crawled_pages") - .select("*", count="exact") + .select("*", count="exact", head=True) .eq("source_id", source_id) .execute() )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/knowledge/knowledge_item_service.py` around lines 450 - 462, The _get_chunks_count function issues a full SELECT (self.supabase.table("archon_crawled_pages").select("*", count="exact")) which retrieves rows unnecessarily; change the query to use a HEAD-style request by adding head=True (consistent with list_items) so only the count is returned, and return result.count or 0 as before. Ensure you update the call inside _get_chunks_count and keep the same result handling logic.python/src/server/services/source_management_service.py (1)
297-313: Duplicate provenance block — extract a helper and movedatetimeimport to module level.The provenance-field block (lines 297–313) is duplicated nearly verbatim at lines 378–394. Also,
from datetime import datetimeis imported inline in both branches despiteUTCalready being imported at the module top (line 8).
- Move
from datetime import datetimeto the top-level imports alongsideUTC.- Extract the provenance fields + timestamp logic into a small helper to avoid the duplication.
Proposed refactor
Add to top-level imports:
-from datetime import UTC +from datetime import UTC, datetimeExtract a helper (e.g., near line 215):
def _add_provenance_fields( upsert_data: dict, *, embedding_model: str | None, embedding_dimensions: int | None, embedding_provider: str | None, vectorizer_settings: dict | None, summarization_model: str | None, ) -> None: """Add provenance tracking fields to an upsert payload.""" if embedding_model: upsert_data["embedding_model"] = embedding_model if embedding_dimensions is not None: upsert_data["embedding_dimensions"] = embedding_dimensions if embedding_provider: upsert_data["embedding_provider"] = embedding_provider if vectorizer_settings is not None: upsert_data["vectorizer_settings"] = vectorizer_settings if summarization_model: upsert_data["summarization_model"] = summarization_model now = datetime.now(UTC).isoformat() upsert_data["last_crawled_at"] = now upsert_data["last_vectorized_at"] = nowThen replace both inline blocks with a single call to
_add_provenance_fields(upsert_data, ...).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/source_management_service.py` around lines 297 - 313, Extract the duplicated provenance/timestamp logic into a helper named _add_provenance_fields that accepts upsert_data and the provenance params (embedding_model, embedding_dimensions, embedding_provider, vectorizer_settings, summarization_model) and sets the fields plus both last_crawled_at and last_vectorized_at to the same datetime string; replace both inline blocks with calls to _add_provenance_fields(upsert_data, ...). Also move from datetime import datetime to the module-level imports (next to UTC) so the helper and callers use the top-level datetime import rather than importing inline.python/src/server/services/crawling/strategies/recursive.py (1)
45-46:url_state_servicetyped asAny— consider using a narrower type.Using
Anyloses type safety. AProtocolor the concreteCrawlUrlStateServicetype (with a lazy/string-forward-ref import to avoid circular deps) would catch misuse at type-check time.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/crawling/strategies/recursive.py` around lines 45 - 46, The parameter url_state_service is currently annotated as Any which loses type safety; replace it with a narrower type such as CrawlUrlStateService | None (or a Protocol describing the methods you use) and use a forward reference or TYPE_CHECKING import to avoid circular imports. Concretely, update the signature for the function/class in recursive.py to use CrawlUrlStateService | None (or a custom Protocol name) instead of Any, add a conditional import under typing.TYPE_CHECKING (or use a string annotation) so the type checker sees CrawlUrlStateService without creating runtime import cycles, and ensure the code uses only the methods declared on that type/Protocol.python/src/server/services/crawling/crawling_service.py (2)
897-899: Unparameterizedtuplereturn type annotation.
-> tupledoesn't convey the shape of the return value. Consider-> tuple[list[dict[str, Any]], str | None]so callers and type checkers know what to expect.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/crawling/crawling_service.py` around lines 897 - 899, The return type annotation for _crawl_by_url_type is uninformative; change its signature from -> tuple to a precise type such as -> tuple[list[dict[str, Any]], str | None] (or the appropriate concrete element types) so callers and type checkers know the return shape; update any imports (typing.Any already used) if needed and ensure callers that unpack the returned tuple still match the new annotated shape.
269-270: Use a concrete type instead ofAnyforurl_state_service.
url_state_service: Any | None = Noneloses all type safety. Consider importing the actual type (e.g.,CrawlUrlStateService) or using a Protocol for the expected interface.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/crawling/crawling_service.py` around lines 269 - 270, The parameter url_state_service on the function/signature in crawling_service.py is typed as Any which removes type safety; replace it with the concrete service type (e.g., CrawlUrlStateService) or a Protocol that defines the methods you use, import that type at top, and update the parameter annotation from url_state_service: Any | None = None to url_state_service: CrawlUrlStateService | None = None (or the Protocol name) so callers and static checkers can validate usage; ensure any references to url_state_service methods inside the class match the chosen interface.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ADR-002`: Crawl Reliability, Ingestion Quality Control & DB Validation.md:
- Around line 53-66: The ADR example's crawl_url_state schema references
knowledge_sources(id) and uses UUID for source_id which doesn't match the
codebase (archon_sources with source_id TEXT); update the example SQL for
crawl_url_state to match the real migration by changing the reference to
archon_sources and the column type to TEXT (e.g., source_id TEXT NOT NULL
REFERENCES archon_sources(source_id)) or remove the FK and keep the same TEXT
column as in the codebase so the ADR mirrors the actual schema used by functions
that rely on crawl_url_state and archon_sources.
In `@archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx`:
- Around line 330-338: The Vectorizer row currently uses truthy checks which
render "0" for chunk_size===0 and can produce an empty <span>; update the
KnowledgeCard rendering logic for item.vectorizer_settings to build a list of
display parts: include chunk when chunk_size is a number (e.g., typeof
item.vectorizer_settings.chunk_size === "number"), include "contextual" when
item.vectorizer_settings.use_contextual is true, and include "hybrid" when
item.vectorizer_settings.use_hybrid is true, then only render the entire row
(the surrounding <div> and <span>) if the parts array has length > 0; refer to
the item.vectorizer_settings, chunk_size, use_contextual, and use_hybrid
identifiers to locate the code.
- Around line 340-351: The date rendering for item.last_crawled_at and
item.last_vectorized_at in KnowledgeCard.tsx uses format(new Date(...))
unguarded and can throw on invalid inputs; wrap each rendering in a safe check
(e.g., parse the string and verify with date-fns isValid or wrap format(new
Date(...)) in a try/catch) so you only render a formatted string when valid,
otherwise render a fallback (e.g., "Unknown" or omit); alternatively, extract a
shared safeDateFormat utility (safeDateFormat(dateStr): string | null) and call
it from the KnowledgeCard footer and these provenance lines to DRY the logic and
ensure consistent fallback behavior for invalid dates.
In `@python/src/server/services/crawling/crawling_service.py`:
- Around line 364-384: Wrap the resume-state logic that calls
get_crawl_url_state_service(...), has_existing_state(original_source_id),
get_crawl_state(original_source_id), and clear_state(original_source_id) in a
try/except so any DB error does not abort the crawl; on exception log a warning
via safe_logfire_info and continue. Before calling clear_state, verify the
crawl_state includes a "fetched" count (or at minimum ensure pending==0 and
failed==0 and fetched>=total) to avoid clearing prematurely. If these URL state
methods are synchronous and this module is async, call them off the event loop
(e.g., asyncio.to_thread) or use their async equivalents to avoid blocking.
Ensure all caught exceptions are swallowed after logging so resume-state
failures degrade gracefully.
- Around line 863-895: _filter_already_processed_urls is declared async but
calls the synchronous get_embedded_urls (via
get_crawl_url_state_service(self.supabase_client)) which can block the event
loop; either make the method synchronous or run the blocking call in a thread.
Fix by either changing _filter_already_processed_urls to a regular def (and
adjust callers) or wrap the blocking call with asyncio.to_thread, e.g. await
asyncio.to_thread(url_state_service.get_embedded_urls, source_id), then proceed
with building embedded_set, filtering urls, and calling
safe_logfire_info(progress_id=self.progress_id) as before.
In `@python/src/server/services/crawling/document_storage_operations.py`:
- Around line 376-394: The provenance retrieval currently can raise and abort
the primary update_source_info path; wrap the calls to
credential_service.get_credentials_by_category("embedding") and ("rag_strategy")
and the int(...) conversion for embedding_dimensions in a small try/except block
so failures are caught, logged, and the code continues with safe fallback values
(e.g., embedding_provider=None, embedding_model=None, embedding_dimensions=None
or a validated integer default, summarization_model=None) instead of raising;
ensure this block lives just before building
vectorizer_settings/summarization_model (around the
embedding_config/embedding_dimensions and rag_settings retrieval) and that
update_source_info uses those fallback None/validated values so provenance
fields are omitted but the primary path is not aborted.
In `@python/src/server/services/source_management_service.py`:
- Around line 298-301: The guard `if embedding_dimensions:` drops valid zero
values; change it to `if embedding_dimensions is not None:` when setting
upsert_data["embedding_dimensions"] so 0 is preserved—update the occurrence near
the block that sets upsert_data (the same area that checks embedding_model) and
also apply the identical change at the later occurrence around line ~381 where
embedding_dimensions is checked again (both places in
source_management_service.py that populate upsert_data/vectorizer settings).
---
Outside diff comments:
In `@python/src/server/services/crawling/strategies/recursive.py`:
- Around line 164-174: Normalize embedded URLs before adding them to the visited
set: when url_state_service.get_embedded_urls(source_id) returns embedded_urls,
map each through the existing normalize_url (the urldefrag wrapper) before
calling visited.update(...). Update the block that handles resume filtering (the
code that calls get_embedded_urls and logger.info) so it uses the normalized
forms to ensure membership checks later in the crawl (where normalize_url is
used) will correctly detect already-embedded URLs.
In `@python/src/server/services/knowledge/knowledge_item_service.py`:
- Around line 166-182: In list_items in knowledge_item_service.py the final
"**source_metadata" spread overrides the computed metadata keys; move the spread
so source_metadata is merged first (i.e., put **source_metadata at the start of
the metadata dict) so the explicit computed keys like "knowledge_type", "tags",
"source_type", "status", "description", "chunks_count", "word_count",
"estimated_pages", "pages_tooltip", "last_scraped", "file_name", "file_type",
"update_frequency", and "code_examples_count" take precedence; mirror the
ordering used in _transform_source_to_item and keep a short comment explaining
that computed values should override raw DB metadata.
---
Duplicate comments:
In `@python/src/server/services/crawling/document_storage_operations.py`:
- Around line 74-78: The except block around url_state_service.initialize_urls
should log the full exception stack trace: in the except Exception as e handler
for initialize_urls (the block that currently calls safe_logfire_error), add a
call to the module logger.error with exc_info=True (in addition to
safe_logfire_error) so the failure of url_state_service.initialize_urls is
recorded with the full traceback; keep the existing safe_logfire_error message
and call logger.error(f"Failed to initialize URL state: {e}", exc_info=True).
- Around line 291-298: The code marks entries via
url_state_service.mark_embedded using unique_doc_urls which includes any
non-empty URL from crawl_results (including docs that were skipped because their
markdown was empty), causing corrupt resume state; change the marking loop to
iterate over all_urls (the list populated only for actually-processed documents)
instead of unique_doc_urls, update the safe_logfire_info count to len(all_urls),
and keep the same exception handling (safe_logfire_error) and
original_source_id/mark_embedded calls so only truly-processed URLs are marked
embedded.
---
Nitpick comments:
In `@python/src/server/services/crawling/crawling_service.py`:
- Around line 897-899: The return type annotation for _crawl_by_url_type is
uninformative; change its signature from -> tuple to a precise type such as ->
tuple[list[dict[str, Any]], str | None] (or the appropriate concrete element
types) so callers and type checkers know the return shape; update any imports
(typing.Any already used) if needed and ensure callers that unpack the returned
tuple still match the new annotated shape.
- Around line 269-270: The parameter url_state_service on the function/signature
in crawling_service.py is typed as Any which removes type safety; replace it
with the concrete service type (e.g., CrawlUrlStateService) or a Protocol that
defines the methods you use, import that type at top, and update the parameter
annotation from url_state_service: Any | None = None to url_state_service:
CrawlUrlStateService | None = None (or the Protocol name) so callers and static
checkers can validate usage; ensure any references to url_state_service methods
inside the class match the chosen interface.
In `@python/src/server/services/crawling/strategies/recursive.py`:
- Around line 45-46: The parameter url_state_service is currently annotated as
Any which loses type safety; replace it with a narrower type such as
CrawlUrlStateService | None (or a Protocol describing the methods you use) and
use a forward reference or TYPE_CHECKING import to avoid circular imports.
Concretely, update the signature for the function/class in recursive.py to use
CrawlUrlStateService | None (or a custom Protocol name) instead of Any, add a
conditional import under typing.TYPE_CHECKING (or use a string annotation) so
the type checker sees CrawlUrlStateService without creating runtime import
cycles, and ensure the code uses only the methods declared on that
type/Protocol.
In `@python/src/server/services/knowledge/knowledge_item_service.py`:
- Around line 450-462: The _get_chunks_count function issues a full SELECT
(self.supabase.table("archon_crawled_pages").select("*", count="exact")) which
retrieves rows unnecessarily; change the query to use a HEAD-style request by
adding head=True (consistent with list_items) so only the count is returned, and
return result.count or 0 as before. Ensure you update the call inside
_get_chunks_count and keep the same result handling logic.
In `@python/src/server/services/source_management_service.py`:
- Around line 297-313: Extract the duplicated provenance/timestamp logic into a
helper named _add_provenance_fields that accepts upsert_data and the provenance
params (embedding_model, embedding_dimensions, embedding_provider,
vectorizer_settings, summarization_model) and sets the fields plus both
last_crawled_at and last_vectorized_at to the same datetime string; replace both
inline blocks with calls to _add_provenance_fields(upsert_data, ...). Also move
from datetime import datetime to the module-level imports (next to UTC) so the
helper and callers use the top-level datetime import rather than importing
inline.
| ```sql | ||
| CREATE TABLE crawl_url_state ( | ||
| id UUID PRIMARY KEY DEFAULT gen_random_uuid(), | ||
| source_id UUID NOT NULL REFERENCES knowledge_sources(id), | ||
| url TEXT NOT NULL, | ||
| status TEXT NOT NULL, -- pending | fetched | embedded | failed | ||
| chunk_count INTEGER, | ||
| content_hash TEXT, -- for duplicate detection | ||
| error_message TEXT, | ||
| created_at TIMESTAMPTZ DEFAULT now(), | ||
| updated_at TIMESTAMPTZ DEFAULT now(), | ||
| UNIQUE(source_id, url) | ||
| ); | ||
| ``` |
There was a problem hiding this comment.
ADR SQL example references knowledge_sources(id) — actual table is archon_sources.
The example schema shows source_id UUID NOT NULL REFERENCES knowledge_sources(id), but the codebase uses archon_sources with source_id TEXT. This could mislead future implementers reading the ADR. Consider updating the example to match the actual migration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ADR-002`: Crawl Reliability, Ingestion Quality Control & DB Validation.md
around lines 53 - 66, The ADR example's crawl_url_state schema references
knowledge_sources(id) and uses UUID for source_id which doesn't match the
codebase (archon_sources with source_id TEXT); update the example SQL for
crawl_url_state to match the real migration by changing the reference to
archon_sources and the column type to TEXT (e.g., source_id TEXT NOT NULL
REFERENCES archon_sources(source_id)) or remove the FK and keep the same TEXT
column as in the codebase so the ADR mirrors the actual schema used by functions
that rely on crawl_url_state and archon_sources.
| {item.vectorizer_settings && ( | ||
| <div className="flex items-center gap-2"> | ||
| <span className="text-gray-400">Vectorizer:</span> | ||
| <span> | ||
| {item.vectorizer_settings.chunk_size && `chunk=${item.vectorizer_settings.chunk_size}`} | ||
| {item.vectorizer_settings.use_contextual && " contextual"} | ||
| {item.vectorizer_settings.use_hybrid && " hybrid"} | ||
| </span> | ||
| </div> |
There was a problem hiding this comment.
chunk_size && ... will render literal 0 when chunk_size is 0.
Using && with a number in JSX is a well-known React gotcha — 0 && <element> renders "0". Additionally, if none of chunk_size, use_contextual, or use_hybrid are truthy, the "Vectorizer:" row renders with an empty <span>.
Proposed fix
{item.vectorizer_settings && (
<div className="flex items-center gap-2">
<span className="text-gray-400">Vectorizer:</span>
<span>
- {item.vectorizer_settings.chunk_size && `chunk=${item.vectorizer_settings.chunk_size}`}
+ {item.vectorizer_settings.chunk_size != null && `chunk=${item.vectorizer_settings.chunk_size}`}
{item.vectorizer_settings.use_contextual && " contextual"}
{item.vectorizer_settings.use_hybrid && " hybrid"}
</span>
</div>
)}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| {item.vectorizer_settings && ( | |
| <div className="flex items-center gap-2"> | |
| <span className="text-gray-400">Vectorizer:</span> | |
| <span> | |
| {item.vectorizer_settings.chunk_size && `chunk=${item.vectorizer_settings.chunk_size}`} | |
| {item.vectorizer_settings.use_contextual && " contextual"} | |
| {item.vectorizer_settings.use_hybrid && " hybrid"} | |
| </span> | |
| </div> | |
| {item.vectorizer_settings && ( | |
| <div className="flex items-center gap-2"> | |
| <span className="text-gray-400">Vectorizer:</span> | |
| <span> | |
| {item.vectorizer_settings.chunk_size != null && `chunk=${item.vectorizer_settings.chunk_size}`} | |
| {item.vectorizer_settings.use_contextual && " contextual"} | |
| {item.vectorizer_settings.use_hybrid && " hybrid"} | |
| </span> | |
| </div> | |
| )} |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx` around
lines 330 - 338, The Vectorizer row currently uses truthy checks which render
"0" for chunk_size===0 and can produce an empty <span>; update the KnowledgeCard
rendering logic for item.vectorizer_settings to build a list of display parts:
include chunk when chunk_size is a number (e.g., typeof
item.vectorizer_settings.chunk_size === "number"), include "contextual" when
item.vectorizer_settings.use_contextual is true, and include "hybrid" when
item.vectorizer_settings.use_hybrid is true, then only render the entire row
(the surrounding <div> and <span>) if the parts array has length > 0; refer to
the item.vectorizer_settings, chunk_size, use_contextual, and use_hybrid
identifiers to locate the code.
| {item.last_crawled_at && ( | ||
| <div className="flex items-center gap-2"> | ||
| <span className="text-gray-400">Last crawled:</span> | ||
| <span>{format(new Date(item.last_crawled_at), "M/d/yyyy h:mm a")}</span> | ||
| </div> | ||
| )} | ||
| {item.last_vectorized_at && ( | ||
| <div className="flex items-center gap-2"> | ||
| <span className="text-gray-400">Last vectorized:</span> | ||
| <span>{format(new Date(item.last_vectorized_at), "M/d/yyyy h:mm a")}</span> | ||
| </div> | ||
| )} |
There was a problem hiding this comment.
Unguarded format(new Date(...)) will throw on invalid date strings.
The existing card footer (lines 242–246) wraps its format() call in a try/catch, but the provenance dates here do not. If last_crawled_at or last_vectorized_at contains an unexpected value, this will crash the component.
Proposed fix — add safe formatting
{item.last_crawled_at && (
<div className="flex items-center gap-2">
<span className="text-gray-400">Last crawled:</span>
- <span>{format(new Date(item.last_crawled_at), "M/d/yyyy h:mm a")}</span>
+ <span>
+ {(() => {
+ try {
+ return format(new Date(item.last_crawled_at), "M/d/yyyy h:mm a");
+ } catch {
+ return new Date(item.last_crawled_at).toLocaleString();
+ }
+ })()}
+ </span>
</div>
)}
{item.last_vectorized_at && (
<div className="flex items-center gap-2">
<span className="text-gray-400">Last vectorized:</span>
- <span>{format(new Date(item.last_vectorized_at), "M/d/yyyy h:mm a")}</span>
+ <span>
+ {(() => {
+ try {
+ return format(new Date(item.last_vectorized_at), "M/d/yyyy h:mm a");
+ } catch {
+ return new Date(item.last_vectorized_at).toLocaleString();
+ }
+ })()}
+ </span>
</div>
)}Alternatively, extract a shared safeDateFormat utility to DRY up both this block and the footer.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| {item.last_crawled_at && ( | |
| <div className="flex items-center gap-2"> | |
| <span className="text-gray-400">Last crawled:</span> | |
| <span>{format(new Date(item.last_crawled_at), "M/d/yyyy h:mm a")}</span> | |
| </div> | |
| )} | |
| {item.last_vectorized_at && ( | |
| <div className="flex items-center gap-2"> | |
| <span className="text-gray-400">Last vectorized:</span> | |
| <span>{format(new Date(item.last_vectorized_at), "M/d/yyyy h:mm a")}</span> | |
| </div> | |
| )} | |
| {item.last_crawled_at && ( | |
| <div className="flex items-center gap-2"> | |
| <span className="text-gray-400">Last crawled:</span> | |
| <span> | |
| {(() => { | |
| try { | |
| return format(new Date(item.last_crawled_at), "M/d/yyyy h:mm a"); | |
| } catch { | |
| return new Date(item.last_crawled_at).toLocaleString(); | |
| } | |
| })()} | |
| </span> | |
| </div> | |
| )} | |
| {item.last_vectorized_at && ( | |
| <div className="flex items-center gap-2"> | |
| <span className="text-gray-400">Last vectorized:</span> | |
| <span> | |
| {(() => { | |
| try { | |
| return format(new Date(item.last_vectorized_at), "M/d/yyyy h:mm a"); | |
| } catch { | |
| return new Date(item.last_vectorized_at).toLocaleString(); | |
| } | |
| })()} | |
| </span> | |
| </div> | |
| )} |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx` around
lines 340 - 351, The date rendering for item.last_crawled_at and
item.last_vectorized_at in KnowledgeCard.tsx uses format(new Date(...))
unguarded and can throw on invalid inputs; wrap each rendering in a safe check
(e.g., parse the string and verify with date-fns isValid or wrap format(new
Date(...)) in a try/catch) so you only render a formatted string when valid,
otherwise render a fallback (e.g., "Unknown" or omit); alternatively, extract a
shared safeDateFormat utility (safeDateFormat(dateStr): string | null) and call
it from the KnowledgeCard footer and these provenance lines to DRY the logic and
ensure consistent fallback behavior for invalid dates.
| # Check for existing crawl state and determine if we're resuming | ||
| url_state_service = get_crawl_url_state_service(self.supabase_client) | ||
| has_existing_state = url_state_service.has_existing_state(original_source_id) | ||
|
|
||
| if has_existing_state: | ||
| crawl_state = url_state_service.get_crawl_state(original_source_id) | ||
| pending_count = crawl_state.get("pending", 0) | ||
| embedded_count = crawl_state.get("embedded", 0) | ||
| failed_count = crawl_state.get("failed", 0) | ||
| total_count = crawl_state.get("total", 0) | ||
|
|
||
| # If there are pending or failed URLs, log resume info | ||
| if pending_count > 0 or failed_count > 0: | ||
| safe_logfire_info( | ||
| f"Resuming crawl | source_id={original_source_id} | " | ||
| f"embedded={embedded_count} | pending={pending_count} | failed={failed_count} | total={total_count}" | ||
| ) | ||
| else: | ||
| # All URLs processed - clear old state for fresh crawl | ||
| url_state_service.clear_state(original_source_id) | ||
| safe_logfire_info(f"Cleared completed crawl state for fresh crawl | source_id={original_source_id}") |
There was a problem hiding this comment.
Resume state check is unguarded — a DB failure here kills the entire crawl.
If has_existing_state(), get_crawl_state(), or clear_state() throws (e.g., table doesn't exist yet, transient DB error), the exception propagates uncaught and the crawl fails entirely at the top-level except Exception handler (Line 749). Resume state checking is an optimization, not a prerequisite — it should degrade gracefully.
Additionally, the missing fetched-state check before clear_state and the synchronous DB calls in async context were noted in prior reviews and still apply.
🛡️ Proposed fix — wrap in try/except so crawl proceeds even if state check fails
# Check for existing crawl state and determine if we're resuming
url_state_service = get_crawl_url_state_service(self.supabase_client)
- has_existing_state = url_state_service.has_existing_state(original_source_id)
-
- if has_existing_state:
- crawl_state = url_state_service.get_crawl_state(original_source_id)
- pending_count = crawl_state.get("pending", 0)
- embedded_count = crawl_state.get("embedded", 0)
- failed_count = crawl_state.get("failed", 0)
- total_count = crawl_state.get("total", 0)
-
- # If there are pending or failed URLs, log resume info
- if pending_count > 0 or failed_count > 0:
- safe_logfire_info(
- f"Resuming crawl | source_id={original_source_id} | "
- f"embedded={embedded_count} | pending={pending_count} | failed={failed_count} | total={total_count}"
- )
- else:
- # All URLs processed - clear old state for fresh crawl
- url_state_service.clear_state(original_source_id)
- safe_logfire_info(f"Cleared completed crawl state for fresh crawl | source_id={original_source_id}")
+ has_existing_state = False
+ try:
+ has_existing_state = await asyncio.to_thread(
+ url_state_service.has_existing_state, original_source_id
+ )
+
+ if has_existing_state:
+ crawl_state = await asyncio.to_thread(
+ url_state_service.get_crawl_state, original_source_id
+ )
+ pending_count = crawl_state.get("pending", 0)
+ embedded_count = crawl_state.get("embedded", 0)
+ failed_count = crawl_state.get("failed", 0)
+ total_count = crawl_state.get("total", 0)
+
+ if pending_count > 0 or failed_count > 0:
+ safe_logfire_info(
+ f"Resuming crawl | source_id={original_source_id} | "
+ f"embedded={embedded_count} | pending={pending_count} | "
+ f"failed={failed_count} | total={total_count}"
+ )
+ else:
+ await asyncio.to_thread(
+ url_state_service.clear_state, original_source_id
+ )
+ safe_logfire_info(
+ f"Cleared completed crawl state for fresh crawl | source_id={original_source_id}"
+ )
+ except Exception as e:
+ logger.warning(
+ f"Failed to check crawl resume state for '{original_source_id}': {e}. "
+ "Proceeding with fresh crawl.",
+ exc_info=True,
+ )
+ has_existing_state = False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/services/crawling/crawling_service.py` around lines 364 -
384, Wrap the resume-state logic that calls get_crawl_url_state_service(...),
has_existing_state(original_source_id), get_crawl_state(original_source_id), and
clear_state(original_source_id) in a try/except so any DB error does not abort
the crawl; on exception log a warning via safe_logfire_info and continue. Before
calling clear_state, verify the crawl_state includes a "fetched" count (or at
minimum ensure pending==0 and failed==0 and fetched>=total) to avoid clearing
prematurely. If these URL state methods are synchronous and this module is
async, call them off the event loop (e.g., asyncio.to_thread) or use their async
equivalents to avoid blocking. Ensure all caught exceptions are swallowed after
logging so resume-state failures degrade gracefully.
| async def _filter_already_processed_urls(self, source_id: str, urls: list[str]) -> list[str]: | ||
| """ | ||
| Filter out URLs that are already embedded. | ||
|
|
||
| async def _crawl_by_url_type(self, url: str, request: dict[str, Any]) -> tuple: | ||
| Args: | ||
| source_id: The source ID | ||
| urls: List of URLs to filter | ||
|
|
||
| Returns: | ||
| List of URLs that have not been embedded yet | ||
| """ | ||
| if not urls: | ||
| return [] | ||
|
|
||
| url_state_service = get_crawl_url_state_service(self.supabase_client) | ||
|
|
||
| # Get embedded URLs | ||
| embedded_urls = url_state_service.get_embedded_urls(source_id) | ||
| embedded_set = set(embedded_urls) | ||
|
|
||
| # Filter | ||
| filtered = [url for url in urls if url not in embedded_set] | ||
|
|
||
| # Log resume info | ||
| if len(filtered) < len(urls): | ||
| skipped = len(urls) - len(filtered) | ||
| safe_logfire_info( | ||
| f"Resume filtering | skipped={skipped} already-embedded URLs | " | ||
| f"remaining={len(filtered)} | source_id={source_id}", | ||
| progress_id=self.progress_id, | ||
| ) | ||
|
|
||
| return filtered |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find safe_logfire_info function definition
rg "def safe_logfire_info" -A 5 python/Repository: coleam00/Archon
Length of output: 462
🏁 Script executed:
# Find get_embedded_urls function definition
rg "def get_embedded_urls" -A 5 python/Repository: coleam00/Archon
Length of output: 616
🏁 Script executed:
# Check imports in the crawling_service file to understand the context
rg "from.*safe_logfire_info|from.*get_crawl_url_state_service" python/src/server/services/crawling/crawling_service.pyRepository: coleam00/Archon
Length of output: 210
🏁 Script executed:
# Check how _filter_already_processed_urls is called
rg "_filter_already_processed_urls" python/src/server/services/crawling/ -B 2 -A 2Repository: coleam00/Archon
Length of output: 1772
🏁 Script executed:
# Check if there are other patterns in the codebase where sync DB calls are made in async methods
rg "get_embedded_urls|url_state_service\." python/src/server/services/crawling/crawling_service.py | head -20Repository: coleam00/Archon
Length of output: 373
🏁 Script executed:
# Look at the full implementation of safe_logfire_info to understand kwargs usage
cat python/src/server/config/logfire_config.py | grep -A 20 "def safe_logfire_info"Repository: coleam00/Archon
Length of output: 537
🏁 Script executed:
# Check how many other async methods make sync DB calls via url_state_service
rg "async def " python/src/server/services/crawling/crawling_service.py -A 30 | grep -E "async def|url_state_service\.|get_crawl_url_state_service"Repository: coleam00/Archon
Length of output: 1622
🏁 Script executed:
# Check if any await calls are used with get_crawl_url_state_service calls
rg "await.*get_crawl_url_state_service|await.*url_state_service" python/src/server/services/crawling/crawling_service.pyRepository: coleam00/Archon
Length of output: 41
_filter_already_processed_urls is async but contains no await calls — consider whether the method should be synchronous.
The method calls get_embedded_urls(), a synchronous operation, without awaiting it. While safe_logfire_info correctly accepts the progress_id keyword argument via **kwargs, the synchronous database call in an async context could block the event loop. If this pattern is intentional across the service, consider making the method synchronous instead, or use asyncio.to_thread() to run the blocking call on a thread pool.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/services/crawling/crawling_service.py` around lines 863 -
895, _filter_already_processed_urls is declared async but calls the synchronous
get_embedded_urls (via get_crawl_url_state_service(self.supabase_client)) which
can block the event loop; either make the method synchronous or run the blocking
call in a thread. Fix by either changing _filter_already_processed_urls to a
regular def (and adjust callers) or wrap the blocking call with
asyncio.to_thread, e.g. await
asyncio.to_thread(url_state_service.get_embedded_urls, source_id), then proceed
with building embedded_set, filtering urls, and calling
safe_logfire_info(progress_id=self.progress_id) as before.
| try: | ||
| # Get current embedding configuration for provenance tracking | ||
| from ..credential_service import credential_service | ||
|
|
||
| embedding_config = await credential_service.get_credentials_by_category("embedding") | ||
| embedding_provider = embedding_config.get("EMBEDDING_PROVIDER", "openai") | ||
| embedding_model = embedding_config.get("EMBEDDING_MODEL", "text-embedding-3-small") | ||
| embedding_dimensions = int(embedding_config.get("EMBEDDING_DIMENSIONS", "1536")) | ||
|
|
||
| # Get vectorizer settings (placeholder for now, can be enhanced later) | ||
| vectorizer_settings = { | ||
| "use_contextual": False, | ||
| "use_hybrid": False, | ||
| "chunk_size": 5000, | ||
| } | ||
|
|
||
| # Get summarization model from RAG strategy | ||
| rag_settings = await credential_service.get_credentials_by_category("rag_strategy") | ||
| summarization_model = rag_settings.get("MODEL_CHOICE", "gpt-4o-mini") |
There was a problem hiding this comment.
Provenance retrieval failures silently degrade source records to fallback without provenance.
If credential_service.get_credentials_by_category("embedding") throws (e.g., transient network error), the entire primary update_source_info call is skipped and the fallback path (Lines 420–449) creates a bare source record without any provenance fields. This means a single credential-service hiccup loses all provenance for the crawl.
Also, int(embedding_config.get("EMBEDDING_DIMENSIONS", "1536")) on Line 383 will raise ValueError if the config contains a non-numeric string, triggering the same fallback.
Consider isolating provenance retrieval so failures default to None rather than aborting the primary path:
🛡️ Proposed fix — isolate provenance retrieval
try:
- # Get current embedding configuration for provenance tracking
- from ..credential_service import credential_service
-
- embedding_config = await credential_service.get_credentials_by_category("embedding")
- embedding_provider = embedding_config.get("EMBEDDING_PROVIDER", "openai")
- embedding_model = embedding_config.get("EMBEDDING_MODEL", "text-embedding-3-small")
- embedding_dimensions = int(embedding_config.get("EMBEDDING_DIMENSIONS", "1536"))
-
- # Get vectorizer settings (placeholder for now, can be enhanced later)
- vectorizer_settings = {
- "use_contextual": False,
- "use_hybrid": False,
- "chunk_size": 5000,
- }
-
- # Get summarization model from RAG strategy
- rag_settings = await credential_service.get_credentials_by_category("rag_strategy")
- summarization_model = rag_settings.get("MODEL_CHOICE", "gpt-4o-mini")
+ # Get current embedding configuration for provenance tracking
+ from ..credential_service import credential_service
+
+ embedding_provider = None
+ embedding_model = None
+ embedding_dimensions = None
+ vectorizer_settings = None
+ summarization_model = None
+
+ try:
+ embedding_config = await credential_service.get_credentials_by_category("embedding")
+ embedding_provider = embedding_config.get("EMBEDDING_PROVIDER", "openai")
+ embedding_model = embedding_config.get("EMBEDDING_MODEL", "text-embedding-3-small")
+ embedding_dimensions = int(embedding_config.get("EMBEDDING_DIMENSIONS", "1536"))
+ vectorizer_settings = {
+ "use_contextual": False,
+ "use_hybrid": False,
+ "chunk_size": 5000,
+ }
+ rag_settings = await credential_service.get_credentials_by_category("rag_strategy")
+ summarization_model = rag_settings.get("MODEL_CHOICE", "gpt-4o-mini")
+ except Exception as prov_err:
+ logger.warning(
+ f"Failed to retrieve provenance config for '{source_id}': {prov_err}, "
+ "source record will be created without provenance",
+ exc_info=True,
+ )
# Call async update_source_info directly
await update_source_info(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/services/crawling/document_storage_operations.py` around
lines 376 - 394, The provenance retrieval currently can raise and abort the
primary update_source_info path; wrap the calls to
credential_service.get_credentials_by_category("embedding") and ("rag_strategy")
and the int(...) conversion for embedding_dimensions in a small try/except block
so failures are caught, logged, and the code continues with safe fallback values
(e.g., embedding_provider=None, embedding_model=None, embedding_dimensions=None
or a validated integer default, summarization_model=None) instead of raising;
ensure this block lives just before building
vectorizer_settings/summarization_model (around the
embedding_config/embedding_dimensions and rag_settings retrieval) and that
update_source_info uses those fallback None/validated values so provenance
fields are omitted but the primary path is not aborted.
| if embedding_model: | ||
| upsert_data["embedding_model"] = embedding_model | ||
| if embedding_dimensions: | ||
| upsert_data["embedding_dimensions"] = embedding_dimensions |
There was a problem hiding this comment.
embedding_dimensions=0 would be silently dropped.
if embedding_dimensions: is falsy for 0. While unlikely in practice, the guard is inconsistent with line 304's vectorizer_settings is not None pattern. Use is not None for the integer check.
Proposed fix
- if embedding_dimensions:
+ if embedding_dimensions is not None:Apply the same change at line 381.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/services/source_management_service.py` around lines 298 -
301, The guard `if embedding_dimensions:` drops valid zero values; change it to
`if embedding_dimensions is not None:` when setting
upsert_data["embedding_dimensions"] so 0 is preserved—update the occurrence near
the block that sets upsert_data (the same area that checks embedding_model) and
also apply the identical change at the later occurrence around line ~381 where
embedding_dimensions is checked again (both places in
source_management_service.py that populate upsert_data/vectorizer settings).
…mmarize - Add semaphore-based concurrency limiting (2 concurrent ops max) - Add ProgressTracker integration for real-time progress updates - Return progressId from endpoints for frontend polling - Update frontend to capture progressId and start polling - Add Code Summarization Agent settings section to RAGSettings UI - Provenance display already exists on KnowledgeCard
- Add Code Summary as third tab alongside Chat and Embeddings - Add LuCode icon with orange color for Code Summary tab - Add inline Ollama URL input for Code Summary when Ollama selected - Remove redundant collapsible section that was added incorrectly - Update all context-aware logic to handle code_summarization selection
- Move Code Summary to second row to prevent overflow - Shorten label to 'Summary' - Add code_summarization to Ollama Config panel with model selection - Update Configuration Summary to handle code summarization
- Add Summary as third tab (row 2) with orange color and LuCode icon - Works alongside Chat (row 1) and Embeddings (row 1)
There was a problem hiding this comment.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
python/src/server/services/storage/code_storage_service.py (2)
1084-1107:⚠️ Potential issue | 🟠 MajorResolve provider before creating the shared LLM client.
generate_code_summaries_batchbuilds a shared client withprovider=None, but_generate_code_example_summary_asynccan resolve a different provider from credentials. That mismatch can break requests.🛠️ Suggested fix
+ if provider is None: + provider = await _get_code_summarization_provider() + # Create a shared LLM client for all summaries (performance optimization) async with get_llm_client(provider=provider) as shared_client:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/storage/code_storage_service.py` around lines 1084 - 1107, The batch routine creates a shared LLM client with provider possibly None which can mismatch what _generate_code_example_summary_async resolves per-item; before calling get_llm_client in generate_code_summaries_batch, resolve the concrete provider to use (e.g., derive from credentials/config) and set the local provider variable, then create the shared_client with that resolved provider and pass that same provider into each _generate_code_example_summary_async call so the client and provider are consistent.
108-137:⚠️ Potential issue | 🟠 MajorDefault model selection should follow the code-summarization provider.
If
CODE_SUMMARIZATION_PROVIDERdiffers from the active LLM provider and no dedicated model is set,_get_model_choicestill chooses defaults for the chat provider, which can produce a provider/model mismatch.🛠️ Suggested adjustment
- provider_config = await credential_service.get_active_provider("llm") - active_provider = provider_config.get("provider", "openai") - model = provider_config.get("chat_model") + provider_config = await credential_service.get_active_provider("llm") + active_provider = provider_config.get("provider", "openai") + model = provider_config.get("chat_model") + code_provider = await _get_code_summarization_provider() + provider_for_defaults = code_provider or active_provider ... - if not model or model.strip() == "": + if not model or model.strip() == "" or provider_for_defaults != active_provider: provider_defaults = { ... } - model = provider_defaults.get(active_provider, "gpt-4o-mini") + model = provider_defaults.get(provider_for_defaults, "gpt-4o-mini")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/storage/code_storage_service.py` around lines 108 - 137, The _get_model_choice function currently uses the active LLM provider to pick provider-specific defaults even when a separate CODE_SUMMARIZATION_PROVIDER is configured; update _get_model_choice to first read CODE_SUMMARIZATION_PROVIDER (via credential_service.get_credential("CODE_SUMMARIZATION_PROVIDER")) and, when no CODE_SUMMARIZATION_MODEL is set, use that provider value (if present and non-empty) as the provider to look up defaults (falling back to provider_config.get("provider", "openai") only if CODE_SUMMARIZATION_PROVIDER is absent/empty); keep the existing provider_defaults map and return/assign model based on the chosen summarization provider so the selected default model matches the code-summarization provider.
🧹 Nitpick comments (2)
archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx (2)
328-387: Consider using Radix Collapsible primitive for the provenance section.The expand/collapse logic is implemented manually with
useState+ conditional rendering (Lines 330–343, 345–385). The coding guidelines recommend using Radix UI primitives fromsrc/features/ui/primitives/, and the ADR-002 implementation status doc itself suggests "Radix Collapsible" for this section. UsingCollapsible.Root/Collapsible.Trigger/Collapsible.Contentwould provide built-in accessibility (aria attributes, keyboard handling) and animation support.As per coding guidelines: "Prefer using Radix UI primitives from src/features/ui/primitives/ when creating UI components."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx` around lines 328 - 387, The provenance expand/collapse uses manual state (showProvenance) and a button click handler in KnowledgeCard (the ChevronRight/Settings2 button and conditional rendering block) which should be replaced with the Radix Collapsible primitive from our UI primitives; import Collapsible.Root, Collapsible.Trigger and Collapsible.Content from src/features/ui/primitives/, wrap the current button markup as Collapsible.Trigger (preserving the ChevronRight rotation via the open state class), move the details panel into Collapsible.Content (remove useState/showProvenance and the onClick handler), and keep all existing item fields (embedding_provider, summarization_model, vectorizer_settings, last_crawled_at, last_vectorized_at) and formatting logic unchanged so accessibility, keyboard handling, and animations are provided by Radix.
190-195:onRevectorizeandonResummarizeare passed unconditionally.Both handlers are always passed to
KnowledgeCardActions, but the child component gates rendering onhasDocuments && onRevectorizeandhasCodeExamples && onResummarize. This means the "Re-vectorize" menu item will appear wheneverdocumentCount > 0regardless of whether re-vectorization makes sense (e.g., the item is currently processing). Not a bug since it works, but consider gating these at the source for clarity — e.g., only passonRevectorizewhen the item isn't already processing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx` around lines 190 - 195, Only pass the handlers to KnowledgeCardActions when they make sense: conditionally pass onRevectorize and onResummarize instead of unconditionally; e.g., replace the current props so onRevectorize is passed only when documentCount > 0 AND the item is not currently processing (e.g., !isProcessing or !isVectorizing) and similarly pass onResummarize only when codeExamplesCount > 0 AND the item is not processing; use the existing handleRevectorize and handleResummarize functions as the values when those conditions are true.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ADR-002-IMPLEMENTATION-STATUS.md`:
- Around line 96-141: Update the ADR status and checklists to reflect the
frontend and reprocessing work already included in this PR: change "Part 2:
Frontend Implementation" from "⏳ PENDING" to completed and check the items
referencing archon-ui-main/src/features/knowledge/types/knowledge.ts and
archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx (the
KnowledgeSource type and the Processing Details UI are present), and update the
reprocessing section (Part 4) from "❌ Not Started" to completed and check the
items implemented by useRevectorizeKnowledgeItem, useResummarizeKnowledgeItem
hooks, revectorizeKnowledgeItem and resummarizeKnowledgeItem service methods,
the "Re-vectorize"/"Re-summarize" actions in KnowledgeCardActions, and the
"Needs re-vectorization" badge in KnowledgeCard so the document accurately
reflects shipped code.
In `@ADR-002`: Crawl Reliability, Ingestion Quality Control & DB Validation.md:
- Around line 96-103: The ADR phase status is incorrect: Phase 4 is marked "Not
Started" while this PR already adds reprocessing hooks and UI; update ADR-002
and ADR-002-IMPLEMENTATION-STATUS.md to reflect partial implementation by
changing Phase 4 status from "❌ Not Started" to something like "⚠️ Partially
Implemented" and add a short note listing implemented artifacts
(useRevectorizeKnowledgeItem, useResummarizeKnowledgeItem, the corresponding
service methods and UI actions) and any remaining work; ensure the ADR text and
the IMPLEMENTATION-STATUS doc mention which sub-tasks are completed and which
are pending so the docs match the code.
In `@archon-ui-main/src/components/settings/RAGSettings.tsx`:
- Around line 1343-1351: The Summary tab button currently builds Tailwind
classes via a template string using activeSelection === 'code_summarization';
replace this dynamic construction with static class constants and pick the
appropriate constant based on state (e.g. create two predefined className
strings like SUMMARY_ACTIVE_CLASS and SUMMARY_INACTIVE_CLASS and choose one
where the component renders the button), or use a utility like clsx/twMerge to
select between those static constants; update the button rendering in
RAGSettings (the element using activeSelection and the 'code_summarization' key)
to use the selected static className instead of composing classes inline.
- Around line 221-229: The component reads
ragSettings.CODE_SUMMARIZATION_PROVIDER and
ragSettings.CODE_SUMMARIZATION_BASE_URL but RAGSettingsProps (ragSettings)
doesn't declare these fields; update the RAGSettingsProps interface to include
CODE_SUMMARIZATION_PROVIDER: ProviderKey (or optional ProviderKey) and
CODE_SUMMARIZATION_BASE_URL: string (or optional string) so the usages in
showCodeSummarySettings, codeSummaryProvider, and codeSummaryInstanceConfig are
type-safe under strict TS; add them as optional properties if they may be
missing and update any related defaulting logic accordingly.
- Around line 475-486: The component lacks a sync from ragSettings → local code
summary provider state, so add a useEffect that watches
ragSettings.CODE_SUMMARIZATION_PROVIDER and updates the local
codeSummaryProvider (or calls setRagSettings appropriately) to keep the UI from
showing stale defaults; implement it similarly to the existing effect but
inverted: use updateCodeSummaryRagSettingsRef to avoid loops, check if
ragSettings.CODE_SUMMARIZATION_PROVIDER is defined and different from the
current codeSummaryProvider, then call the setter to apply
ragSettings.CODE_SUMMARIZATION_PROVIDER so UI and persisted settings stay in
sync.
In `@archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts`:
- Around line 716-761: The hooks useRevectorizeKnowledgeItem and
useResummarizeKnowledgeItem access properties (documents_updated,
examples_updated, model_used) on the mutation result that are not declared in
the service return types; to fix, either update the service methods
knowledgeService.revectorizeKnowledgeItem and
knowledgeService.resummarizeKnowledgeItem to include those optional fields in
their Promise return types (e.g. add documents_updated?: number,
examples_updated?: number, model_used?: string) or change the onSuccess handlers
in useRevectorizeKnowledgeItem and useResummarizeKnowledgeItem to use
data.message (or guard with data.documents_updated ?? undefined) so TypeScript
strict mode is satisfied and no undeclared property is accessed.
In `@archon-ui-main/src/services/credentialsService.ts`:
- Around line 27-30: RagSettings is missing the CODE_SUMMARIZATION_INSTANCE_NAME
key so values set in RAGSettings.tsx aren't persisted or read back; add
CODE_SUMMARIZATION_INSTANCE_NAME?: string to the RagSettings interface, add a
default entry for CODE_SUMMARIZATION_INSTANCE_NAME in the RagSettings default
values object, and include "CODE_SUMMARIZATION_INSTANCE_NAME" in the
string-fields mapping returned by getRagSettings() so updateRagSettings() and
the loader will persist/return the value (refer to the RagSettings interface,
the defaults object, and getRagSettings() function to locate where to add
these).
In `@migration/0.1.0/003_ollama_add_columns.sql`:
- Around line 33-40: The tracking INSERT into the archon_migrations table
(VALUES ('0.1.0','003_ollama_add_columns') ON CONFLICT DO NOTHING) is executed
after the COMMIT and must be moved inside the same BEGIN…COMMIT transaction that
contains the ALTER TABLE statements so the migration and its tracking record are
atomic; relocate the INSERT so it runs before the COMMIT (within the transaction
block) and keep the existing ON CONFLICT clause and status SELECT unchanged.
In `@migration/0.1.0/004_ollama_migrate_data.sql`:
- Around line 68-75: The tracking INSERT into archon_migrations currently runs
after the transaction COMMIT, leaving DROP COLUMN changes outside the same
atomic transaction; move the INSERT INTO archon_migrations (the statement that
inserts ('0.1.0', '004_ollama_migrate_data') ON CONFLICT DO NOTHING) so it
executes before the COMMIT and inside the same BEGIN…COMMIT block (i.e., place
the INSERT into archon_migrations just prior to the COMMIT in this migration) to
ensure the migration recording is atomic with the schema changes.
In `@migration/0.1.0/005_ollama_create_functions.sql`:
- Around line 170-177: The tracking INSERT into archon_migrations is executed
after the COMMIT, breaking transactional atomicity; move the INSERT ('0.1.0',
'005_ollama_create_functions' ON CONFLICT DO NOTHING) so it runs inside the same
transaction that creates the Ollama functions—i.e., ensure the BEGIN…COMMIT
block that wraps the function creation also contains the INSERT (or wrap both
function creation and the INSERT in a single explicit transaction) so that
either both the functions and the migration record are committed or neither are.
In `@python/src/server/api_routes/knowledge_api.py`:
- Around line 684-723: The route handler revectorize_knowledge_item currently
performs orchestration and Supabase reads/updates; refactor by moving all
orchestration (provider API-key validation via
credential_service.get_active_provider and _validate_provider_api_key, fetching
the item via KnowledgeItemService.get_item, ProgressTracker
initialization/start, and launching the background job
_perform_revectorize_with_progress) into a new service method (e.g.,
KnowledgeItemService.revectorize_with_progress or
KnowledgeOrchestrationService.start_revectorize) that returns the progress_id
and status; then simplify revectorize_knowledge_item to call that service method
and return {"success": True, "progressId": progress_id, "message": ...}. Ensure
the service method uses the same symbols (_perform_revectorize_with_progress,
ProgressTracker, KnowledgeItemService.get_item) and keeps error handling (404
HTTPException) so tests and database access are encapsulated in the service
layer.
- Around line 878-882: The code is validating the active "llm" provider but
resummarize should validate the configured code-summarization provider; replace
the hardcoded "llm" usage with the CODE_SUMMARIZATION_PROVIDER constant so you
fetch the correct provider_config (change the call
credential_service.get_active_provider("llm") to
credential_service.get_active_provider(CODE_SUMMARIZATION_PROVIDER)), then
derive provider from provider_config (provider = provider_config.get("provider",
"<appropriate default>")) and call _validate_provider_api_key(provider) as
before to ensure the code summarization provider's API key is validated.
- Around line 788-848: The code uses embedding_dim (set inside the per-batch
loop when result.embeddings exist) later in the
supabase.table("archon_sources").update call which can raise UnboundLocalError
if no embeddings were returned; to fix, introduce a variable (e.g.,
final_embedding_dim or any name) initialized before the batching loop and update
it when you successfully determine embedding_dim inside the loop (in the block
handling result.embeddings), then use that final variable (or a safe default
like 0 or None) in the archon_sources.update call and only set
embedding_model/provider/vectorizer_settings/last_vectorized_at if embeddings
were produced (or explicitly handle the “no embeddings” case) so archon_sources
update won’t reference an undefined embedding_dim; refer to
create_embeddings_batch, result.embeddings, embedding_dim, and the
supabase.table("archon_sources").update call when making the change.
In `@python/src/server/services/knowledge/knowledge_item_service.py`:
- Around line 384-386: _transform_source_to_item currently reads total words
from source.get("total_words", 0) which misses the DB column "total_word_count";
change the lookup to defensively prefer "total_word_count" with a fallback to
"total_words" (e.g. total_words = source.get("total_word_count",
source.get("total_words", 0))) and then use that total_words variable when
setting "word_count", "estimated_pages", and "pages_tooltip" so single-item
responses reflect the real count.
- Around line 427-462: In _check_nevectorization (async def
_check_needs_revectorization) change the default types passed into
credential_service.get_credential to match stored string formats (e.g., use
"false" for USE_CONTEXTUAL_EMBEDDINGS and "512" for CHUNK_SIZE) and compare
against stored_vectorizer_settings values also read as strings to avoid
type-coercion false positives; also replace the bare except block that silently
returns False with logging the exception (use safe_logfire_error or the project
logger, include source.get("source_id") and exc_info=True) before returning
False so import/credential errors are visible.
---
Outside diff comments:
In `@python/src/server/services/storage/code_storage_service.py`:
- Around line 1084-1107: The batch routine creates a shared LLM client with
provider possibly None which can mismatch what
_generate_code_example_summary_async resolves per-item; before calling
get_llm_client in generate_code_summaries_batch, resolve the concrete provider
to use (e.g., derive from credentials/config) and set the local provider
variable, then create the shared_client with that resolved provider and pass
that same provider into each _generate_code_example_summary_async call so the
client and provider are consistent.
- Around line 108-137: The _get_model_choice function currently uses the active
LLM provider to pick provider-specific defaults even when a separate
CODE_SUMMARIZATION_PROVIDER is configured; update _get_model_choice to first
read CODE_SUMMARIZATION_PROVIDER (via
credential_service.get_credential("CODE_SUMMARIZATION_PROVIDER")) and, when no
CODE_SUMMARIZATION_MODEL is set, use that provider value (if present and
non-empty) as the provider to look up defaults (falling back to
provider_config.get("provider", "openai") only if CODE_SUMMARIZATION_PROVIDER is
absent/empty); keep the existing provider_defaults map and return/assign model
based on the chosen summarization provider so the selected default model matches
the code-summarization provider.
---
Duplicate comments:
In `@ADR-002`: Crawl Reliability, Ingestion Quality Control & DB Validation.md:
- Around line 53-66: The ADR's crawl_url_state table definition is inconsistent
with the codebase: it references knowledge_sources(id) and uses UUID for
source_id, while the actual schema uses archon_sources and source_id is TEXT;
update the CREATE TABLE statement for crawl_url_state to reference
archon_sources (replace REFERENCES knowledge_sources(id)) and change the
source_id column type to TEXT (ensure UNIQUE(source_id, url) still applies) so
the example matches the real schema.
In `@archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx`:
- Around line 362-370: In KnowledgeCard, the expression
item.vectorizer_settings.chunk_size &&
`chunk=${item.vectorizer_settings.chunk_size}` uses && which treats 0 as false
and hides chunk_size=0; replace that conditional with an explicit null/undefined
check (e.g., check chunk_size !== undefined && chunk_size !== null or
Number.isFinite) and render `chunk=${item.vectorizer_settings.chunk_size}` when
the check passes so a value of 0 is shown correctly.
- Around line 372-383: The code renders item.last_crawled_at and
item.last_vectorized_at using format(new Date(...)) which can throw for invalid
date strings; update KnowledgeCard.tsx to guard these calls (same approach used
in the footer) by validating/parsing the date before formatting (e.g., use
date-fns isValid(new Date(...)) or wrap format(new Date(...)) in a try/catch)
and render a safe fallback (empty string or "Invalid date") when the value is
invalid; apply this change to both item.last_crawled_at and
item.last_vectorized_at so format is only called on guaranteed-valid Date
objects.
In `@migration/0.1.0/012_add_crawl_url_state.sql`:
- Around line 43-45: The RLS policy "Service role full access to
crawl_url_state" on archon_crawl_url_state is too permissive; replace USING
(true) WITH CHECK (true) with a service-role-only condition such as USING
(auth.role() = 'service_role') WITH CHECK (auth.role() = 'service_role') so only
the service_role can bypass row restrictions; update the policy definition that
mentions "Service role full access to crawl_url_state" to use this condition.
- Around line 21-22: The migration sets updated_at with a DEFAULT but lacks the
trigger to refresh it on updates; add a BEFORE UPDATE trigger and helper
function that sets NEW.updated_at = now() so updated_at is maintained
automatically. In the same migration (012_add_crawl_url_state.sql) create a
trigger function (e.g., trigger_set_timestamp or set_updated_at) and attach a
BEFORE UPDATE trigger on the table that assigns now() to NEW.updated_at and
returns NEW.
---
Nitpick comments:
In `@archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx`:
- Around line 328-387: The provenance expand/collapse uses manual state
(showProvenance) and a button click handler in KnowledgeCard (the
ChevronRight/Settings2 button and conditional rendering block) which should be
replaced with the Radix Collapsible primitive from our UI primitives; import
Collapsible.Root, Collapsible.Trigger and Collapsible.Content from
src/features/ui/primitives/, wrap the current button markup as
Collapsible.Trigger (preserving the ChevronRight rotation via the open state
class), move the details panel into Collapsible.Content (remove
useState/showProvenance and the onClick handler), and keep all existing item
fields (embedding_provider, summarization_model, vectorizer_settings,
last_crawled_at, last_vectorized_at) and formatting logic unchanged so
accessibility, keyboard handling, and animations are provided by Radix.
- Around line 190-195: Only pass the handlers to KnowledgeCardActions when they
make sense: conditionally pass onRevectorize and onResummarize instead of
unconditionally; e.g., replace the current props so onRevectorize is passed only
when documentCount > 0 AND the item is not currently processing (e.g.,
!isProcessing or !isVectorizing) and similarly pass onResummarize only when
codeExamplesCount > 0 AND the item is not processing; use the existing
handleRevectorize and handleResummarize functions as the values when those
conditions are true.
| ### Frontend Implementation | ||
|
|
||
| **Status:** ⏳ PENDING | ||
|
|
||
| **Files to Modify:** | ||
| 1. ⏳ `archon-ui-main/src/features/knowledge/types/knowledge.ts` | ||
| ```typescript | ||
| export interface KnowledgeSource { | ||
| source_id: string; | ||
| // ... existing fields ... | ||
| embedding_model?: string; | ||
| embedding_dimensions?: number; | ||
| embedding_provider?: string; | ||
| vectorizer_settings?: { | ||
| use_contextual?: boolean; | ||
| use_hybrid?: boolean; | ||
| chunk_size?: number; | ||
| }; | ||
| summarization_model?: string; | ||
| last_crawled_at?: string; | ||
| last_vectorized_at?: string; | ||
| } | ||
| ``` | ||
|
|
||
| 2. ⏳ `archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx` | ||
| - Add expandable "Processing Details" section using Radix Collapsible | ||
| - Display embedding_provider/embedding_model (embedding_dimensions D) | ||
| - Display summarization_model | ||
| - Display formatted last_crawled_at timestamp | ||
| - Use Tron-inspired glassmorphism styling | ||
|
|
||
| **UI Design:** | ||
| ```tsx | ||
| <Collapsible.Root> | ||
| <Collapsible.Trigger className="flex items-center gap-2 text-sm text-gray-400 hover:text-cyan-400"> | ||
| <ChevronRight className="transition-transform" /> | ||
| Processing Details | ||
| </Collapsible.Trigger> | ||
| <Collapsible.Content className="mt-2 text-xs text-gray-400 space-y-1 pl-6"> | ||
| <div>Embeddings: {embedding_provider}/{embedding_model} ({embedding_dimensions}D)</div> | ||
| <div>Summarization: {summarization_model}</div> | ||
| <div>Last crawled: {formatDate(last_crawled_at)}</div> | ||
| </Collapsible.Content> | ||
| </Collapsible.Root> | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Status markers are out of date — frontend implementation is already present in this PR.
The "Part 2: Frontend Implementation" is marked "⏳ PENDING" and the items listed (types in knowledge.ts, provenance display in KnowledgeCard.tsx) are already implemented in this PR. Similarly, the checklist at Lines 274–275 should be checked.
Additionally, Part 4 (Lines 211–256) marks all reprocessing items as "❌ Not Started," but this PR includes:
useRevectorizeKnowledgeItem/useResummarizeKnowledgeItemhooksrevectorizeKnowledgeItem/resummarizeKnowledgeItemservice methods- "Re-vectorize" / "Re-summarize" dropdown actions in
KnowledgeCardActions - "Needs re-vectorization" badge in
KnowledgeCard
Update the status markers and checklists to reflect the code that ships with this PR.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ADR-002-IMPLEMENTATION-STATUS.md` around lines 96 - 141, Update the ADR
status and checklists to reflect the frontend and reprocessing work already
included in this PR: change "Part 2: Frontend Implementation" from "⏳ PENDING"
to completed and check the items referencing
archon-ui-main/src/features/knowledge/types/knowledge.ts and
archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx (the
KnowledgeSource type and the Processing Details UI are present), and update the
reprocessing section (Part 4) from "❌ Not Started" to completed and check the
items implemented by useRevectorizeKnowledgeItem, useResummarizeKnowledgeItem
hooks, revectorizeKnowledgeItem and resummarizeKnowledgeItem service methods,
the "Re-vectorize"/"Re-summarize" actions in KnowledgeCardActions, and the
"Needs re-vectorization" badge in KnowledgeCard so the document accurately
reflects shipped code.
| **Implementation:** | ||
| - Add validation API endpoint | ||
| - Add database validation functions | ||
| - Add MCP validation tool | ||
| - Add frontend validation UI | ||
|
|
||
| **Status:** ❌ Not Started (see ADR-002-IMPLEMENTATION-STATUS.md) | ||
|
|
There was a problem hiding this comment.
Phase 3/4 status markers conflict with implemented code in this PR.
The ADR marks Phase 4 (Reprocessing Tools) as "❌ Not Started," but this PR already includes frontend hooks (useRevectorizeKnowledgeItem, useResummarizeKnowledgeItem), service methods, and UI actions for re-vectorize/re-summarize. The implementation status doc (ADR-002-IMPLEMENTATION-STATUS.md) also marks Phase 4 as not started. Consider updating the status to reflect the partial implementation already present.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ADR-002`: Crawl Reliability, Ingestion Quality Control & DB Validation.md
around lines 96 - 103, The ADR phase status is incorrect: Phase 4 is marked "Not
Started" while this PR already adds reprocessing hooks and UI; update ADR-002
and ADR-002-IMPLEMENTATION-STATUS.md to reflect partial implementation by
changing Phase 4 status from "❌ Not Started" to something like "⚠️ Partially
Implemented" and add a short note listing implemented artifacts
(useRevectorizeKnowledgeItem, useResummarizeKnowledgeItem, the corresponding
service methods and UI actions) and any remaining work; ensure the ADR text and
the IMPLEMENTATION-STATUS doc mention which sub-tasks are completed and which
are pending so the docs match the code.
| // Code Summarization state | ||
| const [showCodeSummarySettings, setShowCodeSummarySettings] = useState(false); | ||
| const [codeSummaryProvider, setCodeSummaryProvider] = useState<ProviderKey>(() => | ||
| (ragSettings.CODE_SUMMARIZATION_PROVIDER as ProviderKey) || 'openai' | ||
| ); | ||
| const [codeSummaryInstanceConfig, setCodeSummaryInstanceConfig] = useState({ | ||
| name: '', | ||
| url: ragSettings.CODE_SUMMARIZATION_BASE_URL || 'http://host.docker.internal:11434/v1' | ||
| }); |
There was a problem hiding this comment.
Extend ragSettings type to include CODE_SUMMARIZATION_* fields.
These new property accesses aren’t declared in RAGSettingsProps, which will fail under strict TypeScript and makes the shape inconsistent.
💡 Suggested interface update
ragSettings: {
...
CODE_EXTRACTION_BATCH_SIZE?: number;
CODE_SUMMARY_MAX_WORKERS?: number;
+ CODE_SUMMARIZATION_BASE_URL?: string;
+ CODE_SUMMARIZATION_INSTANCE_NAME?: string;
+ CODE_SUMMARIZATION_PROVIDER?: ProviderKey;
+ CODE_SUMMARIZATION_MODEL?: string;
};As per coding guidelines, "archon-ui-main/src/**/*.{ts,tsx}: Use TypeScript in strict mode with no implicit any in frontend code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@archon-ui-main/src/components/settings/RAGSettings.tsx` around lines 221 -
229, The component reads ragSettings.CODE_SUMMARIZATION_PROVIDER and
ragSettings.CODE_SUMMARIZATION_BASE_URL but RAGSettingsProps (ragSettings)
doesn't declare these fields; update the RAGSettingsProps interface to include
CODE_SUMMARIZATION_PROVIDER: ProviderKey (or optional ProviderKey) and
CODE_SUMMARIZATION_BASE_URL: string (or optional string) so the usages in
showCodeSummarySettings, codeSummaryProvider, and codeSummaryInstanceConfig are
type-safe under strict TS; add them as optional properties if they may be
missing and update any related defaulting logic accordingly.
| // Update ragSettings when codeSummaryProvider changes | ||
| const updateCodeSummaryRagSettingsRef = useRef(true); | ||
|
|
||
| useEffect(() => { | ||
| if (updateCodeSummaryRagSettingsRef.current && codeSummaryProvider && codeSummaryProvider !== ragSettings.CODE_SUMMARIZATION_PROVIDER) { | ||
| setRagSettings(prev => ({ | ||
| ...prev, | ||
| CODE_SUMMARIZATION_PROVIDER: codeSummaryProvider | ||
| })); | ||
| } | ||
| updateCodeSummaryRagSettingsRef.current = true; | ||
| }, [codeSummaryProvider]); |
There was a problem hiding this comment.
Sync codeSummaryProvider from ragSettings to avoid stale state.
Unlike chat/embedding, there’s no ragSettings → local state sync for code summary. This can leave the UI showing the default provider and overwrite persisted settings on save.
🛠️ Suggested sync effect
+ useEffect(() => {
+ if (
+ ragSettings.CODE_SUMMARIZATION_PROVIDER &&
+ ragSettings.CODE_SUMMARIZATION_PROVIDER !== codeSummaryProvider
+ ) {
+ updateCodeSummaryRagSettingsRef.current = false;
+ setCodeSummaryProvider(ragSettings.CODE_SUMMARIZATION_PROVIDER as ProviderKey);
+ }
+ }, [ragSettings.CODE_SUMMARIZATION_PROVIDER]);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@archon-ui-main/src/components/settings/RAGSettings.tsx` around lines 475 -
486, The component lacks a sync from ragSettings → local code summary provider
state, so add a useEffect that watches ragSettings.CODE_SUMMARIZATION_PROVIDER
and updates the local codeSummaryProvider (or calls setRagSettings
appropriately) to keep the UI from showing stale defaults; implement it
similarly to the existing effect but inverted: use
updateCodeSummaryRagSettingsRef to avoid loops, check if
ragSettings.CODE_SUMMARIZATION_PROVIDER is defined and different from the
current codeSummaryProvider, then call the setter to apply
ragSettings.CODE_SUMMARIZATION_PROVIDER so UI and persisted settings stay in
sync.
| className={`min-w-[180px] px-5 py-3 font-semibold text-white dark:text-white | ||
| border border-orange-400/70 dark:border-orange-400/40 | ||
| bg-black/40 backdrop-blur-md | ||
| shadow-[inset_0_0_16px_rgba(234,88,12,0.38)] | ||
| hover:bg-orange-500/12 dark:hover:bg-orange-500/20 | ||
| hover:border-orange-300/80 hover:shadow-[0_0_24px_rgba(251,146,60,0.52)] | ||
| ${(activeSelection === 'code_summarization') | ||
| ? 'shadow-[0_0_26px_rgba(251,146,60,0.55)] ring-2 ring-orange-400/60' | ||
| : 'shadow-[0_0_15px_rgba(251,146,60,0.25)]'} |
There was a problem hiding this comment.
Avoid dynamic Tailwind class construction for the Summary tab button.
The inline template string builds Tailwind classes dynamically. Prefer static class constants selected by state.
As per coding guidelines, "Do not use dynamic Tailwind class construction in frontend components".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@archon-ui-main/src/components/settings/RAGSettings.tsx` around lines 1343 -
1351, The Summary tab button currently builds Tailwind classes via a template
string using activeSelection === 'code_summarization'; replace this dynamic
construction with static class constants and pick the appropriate constant based
on state (e.g. create two predefined className strings like SUMMARY_ACTIVE_CLASS
and SUMMARY_INACTIVE_CLASS and choose one where the component renders the
button), or use a utility like clsx/twMerge to select between those static
constants; update the button rendering in RAGSettings (the element using
activeSelection and the 'code_summarization' key) to use the selected static
className instead of composing classes inline.
| @router.post("/knowledge-items/{source_id}/revectorize") | ||
| async def revectorize_knowledge_item(source_id: str): | ||
| """Re-generate embeddings for all documents in a knowledge item without re-crawling.""" | ||
| from ..utils.progress.progress_tracker import ProgressTracker | ||
|
|
||
| logger.info(f"🔍 Starting re-vectorize for source_id={source_id}") | ||
|
|
||
| # Generate unique progress ID | ||
| progress_id = str(uuid.uuid4()) | ||
|
|
||
| # Initialize progress tracker | ||
| tracker = ProgressTracker(progress_id, operation_type="revectorize") | ||
|
|
||
| try: | ||
| # Validate API key | ||
| provider_config = await credential_service.get_active_provider("embedding") | ||
| provider = provider_config.get("provider", "openai") | ||
| await _validate_provider_api_key(provider) | ||
|
|
||
| # Get the existing knowledge item | ||
| service = KnowledgeItemService(get_supabase_client()) | ||
| existing_item = await service.get_item(source_id) | ||
|
|
||
| if not existing_item: | ||
| raise HTTPException(status_code=404, detail={"error": f"Knowledge item {source_id} not found"}) | ||
|
|
||
| await tracker.start( | ||
| { | ||
| "status": "starting", | ||
| "progress": 0, | ||
| "log": f"Starting re-vectorization for {existing_item.get('title', source_id)}", | ||
| "documents_total": 0, | ||
| "documents_processed": 0, | ||
| } | ||
| ) | ||
|
|
||
| # Start background task with semaphore | ||
| asyncio.create_task(_perform_revectorize_with_progress(progress_id, source_id, provider, tracker)) | ||
|
|
||
| return {"success": True, "progressId": progress_id, "message": "Re-vectorization started"} |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Route should delegate revectorize/resummarize work to a service layer.
These new endpoints execute Supabase reads/updates directly in api_routes. Please move orchestration into a service class and call it from the route for consistency and testability.
As per coding guidelines, "Follow the API route → Service → Database pattern for new API endpoints".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/api_routes/knowledge_api.py` around lines 684 - 723, The
route handler revectorize_knowledge_item currently performs orchestration and
Supabase reads/updates; refactor by moving all orchestration (provider API-key
validation via credential_service.get_active_provider and
_validate_provider_api_key, fetching the item via KnowledgeItemService.get_item,
ProgressTracker initialization/start, and launching the background job
_perform_revectorize_with_progress) into a new service method (e.g.,
KnowledgeItemService.revectorize_with_progress or
KnowledgeOrchestrationService.start_revectorize) that returns the progress_id
and status; then simplify revectorize_knowledge_item to call that service method
and return {"success": True, "progressId": progress_id, "message": ...}. Ensure
the service method uses the same symbols (_perform_revectorize_with_progress,
ProgressTracker, KnowledgeItemService.get_item) and keeps error handling (404
HTTPException) so tests and database access are encapsulated in the service
layer.
| # Create embeddings | ||
| result = await create_embeddings_batch(contents, provider=provider) | ||
|
|
||
| if result.embeddings: | ||
| # Update documents with new embeddings | ||
| for j, (doc, embedding) in enumerate(zip(batch, result.embeddings, strict=False)): | ||
| doc_id = doc.get("id") | ||
| if not doc_id: | ||
| continue | ||
|
|
||
| # Determine embedding column based on dimension | ||
| embedding_dim = len(embedding) if isinstance(embedding, list) else 0 | ||
| embedding_column = None | ||
| if embedding_dim == 768: | ||
| embedding_column = "embedding_768" | ||
| elif embedding_dim == 1024: | ||
| embedding_column = "embedding_1024" | ||
| elif embedding_dim == 1536: | ||
| embedding_column = "embedding_1536" | ||
| elif embedding_dim == 3072: | ||
| embedding_column = "embedding_3072" | ||
| else: | ||
| errors.append(f"Unsupported dimension {embedding_dim} for doc {doc_id}") | ||
| continue | ||
|
|
||
| try: | ||
| supabase.table("archon_crawled_pages").update( | ||
| { | ||
| embedding_column: embedding, | ||
| "embedding_model": embedding_model, | ||
| "embedding_dimension": embedding_dim, | ||
| } | ||
| ).eq("id", doc_id).execute() | ||
| total_updated += 1 | ||
| except Exception as e: | ||
| errors.append(f"Failed to update doc {doc_id}: {str(e)}") | ||
|
|
||
| # Update progress | ||
| progress = 10 + int((i + len(batch)) / total_docs * 85) | ||
| await tracker.update( | ||
| { | ||
| "status": "processing", | ||
| "progress": progress, | ||
| "log": f"Processed {min(i + len(batch), total_docs)}/{total_docs} documents", | ||
| "documents_total": total_docs, | ||
| "documents_processed": min(i + len(batch), total_docs), | ||
| } | ||
| ) | ||
|
|
||
| # Update source provenance | ||
| supabase.table("archon_sources").update( | ||
| { | ||
| "embedding_model": embedding_model, | ||
| "embedding_dimensions": embedding_dim, | ||
| "embedding_provider": provider, | ||
| "vectorizer_settings": vectorizer_settings, | ||
| "last_vectorized_at": datetime.utcnow().isoformat(), | ||
| "needs_revectorization": False, | ||
| } | ||
| ).eq("id", source_id).execute() | ||
|
|
There was a problem hiding this comment.
Guard against empty embedding batches before updating provenance.
embedding_dim is set inside the loop and later used to update archon_sources. If no embeddings are returned, this raises UnboundLocalError and leaves progress in a failed state.
🛠️ Suggested guard
- # Process documents in batches
+ embedding_dim: int | None = None
+ # Process documents in batches
for i in range(0, len(documents), batch_size):
...
if result.embeddings:
for j, (doc, embedding) in enumerate(zip(batch, result.embeddings, strict=False)):
...
- embedding_dim = len(embedding) if isinstance(embedding, list) else 0
+ embedding_dim = len(embedding) if isinstance(embedding, list) else len(embedding)
...
+
+ if embedding_dim is None:
+ await tracker.error("No embeddings were generated; aborting re-vectorization")
+ return📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Create embeddings | |
| result = await create_embeddings_batch(contents, provider=provider) | |
| if result.embeddings: | |
| # Update documents with new embeddings | |
| for j, (doc, embedding) in enumerate(zip(batch, result.embeddings, strict=False)): | |
| doc_id = doc.get("id") | |
| if not doc_id: | |
| continue | |
| # Determine embedding column based on dimension | |
| embedding_dim = len(embedding) if isinstance(embedding, list) else 0 | |
| embedding_column = None | |
| if embedding_dim == 768: | |
| embedding_column = "embedding_768" | |
| elif embedding_dim == 1024: | |
| embedding_column = "embedding_1024" | |
| elif embedding_dim == 1536: | |
| embedding_column = "embedding_1536" | |
| elif embedding_dim == 3072: | |
| embedding_column = "embedding_3072" | |
| else: | |
| errors.append(f"Unsupported dimension {embedding_dim} for doc {doc_id}") | |
| continue | |
| try: | |
| supabase.table("archon_crawled_pages").update( | |
| { | |
| embedding_column: embedding, | |
| "embedding_model": embedding_model, | |
| "embedding_dimension": embedding_dim, | |
| } | |
| ).eq("id", doc_id).execute() | |
| total_updated += 1 | |
| except Exception as e: | |
| errors.append(f"Failed to update doc {doc_id}: {str(e)}") | |
| # Update progress | |
| progress = 10 + int((i + len(batch)) / total_docs * 85) | |
| await tracker.update( | |
| { | |
| "status": "processing", | |
| "progress": progress, | |
| "log": f"Processed {min(i + len(batch), total_docs)}/{total_docs} documents", | |
| "documents_total": total_docs, | |
| "documents_processed": min(i + len(batch), total_docs), | |
| } | |
| ) | |
| # Update source provenance | |
| supabase.table("archon_sources").update( | |
| { | |
| "embedding_model": embedding_model, | |
| "embedding_dimensions": embedding_dim, | |
| "embedding_provider": provider, | |
| "vectorizer_settings": vectorizer_settings, | |
| "last_vectorized_at": datetime.utcnow().isoformat(), | |
| "needs_revectorization": False, | |
| } | |
| ).eq("id", source_id).execute() | |
| # Initialize before loop to prevent UnboundLocalError | |
| embedding_dim: int | None = None | |
| # Process documents in batches | |
| for i in range(0, len(documents), batch_size): | |
| ... | |
| # Create embeddings | |
| result = await create_embeddings_batch(contents, provider=provider) | |
| if result.embeddings: | |
| # Update documents with new embeddings | |
| for j, (doc, embedding) in enumerate(zip(batch, result.embeddings, strict=False)): | |
| doc_id = doc.get("id") | |
| if not doc_id: | |
| continue | |
| # Determine embedding column based on dimension | |
| embedding_dim = len(embedding) if isinstance(embedding, list) else 0 | |
| embedding_column = None | |
| if embedding_dim == 768: | |
| embedding_column = "embedding_768" | |
| elif embedding_dim == 1024: | |
| embedding_column = "embedding_1024" | |
| elif embedding_dim == 1536: | |
| embedding_column = "embedding_1536" | |
| elif embedding_dim == 3072: | |
| embedding_column = "embedding_3072" | |
| else: | |
| errors.append(f"Unsupported dimension {embedding_dim} for doc {doc_id}") | |
| continue | |
| try: | |
| supabase.table("archon_crawled_pages").update( | |
| { | |
| embedding_column: embedding, | |
| "embedding_model": embedding_model, | |
| "embedding_dimension": embedding_dim, | |
| } | |
| ).eq("id", doc_id).execute() | |
| total_updated += 1 | |
| except Exception as e: | |
| errors.append(f"Failed to update doc {doc_id}: {str(e)}") | |
| # Update progress | |
| progress = 10 + int((i + len(batch)) / total_docs * 85) | |
| await tracker.update( | |
| { | |
| "status": "processing", | |
| "progress": progress, | |
| "log": f"Processed {min(i + len(batch), total_docs)}/{total_docs} documents", | |
| "documents_total": total_docs, | |
| "documents_processed": min(i + len(batch), total_docs), | |
| } | |
| ) | |
| # Guard against empty embedding batches before updating provenance | |
| if embedding_dim is None: | |
| await tracker.error("No embeddings were generated; aborting re-vectorization") | |
| return | |
| # Update source provenance | |
| supabase.table("archon_sources").update( | |
| { | |
| "embedding_model": embedding_model, | |
| "embedding_dimensions": embedding_dim, | |
| "embedding_provider": provider, | |
| "vectorizer_settings": vectorizer_settings, | |
| "last_vectorized_at": datetime.utcnow().isoformat(), | |
| "needs_revectorization": False, | |
| } | |
| ).eq("id", source_id).execute() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/api_routes/knowledge_api.py` around lines 788 - 848, The
code uses embedding_dim (set inside the per-batch loop when result.embeddings
exist) later in the supabase.table("archon_sources").update call which can raise
UnboundLocalError if no embeddings were returned; to fix, introduce a variable
(e.g., final_embedding_dim or any name) initialized before the batching loop and
update it when you successfully determine embedding_dim inside the loop (in the
block handling result.embeddings), then use that final variable (or a safe
default like 0 or None) in the archon_sources.update call and only set
embedding_model/provider/vectorizer_settings/last_vectorized_at if embeddings
were produced (or explicitly handle the “no embeddings” case) so archon_sources
update won’t reference an undefined embedding_dim; refer to
create_embeddings_batch, result.embeddings, embedding_dim, and the
supabase.table("archon_sources").update call when making the change.
| # Validate API key (uses LLM provider for summarization) | ||
| provider_config = await credential_service.get_active_provider("llm") | ||
| provider = provider_config.get("provider", "openai") | ||
| await _validate_provider_api_key(provider) | ||
|
|
There was a problem hiding this comment.
Validate the code-summarization provider instead of the chat provider.
resummarize validates the active LLM provider, but summarization may use CODE_SUMMARIZATION_PROVIDER. If they differ, validation passes and the background task fails later.
🛠️ Suggested fix
- provider_config = await credential_service.get_active_provider("llm")
- provider = provider_config.get("provider", "openai")
+ from ..services.storage.code_storage_service import _get_code_summarization_provider
+ provider = await _get_code_summarization_provider()
await _validate_provider_api_key(provider)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/api_routes/knowledge_api.py` around lines 878 - 882, The
code is validating the active "llm" provider but resummarize should validate the
configured code-summarization provider; replace the hardcoded "llm" usage with
the CODE_SUMMARIZATION_PROVIDER constant so you fetch the correct
provider_config (change the call credential_service.get_active_provider("llm")
to credential_service.get_active_provider(CODE_SUMMARIZATION_PROVIDER)), then
derive provider from provider_config (provider = provider_config.get("provider",
"<appropriate default>")) and call _validate_provider_api_key(provider) as
before to ensure the code summarization provider's API key is validated.
| "word_count": source.get("total_words", 0), | ||
| "estimated_pages": round( | ||
| source.get("total_words", 0) / 250, 1 | ||
| ), # Average book page = 250 words | ||
| "estimated_pages": round(source.get("total_words", 0) / 250, 1), # Average book page = 250 words | ||
| "pages_tooltip": f"{round(source.get('total_words', 0) / 250, 1)} pages (≈ {source.get('total_words', 0):,} words)", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for total_word occurrences in Python files
echo "=== Python files with total_word references ==="
rg -n "total_word" --type py -C 2
echo -e "\n=== SQL files with total_word references ==="
rg -n "total_word" --type sql -C 2Repository: coleam00/Archon
Length of output: 13365
🏁 Script executed:
#!/bin/bash
# Find the knowledge_item_service.py file and examine the specific lines
fd "knowledge_item_service.py" --type f -x wc -l {}
echo "=== Context around lines 170-180 (list_items area) ==="
fd "knowledge_item_service.py" --type f -x sed -n '170,180p' {}
echo -e "\n=== Context around lines 310-320 (get_available_sources area) ==="
fd "knowledge_item_service.py" --type f -x sed -n '310,320p' {}
echo -e "\n=== Context around lines 382-390 (_transform_source_to_item area) ==="
fd "knowledge_item_service.py" --type f -x sed -n '382,390p' {}Repository: coleam00/Archon
Length of output: 2862
total_words field lookup will always return 0 if the source contains the actual database column total_word_count
_transform_source_to_item uses source.get("total_words", 0) (lines 384–386), but all data producers create the "total_word_count" key per the database schema. The unchanged list_items path (line 173) correctly uses source.get("total_word_count", 0), and get_available_sources (line 312) defensively falls back to total_word_count. This method does not—meaning word_count and estimated_pages will silently be 0 in single-item responses.
🐛 Proposed fix to align with the defensive pattern in `get_available_sources`
- "word_count": source.get("total_words", 0),
- "estimated_pages": round(source.get("total_words", 0) / 250, 1), # Average book page = 250 words
- "pages_tooltip": f"{round(source.get('total_words', 0) / 250, 1)} pages (≈ {source.get('total_words', 0):,} words)",
+ "word_count": source.get("total_words", source.get("total_word_count", 0)),
+ "estimated_pages": round(source.get("total_words", source.get("total_word_count", 0)) / 250, 1),
+ "pages_tooltip": f"{round(source.get('total_words', source.get('total_word_count', 0)) / 250, 1)} pages (≈ {source.get('total_words', source.get('total_word_count', 0)):,} words)",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "word_count": source.get("total_words", 0), | |
| "estimated_pages": round( | |
| source.get("total_words", 0) / 250, 1 | |
| ), # Average book page = 250 words | |
| "estimated_pages": round(source.get("total_words", 0) / 250, 1), # Average book page = 250 words | |
| "pages_tooltip": f"{round(source.get('total_words', 0) / 250, 1)} pages (≈ {source.get('total_words', 0):,} words)", | |
| "word_count": source.get("total_words", source.get("total_word_count", 0)), | |
| "estimated_pages": round(source.get("total_words", source.get("total_word_count", 0)) / 250, 1), | |
| "pages_tooltip": f"{round(source.get('total_words', source.get('total_word_count', 0)) / 250, 1)} pages (≈ {source.get('total_words', source.get('total_word_count', 0)):,} words)", |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/services/knowledge/knowledge_item_service.py` around lines
384 - 386, _transform_source_to_item currently reads total words from
source.get("total_words", 0) which misses the DB column "total_word_count";
change the lookup to defensively prefer "total_word_count" with a fallback to
"total_words" (e.g. total_words = source.get("total_word_count",
source.get("total_words", 0))) and then use that total_words variable when
setting "word_count", "estimated_pages", and "pages_tooltip" so single-item
responses reflect the real count.
| async def _check_needs_revectorization(self, source: dict[str, Any]) -> bool: | ||
| """Check if re-vectorization is needed by comparing current settings with stored provenance.""" | ||
| try: | ||
| from ..credential_service import credential_service | ||
|
|
||
| stored_embedding_model = source.get("embedding_model") | ||
| stored_embedding_provider = source.get("embedding_provider") | ||
| stored_vectorizer_settings = source.get("vectorizer_settings") or {} | ||
|
|
||
| if not stored_embedding_model: | ||
| return False | ||
|
|
||
| current_embedding_model = await credential_service.get_credential("EMBEDDING_MODEL") | ||
| current_embedding_provider_config = await credential_service.get_active_provider("embedding") | ||
| current_embedding_provider = current_embedding_provider_config.get("provider", "openai") | ||
|
|
||
| if current_embedding_model and stored_embedding_model != current_embedding_model: | ||
| return True | ||
|
|
||
| if stored_embedding_provider and stored_embedding_provider != current_embedding_provider: | ||
| return True | ||
|
|
||
| current_use_contextual = await credential_service.get_credential("USE_CONTEXTUAL_EMBEDDINGS", False) | ||
| stored_use_contextual = stored_vectorizer_settings.get("use_contextual", False) | ||
| if current_use_contextual != stored_use_contextual: | ||
| return True | ||
|
|
||
| current_chunk_size = await credential_service.get_credential("CHUNK_SIZE", 512) | ||
| stored_chunk_size = stored_vectorizer_settings.get("chunk_size", 512) | ||
| if current_chunk_size != stored_chunk_size: | ||
| return True | ||
|
|
||
| return False | ||
|
|
||
| except Exception: | ||
| return False |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd credential_service -t fRepository: coleam00/Archon
Length of output: 153
🏁 Script executed:
find . -name "*credential*service*" -type fRepository: coleam00/Archon
Length of output: 157
🏁 Script executed:
rg "def get_credential" -A 10 -B 2Repository: coleam00/Archon
Length of output: 6762
🏁 Script executed:
rg "async def get_credential" -A 15 -B 2Repository: coleam00/Archon
Length of output: 9540
🏁 Script executed:
cat -n python/src/server/services/credential_service.py | head -100Repository: coleam00/Archon
Length of output: 4162
🏁 Script executed:
rg "load_all_credentials" -A 30 python/src/server/services/credential_service.pyRepository: coleam00/Archon
Length of output: 5482
🏁 Script executed:
rg "USE_CONTEXTUAL_EMBEDDINGS|CHUNK_SIZE" -B 2 -A 2Repository: coleam00/Archon
Length of output: 12416
🏁 Script executed:
head -50 python/src/server/services/knowledge/knowledge_item_service.pyRepository: coleam00/Archon
Length of output: 1517
🏁 Script executed:
rg "logfire|logger" python/src/server/services/knowledge/knowledge_item_service.py | head -20Repository: coleam00/Archon
Length of output: 1251
_check_needs_revectorization: Type coercion false positives and silent exception handler
Two correctness issues:
1. Type mismatch causes false positives (lines 449–457)
Credentials are stored in the database as strings ('false', '512'), but the comparison defaults are native Python types. When a credential is missing from the database, get_credential() returns the default (False bool or 512 int), which will never equal the string value stored in the database or vectorizer_settings:
# If CHUNK_SIZE is in DB: '512' (str) != 512 (int default) → True (incorrectly flagged)
# If CHUNK_SIZE is missing: 512 (int default) != '512' (str from vectorizer_settings) → True (incorrectly flagged)Pass string defaults to match the storage format:
current_use_contextual = await credential_service.get_credential("USE_CONTEXTUAL_EMBEDDINGS", "false")
stored_use_contextual = stored_vectorizer_settings.get("use_contextual", "false")
if current_use_contextual != stored_use_contextual:
return True
current_chunk_size = await credential_service.get_credential("CHUNK_SIZE", "512")
stored_chunk_size = stored_vectorizer_settings.get("chunk_size", "512")
if current_chunk_size != stored_chunk_size:
return True2. Silent except Exception: return False violates logging guidelines (lines 461–462)
Swallowing all exceptions (including ImportError from the deferred import and credential service failures) means errors are invisible. Log the error before returning:
except Exception as e:
safe_logfire_error(
f"Failed to check revectorization needs | source_id={source.get('source_id')} | error={str(e)}",
exc_info=True,
)
return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/services/knowledge/knowledge_item_service.py` around lines
427 - 462, In _check_nevectorization (async def _check_needs_revectorization)
change the default types passed into credential_service.get_credential to match
stored string formats (e.g., use "false" for USE_CONTEXTUAL_EMBEDDINGS and "512"
for CHUNK_SIZE) and compare against stored_vectorizer_settings values also read
as strings to avoid type-coercion false positives; also replace the bare except
block that silently returns False with logging the exception (use
safe_logfire_error or the project logger, include source.get("source_id") and
exc_info=True) before returning False so import/credential errors are visible.
BREAKING CHANGE / HEAVILY IN DEVELOPMENT - NOT FOR PRODUCTION USE This commit adds a new restartable RAG ingestion pipeline with the following features: - New database tables for state tracking: archon_document_blobs, archon_chunks, archon_embedding_sets, archon_embeddings, archon_summaries - Each pipeline stage (download, chunk, embed, summarize) is now separable - Full metadata tracking: embedder_id, embedder_version, embedder_config for embeddings; summarizer_model_id, prompt_hash, style for summaries - Checkpointing: each stage has explicit states (pending/in_progress/done/failed) - Support for multiple embedders and summarization styles per source - Health check utilities for validating pipeline state New services: - services/ingestion/ingestion_state_service.py - State management - services/ingestion/pipeline_orchestrator.py - Main orchestration - services/ingestion/embedding_worker.py - Async embedding processor - services/ingestion/summary_worker.py - Async summarization processor - services/ingestion/health_check.py - Health validation Migration: migration/0.1.0/014_add_pipeline_tables.sql Documentation: docs/ADRs/001-restartable-rag-pipeline.md The old pipeline remains unchanged for backwards compatibility.
There was a problem hiding this comment.
Actionable comments posted: 14
🧹 Nitpick comments (10)
python/src/server/services/ingestion/summary_worker.py (2)
112-119: Only first 3 content parts used — most source content silently dropped for multi-blob sources.
content_parts[:3]limits summarization input to chunks from (at most) 3 blobs. For sources with many documents, the summary will be based on a small, possibly unrepresentative subset. This is a significant design decision that should at least be documented or logged.Consider logging a warning when content is truncated so operators can tune the limit.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/ingestion/summary_worker.py` around lines 112 - 119, The current loop collects chunk contents into content_parts but then restricts input to content_parts[:3], silently dropping most content; update the logic in summary_worker.py to use all gathered parts (remove the hard slice) or implement a deterministic selection (e.g., sample/first N across blobs) instead of content_parts[:3], and add a warning log when you truncate combined_content to 25000 characters so operators know data was dropped; refer to content_parts, content_parts[:3], combined_content, and the async get_chunks_by_blob call on state_service to locate and change the code and add the logging.
100-100: Missing type annotation forsummaryparameter.Similar to the
EmbeddingWorkerissue,summaryshould be typed asSummary.- async def _process_summary(self, summary) -> bool: + async def _process_summary(self, summary: Summary) -> bool:As per coding guidelines,
python/src/**/*.py: "Use Mypy for type checking in Python to ensure type safety".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/ingestion/summary_worker.py` at line 100, The _process_summary method signature lacks a type annotation for the summary parameter; update the signature of async def _process_summary(self, summary) -> bool to annotate summary as Summary (e.g., async def _process_summary(self, summary: Summary) -> bool), and add or ensure the appropriate import for Summary is present at the top of the file (matching the Summary class/type used elsewhere) so Mypy can validate it.python/src/server/services/ingestion/embedding_worker.py (3)
117-128:retry_failed_embeddingsbypassesstate_servicefor the query, inconsistent with the rest of the class.All other methods delegate DB operations to
self.state_service, but this method directly queriesself.supabase. Consider adding a method toIngestionStateService(e.g.,get_failed_embedding_sets) and using it here for consistency. Also, ifupdate_embedding_set_statusfails mid-loop, there's no error handling — partial resets could go unnoticed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/ingestion/embedding_worker.py` around lines 117 - 128, The method retry_failed_embeddings directly queries self.supabase instead of using the IngestionStateService like the rest of the class; add a method on IngestionStateService (e.g., get_failed_embedding_sets or similar) that returns failed embedding set rows and use it inside retry_failed_embeddings, then iterate those rows and call self.state_service.update_embedding_set_status(uuid.UUID(row["id"]), EmbeddingStatus.PENDING) for each; also wrap the per-row update in try/except so failures are logged/collected and do not silently leave partial resets (return counts of successfully reset items and any failures).
67-67: Missing type annotation forembedding_setparameter.Should be typed as
EmbeddingSetfor clarity and MyPy compliance.- async def _process_embedding_set(self, embedding_set, provider: str | None = None) -> bool: + async def _process_embedding_set(self, embedding_set: EmbeddingSet, provider: str | None = None) -> bool:This requires importing
EmbeddingSetfromingestion_state_service.As per coding guidelines,
python/src/**/*.py: "Use Mypy for type checking in Python to ensure type safety".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/ingestion/embedding_worker.py` at line 67, Update the method signature of _process_embedding_set to type the first parameter as EmbeddingSet (e.g., async def _process_embedding_set(self, embedding_set: EmbeddingSet, provider: str | None = None) -> bool) and add an import for EmbeddingSet from ingestion_state_service at the top of the file; this aligns with MyPy and the project's typing conventions and keeps the existing provider and return types unchanged.
56-63: Error logging missing stack traces and context per coding guidelines.The
exceptblock logs the error string but doesn't preserve the full stack trace. Per coding guidelines, errors should includeexc_info=Trueand context about what was being attempted.Proposed fix
except Exception as e: - safe_logfire_error(f"Error processing embedding set {embedding_set.id}: {e}") + logger.error( + f"Error processing embedding set {embedding_set.id}: {e}", + exc_info=True, + ) await self.state_service.update_embedding_set_status(The same applies to line 109.
As per coding guidelines,
python/src/**/*.py: "Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/ingestion/embedding_worker.py` around lines 56 - 63, The except block in embedding_worker.py is only logging str(e) and missing stack traces and operation context; update the safe_logfire_error call(s) (the ones around processing an embedding set and the other exception later in this file) to include exc_info=True and a clearer message describing the attempted operation (e.g., "Error processing embedding set {embedding_set.id} while [operation]"). Also ensure the error_info passed to state_service.update_embedding_set_status (and related calls) includes the same contextual fields (operation/stage and full error string) so full traceability is preserved; reference safe_logfire_error, update_embedding_set_status, and EmbeddingStatus when making the changes.python/src/server/services/ingestion/ingestion_state_service.py (2)
381-394: Large unbounded batch insert for embeddings may hit payload size limits.
store_embeddingsinserts all chunk embeddings in a single request. For large documents, this could mean thousands of rows each containing high-dimensional float vectors (e.g., 1536 floats). PostgREST and Supabase have request payload size limits that this could exceed.Consider batching the inserts (e.g., in groups of 100–500 records).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/ingestion/ingestion_state_service.py` around lines 381 - 394, The store_embeddings method currently inserts all records at once which can exceed Supabase/PostgREST payload limits; change store_embeddings to split chunk_embeddings into batches (e.g., batch_size = 100–500), build records per batch (same shape as current records), and call self.supabase.table("archon_embeddings").insert(batch_records).execute() for each batch (awaiting each call), summing the number of inserted rows from each response.data to return the total; ensure you handle empty/None response.data and surface/raise errors if any batch insert fails.
217-217: Token count estimation is a rough heuristic.
len(content.split()) * 4 // 3approximates token count from word count. This can be significantly off for code, URLs, or non-English text. Consider documenting this is an estimate, or using a proper tokenizer if accuracy matters for downstream decisions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/ingestion/ingestion_state_service.py` at line 217, The current token estimation in ingestion_state_service.py uses a rough heuristic ("token_count": len(content.split()) * 4 // 3) which is unreliable for code, URLs, or non‑English text; update the implementation in the function/method where this appears (search for token_count assignment in IngestionStateService or related functions) to either (A) replace the heuristic with a proper tokenizer (e.g., use tiktoken or the project's canonical tokenizer to compute token count from content) or (B) if you must keep a fast heuristic, add a clear inline comment and change the field name or metadata (e.g., token_count_estimate) and document in the service/class docstring that this is an estimate and may be inaccurate for certain content types so downstream logic can treat it accordingly.python/src/server/services/ingestion/health_check.py (1)
69-69: Magic number heuristic for expected chunk count is fragile and confusing.
sum(1 for _ in blobs) * 10is equivalent tolen(blobs) * 10. The hardcoded10assumes each blob produces exactly 10 chunks, which is unlikely to hold. This makes the "low chunk count" warning unreliable — it could be a false positive or miss real issues depending on document sizes.Consider either removing this heuristic, making the multiplier configurable, or computing expected counts from actual blob content lengths and chunk sizes.
- total_expected_chunks = sum(1 for _ in blobs) * 10 + total_expected_chunks = len(blobs) * 10 # TODO: derive from actual content length / chunk_size🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/ingestion/health_check.py` at line 69, The line computing total_expected_chunks using a magic multiplier is fragile; replace sum(1 for _ in blobs) * 10 with a robust approach in health_check.py: either (A) compute expected chunks per blob from actual blob content lengths (e.g., use each blob's byte/character length and the chunk size constant to calculate chunks as ceil(length / CHUNK_SIZE) and sum those) or (B) make the multiplier configurable by introducing an EXPECTED_CHUNKS_PER_BLOB constant/setting and use len(blobs) * EXPECTED_CHUNKS_PER_BLOB; update references to total_expected_chunks accordingly and ensure you use the blob attribute that holds length (e.g., blob.size or blob.length) and the existing CHUNK_SIZE constant if present.python/src/server/services/ingestion/pipeline_orchestrator.py (1)
105-110: Progress callback valuemin(50, total_chunks)doesn't behave like a percentage.
min(50, total_chunks)clamps the progress value at 50, which appears intended as a percentage (chunking = first 50%). But oncetotal_chunks > 50, progress stays stuck at 50 for every subsequent document. Consider computing actual progress as a proportion of total documents processed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/ingestion/pipeline_orchestrator.py` around lines 105 - 110, The progress value is clamped with min(50, total_chunks) which isn't a percentage; compute a proportional percentage for the "chunking" phase instead. Replace the hard clamp with something like chunk_percent = int(50 * processed_chunks / max(1, total_chunks)) (or if you track processed_documents use processed_documents in place of processed_chunks), then call progress_callback("chunking", chunk_percent, f"Processed {total_blobs} documents, {total_chunks} chunks") so the progress moves from 0→50 as chunking progresses; reference progress_callback, total_chunks and total_blobs (and processed_chunks/processed_documents variable you already maintain) to locate and apply the change.migration/0.1.0/014_add_pipeline_tables.sql (1)
23-24:updated_atcolumns have no auto-update trigger
archon_document_blobs.updated_atandarchon_embedding_sets.updated_atare initialized toNOW()at INSERT but will silently stale on UPDATE unless the application always passes the column explicitly. Consider adding a trigger:CREATE OR REPLACE FUNCTION set_updated_at() RETURNS TRIGGER LANGUAGE plpgsql AS $$ BEGIN NEW.updated_at = NOW(); RETURN NEW; END; $$; CREATE TRIGGER trg_archon_document_blobs_updated_at BEFORE UPDATE ON archon_document_blobs FOR EACH ROW EXECUTE FUNCTION set_updated_at(); -- repeat for archon_embedding_sets, archon_summaries🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@migration/0.1.0/014_add_pipeline_tables.sql` around lines 23 - 24, The migration creates updated_at columns that default to NOW() but never auto-update on UPDATE; add a plpgsql trigger function (e.g. set_updated_at) that sets NEW.updated_at = NOW() and attach BEFORE UPDATE triggers (e.g. trg_archon_document_blobs_updated_at, trg_archon_embedding_sets_updated_at, trg_archon_summaries_updated_at) to the tables archon_document_blobs, archon_embedding_sets, and archon_summaries so updated_at is refreshed automatically on row updates.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/ADRs/001-restartable-rag-pipeline.md`:
- Line 3: Update the ADR status from "Proposed" to "Accepted" (or "Superseded")
and add a short note in the ADR body stating that the migration
014_add_pipeline_tables.sql and accompanying services are being shipped in this
PR and therefore the decision is implemented; also add a line that ADR-002
supersedes this ADR (or mark this ADR as superseded by ADR-002) so the document
accurately reflects current implementation and provenance.
In `@migration/0.1.0/014_add_pipeline_tables.sql`:
- Around line 77-88: The archon_embeddings table currently hardcodes vector
dimensionality with "vector VECTOR(1536)" which breaks multi-embedder support
and contradicts archon_embedding_sets.embedding_dimension; update the schema so
the embedding vector column is dimension-agnostic (use a dimensionless VECTOR
column) or implement one of the suggested alternatives (per-dimension tables or
restrict system to 1536 and remove embedding_dimension) — specifically modify
the archon_embeddings table definition (the vector column) and ensure any
related index/constraints referencing archon_embedding_sets.embedding_dimension
are adjusted to match the chosen approach so different embedding dimensions can
be stored without insert failures.
- Line 67: The UNIQUE constraints (UNIQUE(source_id, embedder_id,
embedder_version) and UNIQUE(source_id, summarizer_model_id, prompt_hash,
style)) are ineffective when embedder_version or prompt_hash can be NULL; change
them to enforce uniqueness by either making those columns NOT NULL with a
sentinel default (e.g., set embedder_version and prompt_hash to NOT NULL DEFAULT
'' and adjust any inserts) or replace the UNIQUE constraints with explicit
unique indexes using COALESCE (for example create unique index on (source_id,
embedder_id, COALESCE(embedder_version, '')) and similarly on (source_id,
summarizer_model_id, COALESCE(prompt_hash, ''), style)) so NULLs are treated as
the sentinel and true uniqueness is enforced.
- Around line 100-107: The archon_summaries table schema is incorrect for
pipeline usage: change summary_content to allow inserts before content is
generated (make summary_content nullable or add DEFAULT '') and add DEFAULT
NOW() to updated_at to match other tables; update the CREATE TABLE definition
for archon_summaries so the column summary_content is not NOT NULL (or has
DEFAULT '') and updated_at TIMESTAMPTZ has DEFAULT NOW() to avoid NOT NULL
violations when inserting queued/pending summaries.
- Around line 47-48: The index creation for idx_archon_chunks_source_id is
invalid because archon_chunks has no source_id column and PostgreSQL forbids
listing a column both as key and in INCLUDE; remove or replace this statement:
drop the CREATE INDEX IF NOT EXISTS idx_archon_chunks_source_id ON
archon_chunks(blob_id, source_id) INCLUDE (source_id) and either delete it
entirely or create/ensure only idx_archon_chunks_blob_id exists to support
queries that must follow blob_id → archon_document_blobs.source_id (join through
archon_document_blobs to access source_id) so migrations run without error.
In `@python/src/server/services/ingestion/embedding_worker.py`:
- Around line 89-95: The loop is using zip(..., strict=False) which can silently
misalign chunk_ids and result.embeddings and is coupled with an unused
enumerate; change the logic in embedding_worker.py to first verify lengths match
and skip the entire batch on mismatch (log via safe_logfire_error with details)
to avoid storing corrupted data, then iterate with a plain zip(chunk_ids,
result.embeddings) (no strict flag) and append only non-empty embeddings to
successful_embeddings; remove the unused enumerate/_i and ensure you return or
continue when lengths differ so no partial/misaligned pairs are persisted.
In `@python/src/server/services/ingestion/health_check.py`:
- Around line 177-196: The check_all_sources method currently calls
check_source_health for each source without per-source error handling, so a
raised exception aborts the whole batch; update check_all_sources to wrap the
await self.check_source_health(source["source_id"]) call in a try/except, catch
Exception as e, and for failures append a result dict for that source (include
source_id, healthy: False, and an error field with the exception message or
traceback) so processing continues for remaining sources and the final returned
structure (total_sources, healthy_sources, unhealthy_sources, results) reflects
both successes and detailed failures; keep existing variables (sources_response,
results, healthy_count) and ensure unhealthy_sources calculation uses the
appended failure entries.
In `@python/src/server/services/ingestion/ingestion_state_service.py`:
- Line 164: Replace the generic raises in create_document_blob,
create_embedding_set, and create_summary with a more specific exception that
includes context and relevant IDs (e.g., raise IngestionError(f"Failed to create
document blob for source_id={source_id} blob_id={blob_id}: {e}") from e); if
IngestionError doesn't exist add a simple custom subclass of Exception
(IngestionError) near other service errors and import it, and ensure you chain
the original exception (using "from e") or include str(e) so the full stack
trace and underlying error are preserved in the logs and error flow.
- Around line 494-508: In update_source_pipeline_status, avoid the
redundant/unwanted overwrite of pipeline_error and use the same None-check
pattern as elsewhere: change the initial check to if error_info is not None:
update_data["pipeline_error"] = error_info, and remove the unconditional
assignment inside the status == "error" branch so pipeline_error is only written
when a non-None error_info is provided; keep the pipeline_completed_at
assignment for status == "complete" and then call the supabase update with
update_data as before.
In `@python/src/server/services/ingestion/pipeline_orchestrator.py`:
- Around line 96-100: The current start_offset logic in create_chunks (used
after smart_chunk_text_async) incorrectly assumes chunks are contiguous;
instead, compute accurate offsets by finding each chunk's substring position in
the original content: iterate chunks and call original_content.find(chunk,
search_start), set start_offset to the found index, set search_start =
found_index + len(chunk) to handle repeated content, and if find returns -1
(trimmed/modified chunk or overlap) fall back to a sentinel (e.g., None or -1)
or skip offset assignment; update create_chunks (and any callers that rely on
start_offsets) to use these real positions or remove/mark offsets as approximate
if you prefer not to compute them.
- Around line 144-178: The _queue_embedding function currently logs failures and
returns None (violating the "never return None to indicate failure" rule);
change it to raise a descriptive exception instead of returning None so failures
propagate to the pipeline error handler; capture the original exception from the
try block, call safe_logfire_error with context, then re-raise (or raise a new
exception that wraps the original) so callers get a raised error; apply the same
change to _queue_summary; reference functions: _queue_embedding, _queue_summary,
and state_service.create_embedding_set (and the equivalent create_summary_set)
when making the change.
In `@python/src/server/services/ingestion/summary_worker.py`:
- Around line 149-178: The _generate_summary function currently calls
get_llm_client() with no provider which can route requests to the wrong backend
for the given model_id; update _generate_summary to resolve the provider (e.g.,
via a helper like resolve_provider_from_model(model_id) or by reading the
source's/summary record config) and pass that provider into
get_llm_client(provider=resolved_provider) before creating the chat completion;
keep the rest of the logic (prompt selection via SUMMARY_PROMPTS/SummaryStyle
and response handling with extract_message_text) the same.
- Line 156: The current conversion SummaryStyle(style) can raise ValueError for
unknown strings causing the fallback in SUMMARY_PROMPTS.get to never run; change
the logic around prompt_template to safely map style to a SummaryStyle by
attempting conversion in a try/except (catch ValueError) or checking membership
(e.g., SummaryStyle.__members__) and default to SummaryStyle.OVERVIEW on
failure, then use that safe enum value to index SUMMARY_PROMPTS (i.e., ensure
you compute a safe_style variable from style before doing
SUMMARY_PROMPTS[safe_style] and assign it to prompt_template).
- Around line 81-96: The code calls
self.state_service.update_summary(summary.id, ...) without guarding for
summary.id possibly being None (Summary.id can be None); before calling
update_summary in the exception handler (and other places around lines handling
failures), add a guard like in EmbeddingWorker: if summary.id is None then
safe_logfire_error a clear message including summary (or summary.external_id),
increment results["failed"], and skip calling update_summary/update DB (do not
pass None to update_summary); otherwise call
self.state_service.update_summary(str(summary.id), SummaryStatus.FAILED,
error_info=...) as before. Ensure the same guard is applied wherever
update_summary is invoked for summary.id (including after _process_summary
failures).
---
Nitpick comments:
In `@migration/0.1.0/014_add_pipeline_tables.sql`:
- Around line 23-24: The migration creates updated_at columns that default to
NOW() but never auto-update on UPDATE; add a plpgsql trigger function (e.g.
set_updated_at) that sets NEW.updated_at = NOW() and attach BEFORE UPDATE
triggers (e.g. trg_archon_document_blobs_updated_at,
trg_archon_embedding_sets_updated_at, trg_archon_summaries_updated_at) to the
tables archon_document_blobs, archon_embedding_sets, and archon_summaries so
updated_at is refreshed automatically on row updates.
In `@python/src/server/services/ingestion/embedding_worker.py`:
- Around line 117-128: The method retry_failed_embeddings directly queries
self.supabase instead of using the IngestionStateService like the rest of the
class; add a method on IngestionStateService (e.g., get_failed_embedding_sets or
similar) that returns failed embedding set rows and use it inside
retry_failed_embeddings, then iterate those rows and call
self.state_service.update_embedding_set_status(uuid.UUID(row["id"]),
EmbeddingStatus.PENDING) for each; also wrap the per-row update in try/except so
failures are logged/collected and do not silently leave partial resets (return
counts of successfully reset items and any failures).
- Line 67: Update the method signature of _process_embedding_set to type the
first parameter as EmbeddingSet (e.g., async def _process_embedding_set(self,
embedding_set: EmbeddingSet, provider: str | None = None) -> bool) and add an
import for EmbeddingSet from ingestion_state_service at the top of the file;
this aligns with MyPy and the project's typing conventions and keeps the
existing provider and return types unchanged.
- Around line 56-63: The except block in embedding_worker.py is only logging
str(e) and missing stack traces and operation context; update the
safe_logfire_error call(s) (the ones around processing an embedding set and the
other exception later in this file) to include exc_info=True and a clearer
message describing the attempted operation (e.g., "Error processing embedding
set {embedding_set.id} while [operation]"). Also ensure the error_info passed to
state_service.update_embedding_set_status (and related calls) includes the same
contextual fields (operation/stage and full error string) so full traceability
is preserved; reference safe_logfire_error, update_embedding_set_status, and
EmbeddingStatus when making the changes.
In `@python/src/server/services/ingestion/health_check.py`:
- Line 69: The line computing total_expected_chunks using a magic multiplier is
fragile; replace sum(1 for _ in blobs) * 10 with a robust approach in
health_check.py: either (A) compute expected chunks per blob from actual blob
content lengths (e.g., use each blob's byte/character length and the chunk size
constant to calculate chunks as ceil(length / CHUNK_SIZE) and sum those) or (B)
make the multiplier configurable by introducing an EXPECTED_CHUNKS_PER_BLOB
constant/setting and use len(blobs) * EXPECTED_CHUNKS_PER_BLOB; update
references to total_expected_chunks accordingly and ensure you use the blob
attribute that holds length (e.g., blob.size or blob.length) and the existing
CHUNK_SIZE constant if present.
In `@python/src/server/services/ingestion/ingestion_state_service.py`:
- Around line 381-394: The store_embeddings method currently inserts all records
at once which can exceed Supabase/PostgREST payload limits; change
store_embeddings to split chunk_embeddings into batches (e.g., batch_size =
100–500), build records per batch (same shape as current records), and call
self.supabase.table("archon_embeddings").insert(batch_records).execute() for
each batch (awaiting each call), summing the number of inserted rows from each
response.data to return the total; ensure you handle empty/None response.data
and surface/raise errors if any batch insert fails.
- Line 217: The current token estimation in ingestion_state_service.py uses a
rough heuristic ("token_count": len(content.split()) * 4 // 3) which is
unreliable for code, URLs, or non‑English text; update the implementation in the
function/method where this appears (search for token_count assignment in
IngestionStateService or related functions) to either (A) replace the heuristic
with a proper tokenizer (e.g., use tiktoken or the project's canonical tokenizer
to compute token count from content) or (B) if you must keep a fast heuristic,
add a clear inline comment and change the field name or metadata (e.g.,
token_count_estimate) and document in the service/class docstring that this is
an estimate and may be inaccurate for certain content types so downstream logic
can treat it accordingly.
In `@python/src/server/services/ingestion/pipeline_orchestrator.py`:
- Around line 105-110: The progress value is clamped with min(50, total_chunks)
which isn't a percentage; compute a proportional percentage for the "chunking"
phase instead. Replace the hard clamp with something like chunk_percent = int(50
* processed_chunks / max(1, total_chunks)) (or if you track processed_documents
use processed_documents in place of processed_chunks), then call
progress_callback("chunking", chunk_percent, f"Processed {total_blobs}
documents, {total_chunks} chunks") so the progress moves from 0→50 as chunking
progresses; reference progress_callback, total_chunks and total_blobs (and
processed_chunks/processed_documents variable you already maintain) to locate
and apply the change.
In `@python/src/server/services/ingestion/summary_worker.py`:
- Around line 112-119: The current loop collects chunk contents into
content_parts but then restricts input to content_parts[:3], silently dropping
most content; update the logic in summary_worker.py to use all gathered parts
(remove the hard slice) or implement a deterministic selection (e.g.,
sample/first N across blobs) instead of content_parts[:3], and add a warning log
when you truncate combined_content to 25000 characters so operators know data
was dropped; refer to content_parts, content_parts[:3], combined_content, and
the async get_chunks_by_blob call on state_service to locate and change the code
and add the logging.
- Line 100: The _process_summary method signature lacks a type annotation for
the summary parameter; update the signature of async def _process_summary(self,
summary) -> bool to annotate summary as Summary (e.g., async def
_process_summary(self, summary: Summary) -> bool), and add or ensure the
appropriate import for Summary is present at the top of the file (matching the
Summary class/type used elsewhere) so Mypy can validate it.
| @@ -0,0 +1,75 @@ | |||
| # ADR-001: Restartable RAG Ingestion Pipeline | |||
|
|
|||
| ## Status: Proposed | |||
There was a problem hiding this comment.
ADR status should reflect that implementation is in progress
The status Proposed is inconsistent — the corresponding migration (014_add_pipeline_tables.sql) and services are being shipped in this very PR. Standard ADR conventions expect the status to be updated to Accepted (or Superseded) before the decision is implemented. Additionally, per the PR context, ADR-002 supersedes this document, which should be noted here.
📝 Suggested update
-## Status: Proposed
+## Status: Accepted (Superseded by ADR-002)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## Status: Proposed | |
| ## Status: Accepted (Superseded by ADR-002) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/ADRs/001-restartable-rag-pipeline.md` at line 3, Update the ADR status
from "Proposed" to "Accepted" (or "Superseded") and add a short note in the ADR
body stating that the migration 014_add_pipeline_tables.sql and accompanying
services are being shipped in this PR and therefore the decision is implemented;
also add a line that ADR-002 supersedes this ADR (or mark this ADR as superseded
by ADR-002) so the document accurately reflects current implementation and
provenance.
| CREATE INDEX IF NOT EXISTS idx_archon_chunks_source_id ON archon_chunks(blob_id, source_id) | ||
| INCLUDE (source_id); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd 'migration' --type dRepository: coleam00/Archon
Length of output: 118
🏁 Script executed:
fd '014_add_pipeline_tables.sql' -type fRepository: coleam00/Archon
Length of output: 229
🏁 Script executed:
find . -path "*/migration/*" -name "*.sql" | head -20Repository: coleam00/Archon
Length of output: 943
🏁 Script executed:
cat -n migration/0.1.0/014_add_pipeline_tables.sql | head -60Repository: coleam00/Archon
Length of output: 3203
🏁 Script executed:
# Check if there are any other references to source_id in the migration file
cat -n migration/0.1.0/014_add_pipeline_tables.sql | grep -i source_idRepository: coleam00/Archon
Length of output: 904
🏁 Script executed:
# Verify the complete archon_chunks table definition to ensure no source_id exists
sed -n '34,44p' migration/0.1.0/014_add_pipeline_tables.sqlRepository: coleam00/Archon
Length of output: 457
🏁 Script executed:
# Verify PostgreSQL behavior: check if referencing a non-existent column in an index causes an error
# Also check if having a column in both key and INCLUDE is valid
grep -A5 "INCLUDE" migration/0.1.0/014_add_pipeline_tables.sqlRepository: coleam00/Archon
Length of output: 296
🏁 Script executed:
# Check if this migration has been run or if there are any other migrations that might add source_id to archon_chunks
grep -r "source_id" migration/0.1.0/ | grep -i "archon_chunks"Repository: coleam00/Archon
Length of output: 195
🏁 Script executed:
# Verify PostgreSQL index rules by checking documentation or testing the syntax
# Let's also check complete_setup.sql to see how the base schema might define things
head -250 migration/complete_setup.sql | tail -100 | grep -A10 -B10 "archon_chunks"Repository: coleam00/Archon
Length of output: 41
🏁 Script executed:
# Check if there are any ALTER TABLE statements that might add source_id to archon_chunks
grep -r "ALTER TABLE archon_chunks" migration/Repository: coleam00/Archon
Length of output: 41
Migration will fail: source_id does not exist in archon_chunks
The archon_chunks table (lines 34–44) contains only id, blob_id, chunk_index, start_offset, end_offset, content, token_count, created_at, and a UNIQUE(blob_id, chunk_index) constraint. The index definition on lines 47–48 references source_id, which does not exist in this table. PostgreSQL will raise an error during migration execution: ERROR: column "source_id" does not exist.
Additionally, if source_id existed, PostgreSQL would reject having it appear in both the key list and the INCLUDE clause (columns in INCLUDE cannot also be key columns).
To reach source_id, queries must traverse through blob_id → archon_document_blobs.source_id. Drop this index or replace it with the already-defined idx_archon_chunks_blob_id if the access pattern can be satisfied by that index alone.
Proposed fix
-CREATE INDEX IF NOT EXISTS idx_archon_chunks_source_id ON archon_chunks(blob_id, source_id)
- INCLUDE (source_id);
+-- blob_id-only index (idx_archon_chunks_blob_id above) already covers this access pattern.
+-- No additional index needed here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@migration/0.1.0/014_add_pipeline_tables.sql` around lines 47 - 48, The index
creation for idx_archon_chunks_source_id is invalid because archon_chunks has no
source_id column and PostgreSQL forbids listing a column both as key and in
INCLUDE; remove or replace this statement: drop the CREATE INDEX IF NOT EXISTS
idx_archon_chunks_source_id ON archon_chunks(blob_id, source_id) INCLUDE
(source_id) and either delete it entirely or create/ensure only
idx_archon_chunks_blob_id exists to support queries that must follow blob_id →
archon_document_blobs.source_id (join through archon_document_blobs to access
source_id) so migrations run without error.
| total_chunk_count INTEGER DEFAULT 0, | ||
| created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
| updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
| UNIQUE(source_id, embedder_id, embedder_version) |
There was a problem hiding this comment.
UNIQUE constraints with nullable columns won't prevent duplicates when the nullable part is NULL
In PostgreSQL, NULL values are considered distinct from each other in B-tree unique indexes. This means:
UNIQUE(source_id, embedder_id, embedder_version)(line 67): two rows with the samesource_id/embedder_idbutembedder_version = NULLwill not violate the constraint.UNIQUE(source_id, summarizer_model_id, prompt_hash, style)(line 106): same issue with nullableprompt_hash.
If these columns being NULL represents a meaningful "no version" or "no prompt hash" state that should still be unique, use COALESCE in a partial unique index or replace nullability with a sentinel value (e.g., '').
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@migration/0.1.0/014_add_pipeline_tables.sql` at line 67, The UNIQUE
constraints (UNIQUE(source_id, embedder_id, embedder_version) and
UNIQUE(source_id, summarizer_model_id, prompt_hash, style)) are ineffective when
embedder_version or prompt_hash can be NULL; change them to enforce uniqueness
by either making those columns NOT NULL with a sentinel default (e.g., set
embedder_version and prompt_hash to NOT NULL DEFAULT '' and adjust any inserts)
or replace the UNIQUE constraints with explicit unique indexes using COALESCE
(for example create unique index on (source_id, embedder_id,
COALESCE(embedder_version, '')) and similarly on (source_id,
summarizer_model_id, COALESCE(prompt_hash, ''), style)) so NULLs are treated as
the sentinel and true uniqueness is enforced.
| CREATE TABLE IF NOT EXISTS archon_embeddings ( | ||
| id UUID PRIMARY KEY DEFAULT gen_random_uuid(), | ||
| chunk_id UUID NOT NULL REFERENCES archon_chunks(id) ON DELETE CASCADE, | ||
| embedding_set_id UUID NOT NULL REFERENCES archon_embedding_sets(id) ON DELETE CASCADE, | ||
| vector VECTOR(1536), | ||
| created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
| UNIQUE(chunk_id, embedding_set_id) | ||
| ); | ||
|
|
||
| CREATE INDEX IF NOT EXISTS idx_archon_embeddings_chunk_id ON archon_embeddings(chunk_id); | ||
| CREATE INDEX IF NOT EXISTS idx_archon_embeddings_set_id ON archon_embeddings(embedding_set_id); | ||
|
|
There was a problem hiding this comment.
Hardcoded VECTOR(1536) prevents multi-embedder support
The stated design goal is "multiple embedders can coexist for the same source", and archon_embedding_sets.embedding_dimension exists to track the actual dimensionality per embedder. However, archon_embeddings.vector VECTOR(1536) enforces a fixed size — any embedder that produces a different dimension (e.g., nomic-embed-text at 768, text-embedding-3-large at 3072) will fail on INSERT.
Options to consider:
- Drop the dimension constraint entirely:
vector VECTOR(pgvector supports dimensionless columns, though this sacrifices type safety and requires different IVFFlat/HNSW index strategies). - Use a separate
archon_embeddings_{dim}table per dimension family, referenced fromarchon_embedding_sets. - Constrain the system to 1536-dim only for now and remove
embedding_dimensionfromarchon_embedding_setsto avoid the false impression of multi-dimension support.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@migration/0.1.0/014_add_pipeline_tables.sql` around lines 77 - 88, The
archon_embeddings table currently hardcodes vector dimensionality with "vector
VECTOR(1536)" which breaks multi-embedder support and contradicts
archon_embedding_sets.embedding_dimension; update the schema so the embedding
vector column is dimension-agnostic (use a dimensionless VECTOR column) or
implement one of the suggested alternatives (per-dimension tables or restrict
system to 1536 and remove embedding_dimension) — specifically modify the
archon_embeddings table definition (the vector column) and ensure any related
index/constraints referencing archon_embedding_sets.embedding_dimension are
adjusted to match the chosen approach so different embedding dimensions can be
stored without insert failures.
| status TEXT NOT NULL DEFAULT 'pending' | ||
| CHECK (status IN ('pending', 'in_progress', 'done', 'failed')), | ||
| error_info JSONB, | ||
| summary_content TEXT NOT NULL, | ||
| updated_at TIMESTAMPTZ NOT NULL, | ||
| created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
| UNIQUE(source_id, summarizer_model_id, prompt_hash, style) | ||
| ); |
There was a problem hiding this comment.
Two schema bugs in archon_summaries prevent correct pipeline usage
1. summary_content TEXT NOT NULL — The pipeline creates summary records at pending status before content is available. Without a default or nullable summary_content, an INSERT for a queued-but-not-yet-generated summary will fail. Either make the column nullable or add DEFAULT ''.
2. updated_at TIMESTAMPTZ NOT NULL missing DEFAULT NOW() — This is inconsistent with archon_document_blobs (line 24) and archon_embedding_sets (line 66), which both have DEFAULT NOW(). Any INSERT that omits updated_at will raise a NOT NULL violation.
🐛 Proposed fix
- summary_content TEXT NOT NULL,
- updated_at TIMESTAMPTZ NOT NULL,
+ summary_content TEXT,
+ updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| status TEXT NOT NULL DEFAULT 'pending' | |
| CHECK (status IN ('pending', 'in_progress', 'done', 'failed')), | |
| error_info JSONB, | |
| summary_content TEXT NOT NULL, | |
| updated_at TIMESTAMPTZ NOT NULL, | |
| created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | |
| UNIQUE(source_id, summarizer_model_id, prompt_hash, style) | |
| ); | |
| status TEXT NOT NULL DEFAULT 'pending' | |
| CHECK (status IN ('pending', 'in_progress', 'done', 'failed')), | |
| error_info JSONB, | |
| summary_content TEXT, | |
| updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | |
| created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | |
| UNIQUE(source_id, summarizer_model_id, prompt_hash, style) | |
| ); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@migration/0.1.0/014_add_pipeline_tables.sql` around lines 100 - 107, The
archon_summaries table schema is incorrect for pipeline usage: change
summary_content to allow inserts before content is generated (make
summary_content nullable or add DEFAULT '') and add DEFAULT NOW() to updated_at
to match other tables; update the CREATE TABLE definition for archon_summaries
so the column summary_content is not NOT NULL (or has DEFAULT '') and updated_at
TIMESTAMPTZ has DEFAULT NOW() to avoid NOT NULL violations when inserting
queued/pending summaries.
| start_offsets = [] | ||
| current_offset = 0 | ||
| for chunk in chunks: | ||
| start_offsets.append(current_offset) | ||
| current_offset += len(chunk) |
There was a problem hiding this comment.
Start offsets assume chunks are contiguous substrings of the original content, which may not hold.
The offset calculation sums chunk lengths sequentially, but smart_chunk_text_async may produce overlapping chunks, chunks with trimmed whitespace, or chunks that don't partition the original text exactly. This would make start_offset and end_offset (computed in create_chunks) incorrect and misleading for consumers.
If offsets need to be accurate, compute them by searching for each chunk's position in the original content. Otherwise, consider documenting that offsets are approximate, or not computing them at all.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/services/ingestion/pipeline_orchestrator.py` around lines
96 - 100, The current start_offset logic in create_chunks (used after
smart_chunk_text_async) incorrectly assumes chunks are contiguous; instead,
compute accurate offsets by finding each chunk's substring position in the
original content: iterate chunks and call original_content.find(chunk,
search_start), set start_offset to the found index, set search_start =
found_index + len(chunk) to handle repeated content, and if find returns -1
(trimmed/modified chunk or overlap) fall back to a sentinel (e.g., None or -1)
or skip offset assignment; update create_chunks (and any callers that rely on
start_offsets) to use these real positions or remove/mark offsets as approximate
if you prefer not to compute them.
| async def _queue_embedding( | ||
| self, | ||
| source_id: str, | ||
| total_chunks: int, | ||
| embedder_id: str | None, | ||
| ): | ||
| try: | ||
| rag_settings = await credential_service.get_credentials_by_category("rag_strategy") | ||
| embedding_provider = rag_settings.get("EMBEDDING_PROVIDER", "openai") | ||
|
|
||
| if not embedder_id: | ||
| embedder_id = await get_embedding_model(provider=embedding_provider) | ||
|
|
||
| embedding_dimensions = int(rag_settings.get("EMBEDDING_DIMENSIONS", "1536")) | ||
|
|
||
| embedding_config = { | ||
| "provider": embedding_provider, | ||
| "dimensions": embedding_dimensions, | ||
| } | ||
|
|
||
| embedding_set = await self.state_service.create_embedding_set( | ||
| source_id=source_id, | ||
| embedder_id=embedder_id, | ||
| embedder_version=None, | ||
| embedder_config=embedding_config, | ||
| total_chunk_count=total_chunks, | ||
| embedding_dimension=embedding_dimensions, | ||
| ) | ||
|
|
||
| safe_logfire_info(f"Created embedding set {embedding_set.id} for source {source_id}") | ||
| return embedding_set | ||
|
|
||
| except Exception as e: | ||
| safe_logfire_error(f"Failed to queue embedding: {e}") | ||
| return None |
There was a problem hiding this comment.
_queue_embedding returns None on failure, violating the "never return None to indicate failure" guideline.
When _queue_embedding fails, it logs and returns None. The caller at line 131 silently accepts this (str(embedding_set.id) if embedding_set else None), meaning the pipeline reports success with embedding_set_id: None — the user has no indication that embedding queueing failed entirely.
Consider either raising the exception (to trigger the pipeline error handler at line 136) or returning an explicit error result that is surfaced in the response.
The same applies to _queue_summary (lines 180–206).
As per coding guidelines, python/src/**/*.py: "never return None to indicate failure".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/services/ingestion/pipeline_orchestrator.py` around lines
144 - 178, The _queue_embedding function currently logs failures and returns
None (violating the "never return None to indicate failure" rule); change it to
raise a descriptive exception instead of returning None so failures propagate to
the pipeline error handler; capture the original exception from the try block,
call safe_logfire_error with context, then re-raise (or raise a new exception
that wraps the original) so callers get a raised error; apply the same change to
_queue_summary; reference functions: _queue_embedding, _queue_summary, and
state_service.create_embedding_set (and the equivalent create_summary_set) when
making the change.
| for summary in pending[:max_batch_size]: | ||
| try: | ||
| success = await self._process_summary(summary) | ||
| if success: | ||
| results["processed"] += 1 | ||
| results["summaries_processed"].append(str(summary.id)) | ||
| else: | ||
| results["failed"] += 1 | ||
| except Exception as e: | ||
| safe_logfire_error(f"Error processing summary {summary.id}: {e}") | ||
| await self.state_service.update_summary( | ||
| summary.id, | ||
| SummaryStatus.FAILED, | ||
| error_info={"error": str(e), "stage": "summary_processing"}, | ||
| ) | ||
| results["failed"] += 1 |
There was a problem hiding this comment.
summary.id may be None — no guard before passing to update_summary.
The Summary dataclass has id: uuid.UUID | None = None. If summary.id is None, it's passed directly to self.state_service.update_summary(summary.id, ...) at line 91, which calls .eq("id", str(summary_id)) — str(None) becomes "None", a silent no-op update. The same issue exists at line 101 and 129.
Add a guard similar to what EmbeddingWorker does at line 46:
for summary in pending[:max_batch_size]:
+ if not summary.id:
+ results["failed"] += 1
+ continue
try:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/services/ingestion/summary_worker.py` around lines 81 - 96,
The code calls self.state_service.update_summary(summary.id, ...) without
guarding for summary.id possibly being None (Summary.id can be None); before
calling update_summary in the exception handler (and other places around lines
handling failures), add a guard like in EmbeddingWorker: if summary.id is None
then safe_logfire_error a clear message including summary (or
summary.external_id), increment results["failed"], and skip calling
update_summary/update DB (do not pass None to update_summary); otherwise call
self.state_service.update_summary(str(summary.id), SummaryStatus.FAILED,
error_info=...) as before. Ensure the same guard is applied wherever
update_summary is invoked for summary.id (including after _process_summary
failures).
| async def _generate_summary( | ||
| self, | ||
| source_id: str, | ||
| content: str, | ||
| model_id: str, | ||
| style: str, | ||
| ) -> str: | ||
| prompt_template = SUMMARY_PROMPTS.get(SummaryStyle(style), SUMMARY_PROMPTS[SummaryStyle.OVERVIEW]) | ||
| prompt = prompt_template.format(content=content, source_id=source_id) | ||
|
|
||
| async with get_llm_client() as client: | ||
| response = await client.chat.completions.create( | ||
| model=model_id, | ||
| messages=[ | ||
| { | ||
| "role": "system", | ||
| "content": "You are a helpful assistant that provides concise library/tool/framework summaries.", | ||
| }, | ||
| {"role": "user", "content": prompt}, | ||
| ], | ||
| ) | ||
|
|
||
| if not response or not response.choices: | ||
| raise Exception("Empty response from LLM") | ||
|
|
||
| summary_text, _, _ = extract_message_text(response.choices[0]) | ||
| if not summary_text: | ||
| raise Exception("LLM returned empty content") | ||
|
|
||
| return summary_text.strip() |
There was a problem hiding this comment.
get_llm_client() called without specifying a provider — may use wrong backend for the given model_id.
_generate_summary receives a model_id that may belong to a specific provider (e.g., an Anthropic model), but get_llm_client() is called with no provider argument and defaults to whatever is globally configured. This could send the request to the wrong provider, causing failures or unexpected behavior.
Consider resolving the provider from model_id or from the summary record's associated configuration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/services/ingestion/summary_worker.py` around lines 149 -
178, The _generate_summary function currently calls get_llm_client() with no
provider which can route requests to the wrong backend for the given model_id;
update _generate_summary to resolve the provider (e.g., via a helper like
resolve_provider_from_model(model_id) or by reading the source's/summary record
config) and pass that provider into get_llm_client(provider=resolved_provider)
before creating the chat completion; keep the rest of the logic (prompt
selection via SUMMARY_PROMPTS/SummaryStyle and response handling with
extract_message_text) the same.
| model_id: str, | ||
| style: str, | ||
| ) -> str: | ||
| prompt_template = SUMMARY_PROMPTS.get(SummaryStyle(style), SUMMARY_PROMPTS[SummaryStyle.OVERVIEW]) |
There was a problem hiding this comment.
SummaryStyle(style) will raise ValueError for unrecognized style strings.
If style contains a value not in the SummaryStyle enum, this line raises an unhandled ValueError. The fallback to SUMMARY_PROMPTS[SummaryStyle.OVERVIEW] in the .get() never triggers because the exception occurs in the dict key construction.
Proposed fix
- prompt_template = SUMMARY_PROMPTS.get(SummaryStyle(style), SUMMARY_PROMPTS[SummaryStyle.OVERVIEW])
+ try:
+ style_enum = SummaryStyle(style)
+ except ValueError:
+ logger.warning(f"Unknown summary style '{style}', falling back to OVERVIEW")
+ style_enum = SummaryStyle.OVERVIEW
+ prompt_template = SUMMARY_PROMPTS[style_enum]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/services/ingestion/summary_worker.py` at line 156, The
current conversion SummaryStyle(style) can raise ValueError for unknown strings
causing the fallback in SUMMARY_PROMPTS.get to never run; change the logic
around prompt_template to safely map style to a SummaryStyle by attempting
conversion in a try/except (catch ValueError) or checking membership (e.g.,
SummaryStyle.__members__) and default to SummaryStyle.OVERVIEW on failure, then
use that safe enum value to index SUMMARY_PROMPTS (i.e., ensure you compute a
safe_style variable from style before doing SUMMARY_PROMPTS[safe_style] and
assign it to prompt_template).
Completes the restartable RAG pipeline by adding worker invocation
mechanisms and integrating with the existing crawl service.
New Components:
- ingestion_api.py: 6 API endpoints for worker control and health checks
- POST /api/ingestion/process-embeddings
- POST /api/ingestion/process-summaries
- GET /api/ingestion/health/{source_id}
- GET /api/ingestion/health
- POST /api/ingestion/retry-failed-embeddings
- POST /api/ingestion/retry-failed-summaries
Crawl Integration:
- Added use_new_pipeline parameter to KnowledgeItemRequest
- Integrated pipeline orchestrator in document_storage_operations.py
- New pipeline creates blobs, chunks, and queues async jobs
Bug Fixes:
- Fixed PostgREST query syntax in get_chunks_by_source()
- Added batching for large blob queries (URI length limit)
- Added url_to_full_document for backward compatibility
Testing:
- Manual testing completed (see TEST_RESULTS.md)
- Pipeline successfully creates 275 blobs, chunks, and embedding sets
- Worker infrastructure validated (picks up jobs, processes, handles errors)
- All core components proven working
Status: Architecture validated, ready for production testing.
Config issue: embedding dimension mismatch (easily resolved).
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 18
🧹 Nitpick comments (5)
test_new_pipeline.md (1)
337-337: Minor wording nit: "fix" → "resolve" for formal documentation tone.-**Solution:** Check error_info in database, fix issue, use retry endpoint. +**Solution:** Check error_info in database, resolve the issue, then use the retry endpoint.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test_new_pipeline.md` at line 337, Update the wording in the sentence that reads "Check error_info in database, fix issue, use retry endpoint." to use a more formal tone by replacing "fix" with "resolve" so it reads "Check error_info in database, resolve issue, use retry endpoint."; locate and edit the exact fragment in the document (the line containing "Check error_info in database, fix issue, use retry endpoint.") to make this single-word substitution.python/src/server/services/crawling/document_storage_operations.py (2)
635-635: Line exceeds 120-character limit.This line is well over 120 characters. Consider breaking it up for readability and guideline compliance.
♻️ Suggested formatting
- response = self.supabase_client.table("archon_sources").select("source_id").eq("source_id", source_id).execute() + response = ( + self.supabase_client.table("archon_sources") + .select("source_id") + .eq("source_id", source_id) + .execute() + )As per coding guidelines: "Use Python 3.12 with 120 character line length in backend code."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/crawling/document_storage_operations.py` at line 635, The line calling self.supabase_client.table("archon_sources").select("source_id").eq("source_id", source_id).execute() exceeds 120 chars; refactor it by splitting the chained call across multiple lines (e.g., assign the table call or the query to an intermediate variable, or use parentheses to break after .table(...).select(...) and .eq(...)) so the resulting assignment to response in the DocumentStorageOperations method remains under 120 characters while preserving the same call chain and behavior.
561-562:safe_logfire_errorfor "no valid documents" is misleading — this is a normal edge case, not an error.When all crawl results have empty content, this is an expected condition (e.g., failed fetches), not a system error. Using
error-level logging here will generate noise in error monitoring.♻️ Suggested change
if not documents: - safe_logfire_error(f"No valid documents to process | source_id={original_source_id}") + safe_logfire_info(f"No valid documents to process — skipping pipeline | source_id={original_source_id}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/crawling/document_storage_operations.py` around lines 561 - 562, Replace the error-level logging call for the empty-documents edge case with a non-error level and clearer message: in the block checking "if not documents:" swap safe_logfire_error(...) for a safe_logfire_info(...) or safe_logfire_warning(...) call and update the message to indicate this is an expected/no-content result (include original_source_id for context). This change should be made where the variable original_source_id is available so the log still records the source but no longer floods error monitoring.python/src/server/api_routes/ingestion_api.py (1)
18-42: POST trigger endpoints expose control parameters as URL query strings rather than a request body.In FastAPI, bare scalar parameters on a
POSThandler resolve to query parameters (e.g.POST /process-embeddings?max_batch_size=20&embedder_id=foo). Request bodies are better suited for sending larger amounts of structured data to your API. Using Pydantic request models would give you schema-validated JSON bodies, keep parameters out of server logs/URLs, and produce cleaner OpenAPI docs.♻️ Example refactor for process-embeddings
+from pydantic import BaseModel + +class ProcessEmbeddingsRequest(BaseModel): + max_batch_size: int = Field(default=10, ge=1) + embedder_id: str | None = None + provider: str | None = None + `@router.post`("/process-embeddings") async def process_pending_embeddings( - max_batch_size: int = 10, - embedder_id: str | None = None, - provider: str | None = None, + request: ProcessEmbeddingsRequest, supabase: Client = Depends(get_supabase_client), ): worker = get_embedding_worker(supabase) result = await worker.process_pending_embeddings( - embedder_id=embedder_id, - max_batch_size=max_batch_size, - provider=provider, + embedder_id=request.embedder_id, + max_batch_size=request.max_batch_size, + provider=request.provider, ) return resultAlso applies to: 45-69, 101-117, 120-141
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/api_routes/ingestion_api.py` around lines 18 - 42, The endpoint process_pending_embeddings currently declares bare scalar parameters which FastAPI exposes as query params; replace them with a Pydantic request model (e.g., ProcessEmbeddingsRequest with fields max_batch_size, embedder_id, provider) and accept that model as the single body parameter in the process_pending_embeddings handler, then pass model fields into get_embedding_worker(...).process_pending_embeddings; update any validation/defaults on the model rather than function args. Apply the same refactor to the other POST handlers in this module (the other router.post endpoints) so they all accept validated Pydantic body models and use those fields when invoking their respective worker methods.python/src/server/services/ingestion/ingestion_state_service.py (1)
64-119: Consider using enum types in dataclass fields instead ofstr.
DocumentBlob.download_status,EmbeddingSet.status,Summary.status, andSummary.styleare typed as barestr, losing the type-safety provided by the enums defined just above. MyPy won't catch invalid status strings at assignment time.♻️ Proposed change (excerpt)
`@dataclass` class DocumentBlob: ... - download_status: str = "pending" + download_status: DownloadStatus = DownloadStatus.PENDING `@dataclass` class EmbeddingSet: ... - status: str = "pending" + status: EmbeddingStatus = EmbeddingStatus.PENDING `@dataclass` class Summary: ... - style: str = "overview" + style: SummaryStyle = SummaryStyle.OVERVIEW - status: str = "pending" + status: SummaryStatus = SummaryStatus.PENDING🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/src/server/services/ingestion/ingestion_state_service.py` around lines 64 - 119, Replace the bare str types for status/style fields with the appropriate enum types and default enum members: change DocumentBlob.download_status to use the DownloadStatus enum (default DownloadStatus.PENDING), EmbeddingSet.status to the EmbeddingSetStatus enum (default e.g., EmbeddingSetStatus.PENDING), Summary.status to SummaryStatus (default SummaryStatus.PENDING) and Summary.style to SummaryStyle (default SummaryStyle.OVERVIEW); update the dataclass annotations and default values to the enum members and ensure any code that serializes/deserializes these fields uses .value or casts to/from the enum (e.g., when reading from DB or JSON) so runtime and MyPy type-safety are preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@python/src/server/api_routes/ingestion_api.py`:
- Around line 18-42: The POST endpoints (e.g., process_pending_embeddings)
currently have no auth and must require a service/auth key; update their route
signatures to add the existing auth dependency (for example require_service_key)
alongside get_supabase_client so only authorized callers can invoke
state-mutating operations, and apply the same change to the other operational
endpoints mentioned (retry_failed_embeddings, retry_failed_summaries,
process_summaries) to check the service-role token before performing any work;
ensure you import the auth dependency (e.g., from ..utils import
require_service_key) and use it as a FastAPI Depends parameter in each affected
function signature.
- Line 20: The max_batch_size parameter currently lacks lower-bound validation
in both endpoints; update the parameter declaration to use FastAPI's Query with
ge=1 (e.g. change max_batch_size: int = 10 to max_batch_size: int = Query(10,
ge=1)) so invalid values like 0 or negatives are rejected before reaching the
worker layer, and add the Query import from fastapi if not already present;
apply the same change to both occurrences of max_batch_size in this module.
In `@python/src/server/api_routes/knowledge_api.py`:
- Around line 757-759: The code currently treats an empty docs_response.data as
an error by calling await tracker.error(...); instead, treat an empty-but-valid
state as completion: replace await tracker.error("No documents found for
source") with await tracker.complete("No documents found for source") (or a
clearer message) in the function handling the normal summarize flow (the block
checking if not docs_response.data), and make the identical replacement in the
resummarize path (the similar check around lines 932–934). Ensure you only
change the tracker method call and message text so the operation is marked
complete rather than failed.
- Around line 722-723: The background tasks created with asyncio.create_task for
_perform_revectorize_with_progress (and similarly for
_perform_resummarize_with_progress) are being discarded and can be GC'd and
cannot be cancelled; change the creation to retain and manage the Task just like
upload tasks by adding the created Task to the existing active_crawl_tasks set
(or an equivalent managed set), ensure you remove the Task from that set when it
completes, and use that stored Task for cancellation in the stop endpoint so
revectorize/resummarize tasks have proper lifecycle and cancellation support.
- Around line 784-863: The revectorize routine collects per-document failures in
the local errors list but never surfaces them; update the code that calls
tracker.complete (and the outer exception handler) to include the errors list in
the completion payload (e.g., include "errors": errors and "documents_failed":
len(errors)) and log it via logger.error/info as appropriate; also change the
outer except block to call logger.error with exc_info=True in addition to
safe_logfire_error so the stack trace is preserved. Apply the same changes
inside _perform_resummarize_with_progress: ensure its local errors list is
returned/embedded in tracker.complete and logged, and use logger.error(...,
exc_info=True) in its exception handler.
- Around line 1003-1009: The progress payload sent to tracker.complete uses the
wrong key name: change the payload in the tracker.complete call inside the
re-summarization completion block (where total_updated is used) to use
"total_examples": total_updated instead of "examples_total": total_updated so it
matches the progress updates emitted during processing and mirrors the
revectorize path; update the entry in the dictionary passed to tracker.complete
(the call site referencing tracker.complete and the keys
"examples_total"/"examples_processed") accordingly.
- Around line 846-857: Replace the deprecated datetime.utcnow() with
datetime.now(timezone.utc) and add "from datetime import timezone" to imports,
and adjust the final tracker.complete() payload to use
documents_total=total_docs and documents_processed=total_updated (keep
documents_total consistent with mid-loop updates) so the log and counts reflect
total_docs as the total and total_updated as processed; locate the call that
sets "last_vectorized_at" (currently using datetime.utcnow()) and the
tracker.complete(...) invocation that currently passes
documents_total=total_updated to make these changes.
- Line 751: Remove the dead initializer embedding_dimensions = 1536; it is never
used and causes confusion because the code later derives the actual dimension
into embedding_dim (from the returned vector length in the logic around
embedding_dim). Delete the unused assignment and run a quick search for
embedding_dimensions to ensure no other code relies on it; if any remaining
references exist, replace them to use embedding_dim instead.
In `@python/src/server/services/crawling/document_storage_operations.py`:
- Around line 563-569: The returned dicts currently use a Python set for
"urls_stored" which is not JSON-serializable and is inconsistent with an older
code path that omits the key; update all return sites in
document_storage_operations.py (the early-return that returns {"source_id":
original_source_id, "chunk_count": 0, "chunks_stored": 0, "urls_stored": set(),
"url_to_page_id": {}} and the normal return around the other block) to use an
empty list ([]) for "urls_stored" instead of set(), and ensure the same
"urls_stored" key (as a list) exists in all return paths to maintain a
consistent contract (keep "url_to_page_id" as a dict).
- Around line 659-661: The except block in document_storage_operations.py that
catches exceptions during the create/update of archon_sources must log the full
traceback and include the relevant source identifier before re-raising; locate
the try/except around the archon_sources create/update logic (look for
functions/methods that reference "archon_sources",
"create_or_update_archon_source", or similar) and replace or augment the
safe_logfire_error call with a logger.error call that includes a descriptive
message, the source_id (or URL) context, and exc_info=True so the stack trace is
preserved, then re-raise the exception as before.
- Around line 525-535: The _process_with_new_pipeline method accepts a
cancellation_check but never uses it, leaving the new pipeline uncancellable;
either (A) update orchestrator.run_pipeline to accept a cancellation_check (or a
cancellable token) and thread that through to any async tasks/workers invoked by
run_pipeline, then pass the cancellation_check from _process_with_new_pipeline
into orchestrator.run_pipeline, and update other calls noted around lines
587–595 to pass the same check; or (B) remove the cancellation_check parameter
from _process_with_new_pipeline (and its callers) and add a clear docstring
comment in _process_with_new_pipeline and public API docs stating the new
pipeline does not support per-request cancellation. Use the symbols
_process_with_new_pipeline and orchestrator.run_pipeline to locate changes and
ensure unit tests/typing are updated accordingly.
- Around line 653-656: The update is currently passing the literal string
"now()" which will be stored verbatim; replace it by generating an ISO timestamp
in Python and pass that value to updated_at. In the call to
self.supabase_client.table("archon_sources").update(...).eq("source_id",
source_id).execute() create a timestamp with
datetime.datetime.utcnow().isoformat() (or datetime.utcnow().isoformat() + "Z"
if you want explicit UTC) and set "updated_at": timestamp instead of "now()";
ensure you import datetime at top if not already present.
In `@python/src/server/services/ingestion/ingestion_state_service.py`:
- Around line 224-238: The create_chunks implementation in
ingestion_state_service.py silently returns an empty list when response.data is
empty; make it consistent with create_document_blob and create_embedding_set by
checking response.data after the Supabase insert and raising an error if it's
empty (include the response.error or status information in the exception message
for debugging). Locate the create_chunks method and the response variable used
for the Supabase insert, and replace the silent empty-return behavior with an
explicit raise (e.g., RuntimeError or a specific service error) that includes
context (operation name "create_chunks", response.status_code/response.error) so
downstream workers fail fast on data-corruption cases.
- Around line 122-530: The async methods in IngestionStateService (e.g.,
create_document_blob, get_document_blob, get_blobs_by_source, create_chunks,
get_chunks_by_blob, get_chunks_by_source, create_embedding_set,
get_embedding_set, get_pending_embedding_sets, update_embedding_set_status,
store_embeddings, get_embeddings_by_set, create_summary, get_pending_summaries,
update_summary, update_source_pipeline_status) call the synchronous Supabase
Client .execute() directly and block the event loop; either (preferred) replace
the sync Client with Supabase AsyncClient/create_async_client and await each
.execute() call, or (interim) wrap every .execute() invocation in an offloading
call such as await asyncio.to_thread(...) (or loop.run_in_executor) so DB calls
run in a thread pool; update imports, client construction, and all .execute()
usages in this class to match the chosen approach and ensure all returned
responses are awaited appropriately.
- Around line 277-298: The batched queries in get_chunks_by_source currently
append results from multiple .in_() calls without guaranteeing order; after
collecting all_chunks, sort the rows by blob_id and chunk_index (e.g., sort key
(row["blob_id"], row["chunk_index"])) before constructing Chunk objects so
callers receive chunks in deterministic per-blob, ascending chunk_index order;
update the code around all_chunks collection and the Chunk(...) list
comprehension in get_chunks_by_source to perform this sort.
In `@test_new_pipeline.md`:
- Line 15: The prerequisite in test_new_pipeline.md incorrectly references
"migration 014" while this PR adds 012_add_crawl_url_state.sql; update the
prerequisite text so it references the correct migration number(s) — either
change "migration 014" to "migration 012" or explicitly list which migrations
this PR ships (e.g., 012_add_crawl_url_state.sql) versus earlier ones, so
testers know exactly which migrations to verify.
- Line 9: The step that cd's into the hardcoded path
"/home/zebastjan/dev/archon" should be removed; instead update the instruction
to either tell contributors to run the steps from the repository root or replace
the hardcoded path with a repository-root detection approach (e.g., use git to
find the top-level repo) so the command works for everyone and does not
reference the personal workstation path.
- Around line 44-60: The docs assign SOURCE_ID to a literal placeholder and
never show how to extract it from the /api/progress/<progress_id> response,
causing subsequent health checks to target an empty URL; fix by showing an
"Expected Response" for the progress endpoint that includes the source_id field,
add a concrete extraction step that captures the progress JSON (e.g., assign
PROGRESS from the curl response and set SOURCE_ID by parsing PROGRESS for
.source_id), and add a short reminder to re-assign SOURCE_ID at the start of
each test (Tests 2, 3, 4, 6) that depends on it so the health check against
/api/ingestion/health/$SOURCE_ID uses a real value.
---
Duplicate comments:
In `@python/src/server/api_routes/knowledge_api.py`:
- Around line 879-883: The code currently validates the chat LLM provider by
calling credential_service.get_active_provider("llm") before calling
_validate_provider_api_key, but resummarization uses the code summarization
provider key; change the validation to fetch and validate the actual code
summarization provider (use the CODE_SUMMARIZATION_PROVIDER constant or config
key instead of the literal "llm") so that
credential_service.get_active_provider(CODE_SUMMARIZATION_PROVIDER) (or
equivalent settings lookup) is used and then pass that provider value into
_validate_provider_api_key to ensure the provider used for resummarization is
the one validated.
- Around line 686-731: The route revectorize_knowledge_item and its helper
_perform_revectorize_with_progress (and likewise
resummarize_knowledge_item/_perform_resummarize_with_progress) currently perform
Supabase reads/updates directly; refactor so all DB interactions move into
KnowledgeItemService methods (e.g., add service methods like revectorize_item,
get_item_for_revectorize, update_item_embeddings, resummarize_item,
update_item_summary) and have the route call the service to start the operation
and the background helper call service methods for all reads/updates; update
revectorize_knowledge_item to only validate input, create the ProgressTracker
and call asyncio.create_task(...) passing the service instance, and change
_perform_revectorize_with_progress to use the new KnowledgeItemService methods
(and mirror these changes for the resummarize helpers) so the API route →
service → database pattern is followed.
- Around line 793-849: The update to archon_sources uses embedding_dim which may
be unbound if no batch produced embeddings; ensure embedding_dim is initialized
before the batch loop (e.g., embedding_dim = 0 or None) and update it whenever
you successfully determine an embedding dimension inside the inner loop (the
block that sets embedding_column based on len(embedding)); then use that
initialized/last-known embedding_dim when calling
supabase.table("archon_sources").update(...). Also consider using a clearly
named variable like last_embedding_dim to avoid accidental reuse of loop-scoped
variables and keep the tracker.update and supabase update logic unchanged.
In `@python/src/server/services/crawling/document_storage_operations.py`:
- Around line 87-91: In the except block around
url_state_service.initialize_urls (the try/except handling URL state
initialization), preserve the exception stack trace by logging the error with
the logger (e.g., logger.error or module logger) using exc_info=True in addition
to the existing safe_logfire_error call; update the except Exception as e
handler to call logger.error("Failed to initialize URL state", exc_info=True)
(or equivalent logger variable) so the traceback is recorded while keeping
safe_logfire_error for the user-facing message.
- Around line 304-311: The code is marking URLs from unique_doc_urls as embedded
even for documents that were skipped; change the logic so mark_embedded is
called only for URLs that were actually embedded (use all_urls, which is
populated for documents that passed content validation and were chunked/stored)
instead of unique_doc_urls; update the loop to iterate over all_urls and adjust
the success log to report len(all_urls) and keep the same exception handling
around url_state_service.mark_embedded(original_source_id, url) so only truly
embedded URLs are marked.
- Around line 389-407: In update_source_info, fetching provenance
(credential_service.get_credentials_by_category) and parsing
EMBEDDING_DIMENSIONS can raise transient errors and currently abort the primary
path; wrap the credential lookups for "embedding" and "rag_strategy" in their
own try/except blocks (or use safe_get helper) and on exception or missing keys
fall back to sensible defaults (embedding_provider="openai",
embedding_model="text-embedding-3-small", embedding_dimensions=1536,
vectorizer_settings defaults, summarization_model="gpt-4o-mini") while logging
the error, and ensure embedding_dimensions is validated with a safe int
conversion that falls back to 1536 instead of letting ValueError propagate so
provenance failures do not abort update_source_info.
In `@python/src/server/services/ingestion/ingestion_state_service.py`:
- Line 164: Replace the three generic bare raises that use Exception("Failed to
create document blob"), Exception("Failed to create embedding set"), and
Exception("Failed to create summary record") with a more specific error that
preserves context and the original error (for example a custom IngestionError or
ValueError) and chain the original exception using "raise ... from e"; include
relevant contextual information (IDs, request data, or underlying exception
message) in the new exception text so callers can debug (locate these by
searching for the exact strings "Failed to create document blob", "Failed to
create embedding set", and "Failed to create summary record" in
ingestion_state_service.py).
- Around line 516-530: In update_source_pipeline_status, avoid the falsy check
and the unconditional write of pipeline_error when status == "error": change the
error handling to use explicit "if error_info is not None" (matching
update_embedding_set_status and update_summary), only add "pipeline_error" to
update_data when error_info is not None, and remove the redundant elif branch
that overwrites pipeline_error; keep the pipeline_completed_at assignment when
status == "complete" and then call
self.supabase.table("archon_sources").update(update_data).eq("source_id",
source_id).execute() as before.
---
Nitpick comments:
In `@python/src/server/api_routes/ingestion_api.py`:
- Around line 18-42: The endpoint process_pending_embeddings currently declares
bare scalar parameters which FastAPI exposes as query params; replace them with
a Pydantic request model (e.g., ProcessEmbeddingsRequest with fields
max_batch_size, embedder_id, provider) and accept that model as the single body
parameter in the process_pending_embeddings handler, then pass model fields into
get_embedding_worker(...).process_pending_embeddings; update any
validation/defaults on the model rather than function args. Apply the same
refactor to the other POST handlers in this module (the other router.post
endpoints) so they all accept validated Pydantic body models and use those
fields when invoking their respective worker methods.
In `@python/src/server/services/crawling/document_storage_operations.py`:
- Line 635: The line calling
self.supabase_client.table("archon_sources").select("source_id").eq("source_id",
source_id).execute() exceeds 120 chars; refactor it by splitting the chained
call across multiple lines (e.g., assign the table call or the query to an
intermediate variable, or use parentheses to break after .table(...).select(...)
and .eq(...)) so the resulting assignment to response in the
DocumentStorageOperations method remains under 120 characters while preserving
the same call chain and behavior.
- Around line 561-562: Replace the error-level logging call for the
empty-documents edge case with a non-error level and clearer message: in the
block checking "if not documents:" swap safe_logfire_error(...) for a
safe_logfire_info(...) or safe_logfire_warning(...) call and update the message
to indicate this is an expected/no-content result (include original_source_id
for context). This change should be made where the variable original_source_id
is available so the log still records the source but no longer floods error
monitoring.
In `@python/src/server/services/ingestion/ingestion_state_service.py`:
- Around line 64-119: Replace the bare str types for status/style fields with
the appropriate enum types and default enum members: change
DocumentBlob.download_status to use the DownloadStatus enum (default
DownloadStatus.PENDING), EmbeddingSet.status to the EmbeddingSetStatus enum
(default e.g., EmbeddingSetStatus.PENDING), Summary.status to SummaryStatus
(default SummaryStatus.PENDING) and Summary.style to SummaryStyle (default
SummaryStyle.OVERVIEW); update the dataclass annotations and default values to
the enum members and ensure any code that serializes/deserializes these fields
uses .value or casts to/from the enum (e.g., when reading from DB or JSON) so
runtime and MyPy type-safety are preserved.
In `@test_new_pipeline.md`:
- Line 337: Update the wording in the sentence that reads "Check error_info in
database, fix issue, use retry endpoint." to use a more formal tone by replacing
"fix" with "resolve" so it reads "Check error_info in database, resolve issue,
use retry endpoint."; locate and edit the exact fragment in the document (the
line containing "Check error_info in database, fix issue, use retry endpoint.")
to make this single-word substitution.
| @router.post("/process-embeddings") | ||
| async def process_pending_embeddings( | ||
| max_batch_size: int = 10, | ||
| embedder_id: str | None = None, | ||
| provider: str | None = None, | ||
| supabase: Client = Depends(get_supabase_client), | ||
| ): | ||
| """ | ||
| Manually trigger processing of pending embedding sets. | ||
|
|
||
| Args: | ||
| max_batch_size: Maximum number of embedding sets to process | ||
| embedder_id: Optional filter by specific embedder | ||
| provider: Optional embedding provider override | ||
|
|
||
| Returns: | ||
| Processing results with counts | ||
| """ | ||
| worker = get_embedding_worker(supabase) | ||
| result = await worker.process_pending_embeddings( | ||
| embedder_id=embedder_id, | ||
| max_batch_size=max_batch_size, | ||
| provider=provider, | ||
| ) | ||
| return result |
There was a problem hiding this comment.
No authentication on state-mutating operational endpoints — major security gap.
All six endpoints (four POST, two GET) are publicly accessible with zero authentication. The POST endpoints are particularly sensitive:
/retry-failed-embeddingsand/retry-failed-summariessilently reset persistedfailedstate back topending, potentially triggering re-processing of previously intentionally skipped items./process-embeddingsand/process-summariescan be hammered to exhaust worker capacity, especially sincemax_batch_sizeis unbounded.
The app's CORS policy is allow_origins=["*"], so any browser can reach these endpoints cross-origin. Even for an internal tool, these trigger operations should require at minimum an API key or the same service-role token check used elsewhere. Consider adding an auth dependency — e.g.:
from ..utils import require_service_key # or your existing auth dependencyAlso applies to: 45-69, 101-117, 120-141
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/api_routes/ingestion_api.py` around lines 18 - 42, The POST
endpoints (e.g., process_pending_embeddings) currently have no auth and must
require a service/auth key; update their route signatures to add the existing
auth dependency (for example require_service_key) alongside get_supabase_client
so only authorized callers can invoke state-mutating operations, and apply the
same change to the other operational endpoints mentioned
(retry_failed_embeddings, retry_failed_summaries, process_summaries) to check
the service-role token before performing any work; ensure you import the auth
dependency (e.g., from ..utils import require_service_key) and use it as a
FastAPI Depends parameter in each affected function signature.
|
|
||
| @router.post("/process-embeddings") | ||
| async def process_pending_embeddings( | ||
| max_batch_size: int = 10, |
There was a problem hiding this comment.
max_batch_size has no lower-bound validation — pass 0 or -1 to silently no-op or misbehave.
Add a Query(ge=1) constraint on both endpoints to guard against nonsensical values before they reach the worker layer:
🛡️ Proposed fix
-from fastapi import APIRouter, Depends
+from fastapi import APIRouter, Depends, Query
from supabase import Client
...
`@router.post`("/process-embeddings")
async def process_pending_embeddings(
- max_batch_size: int = 10,
+ max_batch_size: int = Query(default=10, ge=1),
...
):
...
`@router.post`("/process-summaries")
async def process_pending_summaries(
- max_batch_size: int = 10,
+ max_batch_size: int = Query(default=10, ge=1),
...
):Also applies to: 47-47
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/api_routes/ingestion_api.py` at line 20, The max_batch_size
parameter currently lacks lower-bound validation in both endpoints; update the
parameter declaration to use FastAPI's Query with ge=1 (e.g. change
max_batch_size: int = 10 to max_batch_size: int = Query(10, ge=1)) so invalid
values like 0 or negatives are rejected before reaching the worker layer, and
add the Query import from fastapi if not already present; apply the same change
to both occurrences of max_batch_size in this module.
| # Start background task with semaphore | ||
| asyncio.create_task(_perform_revectorize_with_progress(progress_id, source_id, provider, tracker)) |
There was a problem hiding this comment.
Background task reference dropped — GC risk and no cancellation support.
The asyncio.create_task() result is discarded, meaning the task has no strong reference. "Save a reference to the result of this function, to avoid a task disappearing mid-execution. The event loop only keeps weak references to tasks. A task that isn't referenced elsewhere may get garbage collected at any time, even before it's done."
The same issue is at line 903 for _perform_resummarize_with_progress. By contrast, upload tasks (line 1233) are correctly stored in active_crawl_tasks. Store revectorize/resummarize tasks the same way (or at minimum use the set-based fire-and-forget pattern), so they also get cancellation support via the stop endpoint.
🛡️ Suggested fix
- asyncio.create_task(_perform_revectorize_with_progress(progress_id, source_id, provider, tracker))
+ task = asyncio.create_task(_perform_revectorize_with_progress(progress_id, source_id, provider, tracker))
+ active_crawl_tasks[progress_id] = taskApply the same pattern at line 903 for resummarize.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/api_routes/knowledge_api.py` around lines 722 - 723, The
background tasks created with asyncio.create_task for
_perform_revectorize_with_progress (and similarly for
_perform_resummarize_with_progress) are being discarded and can be GC'd and
cannot be cancelled; change the creation to retain and manage the Task just like
upload tasks by adding the created Task to the existing active_crawl_tasks set
(or an equivalent managed set), ensure you remove the Task from that set when it
completes, and use that stored Task for cancellation in the stop endpoint so
revectorize/resummarize tasks have proper lifecycle and cancellation support.
|
|
||
| # Get current embedding settings for provenance | ||
| embedding_model = await get_embedding_model(provider=provider) | ||
| embedding_dimensions = 1536 |
There was a problem hiding this comment.
embedding_dimensions = 1536 is dead code.
This variable is assigned but never read; all subsequent logic uses embedding_dim (derived from the actual returned vector length at line 801). Remove the stale initialiser to avoid confusion about what the actual dimension will be.
🗑️ Proposed fix
- embedding_dimensions = 1536
-
# Fetch all documents for this source🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/api_routes/knowledge_api.py` at line 751, Remove the dead
initializer embedding_dimensions = 1536; it is never used and causes confusion
because the code later derives the actual dimension into embedding_dim (from the
returned vector length in the logic around embedding_dim). Delete the unused
assignment and run a quick search for embedding_dimensions to ensure no other
code relies on it; if any remaining references exist, replace them to use
embedding_dim instead.
| if not docs_response.data: | ||
| await tracker.error("No documents found for source") | ||
| return |
There was a problem hiding this comment.
tracker.error() misused for an empty-but-valid state.
If a source genuinely has no crawled pages yet, the operation is marked as failed, misleading the user. This is a valid state (e.g., a freshly created source), not an error. Use tracker.complete() with a descriptive message instead.
🛠️ Proposed fix
if not docs_response.data:
- await tracker.error("No documents found for source")
+ await tracker.complete({"log": "No documents found for source — nothing to re-vectorize"})
returnApply the same change at line 932–934 for the resummarize path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/api_routes/knowledge_api.py` around lines 757 - 759, The
code currently treats an empty docs_response.data as an error by calling await
tracker.error(...); instead, treat an empty-but-valid state as completion:
replace await tracker.error("No documents found for source") with await
tracker.complete("No documents found for source") (or a clearer message) in the
function handling the normal summarize flow (the block checking if not
docs_response.data), and make the identical replacement in the resummarize path
(the similar check around lines 932–934). Ensure you only change the tracker
method call and message text so the operation is marked complete rather than
failed.
| response = self.supabase.table("archon_chunks").insert(chunk_records).execute() | ||
|
|
||
| return [ | ||
| Chunk( | ||
| id=uuid.UUID(row["id"]), | ||
| blob_id=uuid.UUID(row["blob_id"]), | ||
| chunk_index=row["chunk_index"], | ||
| start_offset=row.get("start_offset"), | ||
| end_offset=row.get("end_offset"), | ||
| content=row["content"], | ||
| token_count=row.get("token_count"), | ||
| created_at=row.get("created_at"), | ||
| ) | ||
| for row in response.data | ||
| ] |
There was a problem hiding this comment.
create_chunks silently returns [] on empty DB response, inconsistent with other create methods.
create_document_blob and create_embedding_set both raise when response.data is empty; create_chunks does not. A silent empty return means downstream embedding/summary workers see zero chunks with no error signal, violating the fail-fast guideline for data corruption.
🐛 Proposed fix
response = self.supabase.table("archon_chunks").insert(chunk_records).execute()
+ if not response.data:
+ raise RuntimeError(
+ f"Failed to create {len(chunk_records)} chunks for blob_id={blob_id}: no data returned"
+ )
+
return [🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/services/ingestion/ingestion_state_service.py` around lines
224 - 238, The create_chunks implementation in ingestion_state_service.py
silently returns an empty list when response.data is empty; make it consistent
with create_document_blob and create_embedding_set by checking response.data
after the Supabase insert and raising an error if it's empty (include the
response.error or status information in the exception message for debugging).
Locate the create_chunks method and the response variable used for the Supabase
insert, and replace the silent empty-return behavior with an explicit raise
(e.g., RuntimeError or a specific service error) that includes context
(operation name "create_chunks", response.status_code/response.error) so
downstream workers fail fast on data-corruption cases.
| for i in range(0, len(blob_ids), batch_size): | ||
| batch = blob_ids[i : i + batch_size] | ||
| response = ( | ||
| self.supabase.table("archon_chunks") | ||
| .select("*") | ||
| .in_("blob_id", batch) | ||
| .execute() | ||
| ) | ||
| all_chunks.extend(response.data) | ||
|
|
||
| return [ | ||
| Chunk( | ||
| id=uuid.UUID(row["id"]), | ||
| blob_id=uuid.UUID(row["blob_id"]), | ||
| chunk_index=row["chunk_index"], | ||
| start_offset=row.get("start_offset"), | ||
| end_offset=row.get("end_offset"), | ||
| content=row["content"], | ||
| token_count=row.get("token_count"), | ||
| created_at=row.get("created_at"), | ||
| ) | ||
| for row in all_chunks |
There was a problem hiding this comment.
Batched queries in get_chunks_by_source return unordered chunks.
get_chunks_by_blob sorts by chunk_index, but the batched variant here does not. Across multiple blobs the results from different .in_() queries are appended in arrival order, leaving the final list unsorted. Callers expecting ordered chunks (e.g. embedding workers, context assembly) will silently process them out of sequence.
🐛 Proposed fix
response = (
self.supabase.table("archon_chunks")
.select("*")
.in_("blob_id", batch)
+ .order("chunk_index")
.execute()
)
all_chunks.extend(response.data)
+
+ # Sort across batches in case blobs interleave
+ all_chunks.sort(key=lambda r: (r["blob_id"], r["chunk_index"]))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/src/server/services/ingestion/ingestion_state_service.py` around lines
277 - 298, The batched queries in get_chunks_by_source currently append results
from multiple .in_() calls without guaranteeing order; after collecting
all_chunks, sort the rows by blob_id and chunk_index (e.g., sort key
(row["blob_id"], row["chunk_index"])) before constructing Chunk objects so
callers receive chunks in deterministic per-blob, ascending chunk_index order;
update the code around all_chunks collection and the Chunk(...) list
comprehension in get_chunks_by_source to perform this sort.
|
|
||
| 1. Start the backend service: | ||
| ```bash | ||
| cd /home/zebastjan/dev/archon |
There was a problem hiding this comment.
Remove the hardcoded developer path.
/home/zebastjan/dev/archon is a personal workstation path. Contributors running these steps elsewhere will get a confusing No such file or directory error before anything else fails.
✏️ Suggested fix
-cd /home/zebastjan/dev/archon
+cd archon # or: cd "$(git rev-parse --show-toplevel)"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cd /home/zebastjan/dev/archon | |
| cd archon # or: cd "$(git rev-parse --show-toplevel)" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test_new_pipeline.md` at line 9, The step that cd's into the hardcoded path
"/home/zebastjan/dev/archon" should be removed; instead update the instruction
to either tell contributors to run the steps from the repository root or replace
the hardcoded path with a repository-root detection approach (e.g., use git to
find the top-level repo) so the command works for everyone and does not
reference the personal workstation path.
| # cd python && uv run python -m src.server.main | ||
| ``` | ||
|
|
||
| 2. Ensure Supabase is running and migration 014 has been applied (pipeline tables exist) |
There was a problem hiding this comment.
Migration number in prerequisites doesn't match the PR.
The PR adds 012_add_crawl_url_state.sql, but this line tells testers to verify migration 014. If the numbering is genuinely ahead of what this PR ships, the prerequisite should explicitly call out which migrations belong to this PR and which were applied earlier — otherwise testers will be looking for a migration that doesn't exist yet (or they'll silently skip a required one).
✏️ Suggested fix (adjust numbers to match actual shipped migrations)
-2. Ensure Supabase is running and migration 014 has been applied (pipeline tables exist)
+2. Ensure Supabase is running and migrations 012–014 have been applied (crawl URL state + pipeline tables)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| 2. Ensure Supabase is running and migration 014 has been applied (pipeline tables exist) | |
| 2. Ensure Supabase is running and migrations 012–014 have been applied (crawl URL state + pipeline tables) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test_new_pipeline.md` at line 15, The prerequisite in test_new_pipeline.md
incorrectly references "migration 014" while this PR adds
012_add_crawl_url_state.sql; update the prerequisite text so it references the
correct migration number(s) — either change "migration 014" to "migration 012"
or explicitly list which migrations this PR ships (e.g.,
012_add_crawl_url_state.sql) versus earlier ones, so testers know exactly which
migrations to verify.
| # Replace <progress_id> with the ID from step 1 | ||
| curl http://localhost:8181/api/progress/<progress_id> | ||
| ``` | ||
|
|
||
| **Expected:** Status should progress through stages (discovery → downloading → chunking) | ||
|
|
||
| ### Step 3: Verify pipeline state | ||
|
|
||
| Once crawling completes, check that blobs and chunks were created: | ||
|
|
||
| ```bash | ||
| # Get source_id from progress response | ||
| SOURCE_ID="<source_id_from_progress>" | ||
|
|
||
| # Check health of the source | ||
| curl http://localhost:8181/api/ingestion/health/$SOURCE_ID | ||
| ``` |
There was a problem hiding this comment.
$SOURCE_ID is never populated — every subsequent health check silently targets an empty URL.
The block at lines 56-57 assigns a placeholder string literal to SOURCE_ID rather than extracting the real value from the API. Because the progress response format is never shown (Step 2 has no "Expected Response" block), a tester has no way to know which field to read. Additionally, $SOURCE_ID is reused across Tests 2, 3, 4, and 6, all of which assume it was set in a prior terminal session.
Suggested improvements:
- Show the expected progress response so testers can identify the
source_idfield. - Provide a concrete extraction command (e.g.
jq) instead of the placeholder. - Add a re-assignment reminder at the start of each test that needs the variable.
✏️ Suggested fix for Step 2 + Step 3
### Step 2: Check crawl progress
```bash
# Replace <progress_id> with the ID from step 1
-curl http://localhost:8181/api/progress/<progress_id>
+PROGRESS=$(curl -s http://localhost:8181/api/progress/<progress_id>)
+echo "$PROGRESS" | jq .
+SOURCE_ID=$(echo "$PROGRESS" | jq -r '.source_id')+Expected: Response includes a source_id field alongside the current stage.
+
Step 3: Verify pipeline state
Once crawling completes, check that blobs and chunks were created:
-# Get source_id from progress response
-SOURCE_ID="<source_id_from_progress>"
-
# Check health of the source
curl http://localhost:8181/api/ingestion/health/$SOURCE_ID</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.
In @test_new_pipeline.md around lines 44 - 60, The docs assign SOURCE_ID to a
literal placeholder and never show how to extract it from the
/api/progress/<progress_id> response, causing subsequent health checks to target
an empty URL; fix by showing an "Expected Response" for the progress endpoint
that includes the source_id field, add a concrete extraction step that captures
the progress JSON (e.g., assign PROGRESS from the curl response and set
SOURCE_ID by parsing PROGRESS for .source_id), and add a short reminder to
re-assign SOURCE_ID at the start of each test (Tests 2, 3, 4, 6) that depends on
it so the health check against /api/ingestion/health/$SOURCE_ID uses a real
value.
</details>
<!-- fingerprinting:phantom:triton:churro -->
<!-- This is an auto-generated comment by CodeRabbit -->
- Current focus: robust ingestion pipeline - Near-term: batch processing + agent skills - Mid-term: git + IPFS integration - Long-term: database abstraction + knowledge graph
- Add operation_progress table for restartable pipelines - ProgressTracker now persists to database (survives container restarts) - Add pause/resume API endpoints and frontend buttons - Auto-apply migrations on server startup - Fix Summary Ollama detection on page load - Fix settings reload after save to verify persistence - Fix broken test fixtures (renamed helper functions)
Implements foundational pause/resume functionality for web crawl operations with automatic recovery on Docker restart. Download manager-style controls with proper state tracking and UI feedback. ## Changes ### Backend - Pause/Resume State Machine - Add cancellation reason tracking (PAUSED vs STOPPED vs NONE) - Modify CrawlingService.cancel() to accept reason parameter - Add CrawlingService.pause() method for explicit pause operations - Update CancelledError handler to preserve "paused" status - Pause endpoint now calls orchestration.pause() instead of cancel() ### Backend - Auto-Resume on Restart - Add ProgressTracker.auto_resume_paused_operations() method - Server startup now auto-resumes all paused operations - Operations marked as "paused" after Docker restart automatically continue - Set source_id early in crawl initialization for pause/resume support ### Backend - Failed Operations Retryable - Resume endpoint now accepts "failed" status for retry functionality - Failed operations preserve progress data for recovery ### Frontend - UI Improvements - Add "Retry" button for failed operations (blue, RotateCw icon) - Change Resume button icon from StopCircle to Play icon - Update button visibility logic for paused/failed states - Visual distinction between Pause (⏹️), Resume (▶️ ), and Retry (🔄) ## Known Issues ### Critical - **Cancel button non-functional**: UI does not respond to cancel/stop button - **Resume doesn't restart crawl**: UI responds to resume button but actual crawl operation does not continue (no backend activity detected) - **Auto-resume may not trigger**: Docker restart shows "Restored" message but not always "Auto-resumed" message ### Investigation Needed - Source_id may still be NULL in some cases despite early initialization - Cancellation checks (_check_cancellation) may not be frequent enough during long-running operations like discovery or initial page crawling - Active orchestration registry may not properly track paused operations ## Testing Status - ✅ Pause button changes status to "paused" in database - ✅ UI shows correct badges (Paused/Cancelled/Failed) - ✅ Resume button appears for paused operations - ❌ Resume does not actually restart crawl - ❌ Cancel button does not stop operations -⚠️ Auto-resume inconsistent on Docker restart ## Next Steps 1. Debug why resume endpoint doesn't restart crawl service 2. Fix cancel button to properly stop operations 3. Ensure _check_cancellation() called frequently in all crawl stages 4. Add more detailed logging for pause/resume/cancel operations 5. Verify source_id persistence across pause/resume cycles Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add robust testing for pause/resume/cancel functionality with critical bug prevention and fix for source creation reliability. Testing Infrastructure: - API endpoint tests (8/9 passing) for pause/resume/cancel endpoints - Integration tests (6/6 passing) for full pause/resume lifecycle - Comprehensive test documentation with examples and patterns - Quick reference guide for running tests Critical Bug Tests: - Resume with missing source_id returns 400 (prevents early pause bug) - Resume with missing source record returns 404 (prevents DB inconsistency) - Resume validates BEFORE updating DB status (prevents corruption) - Pause before source creation fails gracefully with clear error - Progress never goes backwards after resume Source Creation Fix: - Make source record creation REQUIRED (was best-effort) - Add retry logic: 3 attempts with exponential backoff (1s, 2s) - Fail crawl with clear error if source creation fails after retries - Prevents silent breakage where pause works but resume fails Impact: - 14/15 tests passing (93% success rate) - All critical bugs have regression tests - Transient DB errors handled automatically via retry - Permanent DB failures fail fast with clear error messages - No more silent pause/resume breakage Files: - python/tests/test_pause_resume_cancel_api.py (369 lines) - python/tests/progress_tracking/integration/test_pause_resume_flow.py (338 lines) - python/tests/progress_tracking/README.md (437 lines - test docs) - python/tests/RUN_PAUSE_RESUME_TESTS.md (quick reference) - python/src/server/services/crawling/crawling_service.py (retry logic) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
⚠️ WARNING: Code is in non-working state Crawls start but fail/disappear shortly after. Cannot test pause/resume functionality until basic crawls work. See KNOWN_ISSUES.md for: - Current symptoms and impact - What we know/don't know - Debugging steps for next session - TODO list This is WIP debugging code. DO NOT MERGE until crawls work. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The database constraint only allows: idle, downloading, chunking, embedding, summarizing, complete, error. Using 'initializing' caused constraint violations (error code 23514) and prevented source creation, making all crawls fail immediately. Changed to 'idle' which is the database default and semantically correct for a newly created source that hasn't started processing yet. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
🔄 This repository is being replaced by a new version of Archon. The original Python/MCP codebase is being archived to the This PR is being closed as part of the migration. Thank you for your contribution! |
… nodes (#936) Approval nodes can now capture the reviewer's comment as $nodeId.output (via capture_response: true) and optionally retry on rejection instead of cancelling (via on_reject: { prompt, max_attempts }). This enables iterative human-AI review cycles without needing interactive loop nodes. Changes: - Add approvalOnRejectSchema and extend approval node schema with capture_response and on_reject fields - Extend ApprovalContext with captureResponse, onRejectPrompt, onRejectMaxAttempts (stored at pause time for reject handlers) - Add $REJECTION_REASON variable to substituteWorkflowVariables - Extract executeApprovalNode function with rejection resume logic - Update all 4 approve handlers (CLI, command-handler, orchestrator, server API) to use captureResponse and clear rejection state - Update all 3 reject handlers (CLI, command-handler, server API) to check onRejectPrompt and retry instead of cancel when configured - Add 5 tests for approval node behavior (fresh pause, capture_response, on_reject resume, max_attempts exhaustion, max_attempts=1) Fixes #936
… nodes (coleam00#936) Approval nodes can now capture the reviewer's comment as $nodeId.output (via capture_response: true) and optionally retry on rejection instead of cancelling (via on_reject: { prompt, max_attempts }). This enables iterative human-AI review cycles without needing interactive loop nodes. Changes: - Add approvalOnRejectSchema and extend approval node schema with capture_response and on_reject fields - Extend ApprovalContext with captureResponse, onRejectPrompt, onRejectMaxAttempts (stored at pause time for reject handlers) - Add $REJECTION_REASON variable to substituteWorkflowVariables - Extract executeApprovalNode function with rejection resume logic - Update all 4 approve handlers (CLI, command-handler, orchestrator, server API) to use captureResponse and clear rejection state - Update all 3 reject handlers (CLI, command-handler, server API) to check onRejectPrompt and retry instead of cancel when configured - Add 5 tests for approval node behavior (fresh pause, capture_response, on_reject resume, max_attempts exhaustion, max_attempts=1) Fixes coleam00#936
… nodes (coleam00#936) Approval nodes can now capture the reviewer's comment as $nodeId.output (via capture_response: true) and optionally retry on rejection instead of cancelling (via on_reject: { prompt, max_attempts }). This enables iterative human-AI review cycles without needing interactive loop nodes. Changes: - Add approvalOnRejectSchema and extend approval node schema with capture_response and on_reject fields - Extend ApprovalContext with captureResponse, onRejectPrompt, onRejectMaxAttempts (stored at pause time for reject handlers) - Add $REJECTION_REASON variable to substituteWorkflowVariables - Extract executeApprovalNode function with rejection resume logic - Update all 4 approve handlers (CLI, command-handler, orchestrator, server API) to use captureResponse and clear rejection state - Update all 3 reject handlers (CLI, command-handler, server API) to check onRejectPrompt and retry instead of cancel when configured - Add 5 tests for approval node behavior (fresh pause, capture_response, on_reject resume, max_attempts exhaustion, max_attempts=1) Fixes coleam00#936
Summary
archon_crawl_url_statetable to track per-URL crawl statusCrawlUrlStateServicewith methods to track URL states (pending/fetched/embedded/failed)Changes
New Files
migration/0.1.0/012_add_crawl_url_state.sql- Database table for URL statepython/src/server/services/crawling/crawl_url_state_service.py- Service classpython/tests/test_crawl_url_state_service.py- Unit testsModified Files
python/src/server/services/crawling/crawling_service.py- Added state check at crawl startpython/src/server/services/crawling/document_storage_operations.py- Added URL state initialization and markingTesting
Notes
This is Phase 1 of the crawl checkpoint/resume feature. Next steps:
Summary by CodeRabbit
New Features
Bug Fixes
Performance
Documentation
Tests
Chores