Skip to content

feat: Docker deployment improvements, async fixes, and resilience#928

Closed
steveant wants to merge 5 commits intocoleam00:mainfrom
AeyeOps:fix/docker-deployment-improvements
Closed

feat: Docker deployment improvements, async fixes, and resilience#928
steveant wants to merge 5 commits intocoleam00:mainfrom
AeyeOps:fix/docker-deployment-improvements

Conversation

@steveant
Copy link
Copy Markdown

@steveant steveant commented Jan 15, 2026

Summary

Consolidation of battle-tested improvements from production Docker deployments. These changes address real-world issues encountered when running Archon in Docker Compose environments with self-hosted Supabase.

Changes (13 files)

Health & Resilience

python/src/server/main.py - Non-blocking health check

  • Health endpoint now uses asyncio.wait_for with 2s timeout
  • Prevents health checks from hanging when database is slow
  • Gracefully degrades with timeout/error flags instead of failing hard

python/src/server/services/credential_service.py - Startup resilience

  • Added retry logic with exponential backoff for PGRST205 errors
  • Handles Supabase schema cache not ready during startup
  • Prevents startup failures when Supabase is still initializing

python/src/server/services/storage/document_storage_service.py - Idempotent upserts

  • Changed insert to upsert with on_conflict="url,chunk_number"
  • Prevents duplicate key errors when re-crawling same URLs
  • Integrates with embedding schema fallback for mixed deployments

Async Correctness

python/src/mcp_server/mcp_server.py - Event loop safety

  • Changed threading.Lock to asyncio.Lock to avoid blocking event loop
  • Improved double-check pattern with proper null safety
  • Prevents deadlocks in async contexts

python/src/server/services/storage/code_storage_service.py - Proper async function

  • Converted generate_code_example_summary from sync to async
  • Eliminates asyncio.run() which can cause "event loop already running" errors
  • Function not called externally, safe breaking change

python/src/server/api_routes/knowledge_api.py - Extensible callbacks

  • Progress callback now accepts **extra_fields for extensibility
  • Uses cached schema check function for better performance
  • PEP 604 type hints (dict | None syntax)

Bug Fixes

python/src/agents/document_agent.py - Pydantic AI compatibility

  • Updated result_type to output_type for pydantic-ai 1.0.x compatibility
  • Fixes agent initialization failure with newer pydantic-ai versions

python/src/server/services/crawling/discovery_service.py - Soft 404 detection

  • Added content-type validation to _check_url_exists()
  • Detects soft 404s where servers return HTTP 200 with HTML for missing files
  • Prevents false positives when checking for llms.txt and similar files

Docker Compatibility

python/src/server/config/config.py - Container hostname support

  • Allow hostnames ending with _supabase (e.g., supabase_kong_supabase)
  • Fixes URL validation for Docker Compose Supabase deployments
  • Maintains security by only allowing known container naming patterns

python/src/server/services/storage/embedding_schema_support.py - NEW

  • Schema compatibility helpers for embedding columns
  • Supports both legacy single-column and per-dimension schemas
  • Graceful fallback when dimension columns don't exist

Migrations & Maintenance

migration/complete_setup.sql - Idempotent migrations

  • Trigger creation now checks if exists before creating
  • Added ON CONFLICT DO NOTHING to INSERT statements
  • Makes migrations safely re-runnable for disaster recovery

migration/cleanup_stale_credentials.py - NEW

  • Utility to remove undecryptable credentials after SUPABASE_SERVICE_KEY rotation
  • Detects credentials encrypted with old key and removes them
  • Prevents startup failures after key rotation

Frontend

archon-ui-main/src/App.tsx - React Router v7 preparation

  • Added future flags: v7_startTransition, v7_relativeSplatPath
  • Valid forward-compatibility flags for React Router 6.26.2
  • Prepares codebase for Router v7 migration

Test Plan

  • Bootstrap from clean state (sudo ./bootstrap-archon.sh)
  • Document ingestion succeeds (web crawl + manual upload)
  • Health endpoint responds < 2s under load
  • MCP server starts without deadlock
  • Migrations re-run safely without errors

Breaking Changes

None. All changes are backwards compatible.

Notes

  • These changes were developed and tested in a production Docker Compose environment with self-hosted Supabase
  • The _supabase hostname pattern is specific to Docker Compose service naming conventions
  • The embedding schema fallback supports deployments that may not have the latest migration

Summary by CodeRabbit

  • New Features

    • Changelog added and project version bumped to v0.1.1.
  • Improvements

    • Vector embedding fallback for flexible storage schemas.
    • Broader local Supabase URL acceptance for developer setups.
    • Non-blocking, timeout-protected health checks for faster responsiveness.
    • Async-safe server initialization and async-friendly APIs.
    • Retry/backoff and robustness improvements for credential loading and storage.
    • Progress reporting accepts extra contextual fields.
    • Idempotent database initialization to avoid duplicate entries.
    • Content-type validation to detect soft 404s for certain file types.
  • Bug Fixes

    • Utility to clean up stale encrypted credentials.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 15, 2026

📝 Walkthrough

Walkthrough

Replaces React Router usage with v7 experimental flags; adds stale-credential cleanup script and makes migration SQL idempotent; converts MCP server init to async with serialized lock; adds cached, timeout-bounded DB schema health checks; adds retries for credential loading; introduces embedding-schema detection, upsert storage and fallback; makes a code-summary function async; expands Supabase URL local validation; minor version bumps and changelog/README updates.

Changes

Cohort / File(s) Summary
Frontend Routing
archon-ui-main/src/App.tsx
Swap <Router> for Router future={{ v7_startTransition: true, v7_relativeSplatPath: true }} to enable React Router v7 experimental flags.
Migration / Cleanup
migration/cleanup_stale_credentials.py, migration/complete_setup.sql
Add script to remove stale encrypted credentials when the service key cannot decrypt; make trigger creation and initial inserts idempotent (DO pre-check, ON CONFLICT DO NOTHING).
MCP Server Init (async)
python/src/mcp_server/mcp_server.py
Replace threading.Lock with asyncio.Lock; serialize async initialization inside async with and publish shared context after initialization.
Health Check / Cached Schema
python/src/server/main.py, python/src/server/api_routes/knowledge_api.py
Introduce _check_database_schema_cached() and _perform_database_schema_check(...); use asyncio.wait_for(..., 2s) in health path, expose schema_check_timeout, and have knowledge API call cached check.
Credential Loading Retries
python/src/server/services/credential_service.py
Add retry loop (up to 5 attempts) with backoff for transient PGRST205 API errors; re-raise on exhaustion; cache on success.
Embedding Schema Support & Storage
python/src/server/services/storage/embedding_schema_support.py, python/src/server/services/storage/document_storage_service.py
New module to detect per-dimension embedding columns and track multi-dim support; document storage now uses upsert(..., on_conflict="url,chunk_number"), consults schema helpers, and retries/falls back to legacy embedding column on schema errors.
Code Storage Async Boundary
python/src/server/services/storage/code_storage_service.py
generate_code_example_summary changed from a synchronous wrapper using asyncio.run to async def that awaits the async implementation (callers must now await).
Crawling Content-Type Validation
python/src/server/services/crawling/discovery_service.py
Add EXPECTED_CONTENT_TYPES mapping and validate response Content-Type for certain extensions to detect soft-404s.
Config Validation
python/src/server/config/config.py
Expand Supabase URL local validation to accept hostnames ending with _supabase in addition to existing local patterns.
Agents API Rename Usage
python/src/agents/document_agent.py
Use output_type=DocumentOperation when constructing Agent instead of result_type.
Version & Docs
CHANGELOG.md, README.md, archon-ui-main/package.json, python/pyproject.toml, python/src/agent_work_orders/__init__.py, python/src/agent_work_orders/server.py, python/src/server/config/version.py
Add changelog and bump versions to 0.1.1 across UI and Python packages; update README release line and service responses.

Sequence Diagram(s)

sequenceDiagram
    participant App as DocumentStorageService
    participant EmbedSupport as EmbeddingSchemaSupport
    participant DB as SupabaseDB
    participant RetryLogic as Retry/Fallback

    App->>EmbedSupport: determine_embedding_column(dimension)
    EmbedSupport-->>App: chosen column (per-dim or legacy)
    App->>DB: upsert(batch_data with chosen column)
    DB-->>App: error (missing per-dim column)
    App->>RetryLogic: should_retry_with_legacy_column(error, records)
    RetryLogic->>EmbedSupport: _convert_records_to_legacy(records)
    EmbedSupport-->>RetryLogic: records converted (in-place)
    RetryLogic-->>App: retry = true
    App->>DB: upsert(batch_data with legacy column)
    DB-->>App: success
    App->>EmbedSupport: note_multi_dim_success()
Loading
sequenceDiagram
    participant Client as MCP Client
    participant Server as MCP Server
    participant Lock as Async Init Lock
    participant Init as Services Init

    Client->>Server: connection request
    Server->>Server: check _shared_context
    alt context exists
        Server-->>Client: yield cached context
    else
        Server->>Lock: acquire async lock
        Lock->>Init: initialize session manager & services
        Init-->>Server: store _shared_context
        Lock-->>Server: release lock
        Server-->>Client: yield initialized context
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Possibly related PRs

Suggested reviewers

  • leex279
  • coleam00
  • TimothiousAI

Poem

🐇 Hopped through code and async locks tonight,

Upserts learned new columns and fell back right,
Stale keys cleared where secrets couldn't sing,
Health checks timeboxed — steady as spring,
A rabbit nods: resilient code takes flight.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the main themes of the changeset: Docker deployment improvements, async fixes, and resilience enhancements across multiple services.
Description check ✅ Passed The PR description comprehensively covers summary, 13 file changes organized by category, specific technical details, test plan with checkmarks, and notes. It follows the template structure with clear sections.
Docstring Coverage ✅ Passed Docstring coverage is 89.66% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@python/src/server/services/credential_service.py`:
- Around line 156-173: The except blocks in credential_service.py currently log
errors without stack traces and leave an unreachable RuntimeError at the end;
update the two logger.error(...) calls inside the APIError and generic except
blocks to pass exc_info=True so the full traceback is preserved, and remove the
final raise RuntimeError("Failed to load credentials after retries") since
control either returns, continues, or re-raises inside the loop; keep existing
re-raise behavior for the exceptions (do not swallow them).

In `@python/src/server/services/storage/document_storage_service.py`:
- Around line 475-478: The retry loop currently uses continue when
should_retry_with_legacy_column(e, batch_data) mutates batch_data, which can
skip the actual retry if that happens on the final iteration; change the logic
so that when should_retry_with_legacy_column(...) returns True you set a boolean
flag (e.g., force_one_more_retry) and, if on the last configured retry
iteration, arrange for one additional loop iteration (for example by checking
force_one_more_retry in the loop condition or incrementing the retry limit) so
the mutated batch_data is actually retried; ensure the flag is cleared after the
forced retry so normal retry behavior resumes.
🧹 Nitpick comments (7)
python/src/mcp_server/mcp_server.py (1)

150-198: Solid async initialization pattern with one minor logging improvement.

The structure is well-designed:

  • Local ctx variable avoids race conditions between lock release and yield
  • Yielding outside the lock enables concurrent connections
  • Double-check pattern prevents redundant initialization

Per coding guidelines, prefer exc_info=True over separate traceback.format_exc() call to keep the stack trace in the same log entry.

♻️ Suggested logging improvement
             except Exception as e:
-                logger.error(f"💥 Critical error in lifespan setup: {e}")
-                logger.error(traceback.format_exc())
+                logger.error(f"💥 Critical error in lifespan setup: {e}", exc_info=True)
                 raise
migration/cleanup_stale_credentials.py (3)

42-52: Simplify exception handling.

The broad Exception catch on line 51 is redundant since InvalidToken and ValueError already cover the specific failure cases. Consider removing it or logging the unexpected exception type for debugging.

♻️ Suggested simplification
 def try_decrypt(encrypted_value: str, fernet: Fernet) -> bool:
     """Attempt to decrypt a value, return True if successful."""
     if not encrypted_value:
         return True  # Empty values are fine
 
     try:
         encrypted_bytes = base64.urlsafe_b64decode(encrypted_value.encode("utf-8"))
         fernet.decrypt(encrypted_bytes)
         return True
-    except (InvalidToken, ValueError, Exception):
+    except (InvalidToken, ValueError):
         return False

122-133: Consider adding a dry-run mode for safer operation.

The script permanently deletes credentials that can't be decrypted. While logging provides an audit trail, a --dry-run flag would allow operators to preview what would be deleted before executing, reducing risk of accidental data loss.


55-55: Add return type annotation.

The function returns int but lacks a type annotation. Per coding guidelines for Python backend, type hints should be used.

♻️ Suggested fix
-def main():
+def main() -> int:
python/src/server/main.py (1)

395-398: Consider caching unknown errors briefly.

Unknown errors return {"valid": False} without caching, which could lead to repeated database queries if the error persists. Consider caching these briefly (e.g., 5-10 seconds) to prevent query storms during database issues.

python/src/server/services/storage/embedding_schema_support.py (2)

40-47: Consider simplifying the fallback logic for clarity.

The nested .get() on line 47 is correct but slightly obscure. Since 1536 is guaranteed to exist in _DIMENSION_TO_COLUMN, the inner fallback to LEGACY_COLUMN is unreachable.

🔧 Optional simplification
-    return _DIMENSION_TO_COLUMN.get(dimension, _DIMENSION_TO_COLUMN.get(1536, LEGACY_COLUMN))
+    return _DIMENSION_TO_COLUMN.get(dimension, "embedding_1536")

87-97: Potential data loss if a record contains multiple dimension columns.

If a record somehow has more than one dimension column (e.g., both embedding_768 and embedding_1536), the current logic pops all of them but only preserves the first one via setdefault. Subsequent embeddings are silently discarded.

In practice, records should only have one embedding column, but adding a break after conversion would make the intent explicit and prevent accidental data loss.

🔧 Suggested defensive improvement
     for record in records:
         for column in DIMENSION_COLUMNS:
             if column in record:
                 # Do not clobber an existing legacy value if the caller already set it.
                 record.setdefault(LEGACY_COLUMN, record.pop(column))
                 converted = True
+                break  # Only one dimension column expected per record
     return converted
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ecaece4 and 8274043.

📒 Files selected for processing (11)
  • archon-ui-main/src/App.tsx
  • migration/cleanup_stale_credentials.py
  • migration/complete_setup.sql
  • python/src/mcp_server/mcp_server.py
  • python/src/server/api_routes/knowledge_api.py
  • python/src/server/config/config.py
  • python/src/server/main.py
  • python/src/server/services/credential_service.py
  • python/src/server/services/storage/code_storage_service.py
  • python/src/server/services/storage/document_storage_service.py
  • python/src/server/services/storage/embedding_schema_support.py
🧰 Additional context used
📓 Path-based instructions (8)
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: Fail fast and loud for service startup failures, missing configuration, database connection failures, authentication/authorization failures, data corruption, critical dependency unavailability, and invalid data that would corrupt state
Never accept corrupted data - skip failed items entirely rather than storing corrupted or incomplete data

Files:

  • python/src/server/services/credential_service.py
  • python/src/server/services/storage/document_storage_service.py
  • python/src/server/config/config.py
  • python/src/server/services/storage/code_storage_service.py
  • python/src/server/main.py
  • python/src/server/services/storage/embedding_schema_support.py
  • python/src/server/api_routes/knowledge_api.py
python/src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration

python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety

Files:

  • python/src/server/services/credential_service.py
  • python/src/server/services/storage/document_storage_service.py
  • python/src/server/config/config.py
  • python/src/server/services/storage/code_storage_service.py
  • python/src/server/main.py
  • python/src/mcp_server/mcp_server.py
  • python/src/server/services/storage/embedding_schema_support.py
  • python/src/server/api_routes/knowledge_api.py
python/src/server/services/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

python/src/server/services/**/*.py: For batch processing and background tasks, complete what you can and report detailed failures for each item rather than crashing
Never accept corrupted data - skip failed items entirely rather than storing corrupted data in batch operations

Files:

  • python/src/server/services/credential_service.py
  • python/src/server/services/storage/document_storage_service.py
  • python/src/server/services/storage/code_storage_service.py
  • python/src/server/services/storage/embedding_schema_support.py
python/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations

Files:

  • python/src/server/services/credential_service.py
  • python/src/server/services/storage/document_storage_service.py
  • python/src/server/config/config.py
  • python/src/server/services/storage/code_storage_service.py
  • python/src/server/main.py
  • python/src/mcp_server/mcp_server.py
  • python/src/server/services/storage/embedding_schema_support.py
  • python/src/server/api_routes/knowledge_api.py
python/src/server/main.py

📄 CodeRabbit inference engine (AGENTS.md)

Service startup failures, missing configuration, database connection failures, and authentication/authorization failures should crash with clear errors

Files:

  • python/src/server/main.py
python/src/server/api_routes/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/api_routes/**/*.py: Follow the API route → Service → Database pattern for new API endpoints
Use database values directly without frontend mapping to maintain type-safe end-to-end communication from backend upward

Files:

  • python/src/server/api_routes/knowledge_api.py
archon-ui-main/src/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

archon-ui-main/src/**/*.{ts,tsx}: Use TypeScript in strict mode with no implicit any in frontend code
Do not use dynamic Tailwind class construction in frontend components

Files:

  • archon-ui-main/src/App.tsx
archon-ui-main/src/**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (AGENTS.md)

Use ESLint for legacy frontend code outside /src/features directory with standard React rules

Files:

  • archon-ui-main/src/App.tsx
🧠 Learnings (13)
📚 Learning: 2025-08-20T19:38:04.097Z
Learnt from: Chillbruhhh
Repo: coleam00/Archon PR: 378
File: python/src/server/services/storage/document_storage_service.py:304-306
Timestamp: 2025-08-20T19:38:04.097Z
Learning: The archon_crawled_pages table in the Archon project has a table-level unique constraint on (url, chunk_number) defined inline in the CREATE TABLE statement in migration/complete_setup.sql at line 202, which allows upsert operations with on_conflict="url,chunk_number" to work properly without requiring additional migrations.

Applied to files:

  • python/src/server/services/storage/document_storage_service.py
  • migration/complete_setup.sql
📚 Learning: 2025-08-20T19:38:04.097Z
Learnt from: Chillbruhhh
Repo: coleam00/Archon PR: 378
File: python/src/server/services/storage/document_storage_service.py:304-306
Timestamp: 2025-08-20T19:38:04.097Z
Learning: The archon_crawled_pages table in the Archon project has a table-level unique constraint on (url, chunk_number) defined inline in the CREATE TABLE statement in migration/complete_setup.sql at line 202, which allows upsert operations with on_conflict="url,chunk_number" to work properly.

Applied to files:

  • python/src/server/services/storage/document_storage_service.py
  • migration/complete_setup.sql
📚 Learning: 2025-11-29T19:24:46.265Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: archon-example-workflow/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:46.265Z
Learning: Check if Archon MCP server is available before doing anything else in task management scenarios

Applied to files:

  • python/src/mcp_server/mcp_server.py
📚 Learning: 2025-11-29T19:24:36.223Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:36.223Z
Learning: Applies to archon-ui-main/src/!(features)/**/*.{ts,tsx} : Use ESLint with standard React rules for legacy code in the frontend

Applied to files:

  • archon-ui-main/src/App.tsx
📚 Learning: 2025-11-29T19:25:23.987Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-29T19:25:23.987Z
Learning: Applies to archon-ui-main/src/features/**/components/**/*.{ts,tsx} : Use Radix UI primitives from src/features/ui/primitives/ for new UI components in features directory

Applied to files:

  • archon-ui-main/src/App.tsx
📚 Learning: 2025-11-29T19:25:23.987Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-29T19:25:23.987Z
Learning: Applies to archon-ui-main/src/**/*.{ts,tsx,js,jsx} : Use ESLint for legacy frontend code outside /src/features directory with standard React rules

Applied to files:

  • archon-ui-main/src/App.tsx
📚 Learning: 2025-11-29T19:25:23.987Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-29T19:25:23.987Z
Learning: Applies to archon-ui-main/src/features/**/*.{ts,tsx} : TypeScript strict mode enabled with no implicit any, and use Biome with 120 character lines and double quotes in /src/features directory

Applied to files:

  • archon-ui-main/src/App.tsx
📚 Learning: 2025-11-29T19:24:36.223Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:36.223Z
Learning: Applies to archon-ui-main/src/features/**/components/**/*.{ts,tsx} : Prefer using Radix UI primitives from src/features/ui/primitives/ when creating UI components

Applied to files:

  • archon-ui-main/src/App.tsx
📚 Learning: 2025-11-29T19:24:36.223Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:36.223Z
Learning: Applies to archon-ui-main/src/features/**/*.{ts,tsx} : Apply Tron-inspired glassmorphism styling with Tailwind and follow responsive design patterns with mobile-first approach

Applied to files:

  • archon-ui-main/src/App.tsx
📚 Learning: 2025-11-29T19:25:23.987Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-29T19:25:23.987Z
Learning: Applies to archon-ui-main/src/features/**/components/**/*.{ts,tsx} : Apply Tron-inspired glassmorphism styling with Tailwind in features directory UI components

Applied to files:

  • archon-ui-main/src/App.tsx
📚 Learning: 2025-11-29T19:25:23.987Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-29T19:25:23.987Z
Learning: Applies to archon-ui-main/**/*.test.{ts,tsx} : Use Vitest with React Testing Library for frontend testing

Applied to files:

  • archon-ui-main/src/App.tsx
📚 Learning: 2025-09-04T16:30:05.227Z
Learnt from: stevepresley
Repo: coleam00/Archon PR: 573
File: archon-ui-main/src/config/api.ts:15-25
Timestamp: 2025-09-04T16:30:05.227Z
Learning: Archon UI API config: Prefer lazy getters getApiFullUrl() and getWsUrl() over module-load constants to avoid SSR/test crashes. Avoid CommonJS exports patterns (Object.defineProperty(exports,…)) in ESM. Add typeof window guards with VITE_API_URL fallback inside getApiUrl()/getWebSocketUrl() when SSR safety is required.

Applied to files:

  • archon-ui-main/src/App.tsx
📚 Learning: 2025-11-29T19:24:36.223Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:36.223Z
Learning: Applies to archon-ui-main/src/**/*.{test,spec}.{ts,tsx} : Use Vitest with React Testing Library for frontend testing

Applied to files:

  • archon-ui-main/src/App.tsx
🧬 Code graph analysis (4)
python/src/server/services/credential_service.py (1)
python/src/server/utils/progress/progress_tracker.py (1)
  • error (196-219)
python/src/server/services/storage/document_storage_service.py (1)
python/src/server/services/storage/embedding_schema_support.py (4)
  • determine_embedding_column (40-47)
  • legacy_column_in_use (100-103)
  • note_multi_dim_success (50-55)
  • should_retry_with_legacy_column (58-79)
python/src/mcp_server/mcp_server.py (1)
python/src/server/utils/progress/progress_tracker.py (1)
  • error (196-219)
python/src/server/api_routes/knowledge_api.py (1)
python/src/server/main.py (2)
  • main (404-423)
  • _check_database_schema_cached (297-336)
🔇 Additional comments (22)
python/src/server/config/config.py (1)

139-146: LGTM! Secure patterns for Docker Compose compatibility.

The hostname allowlisting is secure:

  • .localhost is a reserved TLD (RFC 6761) that cannot be registered publicly.
  • _supabase suffix is safe because underscores are invalid in DNS hostnames (RFC 1123), meaning only Docker's internal DNS can resolve such names—no subdomain bypass risk from public domains.
python/src/mcp_server/mcp_server.py (3)

17-17: LGTM!

Necessary import for the asyncio.Lock used in the initialization logic.


67-72: Well-documented lock selection.

The comment clearly explains the tradeoff: asyncio.Lock is appropriate for single Uvicorn worker and avoids blocking the event loop, while noting the constraint for multi-worker deployments. This is the correct choice for FastMCP's intended deployment model.


144-148: LGTM!

The quick-path optimization is correct. The explicit is not None check and the double-check pattern inside the lock (line 156) ensure thread-safety in the async context.

python/src/server/services/credential_service.py (2)

8-8: LGTM!

The imports for asyncio and APIError are correctly added to support the new retry logic with exponential backoff.

Also applies to: 21-21


138-154: LGTM!

The success path correctly processes credentials and updates the cache. The fail-fast behavior on corrupted data (e.g., missing key field) is appropriate for service startup per coding guidelines. The logging of loaded credential count provides good observability.

archon-ui-main/src/App.tsx (1)

105-105: LGTM! Good forward-compatibility preparation.

Adding the React Router v7 future flags (v7_startTransition, v7_relativeSplatPath) is a best practice for incremental migration. This enables the new behaviors early, allowing you to catch any issues before the v7 release.

migration/complete_setup.sql (2)

91-107: LGTM! Correct idempotent INSERT pattern.

Adding ON CONFLICT (key) DO NOTHING to seed data INSERTs ensures migrations can be safely re-run without duplicate key errors. This appropriately preserves user-customized values while allowing initial seeding.


51-70: Good idempotent pattern, with potential simplification for PostgreSQL 14+.

The trigger existence check with DROP/CREATE pattern ensures safe re-runs. For PostgreSQL 14+, you could simplify this using CREATE OR REPLACE TRIGGER alone, as it already handles the "replace if exists" case. The current approach is more verbose but works on older PostgreSQL versions.

python/src/server/main.py (2)

219-260: LGTM! Well-designed non-blocking health check.

The timeout-wrapped schema check prevents health endpoint timeouts when the database is slow. The fallback to "assume valid" on timeout/error is appropriate for health checks - it keeps the service running while logging the issue. The nested timeout structure (2s outer, 1.5s inner) provides appropriate headroom.


297-336: LGTM! Smart caching strategy.

The caching approach is well-designed:

  • Permanently cache valid schema (common case, no need to re-check)
  • Cache failures for 30 seconds (allow recovery without hammering DB)
  • Don't cache timeouts (allow immediate retry)

This minimizes database load while ensuring the health check remains responsive.

python/src/server/api_routes/knowledge_api.py (2)

1037-1055: LGTM! Good extensibility improvement.

Adding **extra_fields to the progress callback allows richer contextual data to flow through without changing the signature again. The PEP 604 type hint (dict | None) is appropriate for Python 3.12.


1263-1265: LGTM! Correctly updated to use async cached schema check.

The change to use _check_database_schema_cached with await is correct and maintains consistency with the main health endpoint. Both endpoints now benefit from the non-blocking, cached schema validation.

migration/cleanup_stale_credentials.py (1)

28-39: Encryption parameters are correctly synchronized with credential_service.py. All values match exactly: salt, iterations, and default key are consistent between both files.

python/src/server/services/storage/embedding_schema_support.py (3)

1-28: LGTM! Module structure and caching approach are sound.

The tri-state cache (None → undetermined, True → multi-dim supported, False → legacy only) elegantly handles runtime schema detection without repeated database introspection.


58-79: LGTM! Error-driven fallback pattern is well-designed.

The function correctly detects PostgREST column-missing errors and transparently migrates records to the legacy schema. The in-place mutation is clearly documented and avoids unnecessary copying.


82-84: LGTM!

Helper functions are minimal and correct.

Also applies to: 100-103

python/src/server/services/storage/document_storage_service.py (4)

14-19: LGTM! Clean import of schema support utilities.

The new imports consolidate embedding column logic into a dedicated module, improving maintainability.


394-401: LGTM! Good fallback with appropriate warning.

Using determine_embedding_column centralizes the logic, and the warning correctly alerts when an unexpected dimension falls back to a default column.


442-447: LGTM! Upsert prevents duplicate key errors.

The switch from insert to upsert with on_conflict="url,chunk_number" aligns with the table's unique constraint and prevents failures during re-crawls. Based on learnings, the archon_crawled_pages table has this constraint defined.


507-513: LGTM! Individual record fallback uses consistent upsert semantics.

The per-record fallback path now matches the batch upsert behavior, ensuring duplicate-key safety even when batch operations fail.

python/src/server/services/storage/code_storage_service.py (1)

576-593: Good async correction. No external callers found.

Converting to native async is the right fix—calling asyncio.run() from within an already-running event loop would raise RuntimeError. A comprehensive search of the codebase confirms generate_code_example_summary has no external callers; the only internal usage (line 1065) correctly awaits the function. The async conversion is safe and poses no breaking changes to the current codebase.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +156 to +173
except APIError as api_error:
if api_error.code == "PGRST205" and attempt < retries:
logger.warning(
"Supabase schema cache not ready (attempt %s/%s); retrying in %ss",
attempt,
retries,
delay,
)
await asyncio.sleep(delay)
delay = min(delay * 2, 10)
continue
logger.error(f"Error loading credentials: {api_error}")
raise
except Exception as e:
logger.error(f"Error loading credentials: {e}")
raise

except Exception as e:
logger.error(f"Error loading credentials: {e}")
raise
raise RuntimeError("Failed to load credentials after retries")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n python/src/server/services/credential_service.py | sed -n '130,180p'

Repository: coleam00/Archon

Length of output: 2484


🏁 Script executed:

fd credential_service.py | head -5

Repository: coleam00/Archon

Length of output: 153


🏁 Script executed:

head -50 python/src/server/services/credential_service.py

Repository: coleam00/Archon

Length of output: 1362


🌐 Web query:

postgrest-py APIError code attribute type

💡 Result:

Short answer: APIError.code is not a fixed Python type — it's whatever value PostgREST returns in the JSON error payload (so usually a string, but can be an int in some responses). The postgrest-py APIError is raised with the raw r.json() dict, so check the runtime type before using it. [1][2][3]

Sources

  • postgrest-py docs (execute() raises APIError with the response JSON). [1]
  • Issue/trace showing APIError containing SQLSTATE-style string code ('42804'). [2]
  • Discussion showing APIError containing a numeric code (204) in some cases. [3]

Unreachable code and missing stack trace preservation in error logging.

This code has two issues per the coding guidelines:

  1. Line 173 is unreachable: The loop always exits via return (success on line 154), raise (APIError on final attempt or line 168, or other exceptions on line 171), or continue (retry). The RuntimeError on line 173 can never be executed and should be removed.

  2. Lines 167 and 170 missing exc_info=True: Per coding guidelines, error logging must preserve full stack traces for debugging. Both logger.error() calls need exc_info=True.

Proposed fix
             except APIError as api_error:
                 if api_error.code == "PGRST205" and attempt < retries:
                     logger.warning(
                         "Supabase schema cache not ready (attempt %s/%s); retrying in %ss",
                         attempt,
                         retries,
                         delay,
                     )
                     await asyncio.sleep(delay)
                     delay = min(delay * 2, 10)
                     continue
-                logger.error(f"Error loading credentials: {api_error}")
+                logger.error(f"Error loading credentials after {attempt} attempts: {api_error}", exc_info=True)
                 raise
             except Exception as e:
-                logger.error(f"Error loading credentials: {e}")
+                logger.error(f"Error loading credentials: {e}", exc_info=True)
                 raise
-
-        raise RuntimeError("Failed to load credentials after retries")
🤖 Prompt for AI Agents
In `@python/src/server/services/credential_service.py` around lines 156 - 173, The
except blocks in credential_service.py currently log errors without stack traces
and leave an unreachable RuntimeError at the end; update the two
logger.error(...) calls inside the APIError and generic except blocks to pass
exc_info=True so the full traceback is preserved, and remove the final raise
RuntimeError("Failed to load credentials after retries") since control either
returns, continues, or re-raises inside the loop; keep existing re-raise
behavior for the exceptions (do not swallow them).

Comment on lines 475 to +478
except Exception as e:
if should_retry_with_legacy_column(e, batch_data):
retry_delay = 1.0
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bug: Legacy column fallback may skip the actual retry on the last attempt.

If should_retry_with_legacy_column returns True during the final retry iteration (retry=2), the continue statement exits the current iteration, but the for loop ends without executing another iteration. The records are converted to use the legacy column but never actually retried.

Example scenario:

  1. retry=2 (last attempt) → upsert fails with "column not found"
  2. should_retry_with_legacy_column(e, batch_data) returns True, mutates batch_data
  3. continue → loop ends (no retry=3 exists)
  4. Converted records are never persisted
🐛 Suggested fix using a flag to force one more retry
             max_retries = 3
             retry_delay = 1.0
+            force_legacy_retry = False

-            for retry in range(max_retries):
+            retry = 0
+            while retry < max_retries or force_legacy_retry:
+                force_legacy_retry = False  # Reset after consuming the forced retry
                 # Check for cancellation before each retry attempt
                 if cancellation_check:
                     try:
                         cancellation_check()
                     except asyncio.CancelledError:
                         # ... existing cancellation handling ...
                         raise

                 try:
                     client.table("archon_crawled_pages").upsert(
                         batch_data,
                         on_conflict="url,chunk_number"
                     ).execute()
                     if not legacy_column_in_use():
                         note_multi_dim_success()
                     total_chunks_stored += len(batch_data)
                     # ... rest of success handling ...
                     break

                 except Exception as e:
                     if should_retry_with_legacy_column(e, batch_data):
                         retry_delay = 1.0
+                        force_legacy_retry = True
+                        retry += 1
                         continue
-                    if retry < max_retries - 1:
+                    elif retry < max_retries - 1:
                         search_logger.warning(
                             f"Error inserting batch (attempt {retry + 1}/{max_retries}): {e}"
                         )
                         await asyncio.sleep(retry_delay)
                         retry_delay *= 2  # Exponential backoff
+                        retry += 1
                     else:
                         # ... existing fallback to individual inserts ...
+                        break
🤖 Prompt for AI Agents
In `@python/src/server/services/storage/document_storage_service.py` around lines
475 - 478, The retry loop currently uses continue when
should_retry_with_legacy_column(e, batch_data) mutates batch_data, which can
skip the actual retry if that happens on the final iteration; change the logic
so that when should_retry_with_legacy_column(...) returns True you set a boolean
flag (e.g., force_one_more_retry) and, if on the last configured retry
iteration, arrange for one additional loop iteration (for example by checking
force_one_more_retry in the loop condition or incrementing the retry limit) so
the mutated batch_data is actually retried; ensure the flag is cleared after the
forced retry so normal retry behavior resumes.

This PR consolidates several improvements for Docker-based deployments:

## Non-blocking Health Check (main.py)
- Health endpoint now uses asyncio.wait_for with 2s timeout
- Prevents health checks from hanging when database is slow
- Gracefully degrades with timeout/error flags instead of failing

## Docker Container URL Validation (config.py)
- Allow hostnames ending with _supabase (e.g., supabase_kong_supabase)
- Fixes URL validation for Docker Compose Supabase deployments
- Maintains security by only allowing known container naming patterns

## MCP Server Async Improvements (mcp_server.py)
- Changed threading.Lock to asyncio.Lock to avoid blocking event loop
- Improved double-check pattern with proper null safety
- Better documentation of thread safety considerations

## Credential Service Resilience (credential_service.py)
- Added retry logic with exponential backoff for PGRST205 errors
- Handles Supabase schema cache not ready during startup
- Prevents startup failures when Supabase is still initializing

## Document Storage Improvements (document_storage_service.py)
- Changed insert to upsert with on_conflict="url,chunk_number"
- Prevents duplicate key errors when re-crawling same URLs
- Added embedding schema fallback support for mixed deployments

## New Files
- embedding_schema_support.py: Schema compatibility helpers for embedding columns
- cleanup_stale_credentials.py: Utility to remove undecryptable credentials
  after SUPABASE_SERVICE_KEY rotation
## Additional Changes (from audit)

### knowledge_api.py
- Extensible progress callback with **extra_fields parameter
- Uses cached schema check function (_check_database_schema_cached)
- PEP 604 type hints for better code quality

### code_storage_service.py
- Converted generate_code_example_summary from sync to async
- Eliminates asyncio.run() which can cause event loop conflicts
- Function not called externally, safe breaking change

### complete_setup.sql
- Idempotent trigger creation (DROP IF EXISTS before CREATE)
- Added ON CONFLICT DO NOTHING to INSERT statements
- Makes migrations safely re-runnable

### App.tsx
- Added React Router v7 future flags (v7_startTransition, v7_relativeSplatPath)
- Prepares codebase for Router v7 migration
- Valid flags for React Router 6.26.2
@steveant steveant force-pushed the fix/docker-deployment-improvements branch from 8274043 to 29ac5e1 Compare January 15, 2026 03:01
steveant and others added 2 commits January 14, 2026 22:06
Rename `result_type` to `output_type` in DocumentAgent to fix
compatibility with pydantic-ai 1.0.x.

The Agent class renamed this parameter in the 1.0 release.
Servers sometimes return HTTP 200 with text/html for missing files
instead of a proper 404 status. This caused the crawler to incorrectly
detect llms.txt files that don't exist.

Added content-type validation to _check_url_exists() that verifies:
- .txt/.md files return text/plain or text/markdown (not text/html)
- .xml files return text/xml or application/xml variants

When a soft 404 is detected, the method now returns False and logs
an informative message, allowing the crawler to fall back to HTML
crawling correctly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented Apr 6, 2026

🔄 This repository is being replaced by a new version of Archon.

The original Python/MCP codebase is being archived to the archive/v1-python-mcp branch. The new Archon (TypeScript workflow engine for AI coding agents) is replacing it.

This PR is being closed as part of the migration. Thank you for your contribution!

@Wirasm Wirasm closed this Apr 6, 2026
coleam00 pushed a commit that referenced this pull request Apr 7, 2026
…-587

fix(git,isolation): worktree lifecycle gaps — ENOENT distinction, corruption warning, orphan cleanup
Tyone88 pushed a commit to Tyone88/Archon that referenced this pull request Apr 16, 2026
…ix-issue-587

fix(git,isolation): worktree lifecycle gaps — ENOENT distinction, corruption warning, orphan cleanup
joaobmonteiro pushed a commit to joaobmonteiro/Archon that referenced this pull request Apr 26, 2026
…ix-issue-587

fix(git,isolation): worktree lifecycle gaps — ENOENT distinction, corruption warning, orphan cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants