fix(mcp): resolve SSE connection deadlock with asyncio.Lock#909
fix(mcp): resolve SSE connection deadlock with asyncio.Lock#909steveant wants to merge 1 commit intocoleam00:mainfrom
Conversation
Replace threading.Lock with asyncio.Lock in lifespan manager to prevent event loop blocking when multiple SSE connections are made concurrently. Problem: - threading.Lock blocks the entire event loop thread - Yielding while holding the lock prevented other connections - Server became unresponsive after 2+ concurrent SSE sessions Solution: - Use asyncio.Lock (yields to other coroutines instead of blocking) - Yield context outside the lock block - Capture context in local variable to prevent race conditions - Add explicit None guard before yield Tested with 20 concurrent connections - all passed, single init confirmed.
WalkthroughThe initialization synchronization mechanism was refactored to replace a threading-based lock with an asyncio-compatible lock. The lifespan initialization flow now uses an async context manager with a double-check pattern, captures the shared context in a local variable, and yields it outside the lock to avoid race conditions during context access. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12–18 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
python/src/mcp_server/mcp_server.py (1)
163-176: Synchronous calls inside async lock may delay concurrent connections.The comments correctly note that synchronous calls to
get_session_manager()andget_mcp_service_client()inside the lock don't cause deadlock (asyncio.Lock yields control), but they may delay concurrent connection attempts during initialization.For current usage this is acceptable, but if these become I/O-bound bottlenecks under high concurrent load, consider wrapping them with
asyncio.to_thread()orloop.run_in_executor().🔎 Optional optimization if initialization becomes a bottleneck
If profiling shows these sync calls are slow:
# Initialize session manager - # NOTE: These sync calls run inside the lock. They don't cause deadlock - # (asyncio.Lock yields to other coroutines), but slow init here delays - # concurrent connection attempts. Consider run_in_executor if these - # become I/O-bound bottlenecks under load. logger.info("🔐 Initializing session manager...") - session_manager = get_session_manager() + session_manager = await asyncio.to_thread(get_session_manager) logger.info("✓ Session manager initialized") # Initialize service client for HTTP calls logger.info("🌐 Initializing service client...") - service_client = get_mcp_service_client() + service_client = await asyncio.to_thread(get_mcp_service_client) logger.info("✓ Service client initialized")Note: Only apply if these functions are I/O-bound and measurements show they're bottlenecks.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
python/src/mcp_server/mcp_server.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
python/src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration
python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety
Files:
python/src/mcp_server/mcp_server.py
python/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations
Files:
python/src/mcp_server/mcp_server.py
🧠 Learnings (1)
📚 Learning: 2025-11-29T19:24:46.265Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: archon-example-workflow/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:46.265Z
Learning: Check if Archon MCP server is available before doing anything else in task management scenarios
Applied to files:
python/src/mcp_server/mcp_server.py
🧬 Code graph analysis (1)
python/src/mcp_server/mcp_server.py (1)
python/src/server/utils/progress/progress_tracker.py (1)
error(196-219)
🔇 Additional comments (6)
python/src/mcp_server/mcp_server.py (6)
17-17: LGTM: Asyncio import added correctly.The asyncio import is necessary for
asyncio.Lock()usage and is properly placed with other standard library imports.
144-148: LGTM: Optimistic check pattern is efficient.The quick check without acquiring the lock (lines 144-148) optimizes the common case where initialization is already complete, allowing subsequent SSE connections to proceed immediately.
The explicit
is not Nonecheck is good defensive programming.
150-158: LGTM: Double-check pattern correctly implemented.The double-check locking pattern is properly implemented:
- Local variable
ctxavoids race conditions between lock release and yield- Type hint
ArchonContext | Noneis explicit- Async lock context manager is correct
- Double-check inside lock prevents duplicate initialization
185-192: LGTM: Atomic assignment and comprehensive error handling.The context is assigned to
_shared_contextlast (line 186) for atomicity, and the flag is set after successful initialization (line 187). Error handling properly logs the exception with full traceback per coding guidelines.
194-203: LGTM: Yielding outside lock prevents blocking—key fix for deadlock.This is the critical fix: yielding the context outside the lock (line 200) ensures other concurrent connections aren't blocked while this connection is active.
The None guard (lines 197-198) is good defensive programming, though unlikely to trigger in normal operation since initialization either succeeds or raises.
67-70: Excellent fix addressing event-loop blocking with clear documentation.The change from
threading.Locktoasyncio.Lockcorrectly resolves the event-loop blocking issue that caused SSE connection deadlocks. The inline documentation clearly explains the event-loop binding constraint and deployment assumptions.Single-worker deployment is enforced by default—neither the Dockerfile.mcp nor docker-compose.yml configuration specifies worker processes, so Uvicorn runs with a single worker. The code comment provides clear guidance for any future multi-worker scenarios.
- Add supabase-network to archon-agent-work-orders service (required for local Supabase CLI connectivity) - Fix MCP SSE deadlock by using asyncio.Lock instead of threading.Lock (PR coleam00#909 submitted upstream) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Superseded by #928 which consolidates this fix along with other improvements. |
Summary
Replace
threading.Lockwithasyncio.Lockin the MCP server lifespan manager to prevent event loop blocking when multiple SSE connections are made concurrently.Problem
The original code used
threading.Lockwhich blocks the entire event loop thread. When yielding while holding the lock, other connections could not proceed, causing the server to become unresponsive after 2+ concurrent SSE sessions.Solution
asyncio.Lock(yields to other coroutines instead of blocking)Noneguard before yieldTesting
Tested with 20 concurrent connections:
Notes
asyncio.Lockis event-loop bound, safe for single Uvicorn worker (FastMCP's intended deployment)Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.