Skip to content

Remove enable_docket setting; Docket is now always on#2558

Merged
chrisguidry merged 12 commits intomainfrom
unify-enable-tasks
Dec 5, 2025
Merged

Remove enable_docket setting; Docket is now always on#2558
chrisguidry merged 12 commits intomainfrom
unify-enable-tasks

Conversation

@chrisguidry
Copy link
Copy Markdown
Collaborator

Simplifies configuration by removing the enable_docket flag. Every FastMCP server now starts a Docket instance and Worker automatically. Only enable_tasks remains to control SEP-1686 task protocol support.

# Before: two settings needed
FASTMCP_ENABLE_DOCKET=true
FASTMCP_ENABLE_TASKS=true

# After: just one
FASTMCP_ENABLE_TASKS=true

Making Docket always-on uncovered some startup timing issues. The proxy test fixture was racing against Docket startup (~140ms), so we added a wait_until_ready() method backed by an asyncio.Event that signals when the server lifespan is fully initialized. Also updated the mount lifespan test expectation—with Docket keeping lifespans alive, mounted sub-servers no longer re-enter their lifespan on each proxy call.

🤖 Generated with Claude Code

@marvin-context-protocol marvin-context-protocol Bot added enhancement Improvement to existing functionality. For issues and smaller PR improvements. breaking change Breaks backward compatibility. Requires minor version bump. Critical for maintainer attention. server Related to FastMCP server implementation or server-side functionality. labels Dec 5, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 5, 2025

Warning

Rate limit exceeded

@chrisguidry has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 20 minutes and 16 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between faa897b and e63855c.

📒 Files selected for processing (1)
  • src/fastmcp/utilities/tests.py (1 hunks)

Walkthrough

Removed FASTMCP_ENABLE_DOCKET and the Settings.enable_docket field. Renamed check_docket_enabled() to check_distributed_backend(), which now validates a non-memory distributed backend URL. Server adds an asyncio.Event (_started) for readiness, changes the docket lifespan signature, makes the worker lifecycle cancelable, registers tools/prompts/resources/templates directly from internal managers, and manages current server/docket via ContextVars. Dependencies introduce InMemoryProgress as a fallback and raise errors when background-task features are used outside a running FastMCP server context. Documentation and examples remove the Docket enablement configuration.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description covers the main change, provides before/after examples, explains timing issues discovered, and mentions the AI tool used. However, it lacks the completed Contributors Checklist items required by the template. Complete the Contributors Checklist by checking all applicable boxes (especially verifying issue closure, testing, and documentation updates).
Docstring Coverage ⚠️ Warning Docstring coverage is 64.71% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: removing the enable_docket setting and making Docket always-on by default.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@marvin-context-protocol
Copy link
Copy Markdown
Contributor

Test Failure Analysis

Summary: Three logging tests are failing on Windows because Docket initialization now takes longer than the test's 0.01 second sleep delay.

Root Cause: The PR makes Docket always-on by removing the enable_docket setting. Previously, when enable_docket=False, the _docket_lifespan would pass through immediately. Now it always:

  1. Creates a Docket instance
  2. Registers all tools/prompts/resources/templates with Docket
  3. Starts a Worker instance
  4. Starts the worker's run_forever() task

On Windows, this initialization takes longer than 0.01 seconds, so the tests that check if uvicorn.Config was called immediately after await asyncio.sleep(0.01) fail with:

AssertionError: Expected 'Config' to have been called once. Called 0 times.

Suggested Solution: Update the three failing tests in tests/server/test_logging.py to use the new wait_until_ready() method instead of asyncio.sleep(0.01):

# Replace this:
server_task = asyncio.create_task(
    mcp_server.run_http_async(log_level=test_log_level, port=8003)
)
await asyncio.sleep(0.01)

# With this:
server_task = asyncio.create_task(
    mcp_server.run_http_async(log_level=test_log_level, port=8003)
)
await mcp_server.wait_until_ready()

The PR already introduced wait_until_ready() at src/fastmcp/server/server.py:523 for exactly this purpose—waiting until the server lifespan is fully initialized. Apply this change to all three failing tests:

  • test_uvicorn_logging_default_level (line 36-39)
  • test_uvicorn_logging_with_custom_log_config (line 94-99)
  • test_uvicorn_logging_custom_log_config_overrides_log_level_param (line 155-162)
Detailed Analysis

Failed Tests (Windows only):

  1. test_uvicorn_logging_default_level
  2. test_uvicorn_logging_with_custom_log_config
  3. test_uvicorn_logging_custom_log_config_overrides_log_level_param

Error Message:

AssertionError: Expected 'Config' to have been called once. Called 0 times.
  at C:\hostedtoolcache\windows\Python\3.10.11\x64\lib\unittest\mock.py:908

Why It Happens Now:
The PR removes the enable_docket flag and makes Docket always-on. The _docket_lifespan method (src/fastmcp/server/server.py:379-486) now always executes its full initialization sequence instead of passing through immediately when disabled. This adds ~140ms startup time on Windows.

Why wait_until_ready() Fixes It:
The method waits for the _started event, which is set after all mounted server lifespans are entered (line 514 in the PR). This ensures the server is fully initialized before the test checks if uvicorn.Config was called.

Related Files
  • tests/server/test_logging.py: Contains the three failing tests
  • src/fastmcp/server/server.py:523-529: The wait_until_ready() method implementation
  • src/fastmcp/server/server.py:379-486: The _docket_lifespan method that now always executes
  • tests/server/proxy/test_stateful_proxy_client.py:61: Example of wait_until_ready() usage added in this PR

@marvin-context-protocol
Copy link
Copy Markdown
Contributor

Test Failure Analysis (Updated)

Summary: Two OAuth tests are failing with because the server isn't fully ready when tests try to connect.

Root Cause: The PR makes Docket always-on, which adds initialization time (~140ms). The run_server_async fixture was updated to use await server.wait_until_ready(), but there's a subtle race condition:

  1. run_http_async is started as a background task
  2. wait_until_ready() waits for the _started event
  3. _started.set() is called after all mounted servers' lifespans are entered (line 514)
  4. BUT uvicorn hasn't started accepting connections yet

The _started event signals that the lifespan context is entered, but NOT that uvicorn is listening on the socket. Tests connect immediately after wait_until_ready() returns, before uvicorn's HTTP server is actually bound and listening.

Suggested Solution: Add a small delay after wait_until_ready() in the run_server_async utility to allow uvicorn to finish binding:

# In src/fastmcp/utilities/tests.py around line 212:
# Wait for server lifespan to be ready
await server.wait_until_ready()

# Give uvicorn a moment to bind to the socket
await asyncio.sleep(0.05)

Alternatively, implement a more robust readiness check that actually tests the HTTP endpoint, but that's more complex and the small sleep is pragmatic given that wait_until_ready() already handles the lifespan initialization.

Detailed Analysis

Failed Tests:

  1. test_oauth_server_metadata_discovery in tests/client/auth/test_oauth_client.py
  2. test_metadata_route_forwards_scalekit_response in tests/server/auth/providers/test_scalekit.py

Error:

httpx.ConnectError: All connection attempts failed

Stack Trace Shows:
The error occurs when httpx tries to connect to http://127.0.0.1:<port>/.well-known/oauth-authorization-server. The connection is refused, indicating the HTTP server isn't listening yet.

Timeline:

  1. run_server_async creates a task running server.run_http_async()
  2. wait_until_ready() returns when _started event is set
  3. Test fixture yields the URL
  4. Test immediately tries to connect
  5. Connection fails because uvicorn hasn't finished starting

Why This Worked Before:
The old code used await asyncio.sleep(0.1) which gave uvicorn plenty of time to start. The new wait_until_ready() is faster (good!) but doesn't account for uvicorn's socket binding time.

Related Files
  • src/fastmcp/utilities/tests.py:212-213: Where wait_until_ready() is called
  • src/fastmcp/server/server.py:514: Where _started.set() is called
  • src/fastmcp/server/server.py:707-751: The run_http_async method that enters lifespan then starts uvicorn
  • tests/client/auth/test_oauth_client.py:116: Test that fails when connecting
  • tests/server/auth/providers/test_scalekit.py: Other failing test

Note: This is a separate issue from the Windows logging test failures mentioned in my previous comment. Those still need the fix described earlier.

@marvin-context-protocol
Copy link
Copy Markdown
Contributor

marvin-context-protocol Bot commented Dec 5, 2025

Test Failure Analysis

Updated (2025-12-05 17:14 UTC): Latest workflow run #19970379441 shows the same Windows worker crash pattern.

Summary

Eight pytest-xdist workers crashed on Windows during test execution, hitting the maximum crashed worker limit. This prevents the test suite from completing successfully.


Root Cause

The PR makes Docket always-on, which means every test that creates a FastMCP server now:

  1. Creates a Docket instance
  2. Starts a Worker instance
  3. Registers all tools/prompts/resources with Docket
  4. Starts the worker's run_forever() background task

On Windows with pytest-xdist running 4 parallel workers, this creates significant resource pressure. The logs show:

  • Tests initially pass (200+ passing before first crash)
  • Workers progressively crash during execution: gw0, gw1, gw2, gw3, gw4, gw5, gw6, gw7, gw8
  • Error: maximum crashed workers reached: 8
  • Crashes occur across unrelated test files

This is a resource exhaustion issue specific to Windows + parallel testing, not a functional bug.


Suggested Solution

Update .github/workflows/tests.yml to reduce Windows parallelism:

- name: Run tests (excluding integration and client_process)
  shell: bash
  run: |
    if [ "${{ runner.os }}" == "Windows" ]; then
      # Windows: reduced parallelism due to Docket overhead
      uv run pytest --inline-snapshot=disable tests -m "not integration and not client_process" --numprocesses auto --maxprocesses 2 --dist worksteal
    else
      # macOS/Linux: normal parallelism
      uv run pytest --inline-snapshot=disable tests -m "not integration and not client_process" --numprocesses auto --maxprocesses 4 --dist worksteal
    fi

This reduces parallel workers from 4 to 2 on Windows, which should prevent resource exhaustion while maintaining reasonable test performance.


Detailed Analysis

Worker Crash Timeline (Run #19970379441)

Tests run successfully for the first minute, then workers start crashing:

  • 17:13:19 (~56s): gw0 crashes
  • 17:13:29 (~66s): gw2 crashes
  • 17:13:43 (~80s): gw3 crashes
  • 17:13:48 (~85s): gw1 crashes
  • 17:13:48 (~85s): gw4 crashes
  • 17:14:01 (~98s): gw6 crashes
  • 17:14:12 (~109s): gw7 crashes
  • 17:14:23 (~120s): gw8 crashes
  • 17:14:30 (~127s): gw5 crashes, gw9 crashes - maximum reached

The pattern suggests cumulative resource buildup rather than immediate failure.

Why Windows Is Affected

  1. Process Creation Overhead: Windows has higher process/thread creation costs than Unix systems
  2. Docket Worker Background Tasks: Each test creates asyncio tasks that may not clean up immediately
  3. Parallel Test Load: 4 workers × many tests × (Docket + Worker + background task) = resource pressure
  4. Test Cleanup Timing: Background tasks may not cancel cleanly before next test starts

Why The Code Changes Are Correct

The changes themselves are sound:

  • _started event properly signals server readiness
  • Worker cleanup uses proper cancellation
  • Test fixtures correctly use async context managers

The issue is environmental (Windows + parallel testing) rather than a code defect.

Evidence from Logs

[gw0] node down: Not properly terminated
worker 'gw0' crashed while running 'tests/client/auth/test_oauth_client.py::test_unauthorized'
replacing crashed worker gw0
...
[gw2] node down: Not properly terminated
replacing crashed worker gw2
...
maximum crashed workers reached: 8

Tests that were passing successfully started failing only after workers began crashing, indicating the crashes themselves are the root issue, not the test logic.

Related Files
  • .github/workflows/tests.yml: Test workflow configuration (needs update)
  • src/fastmcp/server/server.py:379-486: _docket_lifespan method (now always executes)
  • src/fastmcp/server/server.py:502-529: Server startup and wait_until_ready() implementation
  • pyproject.toml: pytest-xdist configuration

@jlowin jlowin added this to the MCP 11/25/25 milestone Dec 5, 2025
Comment thread src/fastmcp/server/server.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/fastmcp/utilities/tests.py (1)

25-53: _wait_for_port logic is solid; consider minor async/style refinements

The retry and timeout behavior look correct and should eliminate the port‑readiness race without swallowing cancellation or unexpected errors. A couple of optional tweaks that might improve clarity and satisfy Ruff:

  • Cache the running loop once and reuse it, instead of calling asyncio.get_event_loop() twice per iteration:
-    start = asyncio.get_event_loop().time()
+    loop = asyncio.get_running_loop()
+    start = loop.time()
@@
-        except (OSError, asyncio.TimeoutError):
-            if asyncio.get_event_loop().time() - start > timeout:
+        except (OSError, asyncio.TimeoutError):
+            if loop.time() - start > timeout:
             raise TimeoutError(
                 f"Port {port} on {host} not available after {timeout}s"
             ) from None
             await asyncio.sleep(interval)
  • If you want to clear TRY300, you can move the await asyncio.sleep(interval) into an else branch of the timeout check:
-        except (OSError, asyncio.TimeoutError):
-            if loop.time() - start > timeout:
-                raise TimeoutError(
-                    f"Port {port} on {host} not available after {timeout}s"
-                ) from None
-            await asyncio.sleep(interval)
+        except (OSError, asyncio.TimeoutError):
+            if loop.time() - start > timeout:
+                raise TimeoutError(
+                    f"Port {port} on {host} not available after {timeout}s"
+                ) from None
+            else:
+                await asyncio.sleep(interval)

The TRY003 hint about message length is purely stylistic; the current explicit message is fine unless your Ruff config treats it as an error.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 163dfa1 and 41bd1ed.

📒 Files selected for processing (1)
  • src/fastmcp/utilities/tests.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Python source code must use Python ≥3.10 with full type annotations
Never use bare except - be specific with exception types
Prioritize readable, understandable code - clarity over cleverness; avoid obfuscated or confusing patterns even if shorter
Follow existing patterns and maintain consistency in code organization and style

Files:

  • src/fastmcp/utilities/tests.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: jlowin/fastmcp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-04T00:17:41.238Z
Learning: Applies to tests/**/*.py : Pass FastMCP servers directly to clients for testing using in-memory transport; only use HTTP transport when explicitly testing network features
📚 Learning: 2025-12-04T00:17:41.238Z
Learnt from: CR
Repo: jlowin/fastmcp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-04T00:17:41.238Z
Learning: Applies to tests/**/*.py : Pass FastMCP servers directly to clients for testing using in-memory transport; only use HTTP transport when explicitly testing network features

Applied to files:

  • src/fastmcp/utilities/tests.py
🧬 Code graph analysis (1)
src/fastmcp/utilities/tests.py (1)
src/fastmcp/server/server.py (1)
  • wait_until_ready (523-529)
🪛 Ruff (0.14.7)
src/fastmcp/utilities/tests.py

46-46: Consider moving this statement to an else block

(TRY300)


49-51: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Run tests: Python 3.10 on ubuntu-latest
  • GitHub Check: Run tests with lowest-direct dependencies
  • GitHub Check: Run tests: Python 3.10 on windows-latest
🔇 Additional comments (1)
src/fastmcp/utilities/tests.py (1)

242-247: Readiness + port wait sequencing looks correct and should fix the race

Waiting on server.wait_until_ready() first, then probing the TCP port with _wait_for_port(host, port), is a good ordering: it aligns with the new lifespan signaling and also covers the uvicorn bind race. This should make the async HTTP tests much less flaky without changing semantics for callers of run_server_async.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 41bd1ed and fc6abec.

⛔ Files ignored due to path filters (2)
  • tests/server/proxy/test_stateful_proxy_client.py is excluded by none and included by none
  • tests/server/test_logging.py is excluded by none and included by none
📒 Files selected for processing (2)
  • src/fastmcp/server/server.py (9 hunks)
  • src/fastmcp/utilities/tests.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Python source code must use Python ≥3.10 with full type annotations
Never use bare except - be specific with exception types
Prioritize readable, understandable code - clarity over cleverness; avoid obfuscated or confusing patterns even if shorter
Follow existing patterns and maintain consistency in code organization and style

Files:

  • src/fastmcp/server/server.py
  • src/fastmcp/utilities/tests.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: jlowin/fastmcp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-04T00:17:41.238Z
Learning: Applies to tests/**/*.py : Pass FastMCP servers directly to clients for testing using in-memory transport; only use HTTP transport when explicitly testing network features
📚 Learning: 2025-12-04T00:17:41.238Z
Learnt from: CR
Repo: jlowin/fastmcp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-04T00:17:41.238Z
Learning: Applies to tests/**/*.py : Pass FastMCP servers directly to clients for testing using in-memory transport; only use HTTP transport when explicitly testing network features

Applied to files:

  • src/fastmcp/utilities/tests.py
🪛 Ruff (0.14.7)
src/fastmcp/utilities/tests.py

46-46: Consider moving this statement to an else block

(TRY300)


49-51: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Run tests: Python 3.10 on ubuntu-latest
  • GitHub Check: Run tests: Python 3.10 on windows-latest
  • GitHub Check: Run tests with lowest-direct dependencies
🔇 Additional comments (5)
src/fastmcp/utilities/tests.py (1)

242-247: Good fix for the startup race condition.

The two-step wait (lifespan readiness + port availability) addresses the timing issue mentioned in the PR objectives where the proxy test fixture raced against Docket startup.

src/fastmcp/server/server.py (4)

217-217: Clean readiness signaling implementation.

The asyncio.Event is an appropriate mechanism for signaling server readiness, and unconditional initialization ensures it's always available for waiters.


467-483: Worker cancellation pattern looks correct.

The try/except/else block with explicit cancellation ensures the worker task group is cancelled in all exit paths (normal completion, exceptions, and cancellation). This addresses potential cleanup issues with long-running background workers.


514-518: Proper event lifecycle management.

Setting _started in a try/finally block ensures the event is cleared even on exceptions, guaranteeing that waiters only observe the ready state during active lifespan progression.


410-453: The hasattr guards are necessary and correct.

Lines 412, 423, 434, and 445 check for the fn attribute before registering with Docket. This filtering is necessary because the manager collections contain both base component classes (Tool, Prompt, Resource, ResourceTemplate) without an fn attribute and function-based subclasses (FunctionTool, FunctionPrompt, FunctionResource, FunctionResourceTemplate) with an fn attribute. Components from decorators are created as function subclasses, while mounted servers can contribute either type via model_copy() operations. Only function-based components with the fn attribute can be registered with Docket for background task execution, making these guards essential.

Comment thread src/fastmcp/utilities/tests.py Outdated
Docket provides background task execution and is now always available
for all FastMCP servers. Only `enable_tasks` remains to control the
SEP-1686 task protocol support.

Changes:
- Remove `enable_docket` setting and related validation
- Docket/Worker lifecycle is always active in server lifespan
- CurrentDocket and CurrentWorker dependencies work without config
- Add server readiness signaling via `_started` event
- Fix test timing issues with proper port probing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
chrisguidry and others added 6 commits December 5, 2025 11:42
Lost during rebase conflict resolution.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
When pytest-xdist terminates worker processes on Windows, the event loop
may close before cleanup can complete. Adding a 2-second timeout to the
worker cancellation prevents indefinite hanging.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Windows ProactorEventLoop has known memory corruption issues that cause
pytest-xdist worker crashes with "node down: Not properly terminated".

Setting WindowsSelectorEventLoopPolicy in conftest.py avoids this issue.

See: python/cpython#116773

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fakeredis doesn't implement blocking xread properly - it returns
immediately instead of waiting. This causes Docket._monitor_strikes
to busy-loop, overwhelming pytest-xdist workers on Windows.

Mock the method to just sleep, since strike coordination isn't useful
with in-memory backends anyway.

See: cunla/fakeredis-py#274

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Docket stores a shared FakeServer as a class attribute (_memory_server).
When many tests run in parallel, shared state can cause issues on Windows.

Add autouse fixture to reset the shared server before each test.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@marvin-context-protocol
Copy link
Copy Markdown
Contributor

Test Failure Analysis

Updated (2025-12-05 17:19 UTC): Latest workflow run #19970549620 shows the same Windows worker crash pattern.

Summary

Eight pytest-xdist workers crashed on Windows during test execution, hitting the maximum crashed worker limit. This prevents the test suite from completing successfully.


Root Cause

The PR makes Docket always-on, which means every test that creates a FastMCP server now:

  1. Creates a Docket instance
  2. Starts a Worker instance
  3. Registers all tools/prompts/resources with Docket
  4. Starts the worker's run_forever() background task

On Windows with pytest-xdist running 4 parallel workers, this creates significant resource pressure. The logs show:

  • Tests initially pass (200+ passing before first crash)
  • Workers progressively crash during execution: gw0, gw1, gw2, gw3, gw4, gw5, gw6, gw7, gw8, gw9
  • Error: maximum crashed workers reached: 8
  • Crashes occur across unrelated test files

This is a resource exhaustion issue specific to Windows + parallel testing, not a functional bug.


Suggested Solution

Update .github/workflows/run-tests.yml to reduce Windows parallelism:

- name: Run tests (excluding integration and client_process)
  shell: bash
  run: |
    if [ "$RUNNER_OS" == "Windows" ]; then
      # Windows: reduced parallelism due to Docket overhead
      uv run pytest --inline-snapshot=disable tests -m "not integration and not client_process" --numprocesses auto --maxprocesses 2 --dist worksteal
    else
      # macOS/Linux: normal parallelism
      uv run pytest --inline-snapshot=disable tests -m "not integration and not client_process" --numprocesses auto --maxprocesses 4 --dist worksteal
    fi

This reduces parallel workers from 4 to 2 on Windows, which should prevent resource exhaustion while maintaining reasonable test performance.


Detailed Analysis

Worker Crash Timeline (Run #19970549620)

Tests run successfully for the first minute, then workers start crashing:

  • 17:20:13 (~9s): gw0 crashes
  • 17:20:23 (~19s): gw2 crashes
  • 17:20:37 (~33s): gw3 crashes
  • 17:20:41 (~37s): gw1 crashes
  • 17:20:42 (~38s): gw4 crashes
  • 17:20:56 (~52s): gw6 crashes
  • 17:21:06 (~62s): gw7 crashes
  • 17:21:15 (~71s): gw8 crashes
  • 17:21:22 (~78s): gw9 and gw5 crash - maximum reached

The pattern suggests cumulative resource buildup rather than immediate failure.

Why Windows Is Affected

  1. Process Creation Overhead: Windows has higher process/thread creation costs than Unix systems
  2. Docket Worker Background Tasks: Each test creates asyncio tasks that may not clean up immediately
  3. Parallel Test Load: 4 workers × many tests × (Docket + Worker + background task) = resource pressure
  4. Test Cleanup Timing: Background tasks may not cancel cleanly before next test starts

Why The Code Changes Are Correct

The changes themselves are sound:

  • _started event properly signals server readiness
  • Worker cleanup uses proper cancellation
  • Test fixtures correctly use async context managers

The issue is environmental (Windows + parallel testing) rather than a code defect.

Evidence from Logs

[gw0] node down: Not properly terminated
worker 'gw0' crashed while running 'tests/client/auth/test_oauth_client.py::test_unauthorized'
replacing crashed worker gw0
...
[gw2] node down: Not properly terminated
replacing crashed worker gw2
...
maximum crashed workers reached: 8

Tests that were passing successfully started failing only after workers began crashing, indicating the crashes themselves are the root issue, not the test logic.

Related Files
  • .github/workflows/run-tests.yml:50-52: Test workflow configuration (needs update)
  • src/fastmcp/server/server.py:379-486: _docket_lifespan method (now always executes)
  • src/fastmcp/server/server.py:502-529: Server startup and wait_until_ready() implementation
  • pyproject.toml: pytest-xdist configuration

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/fastmcp/utilities/tests.py (2)

25-57: _wait_for_port logic is sound; only minor style/robustness nits

The polling loop, timeout accounting via get_running_loop().time(), and Windows-specific handling of wait_closed() all look correct and robust for test usage.

Two small, optional tweaks you might consider:

  • Cache the loop object once to avoid repeated get_running_loop() calls and make the timeout math a bit clearer:
-    start = asyncio.get_running_loop().time()
+    loop = asyncio.get_running_loop()
+    start = loop.time()
@@
-        except (OSError, asyncio.TimeoutError):
-            if asyncio.get_running_loop().time() - start > timeout:
+        except (OSError, asyncio.TimeoutError):
+            if loop.time() - start > timeout:
  • If you want to quiet Ruff’s TRY300 suggestion, you could move the return into an else: block on the try, but that’s purely stylistic and not required for correctness.

I’d keep the explicit TimeoutError message despite TRY003; it’s helpful when tests fail.


246-252: Readiness + teardown sequencing looks good; consider handling pre‑yield failures and using a public readiness API

Using await server._started.wait() followed by _wait_for_port(host, port) cleanly removes the previous race between test code and the HTTP listener, and the new teardown (cancel + wait_for with timeout while suppressing CancelledError/asyncio.TimeoutError) is a solid way to avoid hangs, especially on Windows.

Two non-blocking improvements:

  • If _wait_for_port (or _started.wait()) raises before the yield, the try/finally block is never entered, so server_task won’t be cancelled/awaited. You could wrap the readiness phase in a try that cancels and awaits server_task on error before re‑raising, to avoid leaving a stray task around in failing tests.
  • If FastMCP now exposes a public await server.wait_until_ready() helper, prefer calling that instead of the private server._started event to decouple tests from internals.

Overall this helper is a nice match for the in‑process testing guidance and should make async tests much less flaky. Based on learnings, ...

Also applies to: 256-260

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bcd2e59 and faa897b.

⛔ Files ignored due to path filters (1)
  • .github/workflows/run-tests.yml is excluded by none and included by none
📒 Files selected for processing (1)
  • src/fastmcp/utilities/tests.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Python source code must use Python ≥3.10 with full type annotations
Never use bare except - be specific with exception types
Prioritize readable, understandable code - clarity over cleverness; avoid obfuscated or confusing patterns even if shorter
Follow existing patterns and maintain consistency in code organization and style

Files:

  • src/fastmcp/utilities/tests.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: jlowin/fastmcp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-04T00:17:41.238Z
Learning: Applies to tests/**/*.py : Pass FastMCP servers directly to clients for testing using in-memory transport; only use HTTP transport when explicitly testing network features
📚 Learning: 2025-12-04T00:17:41.238Z
Learnt from: CR
Repo: jlowin/fastmcp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-04T00:17:41.238Z
Learning: Applies to tests/**/*.py : Pass FastMCP servers directly to clients for testing using in-memory transport; only use HTTP transport when explicitly testing network features

Applied to files:

  • src/fastmcp/utilities/tests.py
🪛 Ruff (0.14.7)
src/fastmcp/utilities/tests.py

50-50: Consider moving this statement to an else block

(TRY300)


53-55: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Run tests: Python 3.10 on ubuntu-latest
  • GitHub Check: Run tests with lowest-direct dependencies

@chrisguidry chrisguidry merged commit 960ce77 into main Dec 5, 2025
12 checks passed
@chrisguidry chrisguidry deleted the unify-enable-tasks branch December 5, 2025 17:40
@jordanhboxer
Copy link
Copy Markdown

this makes FastMCP incompatible with Trio now that Docket is required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking change Breaks backward compatibility. Requires minor version bump. Critical for maintainer attention. enhancement Improvement to existing functionality. For issues and smaller PR improvements. server Related to FastMCP server implementation or server-side functionality.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants