engine: keep SimpleEngine serialized across cancellation#8
Conversation
Review Summary by QodoKeep SimpleEngine serialized across request cancellation
WalkthroughsDescription• Introduce _run_blocking_serialized() helper to prevent MLX/Metal race conditions • Ensure generation lock held until worker thread completes, even on cancellation • Replace direct asyncio.to_thread() calls with cancellation-safe wrapper • Add regression test verifying no concurrent MLX access after request cancellation Diagramflowchart LR
A["Request cancelled"] --> B["_run_blocking_serialized acquires lock"]
B --> C["asyncio.to_thread starts worker"]
C --> D["Cancellation caught"]
D --> E["Wait for worker to finish"]
E --> F["Release lock safely"]
F --> G["Next request enters MLX"]
File Changes1. tests/test_simple_engine_cancel_serialization.py
|
Code Review by Qodo
1. MTP stream deadlock
|
| cleanup_rope(model) | ||
|
|
||
| all_resps = await asyncio.to_thread(_run_all) | ||
| all_resps = await self._run_blocking_serialized(_run_all) |
There was a problem hiding this comment.
1. Mtp stream deadlock 🐞 Bug ⛯ Reliability
_stream_generate_text() holds _generation_lock and now awaits _run_blocking_serialized(), which tries to acquire the same lock again, deadlocking MLLM+MTP text-only streaming. Text-only MLLM requests routed to this path will hang indefinitely.
Agent Prompt
### Issue description
`_stream_generate_text()` currently acquires `self._generation_lock` and then calls `_run_blocking_serialized()`, which also acquires `self._generation_lock`. This deadlocks.
### Issue Context
This path is used for MLLM+MTP routing (text-only requests). The PR moved lock acquisition into `_run_blocking_serialized()`, so call sites that already hold the lock must be adjusted.
### Fix Focus Areas
- vllm_mlx/engine/simple.py[229-246]
- vllm_mlx/engine/simple.py[1012-1014]
- vllm_mlx/engine/simple.py[1194-1201]
### Implementation direction
Refactor `_stream_generate_text()` so `_run_all` (and thus MLX/Metal calls) run under `_run_blocking_serialized()` without an additional surrounding `async with self._generation_lock:` block; or add a dedicated cancellation-safe helper that does **not** reacquire the lock when the caller already holds it.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Benchmark: Simple Engine vs Continuous BatchingQwen3.5-2B on Apple Silicon (MLX), 128 max_tokens, streaming, 5 iterations per configuration. Aggregate throughput (tok/s) — total completion tokens / wall clock
Wall clock (seconds) — time from first request to last response
Per-request throughput (tok/s) — individual request completion tokens / elapsed
TTFT — time to first content tokenMeasured from request send to first non-empty SSE delta. Includes prompt processing.
AnalysisThroughput scaling:
Wall clock:
Per-request throughput:
TTFT behavior:
Bottom line: the simple engine serialization from this PR is correct behavior for the non-batched code path. Users needing concurrent request handling should use |
) * fix: keep simple engine serialized across cancellation (#8) * fix: avoid nested simple engine generation locks * fix: catch BaseException in cancellation handler, fix async test markers _run_blocking_serialized catches CancelledError (a BaseException subclass) from the outer scope, but the inner try/except used Exception which would let a second CancelledError during await task escape unhandled. Changed to BaseException to suppress any exception from the draining await. Also fix test_simple_engine.py to use pytest.mark.anyio instead of pytest.mark.asyncio (pytest-asyncio is not configured), and add the anyio_backend fixture to conftest.py restricting to asyncio only since trio is not installed. * fix: preserve prompt token accounting after upstream refresh * fix: restore specprefill fallback helper scope
Summary
asyncio.to_thread(...)workers actually finishReproducer
FortBench hit this on Apple Silicon during local Qwen runs:
/v1/responsesrequest was cancelled by the disconnect guardfailed assertionA command encoder is already encoding to this command buffer``Fix
SimpleEnginenow routes blocking MLX calls through a cancellation-safe helper that:_generation_lockasyncio.to_thread(...)This keeps MLX/Metal single-threaded even when the client disconnects.
Tests
python3 -m unittest tests.test_simple_engine_cancel_serialization -vpython3 -m compileall vllm_mlx tests