engine: keep SimpleEngine serialized across cancellation#220
Conversation
|
This is a real fix for a real race condition. We've been tracking the same Metal command buffer corruption class of bugs through mlx#3216 -- cancellation releasing the asyncio lock while a Metal kernel is still executing in the worker thread is exactly the window that causes it. The Also good that it eliminates the nested lock deadlocks in the specprefill and text MTP streaming paths. +1 from us. We run SimpleEngine on M2 Ultra 128GB with Qwen3.5-122B and have hit cancellation-related crashes in production (concurrent requests from coding agents that cancel mid-generation). |
|
Strong +1. We hit the exact same race in BatchedEngine and fixed it with the same pattern -- Our production data validates this fix class:
The root cause is the same in both engines: Our BatchedEngine fix is in patch 022 (not yet PRed upstream). This PR covers SimpleEngine. Both should merge -- the pattern is validated in production on both engine paths. See also: mlx issue #3317 and PR #3318 for the complementary upstream MLX fix (deferred error propagation from Metal completion handlers, which makes the resulting GPU error non-fatal). |
Thump604
left a comment
There was a problem hiding this comment.
Looks good. The serialization approach is sound and addresses the core problem correctly.
Strengths
-
Lock scope is correct. The
async with self._generation_lock:block encompasses the entire operation, including the re-await in the cancellation handler. This guarantees the lock is held until the background worker finishes, preventing Metal command-buffer overlap. -
Shield + re-await pattern is robust.
- Shield prevents the awaiter from being cancelled, giving the handler a chance to run
- The except handler re-awaits the unshielded task, ensuring synchronous completion before lock release
- This is tighter than pure shield (which could allow early exit) and safer than naive lock-around-to_thread (which releases on cancellation)
-
Unified lock ownership. All paths (generate, chat, stream_chat, specprefill, text_mtp) now use
_run_blocking_serialized()as the sole lock owner. This eliminates nested-lock deadlock risk in helper methods. -
Test coverage is targeted. The regression tests validate:
- Cancellation doesn't release lock before worker finishes (max_concurrent=1 assertion)
- Specprefill and text_mtp helper paths don't pre-acquire the lock
- These are the exact failure modes this PR prevents
Minor note
The tokenizer.py change (adding missing return statement) is a real bug fix, but it's orthogonal to this PR. Consider splitting it to a separate commit for clarity.
LGTM, approve.
|
@waybarrios, @krystophny: independent technical review of this PR. VerificationRead against current upstream main (b4fa030). The diff introduces a new Why the fix is correctThe core code: async def _run_blocking_serialized(self, func, /, *args, **kwargs):
async with self._generation_lock:
task = asyncio.create_task(asyncio.to_thread(func, *args, **kwargs))
try:
return await asyncio.shield(task)
except asyncio.CancelledError:
try:
await task
except BaseException:
pass
raiseThis is exactly the right fix for the bug. This is the same race shape as Metal SIGABRT bugs that have been observed in production with the same async-lock-around-executor-future pattern. The fix is the canonical asyncio idiom for "shield a worker but propagate cancellation to the caller while waiting for the worker to finish." On the secondary fix (nested lock prevention)The other half of the PR is equally important: refactoring specprefill and text-only MTP streaming helpers to NOT pre-acquire the lock themselves, so Catching Test coverage
The conftest adds an Status notePR currently shows CONFLICTING merge status. Likely needs a rebase on current main since RecommendationStrong merge candidate after rebase. Real fix to a real production race (Metal command buffer corruption when a follow-up request enters MLX while the cancelled-request worker is still mid-call), the canonical asyncio cancellation-safe pattern, comprehensive refactor across all SimpleEngine entry points, and a deterministic regression test that catches the race. |
|
This fix aligns with a race I hit independently on the SimpleEngine SpecPrefill path: a client disconnect mid-prefill releases the generation lock while the MLX worker thread is still running, and a subsequent request then spawns a parallel MLX worker on the same Metal command buffer. The Once this rebases cleanly against main, I'm happy to run it against a 128K-context SpecPrefill regression harness I'm putting together (Qwen 3.5 122B target + 2B draft, streaming + non-streaming at 16K/64K/128K). That will exercise the race window under real prefill work rather than mocked timing, and I can post the run output here as an independent signal for the maintainer. No urgency on the rebase from my side. Shield + await is the right design. |
_run_blocking_serialized catches CancelledError (a BaseException subclass) from the outer scope, but the inner try/except used Exception which would let a second CancelledError during await task escape unhandled. Changed to BaseException to suppress any exception from the draining await. Also fix test_simple_engine.py to use pytest.mark.anyio instead of pytest.mark.asyncio (pytest-asyncio is not configured), and add the anyio_backend fixture to conftest.py restricting to asyncio only since trio is not installed.
4cf801e to
aedb342
Compare
|
Force-pushed a refresh onto current upstream |
|
Refresh confirmed on head f2526ff against upstream main b4fa030. The accidental tokenizer hunk is gone and the file surface is now tests/conftest.py (+6), tests/test_simple_engine.py (+7/-5), tests/test_simple_engine_cancel_serialization.py (+143 new), vllm_mlx/engine/simple.py (+375/-370). The +375/-370 on simple.py is the rebase onto current upstream; spot-checking _run_blocking_serialized confirms the asyncio.shield + await-on-CancelledError pattern is intact across generate(), chat() (both branches), stream_chat(), _stream_generate_specprefill(), and _stream_generate_text(). Same race class addressed in BatchedEngine with the same pattern (asyncio.shield wrapping asyncio.to_thread, CancelledError handler awaiting the shielded task before releasing the lock). Five days of production data on the BatchedEngine side: zero recurrences of the Metal command buffer SIGABRT after that pattern landed. The SimpleEngine refactor brings the same guarantees to the single-request serving path. CI green across the full matrix. Prior APPROVED review at 4cf801e stands for the refreshed head. |
Summary
Keep
SimpleEngineserialized until cancellation-driven worker cleanup has finished, and avoid nested lock reacquisition in the streaming helper paths that now use the same serialized runner.Why
A cancelled request could release the async lock before the background worker thread had fully unwound, allowing a follow-up request to overlap on the same Metal command buffer.
At the same time, the serialized helper must be the only place that acquires the generation lock; otherwise specprefill and text-only MTP streaming paths can self-deadlock on a second acquisition.
What changed
asyncio.to_thread(...)worker has fully unwound after cancellation_run_blocking_serialized(...)as the sole lock ownerBaseException(notException) in the cancellation drain to handle double-cancellation safelyStatus
main(b4fa030) on 2026-04-09#215_run_blocking_serialized(...)Files to review
vllm_mlx/engine/simple.pytests/test_simple_engine_cancel_serialization.pyValidation
python -m pytest tests/test_simple_engine_cancel_serialization.py tests/test_simple_engine.py -q->8 passed