Skip to content

engine: keep SimpleEngine serialized across cancellation#220

Merged
Thump604 merged 5 commits intowaybarrios:mainfrom
computor-org:fix/simple-engine-cancel-serialization-upstream
Apr 10, 2026
Merged

engine: keep SimpleEngine serialized across cancellation#220
Thump604 merged 5 commits intowaybarrios:mainfrom
computor-org:fix/simple-engine-cancel-serialization-upstream

Conversation

@krystophny
Copy link
Copy Markdown
Contributor

@krystophny krystophny commented Mar 24, 2026

Summary

Keep SimpleEngine serialized until cancellation-driven worker cleanup has finished, and avoid nested lock reacquisition in the streaming helper paths that now use the same serialized runner.

Why

A cancelled request could release the async lock before the background worker thread had fully unwound, allowing a follow-up request to overlap on the same Metal command buffer.

At the same time, the serialized helper must be the only place that acquires the generation lock; otherwise specprefill and text-only MTP streaming paths can self-deadlock on a second acquisition.

What changed

  • keep the async critical section held until the asyncio.to_thread(...) worker has fully unwound after cancellation
  • make specprefill and text-only MTP helpers rely on _run_blocking_serialized(...) as the sole lock owner
  • catch BaseException (not Exception) in the cancellation drain to handle double-cancellation safely
  • add focused regressions for cancellation overlap and the two nested-lock call sites

Status

  • refreshed onto current upstream main (b4fa030) on 2026-04-09
  • removed the accidental tokenizer change from this PR; the tokenizer fix stays isolated in #215
  • preserved the current upstream non-stream prompt-token accounting while rebasing the chat path onto _run_blocking_serialized(...)

Files to review

  • vllm_mlx/engine/simple.py
  • tests/test_simple_engine_cancel_serialization.py

Validation

  • python -m pytest tests/test_simple_engine_cancel_serialization.py tests/test_simple_engine.py -q -> 8 passed

@Thump604
Copy link
Copy Markdown
Collaborator

This is a real fix for a real race condition. We've been tracking the same Metal command buffer corruption class of bugs through mlx#3216 -- cancellation releasing the asyncio lock while a Metal kernel is still executing in the worker thread is exactly the window that causes it.

The asyncio.shield + await-on-cancel pattern is the right approach. One lock acquisition point, cancellation waits for worker completion. Clean.

Also good that it eliminates the nested lock deadlocks in the specprefill and text MTP streaming paths.

+1 from us. We run SimpleEngine on M2 Ultra 128GB with Qwen3.5-122B and have hit cancellation-related crashes in production (concurrent requests from coding agents that cancel mid-generation).

@Thump604
Copy link
Copy Markdown
Collaborator

Strong +1. We hit the exact same race in BatchedEngine and fixed it with the same pattern -- asyncio.shield() wrapping asyncio.to_thread(), with a CancelledError handler that awaits the shielded task before releasing the lock.

Our production data validates this fix class:

  • Before fix: ~1x/day SIGABRT from tryCoalescingPreviousComputeCommandEncoder assertion (two threads hitting Metal concurrently after lock released during cancellation)
  • After fix: Zero crashes over 5 days unattended production + 20/20 stress test clean (M2 Ultra, Qwen3.5-122B, BatchedEngine)

The root cause is the same in both engines: asyncio.Lock.__aexit__ runs when CancelledError propagates through async with, releasing the lock while to_thread worker is still executing Metal ops. The next request acquires the freed lock, and two threads overlap on the same Metal command buffer.

Our BatchedEngine fix is in patch 022 (not yet PRed upstream). This PR covers SimpleEngine. Both should merge -- the pattern is validated in production on both engine paths.

See also: mlx issue #3317 and PR #3318 for the complementary upstream MLX fix (deferred error propagation from Metal completion handlers, which makes the resulting GPU error non-fatal).

Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. The serialization approach is sound and addresses the core problem correctly.

Strengths

  1. Lock scope is correct. The async with self._generation_lock: block encompasses the entire operation, including the re-await in the cancellation handler. This guarantees the lock is held until the background worker finishes, preventing Metal command-buffer overlap.

  2. Shield + re-await pattern is robust.

    • Shield prevents the awaiter from being cancelled, giving the handler a chance to run
    • The except handler re-awaits the unshielded task, ensuring synchronous completion before lock release
    • This is tighter than pure shield (which could allow early exit) and safer than naive lock-around-to_thread (which releases on cancellation)
  3. Unified lock ownership. All paths (generate, chat, stream_chat, specprefill, text_mtp) now use _run_blocking_serialized() as the sole lock owner. This eliminates nested-lock deadlock risk in helper methods.

  4. Test coverage is targeted. The regression tests validate:

    • Cancellation doesn't release lock before worker finishes (max_concurrent=1 assertion)
    • Specprefill and text_mtp helper paths don't pre-acquire the lock
    • These are the exact failure modes this PR prevents

Minor note

The tokenizer.py change (adding missing return statement) is a real bug fix, but it's orthogonal to this PR. Consider splitting it to a separate commit for clarity.

LGTM, approve.

@Thump604
Copy link
Copy Markdown
Collaborator

Thump604 commented Apr 7, 2026

@waybarrios, @krystophny: independent technical review of this PR.

Verification

Read against current upstream main (b4fa030). The diff introduces a new _run_blocking_serialized() helper on SimpleEngine and refactors generate(), chat() (both LLM and MLLM branches), stream_chat(), _stream_generate_specprefill(), and _stream_generate_text() to route through it. The helper becomes the sole owner of self._generation_lock for all blocking MLX operations.

Why the fix is correct

The core code:

async def _run_blocking_serialized(self, func, /, *args, **kwargs):
    async with self._generation_lock:
        task = asyncio.create_task(asyncio.to_thread(func, *args, **kwargs))
        try:
            return await asyncio.shield(task)
        except asyncio.CancelledError:
            try:
                await task
            except BaseException:
                pass
            raise

This is exactly the right fix for the bug. await asyncio.to_thread(fn) does NOT cancel the underlying executor future once it has started running, so when the outer task is cancelled the worker thread continues and the lock release has to wait for the worker to actually exit. asyncio.shield(task) keeps the worker future alive across outer cancellation, and the except branch awaits the worker through to completion before re-raising. The lock stays held until the worker has truly unwound. A follow-up request that acquires the lock will not overlap on Metal command buffers.

This is the same race shape as Metal SIGABRT bugs that have been observed in production with the same async-lock-around-executor-future pattern. The fix is the canonical asyncio idiom for "shield a worker but propagate cancellation to the caller while waiting for the worker to finish."

On the secondary fix (nested lock prevention)

The other half of the PR is equally important: refactoring specprefill and text-only MTP streaming helpers to NOT pre-acquire the lock themselves, so _run_blocking_serialized can be the sole owner. Without this, those helpers would self-deadlock when they call into the serialized runner under their own outer lock.

Catching BaseException (rather than Exception) in the cancellation drain also handles double-cancellation safely, which is important for clients that race the disconnect-then-retry pattern.

Test coverage

tests/test_simple_engine_cancel_serialization.py adds three focused regressions:

  1. test_cancellation_does_not_release_lock_before_worker_finishes uses a real SimpleEngine, two tasks racing through a mocked model with a barrier, asserts _max_concurrent == 1 after cancellation. This is the deterministic race detector I would write to verify the fix.
  2. test_specprefill_path_does_not_prelock_serialized_runner asserts the lock is NOT already held when _run_blocking_serialized is invoked from the specprefill path.
  3. test_text_mtp_path_does_not_prelock_serialized_runner is the same assertion for the text-only MTP path.

The conftest adds an anyio_backend fixture so the tests run under pytest-anyio.

Status note

PR currently shows CONFLICTING merge status. Likely needs a rebase on current main since vllm_mlx/engine/simple.py has had other changes since this branch was created. The fix itself remains valid, just needs the rebase to apply cleanly.

Recommendation

Strong merge candidate after rebase. Real fix to a real production race (Metal command buffer corruption when a follow-up request enters MLX while the cancelled-request worker is still mid-call), the canonical asyncio cancellation-safe pattern, comprehensive refactor across all SimpleEngine entry points, and a deterministic regression test that catches the race.

@Thump604
Copy link
Copy Markdown
Collaborator

Thump604 commented Apr 8, 2026

This fix aligns with a race I hit independently on the SimpleEngine SpecPrefill path: a client disconnect mid-prefill releases the generation lock while the MLX worker thread is still running, and a subsequent request then spawns a parallel MLX worker on the same Metal command buffer. The ensure_future + shield + await worker on CancelledError shape in _run_blocking_serialized is the exact pattern I ended up with for the BatchedEngine MLLM path.

Once this rebases cleanly against main, I'm happy to run it against a 128K-context SpecPrefill regression harness I'm putting together (Qwen 3.5 122B target + 2B draft, streaming + non-streaming at 16K/64K/128K). That will exercise the race window under real prefill work rather than mocked timing, and I can post the run output here as an independent signal for the maintainer.

No urgency on the rebase from my side. Shield + await is the right design.

_run_blocking_serialized catches CancelledError (a BaseException subclass)
from the outer scope, but the inner try/except used Exception which would
let a second CancelledError during await task escape unhandled. Changed to
BaseException to suppress any exception from the draining await.

Also fix test_simple_engine.py to use pytest.mark.anyio instead of
pytest.mark.asyncio (pytest-asyncio is not configured), and add the
anyio_backend fixture to conftest.py restricting to asyncio only since
trio is not installed.
@krystophny krystophny force-pushed the fix/simple-engine-cancel-serialization-upstream branch from 4cf801e to aedb342 Compare April 9, 2026 06:35
@krystophny
Copy link
Copy Markdown
Contributor Author

Force-pushed a refresh onto current upstream main (b4fa030). I dropped the accidental tokenizer hunk from this PR and kept the current upstream prompt-token accounting in the rebased non-stream chat path. Validation: python -m pytest tests/test_simple_engine_cancel_serialization.py tests/test_simple_engine.py -q -> 8 passed.

@Thump604
Copy link
Copy Markdown
Collaborator

Thump604 commented Apr 9, 2026

Refresh confirmed on head f2526ff against upstream main b4fa030. The accidental tokenizer hunk is gone and the file surface is now tests/conftest.py (+6), tests/test_simple_engine.py (+7/-5), tests/test_simple_engine_cancel_serialization.py (+143 new), vllm_mlx/engine/simple.py (+375/-370). The +375/-370 on simple.py is the rebase onto current upstream; spot-checking _run_blocking_serialized confirms the asyncio.shield + await-on-CancelledError pattern is intact across generate(), chat() (both branches), stream_chat(), _stream_generate_specprefill(), and _stream_generate_text().

Same race class addressed in BatchedEngine with the same pattern (asyncio.shield wrapping asyncio.to_thread, CancelledError handler awaiting the shielded task before releasing the lock). Five days of production data on the BatchedEngine side: zero recurrences of the Metal command buffer SIGABRT after that pattern landed. The SimpleEngine refactor brings the same guarantees to the single-request serving path.

CI green across the full matrix. Prior APPROVED review at 4cf801e stands for the refreshed head.

@Thump604 Thump604 merged commit 55cff4d into waybarrios:main Apr 10, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants