[Enhancement] Engine runtime errors by pi314ever · Pull Request #2426 · vllm-project/vllm-omni

pi314ever · 2026-04-01T17:45:06Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR fixes multiple issues with runtime errors causing hanging issues, tracked in #1346 . PR #2390 exists but does not match patterns set in vLLM regarding generation-time exceptions.

Test Plan

Mock testing for various states/injected errors during generation pipeline.

pytest -v -s \
  tests/diffusion/test_multiproc_engine_concurrency.py::TestMultiprocExecutorRaisesEngineDeadError \
  tests/diffusion/test_multiproc_engine_concurrency.py::TestExecutorFailureCallback \
  tests/diffusion/test_multiproc_engine_concurrency.py::TestDiffusionEngineDeadErrorPassthrough \
  tests/diffusion/test_multiproc_engine_concurrency.py::TestStageDiffusionClientErrorPropagation \
  tests/engine/test_async_omni_engine_outputs.py::test_fatal_error_message_surfaces_through_try_get_output \
  tests/engine/test_async_omni_engine_outputs.py::test_fatal_error_message_surfaces_through_try_get_output_async \
  tests/entrypoints/test_omni_entrypoints.py::test_fatal_error_raises_engine_dead \
  tests/entrypoints/test_omni_entrypoints.py::test_non_fatal_error_raises_runtime \
  tests/entrypoints/test_omni_entrypoints.py::test_async_omni_errored_property_alive \
  tests/entrypoints/test_omni_entrypoints.py::test_async_omni_errored_property_dead_engine \
  tests/entrypoints/test_omni_entrypoints.py::test_async_omni_errored_property_dead_stage \
  tests/entrypoints/test_omni_entrypoints.py::test_async_omni_propagates_engine_error \
  tests/entrypoints/test_omni_entrypoints.py::test_async_omni_propagates_engine_dead_error \
  tests/entrypoints/test_omni_entrypoints.py::test_async_omni_propagates_engine_generate_error \
  tests/entrypoints/openai_api/test_image_server.py::test_health_endpoint \
  tests/entrypoints/openai_api/test_image_server.py::test_health_endpoint_no_engine \
  tests/entrypoints/openai_api/test_image_server.py::test_health_endpoint_dead_engine \
  tests/entrypoints/openai_api/test_image_server.py::test_health_endpoint_after_generate_error

Test Result

All tests pass.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

hsliuustc0106

@fake0fan @chickeyton PTAL

Copilot

Pull request overview

This PR improves how generation-time failures (especially fatal engine/stage failures like worker death/OOM) propagate through the diffusion runtime and OpenAI-compatible serving layer, aiming to prevent hangs and better mirror upstream vLLM error-handling patterns.

Changes:

Introduces structured per-request error outputs (OmniRequestOutput.error) and propagates stage errors as EngineGenerateError vs fatal EngineDeadError.
Adds diffusion worker death detection (monitor thread + failure callbacks) and plumbs “engine dead” state through stage clients, orchestrator, and health checks.
Updates OpenAI API server endpoints and /health behavior, plus adds targeted regression tests for these failure modes.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
vllm_omni/outputs.py	Adds `error` field + `from_error()` constructor for failed request outputs.
vllm_omni/entrypoints/openai/api_server.py	Re-raises engine exceptions; updates `/health` to return 503 on `EngineDeadError`.
vllm_omni/entrypoints/omni_base.py	Extends `check_health()` to query stage clients.
vllm_omni/entrypoints/async_omni.py	Differentiates recoverable vs fatal stage errors; updates errored-state detection and output loop behavior.
vllm_omni/entrypoints/async_omni_diffusion.py	Wraps diffusion generation exceptions as `EngineGenerateError` while preserving fatal errors.
vllm_omni/engine/orchestrator.py	Detects diffusion error outputs and stage-dead conditions; forwards error outputs and escalates fatal failures.
vllm_omni/engine/async_omni_engine.py	Marks orchestrator-thread crashes as fatal messages (`fatal: True`).
vllm_omni/diffusion/stage_diffusion_client.py	Tracks `_engine_dead`, queues error outputs, and registers worker-death callbacks.
vllm_omni/diffusion/executor/multiproc_executor.py	Adds worker monitor thread, failure callbacks, and `EngineDeadError` propagation.
vllm_omni/diffusion/diffusion_engine.py	Tightens error raising semantics and preserves `EngineDeadError`.
vllm_omni/diffusion/offloader/layerwise_backend.py	Adds `__future__` annotations import (typing).
tests/*	Adds regression tests for fatal/non-fatal error propagation and `/health` behavior.

Comments suppressed due to low confidence (1)

vllm_omni/diffusion/stage_diffusion_client.py:140

In add_batch_request_async, the triple-quoted string is placed after runtime statements, so it is not treated as the function docstring (it becomes a no-op string literal). Move the docstring to be the first statement in the function body (before the _engine_dead check) or convert it to a comment to avoid losing documentation and confusing static doc tools.

    async def add_batch_request_async(
        self,
        request_id: str,
        prompts: list[OmniPromptType],
        sampling_params: OmniDiffusionSamplingParams,
    ) -> None:
        if self._engine_dead:
            raise EngineDeadError()
        """Submit a list of prompts as a single batched engine call.

        All prompts are processed in one ``DiffusionEngine.step()`` call
        and the combined result is placed on the output queue with a single
        *request_id*.
        """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

fake0fan

left a few comments. One thing I’m curious about is whether the current changes are compatible with the LLM engine’s dead-worker detection.

xuechendi · 2026-04-02T14:36:06Z

@pi314ever , I am a little confused on what is this PR fixed for?
May you also check this PR - #2006 which will bring in refactoring to entrypoint

lishunyang12

left a few comments, mostly around the worker monitor logic

Co-authored-by: Chenguang Zheng <645327136@qq.com> Signed-off-by: Daniel Huang <pilotflyer824@gmail.com>

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

pi314ever · 2026-04-02T21:07:54Z

@fake0fan @lishunyang12 I have resolved the outstanding comments. PTAL!

hsliuustc0106 · 2026-04-02T23:13:53Z

@bjf-frz are u coordinating with @pi314ever ?

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

pi314ever · 2026-04-17T22:33:28Z

@david6666666 PR Is ready. AMD CI fail is not related.

bjf-frz · 2026-04-18T04:47:34Z

I tested this pr, the server end encountered a "prompt too long" error, and the user end received an error status which shows:

{
  "code": "EngineGenerateError",
  "message": ""
}

The code field did not show proper HTTP error code, the error message mismatched to the code field, the message field is empty and did not show the specific error message.

bjf-frz · 2026-04-18T07:55:12Z

I tested this pr, the server end encountered a "prompt too long" error, and the user end received an error status which shows:
{
  "code": "EngineGenerateError",
  "message": ""
}
The code field did not show proper HTTP error code, the error message mismatched to the code field, the message field is empty and did not show the specific error message.

I fixed this bug in v1/videos api, you may check this commit:
debug-pr-2426

hsliuustc0106

PR Overview

This PR addresses runtime error propagation through the diffusion engine, orchestrator, and OpenAI-compatible serving layer — specifically for worker death/OOM/fatal failures that previously caused hanging. It introduces structured error handling with EngineDeadError (fatal) vs EngineGenerateError (recoverable), subprocess death monitoring, and Omni-aware exception handlers.

Scope: 21 files, ~710 LOC production + ~835 LOC tests.

Blocker Scan

Category	Result
Correctness	ISSUE: `EngineGenerateError()` raised without message at `omni_base.py:301` — see inline comment
Reliability/Safety	PASS — monitor threads use daemon=True, weakref cleanup, and OSError guards
Breaking Changes	PASS — `OmniBase.errored` is a new property; `AsyncOmni.errored` semantics broadened (was orchestrator-only, now includes stage clients) but all callers check via `.errored` which still returns bool
Test Coverage	PASS — 18 targeted tests covering worker monitor, death sentinel, error propagation through all layers, and /health endpoint
Documentation	PASS — inline docstrings are thorough; no new public API exposed
Security	PASS

Blocking Issue

Empty error message for recoverable failures (omni_base.py:301):

EngineGenerateError() is raised without the error text. The app-level handler calls create_error_response(exc) which uses str(exc), producing an empty "message": "" in the client response. This matches the bug reported in the PR thread where a "prompt too long" error became {"code": "EngineGenerateError", "message": ""}.

Fix: raise EngineGenerateError(error_text) instead of raise EngineGenerateError() from RuntimeError(error_text).

Non-Blocking Observations

Copilot's earlier concern about EngineDeadError at orchestrator line 279 — you've already addressed this by switching to check_health(). Good.
_wait_for_orchestrator_init polling loop — the 1-second poll interval during startup is reasonable. The early-detection of dead orchestrator thread is a meaningful improvement over the previous single-blocking future.result().
Video background task terminate_if_errored — correctly signals shutdown from background tasks that can't propagate exceptions to FastAPI handlers. The app_state threading is clean.
O(n_stages) check on every error in _check_engine_output_error — you've documented this with a NOTE. Acceptable for the typical small number of stages.

Verdict

1 blocker (empty error message). Once EngineGenerateError(error_text) is fixed, this is ready to approve.

Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: Daniel Huang <pilotflyer824@gmail.com>

pi314ever · 2026-04-19T19:03:47Z

@hsliuustc0106 I have addressed the blocking issue.

david6666666 · 2026-04-20T08:27:31Z

please fix CI

david6666666 · 2026-04-20T09:06:48Z

I tested this pr, the server end encountered a "prompt too long" error, and the user end received an error status which shows:
{
  "code": "EngineGenerateError",
  "message": ""
}
The code field did not show proper HTTP error code, the error message mismatched to the code field, the message field is empty and did not show the specific error message.
I fixed this bug in v1/videos api, you may check this commit: debug-pr-2426

we should consider this @pi314ever, thx

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Signed-off-by: bjf-frz <frz123db@gmail.com> Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

pi314ever · 2026-04-20T17:36:47Z

@david6666666 I cherry-picked the commit in, should be resolved now. I also fixed CI.

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

david6666666 · 2026-04-21T01:07:44Z

please resolve conflicts, thx

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

david6666666 · 2026-04-21T05:20:51Z

LGTM now

david6666666 · 2026-04-21T05:21:54Z

@hsliuustc0106 @Gaohan123 ptal thx

lgtm

Signed-off-by: Daniel Huang <daniel1.huang@intel.com> Signed-off-by: Daniel Huang <pilotflyer824@gmail.com> Signed-off-by: bjf-frz <frz123db@gmail.com> Co-authored-by: Chenguang Zheng <645327136@qq.com> Co-authored-by: Zhou Taichang <tzhouam@connect.ust.hk> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: bjf-frz <frz123db@gmail.com>

pi314ever added 2 commits April 1, 2026 10:51

Integrate vLLM's EngineDeadError and EngineGenerateError into Omni

770a2e9

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Match vLLM exception info

a35010f

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

pi314ever force-pushed the engine-runtime-errors branch from 816f281 to a35010f Compare April 1, 2026 17:51

pi314ever added 3 commits April 1, 2026 11:20

Fix syntax error

0dfd425

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Fix tests

7d0176d

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Parity on batch

fd63622

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

pi314ever marked this pull request as ready for review April 1, 2026 19:47

pi314ever requested a review from hsliuustc0106 as a code owner April 1, 2026 19:47

This was referenced Apr 1, 2026

[Bugfix]fix wan2.2 RuntimeError no response to user #2390

Closed

[RFC]: Exit on OOM #1346

Open

hsliuustc0106 reviewed Apr 1, 2026

View reviewed changes

hsliuustc0106 requested review from SamitHuang, ZJY0516, Copilot and wtomin April 1, 2026 22:04

Copilot started reviewing on behalf of hsliuustc0106 April 1, 2026 22:04 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

Comment thread vllm_omni/engine/orchestrator.py

Comment thread vllm_omni/entrypoints/async_omni.py

Comment thread vllm_omni/entrypoints/openai/api_server.py Outdated

pi314ever added 2 commits April 1, 2026 16:41

Remove unnecessary paths in api server

3e4de94

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Resolve copilot suggestions

e239ee3

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

fake0fan reviewed Apr 2, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/executor/multiproc_executor.py Outdated

Comment thread vllm_omni/diffusion/executor/multiproc_executor.py

lishunyang12 reviewed Apr 2, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/executor/multiproc_executor.py

Comment thread vllm_omni/engine/orchestrator.py

Comment thread vllm_omni/entrypoints/async_omni.py Outdated

pi314ever and others added 6 commits April 2, 2026 10:02

Update vllm_omni/diffusion/executor/multiproc_executor.py

8a35f75

Co-authored-by: Chenguang Zheng <645327136@qq.com> Signed-off-by: Daniel Huang <pilotflyer824@gmail.com>

set is_failed only if proc dead

ab1a6a6

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Add ensure_open check to check_health

4dd19cc

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Move is_fail checks on dequeue updates from add_req to collect_rpc

c0a25eb

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Use check_health over checking _engine_dead

b6d3877

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Add note on O(n_stages)

0d161ae

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

pi314ever added 4 commits April 17, 2026 10:21

Fix race condition

6157726

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Add comments for progating errors

0783dee

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Fix test

ea9428a

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Remove redundant is_alive check

88e737b

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

david6666666 reviewed Apr 18, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/omni_base.py Outdated

hsliuustc0106 reviewed Apr 18, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/omni_base.py Outdated

hsliuustc0106 previously requested changes Apr 18, 2026

View reviewed changes

Update vllm_omni/entrypoints/omni_base.py

60490c2

Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: Daniel Huang <pilotflyer824@gmail.com>

david6666666 mentioned this pull request Apr 20, 2026

[Bugfix] Return 400 for Qwen image edit validation errors #2930

Closed

pi314ever and others added 2 commits April 20, 2026 10:20

Fix CI

f242bd3

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

fix videos response error

1c0af8c

Signed-off-by: bjf-frz <frz123db@gmail.com> Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Fix failed error code output

2cb7bfa

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Merge branch 'main' into engine-runtime-errors

e29955f

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

david6666666 reviewed Apr 21, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/openai/api_server.py

Comment thread vllm_omni/entrypoints/openai/api_server.py

david6666666 approved these changes Apr 21, 2026

View reviewed changes

david6666666 merged commit 52b5336 into vllm-project:main Apr 21, 2026
8 checks passed

Conversation

pi314ever commented Apr 1, 2026

Purpose

Test Plan

Test Result

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fake0fan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xuechendi commented Apr 2, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pi314ever commented Apr 2, 2026

Uh oh!

hsliuustc0106 commented Apr 2, 2026

Uh oh!

pi314ever commented Apr 17, 2026

Uh oh!

Uh oh!

bjf-frz commented Apr 18, 2026

Uh oh!

bjf-frz commented Apr 18, 2026

Uh oh!

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

PR Overview

Blocker Scan

Blocking Issue

Non-Blocking Observations

Verdict

Uh oh!

pi314ever commented Apr 19, 2026

Uh oh!

david6666666 commented Apr 20, 2026

Uh oh!

david6666666 commented Apr 20, 2026

Uh oh!

pi314ever commented Apr 20, 2026

Uh oh!

david6666666 commented Apr 21, 2026

Uh oh!

Uh oh!

Uh oh!

david6666666 commented Apr 21, 2026

Uh oh!

david6666666 commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants