Skip to content

[Enhancement] Engine runtime errors#2426

Merged
david6666666 merged 48 commits into
vllm-project:mainfrom
pi314ever:engine-runtime-errors
Apr 21, 2026
Merged

[Enhancement] Engine runtime errors#2426
david6666666 merged 48 commits into
vllm-project:mainfrom
pi314ever:engine-runtime-errors

Conversation

@pi314ever
Copy link
Copy Markdown
Contributor

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR fixes multiple issues with runtime errors causing hanging issues, tracked in #1346 . PR #2390 exists but does not match patterns set in vLLM regarding generation-time exceptions.

Test Plan

Mock testing for various states/injected errors during generation pipeline.

pytest -v -s \
  tests/diffusion/test_multiproc_engine_concurrency.py::TestMultiprocExecutorRaisesEngineDeadError \
  tests/diffusion/test_multiproc_engine_concurrency.py::TestExecutorFailureCallback \
  tests/diffusion/test_multiproc_engine_concurrency.py::TestDiffusionEngineDeadErrorPassthrough \
  tests/diffusion/test_multiproc_engine_concurrency.py::TestStageDiffusionClientErrorPropagation \
  tests/engine/test_async_omni_engine_outputs.py::test_fatal_error_message_surfaces_through_try_get_output \
  tests/engine/test_async_omni_engine_outputs.py::test_fatal_error_message_surfaces_through_try_get_output_async \
  tests/entrypoints/test_omni_entrypoints.py::test_fatal_error_raises_engine_dead \
  tests/entrypoints/test_omni_entrypoints.py::test_non_fatal_error_raises_runtime \
  tests/entrypoints/test_omni_entrypoints.py::test_async_omni_errored_property_alive \
  tests/entrypoints/test_omni_entrypoints.py::test_async_omni_errored_property_dead_engine \
  tests/entrypoints/test_omni_entrypoints.py::test_async_omni_errored_property_dead_stage \
  tests/entrypoints/test_omni_entrypoints.py::test_async_omni_propagates_engine_error \
  tests/entrypoints/test_omni_entrypoints.py::test_async_omni_propagates_engine_dead_error \
  tests/entrypoints/test_omni_entrypoints.py::test_async_omni_propagates_engine_generate_error \
  tests/entrypoints/openai_api/test_image_server.py::test_health_endpoint \
  tests/entrypoints/openai_api/test_image_server.py::test_health_endpoint_no_engine \
  tests/entrypoints/openai_api/test_image_server.py::test_health_endpoint_dead_engine \
  tests/entrypoints/openai_api/test_image_server.py::test_health_endpoint_after_generate_error

Test Result

All tests pass.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
@pi314ever pi314ever force-pushed the engine-runtime-errors branch from 816f281 to a35010f Compare April 1, 2026 17:51
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
@pi314ever pi314ever marked this pull request as ready for review April 1, 2026 19:47
@pi314ever pi314ever requested a review from hsliuustc0106 as a code owner April 1, 2026 19:47
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves how generation-time failures (especially fatal engine/stage failures like worker death/OOM) propagate through the diffusion runtime and OpenAI-compatible serving layer, aiming to prevent hangs and better mirror upstream vLLM error-handling patterns.

Changes:

  • Introduces structured per-request error outputs (OmniRequestOutput.error) and propagates stage errors as EngineGenerateError vs fatal EngineDeadError.
  • Adds diffusion worker death detection (monitor thread + failure callbacks) and plumbs “engine dead” state through stage clients, orchestrator, and health checks.
  • Updates OpenAI API server endpoints and /health behavior, plus adds targeted regression tests for these failure modes.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
vllm_omni/outputs.py Adds error field + from_error() constructor for failed request outputs.
vllm_omni/entrypoints/openai/api_server.py Re-raises engine exceptions; updates /health to return 503 on EngineDeadError.
vllm_omni/entrypoints/omni_base.py Extends check_health() to query stage clients.
vllm_omni/entrypoints/async_omni.py Differentiates recoverable vs fatal stage errors; updates errored-state detection and output loop behavior.
vllm_omni/entrypoints/async_omni_diffusion.py Wraps diffusion generation exceptions as EngineGenerateError while preserving fatal errors.
vllm_omni/engine/orchestrator.py Detects diffusion error outputs and stage-dead conditions; forwards error outputs and escalates fatal failures.
vllm_omni/engine/async_omni_engine.py Marks orchestrator-thread crashes as fatal messages (fatal: True).
vllm_omni/diffusion/stage_diffusion_client.py Tracks _engine_dead, queues error outputs, and registers worker-death callbacks.
vllm_omni/diffusion/executor/multiproc_executor.py Adds worker monitor thread, failure callbacks, and EngineDeadError propagation.
vllm_omni/diffusion/diffusion_engine.py Tightens error raising semantics and preserves EngineDeadError.
vllm_omni/diffusion/offloader/layerwise_backend.py Adds __future__ annotations import (typing).
tests/* Adds regression tests for fatal/non-fatal error propagation and /health behavior.
Comments suppressed due to low confidence (1)

vllm_omni/diffusion/stage_diffusion_client.py:140

  • In add_batch_request_async, the triple-quoted string is placed after runtime statements, so it is not treated as the function docstring (it becomes a no-op string literal). Move the docstring to be the first statement in the function body (before the _engine_dead check) or convert it to a comment to avoid losing documentation and confusing static doc tools.
    async def add_batch_request_async(
        self,
        request_id: str,
        prompts: list[OmniPromptType],
        sampling_params: OmniDiffusionSamplingParams,
    ) -> None:
        if self._engine_dead:
            raise EngineDeadError()
        """Submit a list of prompts as a single batched engine call.

        All prompts are processed in one ``DiffusionEngine.step()`` call
        and the combined result is placed on the output queue with a single
        *request_id*.
        """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vllm_omni/engine/orchestrator.py
Comment thread vllm_omni/entrypoints/async_omni.py
Comment thread vllm_omni/entrypoints/openai/api_server.py Outdated
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Copy link
Copy Markdown
Contributor

@fake0fan fake0fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments. One thing I’m curious about is whether the current changes are compatible with the LLM engine’s dead-worker detection.

Comment thread vllm_omni/diffusion/executor/multiproc_executor.py Outdated
Comment thread vllm_omni/diffusion/executor/multiproc_executor.py
@xuechendi
Copy link
Copy Markdown
Contributor

@pi314ever , I am a little confused on what is this PR fixed for?
May you also check this PR - #2006 which will bring in refactoring to entrypoint

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments, mostly around the worker monitor logic

Comment thread vllm_omni/diffusion/executor/multiproc_executor.py
Comment thread vllm_omni/engine/orchestrator.py
Comment thread vllm_omni/entrypoints/async_omni.py Outdated
pi314ever and others added 6 commits April 2, 2026 10:02
Co-authored-by: Chenguang Zheng <645327136@qq.com>
Signed-off-by: Daniel Huang <pilotflyer824@gmail.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
@pi314ever
Copy link
Copy Markdown
Contributor Author

@fake0fan @lishunyang12 I have resolved the outstanding comments. PTAL!

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@bjf-frz are u coordinating with @pi314ever ?

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
@pi314ever
Copy link
Copy Markdown
Contributor Author

@david6666666 PR Is ready. AMD CI fail is not related.

Comment thread vllm_omni/entrypoints/omni_base.py Outdated
@bjf-frz
Copy link
Copy Markdown
Contributor

bjf-frz commented Apr 18, 2026

I tested this pr, the server end encountered a "prompt too long" error, and the user end received an error status which shows:

{
  "code": "EngineGenerateError",
  "message": ""
}

The code field did not show proper HTTP error code, the error message mismatched to the code field, the message field is empty and did not show the specific error message.

@bjf-frz
Copy link
Copy Markdown
Contributor

bjf-frz commented Apr 18, 2026

I tested this pr, the server end encountered a "prompt too long" error, and the user end received an error status which shows:

{
  "code": "EngineGenerateError",
  "message": ""
}

The code field did not show proper HTTP error code, the error message mismatched to the code field, the message field is empty and did not show the specific error message.

I fixed this bug in v1/videos api, you may check this commit:
debug-pr-2426

Comment thread vllm_omni/entrypoints/omni_base.py Outdated
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR addresses runtime error propagation through the diffusion engine, orchestrator, and OpenAI-compatible serving layer — specifically for worker death/OOM/fatal failures that previously caused hanging. It introduces structured error handling with EngineDeadError (fatal) vs EngineGenerateError (recoverable), subprocess death monitoring, and Omni-aware exception handlers.

Scope: 21 files, ~710 LOC production + ~835 LOC tests.


Blocker Scan

Category Result
Correctness ISSUE: EngineGenerateError() raised without message at omni_base.py:301 — see inline comment
Reliability/Safety PASS — monitor threads use daemon=True, weakref cleanup, and OSError guards
Breaking Changes PASS — OmniBase.errored is a new property; AsyncOmni.errored semantics broadened (was orchestrator-only, now includes stage clients) but all callers check via .errored which still returns bool
Test Coverage PASS — 18 targeted tests covering worker monitor, death sentinel, error propagation through all layers, and /health endpoint
Documentation PASS — inline docstrings are thorough; no new public API exposed
Security PASS

Blocking Issue

Empty error message for recoverable failures (omni_base.py:301):

EngineGenerateError() is raised without the error text. The app-level handler calls create_error_response(exc) which uses str(exc), producing an empty "message": "" in the client response. This matches the bug reported in the PR thread where a "prompt too long" error became {"code": "EngineGenerateError", "message": ""}.

Fix: raise EngineGenerateError(error_text) instead of raise EngineGenerateError() from RuntimeError(error_text).


Non-Blocking Observations

  1. Copilot's earlier concern about EngineDeadError at orchestrator line 279 — you've already addressed this by switching to check_health(). Good.

  2. _wait_for_orchestrator_init polling loop — the 1-second poll interval during startup is reasonable. The early-detection of dead orchestrator thread is a meaningful improvement over the previous single-blocking future.result().

  3. Video background task terminate_if_errored — correctly signals shutdown from background tasks that can't propagate exceptions to FastAPI handlers. The app_state threading is clean.

  4. O(n_stages) check on every error in _check_engine_output_error — you've documented this with a NOTE. Acceptable for the typical small number of stages.


Verdict

1 blocker (empty error message). Once EngineGenerateError(error_text) is fixed, this is ready to approve.

Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: Daniel Huang <pilotflyer824@gmail.com>
@pi314ever
Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 I have addressed the blocking issue.

@david6666666
Copy link
Copy Markdown
Collaborator

please fix CI

@david6666666
Copy link
Copy Markdown
Collaborator

I tested this pr, the server end encountered a "prompt too long" error, and the user end received an error status which shows:

{
  "code": "EngineGenerateError",
  "message": ""
}

The code field did not show proper HTTP error code, the error message mismatched to the code field, the message field is empty and did not show the specific error message.

I fixed this bug in v1/videos api, you may check this commit: debug-pr-2426

we should consider this @pi314ever, thx

pi314ever and others added 2 commits April 20, 2026 10:20
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: bjf-frz <frz123db@gmail.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
@pi314ever
Copy link
Copy Markdown
Contributor Author

@david6666666 I cherry-picked the commit in, should be resolved now. I also fixed CI.

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
@david6666666
Copy link
Copy Markdown
Collaborator

please resolve conflicts, thx

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Comment thread vllm_omni/entrypoints/openai/api_server.py
Comment thread vllm_omni/entrypoints/openai/api_server.py
@david6666666
Copy link
Copy Markdown
Collaborator

LGTM now

@david6666666
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 @Gaohan123 ptal thx

@david6666666 david6666666 merged commit 52b5336 into vllm-project:main Apr 21, 2026
8 checks passed
qinganrice pushed a commit to qinganrice/vllm-omni that referenced this pull request Apr 23, 2026
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <pilotflyer824@gmail.com>
Signed-off-by: bjf-frz <frz123db@gmail.com>
Co-authored-by: Chenguang Zheng <645327136@qq.com>
Co-authored-by: Zhou Taichang <tzhouam@connect.ust.hk>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: bjf-frz <frz123db@gmail.com>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <pilotflyer824@gmail.com>
Signed-off-by: bjf-frz <frz123db@gmail.com>
Co-authored-by: Chenguang Zheng <645327136@qq.com>
Co-authored-by: Zhou Taichang <tzhouam@connect.ust.hk>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: bjf-frz <frz123db@gmail.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <pilotflyer824@gmail.com>
Signed-off-by: bjf-frz <frz123db@gmail.com>
Co-authored-by: Chenguang Zheng <645327136@qq.com>
Co-authored-by: Zhou Taichang <tzhouam@connect.ust.hk>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: bjf-frz <frz123db@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants