Skip to content

[docs] RFC: Async D2H + Stage Pipeline Async Transfer for Qwen3-Omni#3508

Closed
tzhouam wants to merge 24 commits into
mainfrom
dev/vllm-align
Closed

[docs] RFC: Async D2H + Stage Pipeline Async Transfer for Qwen3-Omni#3508
tzhouam wants to merge 24 commits into
mainfrom
dev/vllm-align

Conversation

@tzhouam
Copy link
Copy Markdown
Collaborator

@tzhouam tzhouam commented May 11, 2026

Summary

This PR adds an architectural RFC proposing async D2H + inter-stage pipeline async transfer optimizations for Qwen3-Omni's multi-stage inference pipeline.

Motivation

The current Qwen3-Omni pipeline has two synchronous bottlenecks:

  1. Intra-step D2H: hidden_states.to("cpu") blocks the default CUDA stream, preventing the next forward pass from overlapping with the copy.
  2. Inter-stage transfer: Each chunk traverses pickle → SHM write → SHM read → unpickle → sync H2D sequentially, dominating first-audio latency.

Proposed Design

  • AsyncD2HController with dedicated S_d2h stream + PinnedRingPool
  • BatchPayloadFuture / BatchPayloadSliceRef ownership model
  • StagePayloadHandle + AsyncH2DController for receiver-side H2D overlap
  • tensor_blob zero-copy wire protocol (msgspec header + SHM blob, replaces pickle)
  • Phased rollout (P0–P5-N) with per-phase env switches for independent enable/disable

Performance Targets

Stage First-audio TTFA p50 Improvement
Baseline (PR #3164) 600 ms
+ Intra-step async D2H (P1a–P3) 580 ms -3%
+ Inter-stage async (P4b) 480 ms -20%
+ Zero-copy wire (P5-W) 450 ms -25%

Text-only traffic is 100% no-op across all phases.

🤖 Generated with Claude Code

tzhouam and others added 24 commits May 2, 2026 19:23
…ependencies

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
…ngine_core_client.py, async_omni.py

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
…executable for command execution, modify test_orchestrator to accept kwargs in shutdown method, and add new parameters to OmniRequest for improved functionality.

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
The urllib.parse.quote() with safe=':/' was encoding the + in
git version strings (e.g. v0.20.2rc1+g54dc64d5d) as %2B, which
uv pip install cannot parse as a valid PEP 440 version.
Changed safe=':/' to safe=':/+' to keep + literal.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
…t and related functions

- Removed `prompt_is_token_ids` from `OmniRequest` and its usage in `_upgrade_to_omni_request` and `_apply_omni_final_stage_metadata`.
- Updated imports in `npu_ar_model_runner` and `npu_generation_model_runner` to directly import `preprocess_mamba`.
- Simplified import statements in `hunyuan_image3.py` by removing the try-except block for `SharedFusedMoE`.
- Added docstring to `Qwen3TTSTokenizerV2DecoderTransformerModel` for `cache_position` parameter.
- Enhanced GPU model runners to handle ngram GPU tensor updates and prevent scheduler output modifications.

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
… minutes and add null check for request output in Moss TTS Nano tests

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
# Conflicts:
#	vllm_omni/engine/async_omni_engine.py
#	vllm_omni/engine/orchestrator.py
…ndling in OmniRequest and schedulers. Add Gemma4Proposer to model runners and ensure proper shutdown of stage clients.
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Design document proposing async D2H controller, pinned ring pool,
BatchPayloadFuture ownership model, StagePayloadHandle, zero-copy
tensor_blob wire protocol, and phased rollout for Qwen3-Omni's
multi-stage inference pipeline (Thinker → Talker → Code2Wav).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ac4c7d7e7e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

priority: int = 0,
data_parallel_rank: int | None = None,
reasoning_ended: bool | None = None,
reasoning_parser_kwargs: dict[str, Any] | None = None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Forward reasoning parser kwargs to the engine

When async callers pass reasoning_parser_kwargs, this new parameter is silently ignored: generate() never forwards it to self.engine.add_request_async(...), and AsyncOmniEngine.add_request/_build_add_request_message still have no matching parameter, so the new propagation added in OmniRequest/OmniEngineCoreRequest is never exercised for normal async requests. This breaks requests that rely on per-request reasoning parser options while appearing to accept them.

Useful? React with 👍 / 👎.

Comment on lines +83 to +86
# NSpid:\t<container_pid>\t<host_pid>
parts = line.split()
if len(parts) >= 2:
return int(parts[-1])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use the outer PID from NSpid for NVML lookups

In a nested PID namespace, NSpid is reported from the outer namespace toward the inner one (for example <host_pid> <container_pid>), so returning parts[-1] gives the container PID rather than the host PID that NVML reports. In containers without --pid=host, this makes get_process_gpu_memory(..., pid=host_pid) look up the wrong PID and fall back to profiling instead of using process-scoped memory accounting.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant