[docs] RFC: Async D2H + Stage Pipeline Async Transfer for Qwen3-Omni#3508

Closed

tzhouam wants to merge 24 commits into

mainfrom

Collaborator

tzhouam commented May 11, 2026

Summary

This PR adds an architectural RFC proposing async D2H + inter-stage pipeline async transfer optimizations for Qwen3-Omni's multi-stage inference pipeline.

Motivation

The current Qwen3-Omni pipeline has two synchronous bottlenecks:

Intra-step D2H: hidden_states.to("cpu") blocks the default CUDA stream, preventing the next forward pass from overlapping with the copy.
Inter-stage transfer: Each chunk traverses pickle → SHM write → SHM read → unpickle → sync H2D sequentially, dominating first-audio latency.

Proposed Design

AsyncD2HController with dedicated S_d2h stream + PinnedRingPool
BatchPayloadFuture / BatchPayloadSliceRef ownership model
StagePayloadHandle + AsyncH2DController for receiver-side H2D overlap
tensor_blob zero-copy wire protocol (msgspec header + SHM blob, replaces pickle)
Phased rollout (P0–P5-N) with per-phase env switches for independent enable/disable

Performance Targets

Stage	First-audio TTFA p50	Improvement
Baseline (PR #3164)	600 ms	—
+ Intra-step async D2H (P1a–P3)	580 ms	-3%
+ Inter-stage async (P4b)	480 ms	-20%
+ Zero-copy wire (P5-W)	450 ms	-25%

Text-only traffic is 100% no-op across all phases.

🤖 Generated with Claude Code

tzhouam and others added 24 commits

May 2, 2026 19:23


          [CI] Update Dockerfile for vLLM with CUDA 13.0 build and additional d…

eb831d2

…ependencies

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>


          ci: bump wheel URLs in Dockerfile.ci

415fd8f

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>


          Merge remote-tracking branch 'origin/main' into dev/vllm-align

609104f


          dirty_worktree: commit rebase modifications to Dockerfile.ci, stage_e…

e5383cf

…ngine_core_client.py, async_omni.py

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>


          Merge remote-tracking branch 'origin/main' into dev/vllm-align

b0e6885


          Enhance tests and request handling: Update test_serve_cli to use sys.…

284f59d

…executable for command execution, modify test_orchestrator to accept kwargs in shutdown method, and add new parameters to OmniRequest for improved functionality.

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>


          Merge remote-tracking branch 'origin/main' into dev/vllm-align

f280d36


          rebase: align vllm-omni with vLLM 54dc64d5d399

5b09f52

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>


          fix: prevent URL-encoding of + in wheel version string

a0eeff0

The urllib.parse.quote() with safe=':/' was encoding the + in
git version strings (e.g. v0.20.2rc1+g54dc64d5d) as %2B, which
uv pip install cannot parse as a valid PEP 440 version.
Changed safe=':/' to safe=':/+' to keep + literal.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>


          fix: also prevent re-encoding of % in wheel URL

5c8c5f8

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>


          Merge remote-tracking branch 'origin/main' into dev/vllm-align

eaebe81


          Update Dockerfile.ci wheel URL bump

c76f0d0

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>


          Merge remote-tracking branch 'origin/main' into dev/vllm-align

110ba72

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>


          Merge remote-tracking branch 'origin/main' into dev/vllm-align

d595c16


          Refactor: Remove unused prompt_is_token_ids parameter from OmniReques…

364e6ea

…t and related functions

- Removed `prompt_is_token_ids` from `OmniRequest` and its usage in `_upgrade_to_omni_request` and `_apply_omni_final_stage_metadata`.
- Updated imports in `npu_ar_model_runner` and `npu_generation_model_runner` to directly import `preprocess_mamba`.
- Simplified import statements in `hunyuan_image3.py` by removing the try-except block for `SharedFusedMoE`.
- Added docstring to `Qwen3TTSTokenizerV2DecoderTransformerModel` for `cache_position` parameter.
- Enhanced GPU model runners to handle ngram GPU tensor updates and prevent scheduler output modifications.

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>


          Merge remote-tracking branch 'origin/main' into dev/vllm-align

5a014da


          rebase: align vllm-omni with vLLM 132765e35606

7189aed

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>


          fix: address CI failures (debug round 1)

b11db7d

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>


          chore: increase timeout for Diffusion Model CPU offloading test to 30…

39b2a91

… minutes and add null check for request output in Moss TTS Nano tests

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>


          Merge remote-tracking branch 'origin/main' into dev/vllm-align

923d133

# Conflicts:
#	vllm_omni/engine/async_omni_engine.py
#	vllm_omni/engine/orchestrator.py


          Update Dockerfile for precompiled wheel commit and enhance request ha…

f53025c

…ndling in OmniRequest and schedulers. Add Gemma4Proposer to model runners and ensure proper shutdown of stage clients.


          rebase: align vllm-omni with vLLM 1acd67a795eb

d0e1fe6

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>


          fix: address CI failures (debug round 1)

fb6e9d4

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>


          [docs] Add RFC for Async D2H + Stage Pipeline Async Transfer

ac4c7d7

Design document proposing async D2H controller, pinned ring pool,
BatchPayloadFuture ownership model, StagePayloadHandle, zero-copy
tensor_blob wire protocol, and phased rollout for Qwen3-Omni's
multi-stage inference pipeline (Thinker → Talker → Code2Wav).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

tzhouam requested review from Isotr0py, ZeldaHuang, gcanlin, linyueqian, princepride and yuanheng-zhao as code owners

May 11, 2026 13:27

tzhouam requested review from Gaohan123, RuixiangMa, SamitHuang, ZJY0516, congw729, david6666666, hsliuustc0106, wtomin, yenuo26 and ywang96 as code owners

May 11, 2026 13:27

tzhouam closed this

chatgpt-codex-connector Bot reviewed

View reviewed changes

chatgpt-codex-connector Bot left a comment

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ac4c7d7e7e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm_omni/entrypoints/async_omni.py

                       priority: int = 0,
                       data_parallel_rank: int | None = None,
                       reasoning_ended: bool | None = None,
+                      reasoning_parser_kwargs: dict[str, Any] | None = None,

chatgpt-codex-connector Bot May 11, 2026

Forward reasoning parser kwargs to the engine

When async callers pass reasoning_parser_kwargs, this new parameter is silently ignored: generate() never forwards it to self.engine.add_request_async(...), and AsyncOmniEngine.add_request/_build_add_request_message still have no matching parameter, so the new propagation added in OmniRequest/OmniEngineCoreRequest is never exercised for normal async requests. This breaks requests that rely on per-request reasoning parser options while appearing to accept them.

Useful? React with 👍 / 👎.

vllm_omni/worker/gpu_memory_utils.py

Comment on lines +83 to +86

+                                  # NSpid:\t<container_pid>\t<host_pid>
+                                  parts = line.split()
+                                  if len(parts) >= 2:
+                                      return int(parts[-1])

chatgpt-codex-connector Bot May 11, 2026

Use the outer PID from NSpid for NVML lookups

In a nested PID namespace, NSpid is reported from the outer namespace toward the inner one (for example <host_pid> <container_pid>), so returning parts[-1] gives the container PID rather than the host PID that NVML reports. In containers without --pid=host, this makes get_process_gpu_memory(..., pid=host_pid) look up the wrong PID and fall back to profiling instead of using process-scoped memory accounting.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

chatgpt-codex-connector[bot] chatgpt-codex-connector[bot] left review comments

gcanlin Awaiting requested review from gcanlin gcanlin is a code owner

linyueqian Awaiting requested review from linyueqian linyueqian is a code owner

ZeldaHuang Awaiting requested review from ZeldaHuang ZeldaHuang is a code owner

princepride Awaiting requested review from princepride princepride is a code owner

yuanheng-zhao Awaiting requested review from yuanheng-zhao yuanheng-zhao is a code owner

Isotr0py Awaiting requested review from Isotr0py Isotr0py is a code owner

SamitHuang Awaiting requested review from SamitHuang SamitHuang is a code owner

wtomin Awaiting requested review from wtomin wtomin is a code owner

ZJY0516 Awaiting requested review from ZJY0516 ZJY0516 is a code owner

RuixiangMa Awaiting requested review from RuixiangMa RuixiangMa is a code owner

david6666666 Awaiting requested review from david6666666 david6666666 is a code owner

yenuo26 Awaiting requested review from yenuo26 yenuo26 is a code owner

congw729 Awaiting requested review from congw729 congw729 is a code owner

hsliuustc0106 Awaiting requested review from hsliuustc0106 hsliuustc0106 is a code owner

Gaohan123 Awaiting requested review from Gaohan123 Gaohan123 is a code owner

ywang96 Awaiting requested review from ywang96 ywang96 is a code owner

Labels

None yet