[docs] RFC: Async D2H + Stage Pipeline Async Transfer for Qwen3-Omni#3508
[docs] RFC: Async D2H + Stage Pipeline Async Transfer for Qwen3-Omni#3508tzhouam wants to merge 24 commits into
Conversation
…ependencies Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
…ngine_core_client.py, async_omni.py Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
…executable for command execution, modify test_orchestrator to accept kwargs in shutdown method, and add new parameters to OmniRequest for improved functionality. Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
The urllib.parse.quote() with safe=':/' was encoding the + in git version strings (e.g. v0.20.2rc1+g54dc64d5d) as %2B, which uv pip install cannot parse as a valid PEP 440 version. Changed safe=':/' to safe=':/+' to keep + literal. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
…t and related functions - Removed `prompt_is_token_ids` from `OmniRequest` and its usage in `_upgrade_to_omni_request` and `_apply_omni_final_stage_metadata`. - Updated imports in `npu_ar_model_runner` and `npu_generation_model_runner` to directly import `preprocess_mamba`. - Simplified import statements in `hunyuan_image3.py` by removing the try-except block for `SharedFusedMoE`. - Added docstring to `Qwen3TTSTokenizerV2DecoderTransformerModel` for `cache_position` parameter. - Enhanced GPU model runners to handle ngram GPU tensor updates and prevent scheduler output modifications. Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
… minutes and add null check for request output in Moss TTS Nano tests Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
# Conflicts: # vllm_omni/engine/async_omni_engine.py # vllm_omni/engine/orchestrator.py
…ndling in OmniRequest and schedulers. Add Gemma4Proposer to model runners and ensure proper shutdown of stage clients.
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Design document proposing async D2H controller, pinned ring pool, BatchPayloadFuture ownership model, StagePayloadHandle, zero-copy tensor_blob wire protocol, and phased rollout for Qwen3-Omni's multi-stage inference pipeline (Thinker → Talker → Code2Wav). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ac4c7d7e7e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| priority: int = 0, | ||
| data_parallel_rank: int | None = None, | ||
| reasoning_ended: bool | None = None, | ||
| reasoning_parser_kwargs: dict[str, Any] | None = None, |
There was a problem hiding this comment.
Forward reasoning parser kwargs to the engine
When async callers pass reasoning_parser_kwargs, this new parameter is silently ignored: generate() never forwards it to self.engine.add_request_async(...), and AsyncOmniEngine.add_request/_build_add_request_message still have no matching parameter, so the new propagation added in OmniRequest/OmniEngineCoreRequest is never exercised for normal async requests. This breaks requests that rely on per-request reasoning parser options while appearing to accept them.
Useful? React with 👍 / 👎.
| # NSpid:\t<container_pid>\t<host_pid> | ||
| parts = line.split() | ||
| if len(parts) >= 2: | ||
| return int(parts[-1]) |
There was a problem hiding this comment.
Use the outer PID from NSpid for NVML lookups
In a nested PID namespace, NSpid is reported from the outer namespace toward the inner one (for example <host_pid> <container_pid>), so returning parts[-1] gives the container PID rather than the host PID that NVML reports. In containers without --pid=host, this makes get_process_gpu_memory(..., pid=host_pid) look up the wrong PID and fall back to profiling instead of using process-scoped memory accounting.
Useful? React with 👍 / 👎.
Summary
This PR adds an architectural RFC proposing async D2H + inter-stage pipeline async transfer optimizations for Qwen3-Omni's multi-stage inference pipeline.
Motivation
The current Qwen3-Omni pipeline has two synchronous bottlenecks:
hidden_states.to("cpu")blocks the default CUDA stream, preventing the next forward pass from overlapping with the copy.pickle → SHM write → SHM read → unpickle → sync H2Dsequentially, dominating first-audio latency.Proposed Design
AsyncD2HControllerwith dedicatedS_d2hstream +PinnedRingPoolBatchPayloadFuture/BatchPayloadSliceRefownership modelStagePayloadHandle+AsyncH2DControllerfor receiver-side H2D overlaptensor_blobzero-copy wire protocol (msgspec header + SHM blob, replaces pickle)Performance Targets
Text-only traffic is 100% no-op across all phases.
🤖 Generated with Claude Code