[Perf] Reduce IPC overhead for single-stage diffusion serving for Wan2.2#1715
Conversation
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Two optimizations that eliminate ~6.5s of IPC serialization overhead for single-stage diffusion pipelines (e.g. Wan2.2 I2V/T2V) in online serving mode: Phase 1 – Inline diffusion (eliminate Hop3): When there is exactly one diffusion stage in async mode, initialize OmniDiffusion directly in the orchestrator process instead of spawning a stage worker subprocess. This removes the entire Hop3 serialization path (pickle + mp.Queue/SHM) between the stage worker and orchestrator. GPU workers for tensor parallelism are still spawned by DiffusionExecutor. Phase 2 – SHM tensor transfer (optimize Hop1): Replace pickle-based serialization of large tensors through MessageQueue with POSIX shared memory. The worker copies tensor data into a named SHM segment and enqueues only lightweight metadata; the scheduler reconstructs the tensor from SHM. This reduces Hop1 overhead from ~3.4s to ~1.5s. Measured on Wan2.2-I2V-A14B (TP=2, 1280x720, 5s@16fps, 1 step): Before: e2e = 37.5s Phase 1: e2e = 33.1s (−4.4s) Phase 2: e2e = 31.0s (−2.1s) Total: e2e = 31.0s (−6.5s, −17.5%) Made-with: Cursor Signed-off-by: samithuang <285365963@qq.com>
…17.5%) perf: reduce IPC overhead for single-stage diffusion serving (~6.5s, 17.5%)
|
@wuhang2014 PTAL |
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dd4468cbb2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
there are many time log changes in this PR, I think we need to rm them |
lishunyang12
left a comment
There was a problem hiding this comment.
left a question inline
|
the |
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
rm the redundant logs |
fixed |
| from vllm_omni.platforms import current_omni_platform | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
| DEBUG_PERF = False |
There was a problem hiding this comment.
suggest to drop DEBUG_PERF to keep code clean.
There was a problem hiding this comment.
we still need it for perf tuning recently. It only add logs for Wan2.2, which should be ok. Can remove it once perf tuning finished
…2.2 (vllm-project#1715) Signed-off-by: samithuang <285365963@qq.com> Signed-off-by: Samit <285365963@qq.com> Signed-off-by: lishunyang <lishunyang12@163.com>
…fusion IPC design into scheduler refactor Signed-off-by: jader <yjader@foxmail.com>
…2.2 (vllm-project#1715) Signed-off-by: samithuang <285365963@qq.com> Signed-off-by: Samit <285365963@qq.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
This PR eliminates ~6.5 seconds of IPC serialization overhead for single-stage diffusion pipelines (e.g. Wan2.2 I2V/T2V) in online serving, reducing e2e latency from 37.5s to 31.0s (−17.5%) with zero impact on GPU computation.
Related issue: #1712
For comparison, SGLang collocates its scheduler and GPU worker in a single process, resulting in only one process boundary and near-zero IPC overhead for the main pipeline.
Changes
Phase 1 - Inline diffusion mode (eliminates Hop3, saves ~4.5s)
When there is exactly one diffusion stage in async mode,
OmniDiffusionis initialized directly in the orchestrator process instead of spawning a separate stage worker subprocess. This completely removes the Hop3 serialization path (pickle +mp.Queue/SHM) between the stage worker and orchestrator.omni.py: Detects single-stage diffusion in_initialize_stages()and calls_init_inline_diffusion_engine()to set up the engine in-process, bypassing_start_stages()and_wait_for_stages_ready().async_omni.py: Adds_generate_inline()which runsOmniDiffusion.generate()in a thread executor (non-blocking for asyncio) and yields results directly - no queues, no serialization.DiffusionExecutoras separate processes. Multi-stage pipelines (e.g. LLM + diffusion) fall back to the original subprocess path.Phase 2 - SHM tensor transfer (optimizes Hop1, saves ~2.1s)
Replaces pickle-based serialization of large tensors through
MessageQueue(Hop1: GPU worker to scheduler) with POSIX shared memory:data.py: Addspack_diffusion_output_shm()/unpack_diffusion_output_shm()helpers that transfer tensors >1 MB via named SHM segments, sending only lightweight metadata through the queue.diffusion_worker.py: Callspack_diffusion_output_shm()beforeresult_mq.enqueue().scheduler.py: Callsunpack_diffusion_output_shm()afterresult_mq.dequeue().Hop1 overhead drops from ~3.4s (pickle serialize + deserialize) to ~1.5s (memcpy to/from SHM).
Test Plan
Test Result
Before
After
Wan2.2-I2V-A14B, TP=2, 1280x720, 5s@16fps, 1 denoising step:
Backward Compatibility
Omniclass is unaffected (is_asyncguard).MessageQueue.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)