Skip to content

[Perf] Reduce IPC overhead for single-stage diffusion serving for Wan2.2#1715

Merged
hsliuustc0106 merged 15 commits into
vllm-project:mainfrom
SamitHuang:main
Mar 9, 2026
Merged

[Perf] Reduce IPC overhead for single-stage diffusion serving for Wan2.2#1715
hsliuustc0106 merged 15 commits into
vllm-project:mainfrom
SamitHuang:main

Conversation

@SamitHuang
Copy link
Copy Markdown
Collaborator

@SamitHuang SamitHuang commented Mar 6, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR eliminates ~6.5 seconds of IPC serialization overhead for single-stage diffusion pipelines (e.g. Wan2.2 I2V/T2V) in online serving, reducing e2e latency from 37.5s to 31.0s (−17.5%) with zero impact on GPU computation.

Related issue: #1712

API Server (Orchestrator) --Hop3--> Stage Worker subprocess --Hop1--> GPU Workers
                          <--Hop3--                         <--Hop1--

For comparison, SGLang collocates its scheduler and GPU worker in a single process, resulting in only one process boundary and near-zero IPC overhead for the main pipeline.

Changes

Phase 1 - Inline diffusion mode (eliminates Hop3, saves ~4.5s)

When there is exactly one diffusion stage in async mode, OmniDiffusion is initialized directly in the orchestrator process instead of spawning a separate stage worker subprocess. This completely removes the Hop3 serialization path (pickle + mp.Queue/SHM) between the stage worker and orchestrator.

  • omni.py: Detects single-stage diffusion in _initialize_stages() and calls _init_inline_diffusion_engine() to set up the engine in-process, bypassing _start_stages() and _wait_for_stages_ready().
  • async_omni.py: Adds _generate_inline() which runs OmniDiffusion.generate() in a thread executor (non-blocking for asyncio) and yields results directly - no queues, no serialization.
  • GPU workers for tensor parallelism are still spawned by DiffusionExecutor as separate processes. Multi-stage pipelines (e.g. LLM + diffusion) fall back to the original subprocess path.

Phase 2 - SHM tensor transfer (optimizes Hop1, saves ~2.1s)

Replaces pickle-based serialization of large tensors through MessageQueue (Hop1: GPU worker to scheduler) with POSIX shared memory:

  • data.py: Adds pack_diffusion_output_shm() / unpack_diffusion_output_shm() helpers that transfer tensors >1 MB via named SHM segments, sending only lightweight metadata through the queue.
  • diffusion_worker.py: Calls pack_diffusion_output_shm() before result_mq.enqueue().
  • scheduler.py: Calls unpack_diffusion_output_shm() after result_mq.dequeue().

Hop1 overhead drops from ~3.4s (pickle serialize + deserialize) to ~1.5s (memcpy to/from SHM).

Test Plan

export VLLM_TORCH_PROFILER_DIR=./profiles

MODEL="${MODEL:-Wan-AI/Wan2.2-I2V-A14B-Diffusers}"
PORT="${PORT:-3000}"
CACHE_BACKEND="${CACHE_BACKEND:-none}"
ENABLE_CACHE_DIT_SUMMARY="${ENABLE_CACHE_DIT_SUMMARY:-0}"

echo "Starting Wan2.2 I2V server..."
echo "Model: $MODEL"
echo "Port: $PORT"
echo "Cache backend: $CACHE_BACKEND"
if [ "$ENABLE_CACHE_DIT_SUMMARY" != "0" ]; then
    echo "Cache-DiT summary: enabled"
fi

CACHE_BACKEND_FLAG=""
if [ "$CACHE_BACKEND" != "none" ]; then
    CACHE_BACKEND_FLAG="--cache-backend $CACHE_BACKEND"
fi

vllm serve "$MODEL" --omni \
    --port "$PORT" \
    --num-gpus 2 \
    --tensor-parallel-size 2 \
    --log-stats \
    $CACHE_BACKEND_FLAG \
    $(if [ "$ENABLE_CACHE_DIT_SUMMARY" != "0" ]; then echo "--enable-cache-dit-summary"; fi)
#!/bin/bash

INPUT_IMAGE="${INPUT_IMAGE:-/home/user/rabbit.jpeg}"
OUTPUT_PATH="${OUTPUT_PATH:-wan22_i2v_output.mp4}"
PORT="${PORT:-3000}"

if [ ! -f "$INPUT_IMAGE" ]; then
    echo "Input image not found: $INPUT_IMAGE"
    exit 1
fi

curl -X POST http://localhost:${PORT}/v1/videos \
  -H "Accept: application/json" \
  -F "prompt=一只棕色野兔的正面特写镜头,采用低角度仰拍视角,营造亲密而庄严的视觉冲击。兔子一双圆润漆黑的大眼睛直视镜头深处,眼神中交织着野生动物的警觉与一丝难以言喻的温柔好奇,仿佛在与观者建立跨越物种的静默对话。它毛色呈现层次丰富的棕褐渐变,从浅奶油色腹部过渡到深棕背部,每根毛发纹理清晰可辨,在侧光下泛着丝绸般的光泽。细长洁白的胡须共有三对,随呼吸节奏微微颤动,偶尔因捕捉气流信息而轻轻摇摆。
两只标志性的长耳完全竖立,耳廓外侧覆盖短密棕毛,内侧则露出粉嫩的血管网络,薄如蝉翼的皮肤下血液流动隐约可见,耳朵以细微幅度不时转动,精准定位声源方向。背景是一片澄澈的蔚蓝天空,形态蓬松的白色积云以缓慢速度横向漂移,云影在兔子头顶交替变化,光线随之明暗流转。晴朗天气的明媚阳光从画面左上方45度角倾泻而下,在兔脸右侧形成柔和的伦勃朗式阴影,强化了面部立体感和皮毛质感。
兔子湿润的黑鼻子持续进行每秒三至四次的快速抽动,这是它们感知化学信号的本能动作,粉色三瓣嘴随之轻启,露出正在反刍的洁白门齿,下颌以稳定节奏左右研磨。摄影采用大光圈浅景深,焦点牢牢锁定在兔子双眼连线所在的焦平面,背景天空和远景绿色植被虚化成圆润的彩色光斑,前景几根嫩绿草叶闯入画面边缘,以缓慢弧线随风摇曳,暗示着和煦的春日微风。
整个场景弥漫着宁静致远的田园诗意,色彩温暖饱和,充满生命力。兔子在持续五秒的对视后,以典型 lagomorph 特征完成一次完整的瞬膜眨眼——第三眼睑从内侧横向滑过眼球,继而缓缓歪头向右十五度,这个行为在动物行为学中代表认知加工和好奇心表达,耳朵随之向同一方向倾斜,最终恢复正视姿态,胡须舒展,完成这段短暂而珍贵的自然纪录。" \
  -F "negative_prompt=色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" \
  -F "input_reference=@${INPUT_IMAGE}" \
  -F "size=1280x720" \
  -F "seconds=5" \
  -F "fps=16" \
  -F "num_inference_steps=1" \
  -F "guidance_scale=3.5" \
  -F "guidance_scale_2=3.5" \
  -F "boundary_ratio=0.875" \
  -F "flow_shift=5.0" \
  -F "seed=42" | jq -r '.data[0].b64_json' | base64 -d > "${OUTPUT_PATH}"

echo "Saved video to ${OUTPUT_PATH}"

Test Result

Before

03-06 08:35:47 [pipeline_wan2_2_i2v.py:591] Pipeline stage timing summary: TextEncoding=61.34 ms, ImageEncoding=0.01 ms, LatentPreparation=4743.45 ms, Denoising=16448.06 ms (1 steps), Decoding=7608.98 ms, StagesSum=28861.84 ms, PipelineWall=28862.50 ms, Unaccounted=0.66 ms
[Stage-0] INFO 03-06 08:35:49 [diffusion_worker.py:380] Hop1 worker→scheduler: result_mq.enqueue took 1507.57 ms (rank 0)
[Stage-0] INFO 03-06 08:35:51 [scheduler.py:74] Hop1 scheduler←worker: result_mq.dequeue took 32314.12 ms (includes waiting for generation + enqueue serialization)
[Stage-0] INFO 03-06 08:35:51 [diffusion_engine.py:85] Generation completed successfully.
[Stage-0] INFO 03-06 08:35:52 [diffusion_engine.py:107] Post-processing completed in 0.7679 seconds
[Stage-0] INFO 03-06 08:35:52 [diffusion_engine.py:110] DiffusionEngine.step breakdown: preprocess=20.24 ms, add_req_and_wait=32333.22 ms, postprocess=767.92 ms, total=33121.94 ms
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502]
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502] [Overall Summary]
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502] +-----------------------------+------------+
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502] | Field                       |      Value |
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502] +-----------------------------+------------+
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502] | e2e_requests                |          1 |
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502] | e2e_wall_time_ms            | 38,255.226 |
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502] | e2e_avg_time_per_request_ms | 38,255.226 |
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502] | e2e_stage_0_wall_time_ms    | 38,252.134 |
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502] +-----------------------------+------------+

After

INFO 03-06 11:04:50 [pipeline_wan2_2_i2v.py:583] Pipeline stage timing summary: TextEncoding=59.23 ms, ImageEncoding=0.01 ms, LatentPreparation=4740.84 ms, Denoising=16440.97 ms (1 steps), Decoding=7608.52 ms, StagesSum=28849.57 ms, PipelineWall=28849.99 ms, Unaccounted=0.41 ms
INFO 03-06 11:04:51 [diffusion_worker.py:390] Hop1 worker→scheduler: shm_pack=1105.69 ms, mq.enqueue=0.13 ms, total=1105.82 ms (rank 0)
(APIServer pid=424673) INFO 03-06 11:04:52 [scheduler.py:82] Hop1 scheduler←worker: mq.dequeue=29977.74 ms, shm_unpack=408.63 ms (dequeue includes generation wait)
(APIServer pid=424673) INFO 03-06 11:04:52 [diffusion_engine.py:85] Generation completed successfully.
(APIServer pid=424673) INFO 03-06 11:04:52 [diffusion_engine.py:107] Post-processing completed in 0.5268 seconds
(APIServer pid=424673) INFO 03-06 11:04:52 [diffusion_engine.py:110] DiffusionEngine.step breakdown: preprocess=26.36 ms, add_req_and_wait=30405.40 ms, postprocess=526.84 ms, total=30958.92 ms
(APIServer pid=424673) INFO 03-06 11:04:52 [omni_diffusion.py:128] OmniDiffusion.generate total: 30961.16 ms
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] ^M
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] [Overall Summary]^M
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] +-----------------------------+------------+^M
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] | Field                       |      Value |^M
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] +-----------------------------+------------+^M
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] | e2e_requests                |          1 |^M
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] | e2e_wall_time_ms            | 30,962.290 |^M
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] | e2e_avg_time_per_request_ms | 30,962.290 |^M
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] | e2e_stage_0_wall_time_ms    | 30,962.273 |^M
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] +-----------------------------+------------+

Wan2.2-I2V-A14B, TP=2, 1280x720, 5s@16fps, 1 denoising step:

Metric Before Phase 1 Phase 1+2
e2e_wall_time_ms 37,546 33,078 30,962
PipelineWall (GPU) 28,881 28,895 28,850
Hop1 (worker to scheduler) 1,530 ms 1,495 ms 1,106 ms
Hop3 (stage worker IPC) ~5,000 ms 0 (eliminated) 0 (eliminated)

Backward Compatibility

  • Multi-stage pipelines (LLM + diffusion) are unaffected - they continue to use the stage worker subprocess path.
  • The sync Omni class is unaffected (is_async guard).
  • SHM pack/unpack has graceful fallback: if SHM fails, it falls back to regular pickle through MessageQueue.
  • No API changes.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Two optimizations that eliminate ~6.5s of IPC serialization overhead
for single-stage diffusion pipelines (e.g. Wan2.2 I2V/T2V) in online
serving mode:

Phase 1 – Inline diffusion (eliminate Hop3):
When there is exactly one diffusion stage in async mode, initialize
OmniDiffusion directly in the orchestrator process instead of spawning
a stage worker subprocess. This removes the entire Hop3 serialization
path (pickle + mp.Queue/SHM) between the stage worker and orchestrator.
GPU workers for tensor parallelism are still spawned by DiffusionExecutor.

Phase 2 – SHM tensor transfer (optimize Hop1):
Replace pickle-based serialization of large tensors through MessageQueue
with POSIX shared memory. The worker copies tensor data into a named SHM
segment and enqueues only lightweight metadata; the scheduler reconstructs
the tensor from SHM. This reduces Hop1 overhead from ~3.4s to ~1.5s.

Measured on Wan2.2-I2V-A14B (TP=2, 1280x720, 5s@16fps, 1 step):
  Before:  e2e = 37.5s
  Phase 1: e2e = 33.1s  (−4.4s)
  Phase 2: e2e = 31.0s  (−2.1s)
  Total:   e2e = 31.0s  (−6.5s, −17.5%)

Made-with: Cursor

Signed-off-by: samithuang <285365963@qq.com>
…17.5%)

perf: reduce IPC overhead for single-stage diffusion serving (~6.5s, 17.5%)
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@wuhang2014 PTAL

Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
@SamitHuang SamitHuang marked this pull request as ready for review March 6, 2026 16:33
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dd4468cbb2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/diffusion/diffusion_engine.py
Comment thread vllm_omni/entrypoints/omni.py
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

there are many time log changes in this PR, I think we need to rm them

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a question inline

@lishunyang12
Copy link
Copy Markdown
Collaborator

the time.perf_counter() / logger.info("... took %.3f s") instrumentation is spread across ~11 files. are these meant to stay permanently? if so, might be better at DEBUG level to avoid spamming production logs.

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Mar 9, 2026
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
@SamitHuang
Copy link
Copy Markdown
Collaborator Author

there are many time log changes in this PR, I think we need to rm them

rm the redundant logs

@SamitHuang
Copy link
Copy Markdown
Collaborator Author

left a question inline

fixed

Signed-off-by: samithuang <285365963@qq.com>
from vllm_omni.platforms import current_omni_platform

logger = logging.getLogger(__name__)
DEBUG_PERF = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to drop DEBUG_PERF to keep code clean.

Copy link
Copy Markdown
Collaborator Author

@SamitHuang SamitHuang Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need it for perf tuning recently. It only add logs for Wan2.2, which should be ok. Can remove it once perf tuning finished

Comment thread vllm_omni/diffusion/data.py Outdated
Signed-off-by: samithuang <285365963@qq.com>
Comment thread vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2.py Outdated
Comment thread vllm_omni/diffusion/data.py Outdated
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
@hsliuustc0106 hsliuustc0106 merged commit 155856f into vllm-project:main Mar 9, 2026
6 of 7 checks passed
lishunyang12 pushed a commit to lishunyang12/vllm-omni that referenced this pull request Mar 11, 2026
…2.2 (vllm-project#1715)

Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: lishunyang <lishunyang12@163.com>
yJader added a commit to omni-nicelab/vllm-omni-batching that referenced this pull request Mar 11, 2026
…fusion IPC design into scheduler refactor

Signed-off-by: jader <yjader@foxmail.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…2.2 (vllm-project#1715)

Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants