[Perf] Reduce IPC overhead for single-stage diffusion serving for Wan2.2 by SamitHuang · Pull Request #1715 · vllm-project/vllm-omni

SamitHuang · 2026-03-06T11:30:08Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR eliminates ~6.5 seconds of IPC serialization overhead for single-stage diffusion pipelines (e.g. Wan2.2 I2V/T2V) in online serving, reducing e2e latency from 37.5s to 31.0s (−17.5%) with zero impact on GPU computation.

Related issue: #1712

API Server (Orchestrator) --Hop3--> Stage Worker subprocess --Hop1--> GPU Workers
                          <--Hop3--                         <--Hop1--

For comparison, SGLang collocates its scheduler and GPU worker in a single process, resulting in only one process boundary and near-zero IPC overhead for the main pipeline.

Changes

Phase 1 - Inline diffusion mode (eliminates Hop3, saves ~4.5s)

When there is exactly one diffusion stage in async mode, OmniDiffusion is initialized directly in the orchestrator process instead of spawning a separate stage worker subprocess. This completely removes the Hop3 serialization path (pickle + mp.Queue/SHM) between the stage worker and orchestrator.

omni.py: Detects single-stage diffusion in _initialize_stages() and calls _init_inline_diffusion_engine() to set up the engine in-process, bypassing _start_stages() and _wait_for_stages_ready().
async_omni.py: Adds _generate_inline() which runs OmniDiffusion.generate() in a thread executor (non-blocking for asyncio) and yields results directly - no queues, no serialization.
GPU workers for tensor parallelism are still spawned by DiffusionExecutor as separate processes. Multi-stage pipelines (e.g. LLM + diffusion) fall back to the original subprocess path.

Phase 2 - SHM tensor transfer (optimizes Hop1, saves ~2.1s)

Replaces pickle-based serialization of large tensors through MessageQueue (Hop1: GPU worker to scheduler) with POSIX shared memory:

data.py: Adds pack_diffusion_output_shm() / unpack_diffusion_output_shm() helpers that transfer tensors >1 MB via named SHM segments, sending only lightweight metadata through the queue.
diffusion_worker.py: Calls pack_diffusion_output_shm() before result_mq.enqueue().
scheduler.py: Calls unpack_diffusion_output_shm() after result_mq.dequeue().

Hop1 overhead drops from ~3.4s (pickle serialize + deserialize) to ~1.5s (memcpy to/from SHM).

Test Plan

export VLLM_TORCH_PROFILER_DIR=./profiles

MODEL="${MODEL:-Wan-AI/Wan2.2-I2V-A14B-Diffusers}"
PORT="${PORT:-3000}"
CACHE_BACKEND="${CACHE_BACKEND:-none}"
ENABLE_CACHE_DIT_SUMMARY="${ENABLE_CACHE_DIT_SUMMARY:-0}"

echo "Starting Wan2.2 I2V server..."
echo "Model: $MODEL"
echo "Port: $PORT"
echo "Cache backend: $CACHE_BACKEND"
if [ "$ENABLE_CACHE_DIT_SUMMARY" != "0" ]; then
    echo "Cache-DiT summary: enabled"
fi

CACHE_BACKEND_FLAG=""
if [ "$CACHE_BACKEND" != "none" ]; then
    CACHE_BACKEND_FLAG="--cache-backend $CACHE_BACKEND"
fi

vllm serve "$MODEL" --omni \
    --port "$PORT" \
    --num-gpus 2 \
    --tensor-parallel-size 2 \
    --log-stats \
    $CACHE_BACKEND_FLAG \
    $(if [ "$ENABLE_CACHE_DIT_SUMMARY" != "0" ]; then echo "--enable-cache-dit-summary"; fi)

#!/bin/bash

INPUT_IMAGE="${INPUT_IMAGE:-/home/user/rabbit.jpeg}"
OUTPUT_PATH="${OUTPUT_PATH:-wan22_i2v_output.mp4}"
PORT="${PORT:-3000}"

if [ ! -f "$INPUT_IMAGE" ]; then
    echo "Input image not found: $INPUT_IMAGE"
    exit 1
fi

curl -X POST http://localhost:${PORT}/v1/videos \
  -H "Accept: application/json" \
  -F "prompt=一只棕色野兔的正面特写镜头，采用低角度仰拍视角，营造亲密而庄严的视觉冲击。兔子一双圆润漆黑的大眼睛直视镜头深处，眼神中交织着野生动物的警觉与一丝难以言喻的温柔好奇，仿佛在与观者建立跨越物种的静默对话。它毛色呈现层次丰富的棕褐渐变，从浅奶油色腹部过渡到深棕背部，每根毛发纹理清晰可辨，在侧光下泛着丝绸般的光泽。细长洁白的胡须共有三对，随呼吸节奏微微颤动，偶尔因捕捉气流信息而轻轻摇摆。
两只标志性的长耳完全竖立，耳廓外侧覆盖短密棕毛，内侧则露出粉嫩的血管网络，薄如蝉翼的皮肤下血液流动隐约可见，耳朵以细微幅度不时转动，精准定位声源方向。背景是一片澄澈的蔚蓝天空，形态蓬松的白色积云以缓慢速度横向漂移，云影在兔子头顶交替变化，光线随之明暗流转。晴朗天气的明媚阳光从画面左上方45度角倾泻而下，在兔脸右侧形成柔和的伦勃朗式阴影，强化了面部立体感和皮毛质感。
兔子湿润的黑鼻子持续进行每秒三至四次的快速抽动，这是它们感知化学信号的本能动作，粉色三瓣嘴随之轻启，露出正在反刍的洁白门齿，下颌以稳定节奏左右研磨。摄影采用大光圈浅景深，焦点牢牢锁定在兔子双眼连线所在的焦平面，背景天空和远景绿色植被虚化成圆润的彩色光斑，前景几根嫩绿草叶闯入画面边缘，以缓慢弧线随风摇曳，暗示着和煦的春日微风。
整个场景弥漫着宁静致远的田园诗意，色彩温暖饱和，充满生命力。兔子在持续五秒的对视后，以典型 lagomorph 特征完成一次完整的瞬膜眨眼——第三眼睑从内侧横向滑过眼球，继而缓缓歪头向右十五度，这个行为在动物行为学中代表认知加工和好奇心表达，耳朵随之向同一方向倾斜，最终恢复正视姿态，胡须舒展，完成这段短暂而珍贵的自然纪录。" \
  -F "negative_prompt=色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" \
  -F "input_reference=@${INPUT_IMAGE}" \
  -F "size=1280x720" \
  -F "seconds=5" \
  -F "fps=16" \
  -F "num_inference_steps=1" \
  -F "guidance_scale=3.5" \
  -F "guidance_scale_2=3.5" \
  -F "boundary_ratio=0.875" \
  -F "flow_shift=5.0" \
  -F "seed=42" | jq -r '.data[0].b64_json' | base64 -d > "${OUTPUT_PATH}"

echo "Saved video to ${OUTPUT_PATH}"

Test Result

Before

03-06 08:35:47 [pipeline_wan2_2_i2v.py:591] Pipeline stage timing summary: TextEncoding=61.34 ms, ImageEncoding=0.01 ms, LatentPreparation=4743.45 ms, Denoising=16448.06 ms (1 steps), Decoding=7608.98 ms, StagesSum=28861.84 ms, PipelineWall=28862.50 ms, Unaccounted=0.66 ms
[Stage-0] INFO 03-06 08:35:49 [diffusion_worker.py:380] Hop1 worker→scheduler: result_mq.enqueue took 1507.57 ms (rank 0)
[Stage-0] INFO 03-06 08:35:51 [scheduler.py:74] Hop1 scheduler←worker: result_mq.dequeue took 32314.12 ms (includes waiting for generation + enqueue serialization)
[Stage-0] INFO 03-06 08:35:51 [diffusion_engine.py:85] Generation completed successfully.
[Stage-0] INFO 03-06 08:35:52 [diffusion_engine.py:107] Post-processing completed in 0.7679 seconds
[Stage-0] INFO 03-06 08:35:52 [diffusion_engine.py:110] DiffusionEngine.step breakdown: preprocess=20.24 ms, add_req_and_wait=32333.22 ms, postprocess=767.92 ms, total=33121.94 ms
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502]
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502] [Overall Summary]
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502] +-----------------------------+------------+
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502] | Field                       |      Value |
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502] +-----------------------------+------------+
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502] | e2e_requests                |          1 |
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502] | e2e_wall_time_ms            | 38,255.226 |
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502] | e2e_avg_time_per_request_ms | 38,255.226 |
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502] | e2e_stage_0_wall_time_ms    | 38,252.134 |
(APIServer pid=379060) INFO 03-06 08:35:57 [stats.py:502] +-----------------------------+------------+

After

INFO 03-06 11:04:50 [pipeline_wan2_2_i2v.py:583] Pipeline stage timing summary: TextEncoding=59.23 ms, ImageEncoding=0.01 ms, LatentPreparation=4740.84 ms, Denoising=16440.97 ms (1 steps), Decoding=7608.52 ms, StagesSum=28849.57 ms, PipelineWall=28849.99 ms, Unaccounted=0.41 ms
INFO 03-06 11:04:51 [diffusion_worker.py:390] Hop1 worker→scheduler: shm_pack=1105.69 ms, mq.enqueue=0.13 ms, total=1105.82 ms (rank 0)
(APIServer pid=424673) INFO 03-06 11:04:52 [scheduler.py:82] Hop1 scheduler←worker: mq.dequeue=29977.74 ms, shm_unpack=408.63 ms (dequeue includes generation wait)
(APIServer pid=424673) INFO 03-06 11:04:52 [diffusion_engine.py:85] Generation completed successfully.
(APIServer pid=424673) INFO 03-06 11:04:52 [diffusion_engine.py:107] Post-processing completed in 0.5268 seconds
(APIServer pid=424673) INFO 03-06 11:04:52 [diffusion_engine.py:110] DiffusionEngine.step breakdown: preprocess=26.36 ms, add_req_and_wait=30405.40 ms, postprocess=526.84 ms, total=30958.92 ms
(APIServer pid=424673) INFO 03-06 11:04:52 [omni_diffusion.py:128] OmniDiffusion.generate total: 30961.16 ms
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] ^M
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] [Overall Summary]^M
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] +-----------------------------+------------+^M
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] | Field                       |      Value |^M
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] +-----------------------------+------------+^M
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] | e2e_requests                |          1 |^M
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] | e2e_wall_time_ms            | 30,962.290 |^M
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] | e2e_avg_time_per_request_ms | 30,962.290 |^M
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] | e2e_stage_0_wall_time_ms    | 30,962.273 |^M
(APIServer pid=424673) INFO 03-06 11:04:52 [stats.py:502] +-----------------------------+------------+

Wan2.2-I2V-A14B, TP=2, 1280x720, 5s@16fps, 1 denoising step:

Metric	Before	Phase 1	Phase 1+2
e2e_wall_time_ms	37,546	33,078	30,962
PipelineWall (GPU)	28,881	28,895	28,850
Hop1 (worker to scheduler)	1,530 ms	1,495 ms	1,106 ms
Hop3 (stage worker IPC)	~5,000 ms	0 (eliminated)	0 (eliminated)

Backward Compatibility

Multi-stage pipelines (LLM + diffusion) are unaffected - they continue to use the stage worker subprocess path.
The sync Omni class is unaffected (is_async guard).
SHM pack/unpack has graceful fallback: if SHM fails, it falls back to regular pickle through MessageQueue.
No API changes.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: samithuang <285365963@qq.com>

Two optimizations that eliminate ~6.5s of IPC serialization overhead for single-stage diffusion pipelines (e.g. Wan2.2 I2V/T2V) in online serving mode: Phase 1 – Inline diffusion (eliminate Hop3): When there is exactly one diffusion stage in async mode, initialize OmniDiffusion directly in the orchestrator process instead of spawning a stage worker subprocess. This removes the entire Hop3 serialization path (pickle + mp.Queue/SHM) between the stage worker and orchestrator. GPU workers for tensor parallelism are still spawned by DiffusionExecutor. Phase 2 – SHM tensor transfer (optimize Hop1): Replace pickle-based serialization of large tensors through MessageQueue with POSIX shared memory. The worker copies tensor data into a named SHM segment and enqueues only lightweight metadata; the scheduler reconstructs the tensor from SHM. This reduces Hop1 overhead from ~3.4s to ~1.5s. Measured on Wan2.2-I2V-A14B (TP=2, 1280x720, 5s@16fps, 1 step): Before: e2e = 37.5s Phase 1: e2e = 33.1s (−4.4s) Phase 2: e2e = 31.0s (−2.1s) Total: e2e = 31.0s (−6.5s, −17.5%) Made-with: Cursor Signed-off-by: samithuang <285365963@qq.com>

…17.5%) perf: reduce IPC overhead for single-stage diffusion serving (~6.5s, 17.5%)

hsliuustc0106 · 2026-03-06T13:50:17Z

@wuhang2014 PTAL

Signed-off-by: Samit <285365963@qq.com>

Signed-off-by: samithuang <285365963@qq.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dd4468cbb2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

hsliuustc0106 · 2026-03-06T23:33:55Z

there are many time log changes in this PR, I think we need to rm them

lishunyang12

left a question inline

lishunyang12 · 2026-03-09T01:55:17Z

the time.perf_counter() / logger.info("... took %.3f s") instrumentation is spread across ~11 files. are these meant to stay permanently? if so, might be better at DEBUG level to avoid spamming production logs.

Signed-off-by: samithuang <285365963@qq.com>

Signed-off-by: Samit <285365963@qq.com>

Signed-off-by: samithuang <285365963@qq.com>

SamitHuang · 2026-03-09T04:28:13Z

there are many time log changes in this PR, I think we need to rm them

rm the redundant logs

SamitHuang · 2026-03-09T04:28:24Z

left a question inline

fixed

Signed-off-by: samithuang <285365963@qq.com>

zhtmike · 2026-03-09T06:51:16Z

 from vllm_omni.platforms import current_omni_platform

 logger = logging.getLogger(__name__)
+DEBUG_PERF = False


suggest to drop DEBUG_PERF to keep code clean.

we still need it for perf tuning recently. It only add logs for Wan2.2, which should be ok. Can remove it once perf tuning finished

Signed-off-by: samithuang <285365963@qq.com>

…2.2 (vllm-project#1715) Signed-off-by: samithuang <285365963@qq.com> Signed-off-by: Samit <285365963@qq.com> Signed-off-by: lishunyang <lishunyang12@163.com>

…fusion IPC design into scheduler refactor Signed-off-by: jader <yjader@foxmail.com>

…2.2 (vllm-project#1715) Signed-off-by: samithuang <285365963@qq.com> Signed-off-by: Samit <285365963@qq.com>

SamitHuang added 4 commits March 6, 2026 10:14

add time cost log for different stages

26a4e8d

Signed-off-by: samithuang <285365963@qq.com>

reduce hop3 overhead

b3b70a8

Signed-off-by: samithuang <285365963@qq.com>

perf: reduce IPC overhead for single-stage diffusion serving (~6.5s, …

bf2ddb0

…17.5%) perf: reduce IPC overhead for single-stage diffusion serving (~6.5s, 17.5%)

SamitHuang added 2 commits March 6, 2026 23:43

Merge branch 'main' into main

735b2ca

Signed-off-by: Samit <285365963@qq.com>

fix conflicts

dd4468c

Signed-off-by: samithuang <285365963@qq.com>

SamitHuang marked this pull request as ready for review March 6, 2026 16:33

SamitHuang requested a review from hsliuustc0106 as a code owner March 6, 2026 16:33

chatgpt-codex-connector Bot reviewed Mar 6, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/diffusion_engine.py

Comment thread vllm_omni/entrypoints/omni.py

lishunyang12 reviewed Mar 9, 2026

View reviewed changes

hsliuustc0106 added the ready label to trigger buildkite CI label Mar 9, 2026

SamitHuang added 4 commits March 9, 2026 03:25

rm redundancy

ff62a1e

Signed-off-by: samithuang <285365963@qq.com>

Merge branch 'main' into main

870963e

Signed-off-by: Samit <285365963@qq.com>

rm logs

5414a42

Signed-off-by: samithuang <285365963@qq.com>

fix inline

e3dec54

Signed-off-by: samithuang <285365963@qq.com>

fix ci

2cd9f9f

Signed-off-by: samithuang <285365963@qq.com>

zhtmike reviewed Mar 9, 2026

View reviewed changes

ZJY0516 reviewed Mar 9, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/data.py Outdated

fix ci

172040a

Signed-off-by: samithuang <285365963@qq.com>

wtomin reviewed Mar 9, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2.py Outdated

wtomin reviewed Mar 9, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/data.py Outdated

SamitHuang added 3 commits March 9, 2026 08:14

fix log

0a86fc5

Signed-off-by: samithuang <285365963@qq.com>

fix

9b9c597

Signed-off-by: samithuang <285365963@qq.com>

fix log

bda0f2d

Signed-off-by: samithuang <285365963@qq.com>

david6666666 approved these changes Mar 9, 2026

View reviewed changes

hsliuustc0106 merged commit 155856f into vllm-project:main Mar 9, 2026
6 of 7 checks passed

ApsarasX approved these changes Mar 9, 2026

View reviewed changes

hsliuustc0106 mentioned this pull request Mar 10, 2026

[Performance]: Online serving adds ~1.3s overhead vs offline for image generation #1783

Open

1 task

yJader added a commit to omni-nicelab/vllm-omni-batching that referenced this pull request Mar 11, 2026

Resolve upstream merge conflicts by integrating vllm-project#1715 dif…

88a6362

…fusion IPC design into scheduler refactor Signed-off-by: jader <yjader@foxmail.com>

asukaqaq-s mentioned this pull request Mar 11, 2026

[Feat] Support step-boundary abort in diffusion #1769

Merged

10 tasks

SamitHuang mentioned this pull request Apr 13, 2026

[Perf] Eliminate Hop 3 IPC overhead for single-stage diffusion via inline execution #2736

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Reduce IPC overhead for single-stage diffusion serving for Wan2.2#1715

[Perf] Reduce IPC overhead for single-stage diffusion serving for Wan2.2#1715
hsliuustc0106 merged 15 commits into
vllm-project:mainfrom
SamitHuang:main

SamitHuang commented Mar 6, 2026 •

edited

Loading

Uh oh!

hsliuustc0106 commented Mar 6, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 commented Mar 6, 2026

Uh oh!

lishunyang12 left a comment

Uh oh!

lishunyang12 commented Mar 9, 2026

Uh oh!

SamitHuang commented Mar 9, 2026

Uh oh!

SamitHuang commented Mar 9, 2026

Uh oh!

zhtmike Mar 9, 2026

Uh oh!

SamitHuang Mar 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

SamitHuang commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Test Plan

Test Result

Backward Compatibility

Uh oh!

hsliuustc0106 commented Mar 6, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 commented Mar 6, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

lishunyang12 commented Mar 9, 2026

Uh oh!

SamitHuang commented Mar 9, 2026

Uh oh!

SamitHuang commented Mar 9, 2026

Uh oh!

zhtmike Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

SamitHuang Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

SamitHuang commented Mar 6, 2026 •

edited

Loading

SamitHuang Mar 9, 2026 •

edited

Loading