[Feature] Add support for Pipeline Parallel and integrate it into Wan 2.2 by hadipash · Pull Request #2322 · vllm-project/vllm-omni

hadipash · 2026-03-30T08:18:09Z

Purpose

Add support for Pipeline Parallelism through the PipelineParallelMixin class.
Integrate its support into Wan 2.2.
Delete unused Wan22TI2VPipeline to avoid confusion.

Test Plan

pytest tests/diffusion/distributed/test_pipeline_parallel.py

Test Result

Model	Mode	Single Divice	PP=2	PP=4
Wan2.2-TI2V-5B-Diffusers	T2V	t2v_5B_single.mp4	t2v_5B_pp2.mp4	t2v_5B_pp4.mp4
Time		46.3 s	37.9 s	34.1 s
Memory		52.0 GB	47.4 GB	44.9 GB
Wan2.2-T2V-A14B-Diffusers	T2V	t2v_14B_single.mp4	t2v_14B_pp2.mp4	t2v_14B_pp4.mp4
Time		208.9 s	171.3 s	152.5 s
Memory		76.6 GB	50.3 GB	37.2 GB
Wan2.2-I2V-A14B-Diffusers	I2V	i2v_14B_single.mp4	i2v_14B_pp2.mp4	i2v_14B_pp4.mp4
Time		256.1 s	211.1 s	186.9 s
Memory		76.4 GB	50.1 GB	37.0 GB

Hybrid Parallel with TP and SP

Model	Mode	Single Divice	PP=2 SP=2	PP=2 TP=2
Wan2.2-TI2V-5B-Diffusers	T2V	t2v_5B_pp1_cfg1.mp4	t2v_5B_pp2_cfg1_sp2.mp4	t2v_5B_pp2_cfg1_tp2.mp4
Time		45.0 s	34.4 s	31.7 s
Memory		45.4 GB	43.9 GB	39.4 GB

Ascend NPU

Tested on Atlas A2 with Wan2.2-TI2V-5B-Diffusers.

Parallel Mode	Result	Time
Single	t2v_5B_single.mp4	151.6 s
PP=4	t2v_5B_pp4.mp4	100.2 s
PP=2 SP=2	t2v_5B_pp2_sp2.mp4	75.2 s
PP=2 TP=2	t2v_5B_pp2_tp2.mp4	102.4 s
PP=4 CFG=2	t2v_5B_pp4_cfg2.mp4	81.1 s
PP=8	t2v_5B_pp8.mp4	90.9 s

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

hadipash · 2026-04-01T10:09:56Z

@zzhang-fr added async latents transfer to the first rank.

lishunyang12

left a couple comments, mostly around the async send lifetime

chatgpt-codex-connector · 2026-04-10T06:27:07Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

hadipash · 2026-04-10T08:13:38Z

Rebased onto the main branch.

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com> # Conflicts: # docs/user_guide/diffusion_features.md

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com> # Conflicts: # vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_ti2v.py # vllm_omni/diffusion/models/wan2_2/wan2_2_transformer.py

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com> # Conflicts: # docs/user_guide/diffusion_features.md

# Conflicts: # docs/user_guide/diffusion_features.md

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com> # Conflicts: # docs/user_guide/diffusion_features.md # vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2.py # vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_i2v.py # vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_ti2v.py # vllm_omni/diffusion/models/wan2_2/wan2_2_transformer.py

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com> # Conflicts: # docs/user_guide/diffusion_features.md # examples/offline_inference/text_to_video/text_to_video.py # tests/diffusion/models/wan2_2/test_wan22_ti2v_pipeline.py # vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_ti2v.py # vllm_omni/engine/async_omni_engine.py

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com> # Conflicts: # docs/user_guide/diffusion_features.md # vllm_omni/diffusion/models/wan2_2/__init__.py

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com> # Conflicts: # docs/user_guide/diffusion_features.md # vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_i2v.py # vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_ti2v.py # vllm_omni/diffusion/models/wan2_2/wan2_2_transformer.py

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com> # Conflicts: # examples/offline_inference/image_to_video/image_to_video.py # examples/offline_inference/text_to_video/text_to_video.py # vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_ti2v.py # vllm_omni/diffusion/models/wan2_2/wan2_2_transformer.py

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

wtomin

All of my comments were addressed. LGTM.

Also hear thoughts from @lishunyang12 @RuixiangMa @SamitHuang @david6666666 @yuanheng-zhao @hsliuustc0106

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

hsliuustc0106

Review Summary

Well-structured PR with clean mixin design and thorough test coverage. The PP communication pattern (async isend/irecv with Gloo metadata + NCCL tensors) follows existing vLLM-Omni conventions. A few items to address — see inline comments.

hsliuustc0106 · 2026-05-15T12:57:01Z

-    if current_omni_platform.is_npu():
-        assert pipeline_parallel_size == 1, "Current pipefusion is not ready for NPU"
-
    dit_parallel_size = (


The NPU guard was removed here. The original FIXME specifically mentions that NPU async P2P differs from CUDA in torch. Since this PR replaces the old pipefusion with a new implementation using isend_tensor_dict/irecv_tensor_dict, was this validated on NPU? If not, it would be safer to keep the guard (possibly updated for the new PP implementation) to avoid silent hangs on NPU hardware.

# Previously: # FIXME: Since the async p2p communication operation of NPU is not same as cuda in torch, # the pipefusion is not ready for npu yet if current_omni_platform.is_npu(): assert pipeline_parallel_size == 1, "Current pipefusion is not ready for NPU"

hsliuustc0106 · 2026-05-15T12:57:01Z

+            Non-last Pipeline Parallel stages return IntermediateTensors instead of final noise tensors.
        """
-        return self.transformer(*args, **kwargs)[0]
+        result = self.transformer(*args, **kwargs)


This changes the return contract of CFGParallelMixin.predict_noise globally — previously it always did result[0], now it passes through non-tuple results. This is necessary for PP (where IntermediateTensors is returned on non-last stages), but it subtly changes behavior for all pipelines inheriting CFGParallelMixin.

Consider adding a brief comment here explaining the PP motivation, e.g.:

# Support Pipeline Parallel: non-last PP stages return IntermediateTensors # instead of a tuple, so we must handle both cases.

This makes the intent clear to future maintainers who might wonder why the simple result[0] was changed.

hsliuustc0106 · 2026-05-15T12:57:01Z

+
+    def _wrapped_vae_decode(self) -> None:
+        orig_decode = self.vae.decode
+        vae_distributed = hasattr(self.vae, "is_distributed_enabled") and self.vae.is_distributed_enabled()


vae_distributed is captured once at __init__ time and baked into the closure. If the VAE's distributed state changes after construction (e.g. dynamic enable/disable), the wrapper would use stale state.

This is likely fine given current usage, but worth noting with a comment or assertion. Alternatively, you could check self.vae.is_distributed_enabled() on each decode call if the state could change.

hsliuustc0106 · 2026-05-15T12:57:01Z

+            all_kwargs = [positive_kwargs if get_classifier_free_guidance_rank() == 0 else negative_kwargs]
+        else:
+            # Sequential CFG (or no CFG): this PP pipeline handles all branches.
+            all_kwargs = [positive_kwargs] + ([negative_kwargs] if do_true_cfg else [])


For sequential CFG (cfg_parallel_size=1) with PP, this doubles the communication volume per denoising step — each step now runs two full forward chains through the PP pipeline (positive + negative branch).

This is correct, but it would be worth adding a note in the design doc about this perf characteristic so users are aware that PP + sequential CFG has higher communication cost than PP + CFG-Parallel.

hsliuustc0106 · 2026-05-15T12:57:01Z

-        )
+        # Patch embedding only on first PP stage; other stages receive hidden_states via P2P
+        if is_pipeline_first_stage():
+            self.patch_embedding = Conv3dLayer(


is_pipeline_first_stage() is called here at __init__ time. This requires initialize_model_parallel() to have been called before the transformer is constructed. The ordering is presumably guaranteed by the pipeline initialization flow, but a defensive assertion would help catch misordering:

assert get_pipeline_parallel_world_size() > 0 or is_pipeline_first_stage(), \ "initialize_model_parallel must be called before WanTransformer3DModel construction"

Alternatively, consider documenting this prerequisite in the transformer's docstring.

hsliuustc0106 · 2026-05-15T12:57:01Z

-        self.proj_out = nn.Linear(inner_dim, out_channels * math.prod(patch_size))
+        # 4. Output norm & projection — only on the last PP stage
+        if is_pipeline_last_stage():
+            self.norm_out = AdaLayerNorm(inner_dim, elementwise_affine=False, eps=eps)


Same note as above — is_pipeline_last_stage() is called at init time. If pipeline_parallel_size > num_layers, some PP ranks would get zero layers (empty [start_layer, end_layer) range) and would be neither first nor last stage for output projection. This edge case should either be validated or documented.

hsliuustc0106 · 2026-05-15T12:57:01Z

@@ -50,9 +45,6 @@
    "Wan22S2VPipeline",


The removal of Wan22TI2VPipeline is a breaking change for anyone using it. The PR description says it's "unused" — confirming there are no external consumers (and no references in docs or examples beyond this module) would be good.

Also, the deleted test_wan22_ti2v_pipeline.py had real test coverage (preprocess validation, timestep expansion, I2V latent preparation). Were those test cases verified to be covered elsewhere, or intentionally dropped?

hadipash mentioned this pull request Mar 30, 2026

[RFC]: World Model Support #1987

Open

19 tasks

wtomin self-requested a review March 30, 2026 08:37

wtomin mentioned this pull request Mar 31, 2026

[RFC]: Pipeline Parallelism & Stream Batch for Real-Time Video Generation #2280

Open

16 tasks

lishunyang12 reviewed Apr 2, 2026

View reviewed changes

hadipash marked this pull request as ready for review April 10, 2026 06:27

hadipash requested a review from hsliuustc0106 as a code owner April 10, 2026 06:27

hadipash added 9 commits April 10, 2026 15:40

add pipeline parallel to Wan 2.2

a25e9a1

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

add async communication for last -> first rank

70fb785

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

bug fix

ea5aa03

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

bug fix

95c9db8

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

align isend with vLLM's implementation

8a62204

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

add CFG parallel support

79a83db

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

add unit tests and docs

0db0f8f

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

add online serving

7244874

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

remove unused pipeline

eb9d9ac

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

hadipash force-pushed the wan_pipe_parallel branch from cce7bff to eb9d9ac Compare April 10, 2026 08:10

hadipash added 7 commits April 13, 2026 10:16

Merge branch 'main' into wan_pipe_parallel

eb1190b

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com> # Conflicts: # docs/user_guide/diffusion_features.md

add safeguard

154a4db

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

align UT with other tests

8c394da

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

bug fixes and align UT with other tests

2be4cc9

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

Merge branch 'main' into wan_pipe_parallel

169ad84

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com> # Conflicts: # vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_ti2v.py # vllm_omni/diffusion/models/wan2_2/wan2_2_transformer.py

Merge branch 'main' into wan_pipe_parallel

46ac762

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com> # Conflicts: # docs/user_guide/diffusion_features.md

Merge branch 'main' into wan_pipe_parallel

7134edd

# Conflicts: # docs/user_guide/diffusion_features.md

mnasser02 mentioned this pull request Apr 16, 2026

[WIP][Feature] Pipeline Parallelism & Stream Batch for Real-Time Video (#2280) zzhang-fr/vllm-omni#2

Closed

5 tasks

hadipash added 3 commits April 21, 2026 10:28

add NPU support

4f19702

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

remove unused pipeline

e1abb75

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

mnasser02 mentioned this pull request Apr 24, 2026

[WIP][Feature] Temporal Pipeline Parallelism & Stream Batch for Real-Time Video #3099

Draft

5 tasks

hadipash added 2 commits May 4, 2026 10:45

Merge branch 'main' into wan_pipe_parallel

0d733d5

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com> # Conflicts: # docs/user_guide/diffusion_features.md # vllm_omni/diffusion/models/wan2_2/__init__.py

wtomin mentioned this pull request May 7, 2026

[RFC]: vLLM-Omni Diffusion Module — Q2 2026 Roadmap #2226

Open

25 tasks