[Bagel]: Support SP by princepride · Pull Request #1903 · vllm-project/vllm-omni

princepride · 2026-03-15T08:55:47Z

Purpose

Related: #1217
Implement Ulysses & Ring Sequence Parallelism for BAGEL model.

Key Changes

Add SP support (Ulysses, Ring, hybrid Ulysses+Ring) to BAGEL transformer
Multi-stage KV cache broadcast: rank 0 receives KV cache via SharedMemory connector and broadcasts to all SP ranks via broadcast_object_list
receive_multi_kv_cache_distributed added to OmniKVTransferManager to minimize changes to diffusion_model_runner.py
VAE token splitting across SP ranks (_split_vae_for_sp)
Compatible with TeaCache and Cache-DiT hook-based acceleration

Benchmarks

BAGEL-7B-MoT, 1024×1024 images, 50 inference steps, CFG (text_scale=4.0, img_scale=1.5).
Each diffusion worker uses ~27.20 GiB model VRAM.

Configuration	Diffusion GPUs	E2E Latency	Speedup
Baseline (no SP)	1	19.04s	1.0x
Ulysses=2	2	13.92s	1.37x
Ring=2	2	13.93s	1.37x
Ulysses=2 + Ring=2	4	14.55s	1.31x
Ulysses=2 + TeaCache	2	14.41s	1.32x
Ulysses=2 + Cache-DiT (Fn=4, W=8)	2	9.26s	2.06x
Ring=2 + TeaCache	2	14.33s	1.33x
Ring=2 + Cache-DiT (Fn=4, W=8)	2	8.72s	2.18x
Ulysses=2 + Ring=2 + TeaCache	4	15.21s	1.25x
Ulysses=2 + Ring=2 + Cache-DiT (Fn=4, W=8)	4	10.91s	1.74x

Note: At 1024×1024, 2-GPU configs outperform 4-GPU configs due to communication overhead. Higher resolutions would benefit more from 4-GPU parallelism.

Note: SP + Cache-DiT quality depends on cache parameters. Use conservative settings (higher Fn_compute_blocks and max_warmup_steps) for better image quality.

Test Plan

"""Test BAGEL Sequence Parallel: ulysses=2, ring=2, hybrid 2x2."""

from vllm_omni.entrypoints.omni import Omni

MODEL = "ByteDance-Seed/BAGEL-7B-MoT"
PROMPT = "<|im_start|>A cute cat<|im_end|>"

# Example: Ulysses=2 + Ring=2 with stage config
omni = Omni(
    model=MODEL,
    stage_configs_path="vllm_omni/model_executor/stage_configs/bagel_usp2_ring2.yaml",
)

params_list = omni.default_sampling_params_list
params_list[0].max_tokens = 1
params_list[1].num_inference_steps = 50
params_list[1].seed = 52
params_list[1].width = 1024
params_list[1].height = 1024
params_list[1].extra_args = {
    "cfg_text_scale": 4.0,
    "cfg_img_scale": 1.5,
}

outputs = list(omni.generate(
    prompts=[{"prompt": PROMPT, "modalities": ["image"]}],
    sampling_params_list=params_list,
))

for req_output in outputs:
    images = getattr(req_output, "images", None)
    if images:
        for j, img in enumerate(images):
            img.save(f"bagel_sp_{j}.png")
            print(f"Saved: bagel_sp_{j}.png")

omni.close()

Test Result

SP Only

no-sp	up=2	ring=2	up=2+ring=2

SP + TeaCache

up=2+teacache	ring=2+teacache	up=2+ring=2+teacache

SP + Cache-DiT (Fn=4, W=8)

up=2+cache-dit	ring=2+cache-dit	up=2+ring=2+cache-dit

Signed-off-by: princepride <wangzhipeng628@gmail.com>

princepride · 2026-03-15T08:56:36Z

@wtomin PTAL

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4245ced511

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: princepride <wangzhipeng628@gmail.com>

princepride

Review Summary

Validated:

All gates pass (DCO, pre-commit, build, docs)
Visual quality evidence provided (4-way image comparison)
SP entry/exit context tracking via _sp_enter/_sp_exit and _sp_shard_depth is consistent with the forward-context contract
_forward_sp_gen output slicing is correct: joint_strategy="front" prepends joint query, so attn_out[:text_len] is text and attn_out[text_len:] is VAE — confirmed against Ulysses post_attention
Registry addition of "bagel" to transformer_attrs with _sp_plan = {} correctly triggers sp_plan_hooks_applied=True so sp_active is gated by _sp_shard_depth

Issues requiring fixes before merge:

No automated test. tests/e2e/offline_inference/test_sequence_parallel.py exists but BAGEL SP cases (ulysses_degree=2, ring_degree=2, combined) are not added. The manual script in the PR body is also broken — name is not defined in the loop scope (should be t["name"]) — so it cannot have been run as written.
_split_vae_for_sp else branch reconstructs wrong indices. See inline comment at line 1106.
local_packed_indexes silently discards the caller's packed_indexes. See inline comment at line 2069.

Signed-off-by: princepride <wangzhipeng628@gmail.com>

Removed comments about KV cache handling in the inference context. Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

wtomin · 2026-03-19T03:41:41Z

Please provide the e2e latency and VRAM in your PR body. Add you may update the two files:

docs/user_guide/diffusion_acceleration.md
docs/user_guide/diffusion/parallelism_acceleration.md

Can you verify if SP works along with teacache or cache-dit? Because these two features are hooked-based design.

princepride · 2026-03-19T03:46:44Z

Please provide the e2e latency and VRAM in your PR body. Add you may update the two files:

docs/user_guide/diffusion_acceleration.md

docs/user_guide/diffusion/parallelism_acceleration.md

Can you verify if SP works along with teacache or cache-dit? Because these two features are hooked-based design.

Sure, I will do it later, can you help me approve #1998, I split bagel e2e test to L2 and L3, it may help us download hf weight.

Signed-off-by: princepride <wangzhipeng628@gmail.com>

hsliuustc0106 · 2026-03-19T08:59:15Z

any ci test and doc updates? perf improvment and memory comsumption?

princepride · 2026-03-19T09:00:56Z

Will update L4 test.

hsliuustc0106

Review Summary

This PR adds Ulysses & Ring sequence parallelism support for BAGEL with a well-structured implementation. The visual test evidence demonstrates correctness across SP configurations.

✅ Validated

Implementation follows existing patterns
Visual test results show consistent output quality
Stage config for 4-GPU setup included
Broadcast-aware KV transfer for multi-GPU stages

🔧 Changes Needed

Missing automated tests - For a distributed feature like sequence parallelism, E2E tests are essential for CI coverage. Please add tests covering:

Basic SP correctness test - Verify output with SP matches non-SP baseline (or document expected differences)
Multi-GPU SP test - At minimum, a test for ulysses=2 on 2 GPUs

Suggested test location: tests/e2e/offline_inference/test_bagel_sp.py

Example test structure:

@pytest.mark.parametrize("sp_config", ["ulysses2", "ring2"])
@hardware_test(res={"cuda": "H100"}, num_cards={"cuda": 2})
def test_bagel_sp_correctness(sp_config):
    # Compare SP output against non-SP baseline

📝 Minor (non-blocking)

The KV transfer method change (receive_multi_kv_cache → receive_multi_kv_cache_distributed) affects all diffusion models. Consider verifying other models still work correctly.
Single-image batch limitation is asserted but could be documented in a README.

Thanks for the comprehensive test evidence in the PR description!

Signed-off-by: princepride <wangzhipeng628@gmail.com>

princepride · 2026-03-19T16:56:36Z

@wtomin @hsliuustc0106 PTAL I update the benchmark result in this pr's description. I also update the docs and L4 unit test.

lishunyang12

Left a few comments. Main concern is the receive_multi_kv_cache_distributed change in model runner — that's a shared path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: princepride <wangzhipeng628@gmail.com>

princepride

@lishunyang12 PTAL

wtomin · 2026-03-22T13:42:59Z

It is good to see E2E latency and VRAM reported in the PR body. Various combinations of acceleration features have been tested, plus the online serving test. It's almost perfect to me.

I just wondering if there is a reason that this PR only includes bagel_usp2_ring2.yaml, because usp2+ring2 seems not the best combo.

princepride · 2026-03-22T13:46:14Z

It is good to see E2E latency and VRAM reported in the PR body. Various combinations of acceleration features have been tested, plus the online serving test. It's almost perfect to me.

I just wondering if there is a reason that this PR only includes bagel_usp2_ring2.yaml, because usp2+ring2 seems not the best combo.

Because I think other yaml can easy create from bagel_usp2_ring2.yaml😂

princepride · 2026-03-23T04:01:05Z

@wtomin @hsliuustc0106 Can someone help me approve it?

Signed-off-by: princepride <wangzhipeng628@gmail.com>

wtomin

LGTM.

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

Signed-off-by: princepride <wangzhipeng628@gmail.com>

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

wtomin · 2026-03-23T14:51:54Z

@princepride SP and TeaCache are intertwined. Please check my recent bugfix for Qwen-Image #2101 and verify if SP and teacache are compatible for bagel.

bagel support sp

4245ced

Signed-off-by: princepride <wangzhipeng628@gmail.com>

princepride requested a review from hsliuustc0106 as a code owner March 15, 2026 08:55

chatgpt-codex-connector Bot reviewed Mar 15, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py

Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py

princepride mentioned this pull request Mar 15, 2026

[RFC]: Bagel deployment #936

Open

14 tasks

simplify code

6e21d9c

Signed-off-by: princepride <wangzhipeng628@gmail.com>

wtomin reviewed Mar 16, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py Outdated

wtomin mentioned this pull request Mar 17, 2026

[RFC]: Continuous Diffusion Model Acceleration Support #1217

Open

1 task

princepride commented Mar 17, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py

Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py

princepride added 3 commits March 18, 2026 23:01

Merge branch 'main' into bagel-support-sp

9647c7e

add receive kv cache distribute in kv_transfer_manager

45ebd16

Signed-off-by: princepride <wangzhipeng628@gmail.com>

Clean up comments in diffusion model runner

076bc46

Removed comments about KV cache handling in the inference context. Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

princepride added the ready label to trigger buildkite CI label Mar 19, 2026

wtomin reviewed Mar 19, 2026

View reviewed changes

Comment thread vllm_omni/model_executor/stage_configs/bagel_usp2.yaml

princepride removed the ready label to trigger buildkite CI label Mar 19, 2026

princepride added 2 commits March 19, 2026 16:14

Merge branch 'main' into bagel-support-sp

493ca1f

change yaml name

d5aa95c

Signed-off-by: princepride <wangzhipeng628@gmail.com>

hsliuustc0106 requested changes Mar 19, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/worker/diffusion_model_runner.py

princepride added 4 commits March 19, 2026 15:48

update docs

f985b99

Signed-off-by: princepride <wangzhipeng628@gmail.com>

Merge branch 'main' into bagel-support-sp

d42d075

update docs

dd27b4a

Signed-off-by: princepride <wangzhipeng628@gmail.com>

Add L4

3497b33

Signed-off-by: princepride <wangzhipeng628@gmail.com>

princepride requested review from hsliuustc0106 and wtomin March 19, 2026 16:49

princepride changed the title ~~[Bagel]: Support sp~~ [Bagel]: Support SP Mar 19, 2026

lishunyang12 reviewed Mar 21, 2026

View reviewed changes

Fix test mock and add broadcast comment per review feedback

2f7c9ec

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: princepride <wangzhipeng628@gmail.com>

princepride commented Mar 21, 2026

View reviewed changes

update bagel usp yaml

8a6fa37

Signed-off-by: princepride <wangzhipeng628@gmail.com>

wtomin approved these changes Mar 23, 2026

View reviewed changes

wtomin reviewed Mar 23, 2026

View reviewed changes

Comment thread vllm_omni/model_executor/stage_configs/bagel_usp2.yaml Outdated

Apply suggestion from @wtomin

1fa34c2

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

wtomin added the ready label to trigger buildkite CI label Mar 23, 2026

fix docs bug

cc30a50

Signed-off-by: princepride <wangzhipeng628@gmail.com>

princepride enabled auto-merge (squash) March 23, 2026 12:00

Merge branch 'main' into bagel-support-sp

de1a400

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

hsliuustc0106 disabled auto-merge March 23, 2026 13:52

hsliuustc0106 merged commit 77d773a into vllm-project:main Mar 23, 2026
7 of 8 checks passed

Conversation

princepride commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Key Changes

Benchmarks

Test Plan

Test Result

SP Only

SP + TeaCache

SP + Cache-DiT (Fn=4, W=8)

Uh oh!

princepride commented Mar 15, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

princepride left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wtomin commented Mar 19, 2026

Uh oh!

princepride commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hsliuustc0106 commented Mar 19, 2026

Uh oh!

princepride commented Mar 19, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Review Summary

✅ Validated

🔧 Changes Needed

📝 Minor (non-blocking)

Uh oh!

Uh oh!

princepride commented Mar 19, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

princepride left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wtomin commented Mar 22, 2026

Uh oh!

princepride commented Mar 22, 2026

Uh oh!

princepride commented Mar 23, 2026

Uh oh!

wtomin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wtomin commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

princepride commented Mar 15, 2026 •

edited

Loading

princepride commented Mar 19, 2026 •

edited

Loading

princepride left a comment •

edited

Loading