Skip to content

[Bagel]: Support SP#1903

Merged
hsliuustc0106 merged 16 commits intovllm-project:mainfrom
princepride:bagel-support-sp
Mar 23, 2026
Merged

[Bagel]: Support SP#1903
hsliuustc0106 merged 16 commits intovllm-project:mainfrom
princepride:bagel-support-sp

Conversation

@princepride
Copy link
Copy Markdown
Collaborator

@princepride princepride commented Mar 15, 2026

Purpose

Related: #1217
Implement Ulysses & Ring Sequence Parallelism for BAGEL model.

Key Changes

  • Add SP support (Ulysses, Ring, hybrid Ulysses+Ring) to BAGEL transformer
  • Multi-stage KV cache broadcast: rank 0 receives KV cache via SharedMemory connector and broadcasts to all SP ranks via broadcast_object_list
  • receive_multi_kv_cache_distributed added to OmniKVTransferManager to minimize changes to diffusion_model_runner.py
  • VAE token splitting across SP ranks (_split_vae_for_sp)
  • Compatible with TeaCache and Cache-DiT hook-based acceleration

Benchmarks

BAGEL-7B-MoT, 1024×1024 images, 50 inference steps, CFG (text_scale=4.0, img_scale=1.5).
Each diffusion worker uses ~27.20 GiB model VRAM.

Configuration Diffusion GPUs E2E Latency Speedup
Baseline (no SP) 1 19.04s 1.0x
Ulysses=2 2 13.92s 1.37x
Ring=2 2 13.93s 1.37x
Ulysses=2 + Ring=2 4 14.55s 1.31x
Ulysses=2 + TeaCache 2 14.41s 1.32x
Ulysses=2 + Cache-DiT (Fn=4, W=8) 2 9.26s 2.06x
Ring=2 + TeaCache 2 14.33s 1.33x
Ring=2 + Cache-DiT (Fn=4, W=8) 2 8.72s 2.18x
Ulysses=2 + Ring=2 + TeaCache 4 15.21s 1.25x
Ulysses=2 + Ring=2 + Cache-DiT (Fn=4, W=8) 4 10.91s 1.74x

Note: At 1024×1024, 2-GPU configs outperform 4-GPU configs due to communication overhead. Higher resolutions would benefit more from 4-GPU parallelism.

Note: SP + Cache-DiT quality depends on cache parameters. Use conservative settings (higher Fn_compute_blocks and max_warmup_steps) for better image quality.

Test Plan

"""Test BAGEL Sequence Parallel: ulysses=2, ring=2, hybrid 2x2."""

from vllm_omni.entrypoints.omni import Omni

MODEL = "ByteDance-Seed/BAGEL-7B-MoT"
PROMPT = "<|im_start|>A cute cat<|im_end|>"

# Example: Ulysses=2 + Ring=2 with stage config
omni = Omni(
    model=MODEL,
    stage_configs_path="vllm_omni/model_executor/stage_configs/bagel_usp2_ring2.yaml",
)

params_list = omni.default_sampling_params_list
params_list[0].max_tokens = 1
params_list[1].num_inference_steps = 50
params_list[1].seed = 52
params_list[1].width = 1024
params_list[1].height = 1024
params_list[1].extra_args = {
    "cfg_text_scale": 4.0,
    "cfg_img_scale": 1.5,
}

outputs = list(omni.generate(
    prompts=[{"prompt": PROMPT, "modalities": ["image"]}],
    sampling_params_list=params_list,
))

for req_output in outputs:
    images = getattr(req_output, "images", None)
    if images:
        for j, img in enumerate(images):
            img.save(f"bagel_sp_{j}.png")
            print(f"Saved: bagel_sp_{j}.png")

omni.close()

Test Result

SP Only

no-sp up=2 ring=2 up=2+ring=2
baseline up2 ring2 up2_ring2

SP + TeaCache

up=2+teacache ring=2+teacache up=2+ring=2+teacache
image image image

SP + Cache-DiT (Fn=4, W=8)

up=2+cache-dit ring=2+cache-dit up=2+ring=2+cache-dit
image image image

Signed-off-by: princepride <wangzhipeng628@gmail.com>
@princepride
Copy link
Copy Markdown
Collaborator Author

@wtomin PTAL

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4245ced511

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py
Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py
@princepride princepride mentioned this pull request Mar 15, 2026
14 tasks
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py Outdated
Copy link
Copy Markdown
Collaborator Author

@princepride princepride left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Validated:

  • All gates pass (DCO, pre-commit, build, docs)
  • Visual quality evidence provided (4-way image comparison)
  • SP entry/exit context tracking via _sp_enter/_sp_exit and _sp_shard_depth is consistent with the forward-context contract
  • _forward_sp_gen output slicing is correct: joint_strategy="front" prepends joint query, so attn_out[:text_len] is text and attn_out[text_len:] is VAE — confirmed against Ulysses post_attention
  • Registry addition of "bagel" to transformer_attrs with _sp_plan = {} correctly triggers sp_plan_hooks_applied=True so sp_active is gated by _sp_shard_depth

Issues requiring fixes before merge:

  1. No automated test. tests/e2e/offline_inference/test_sequence_parallel.py exists but BAGEL SP cases (ulysses_degree=2, ring_degree=2, combined) are not added. The manual script in the PR body is also broken — name is not defined in the loop scope (should be t["name"]) — so it cannot have been run as written.

  2. _split_vae_for_sp else branch reconstructs wrong indices. See inline comment at line 1106.

  3. local_packed_indexes silently discards the caller's packed_indexes. See inline comment at line 2069.

Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py
Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Removed comments about KV cache handling in the inference context.

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
@princepride princepride added the ready label to trigger buildkite CI label Mar 19, 2026
Comment thread vllm_omni/model_executor/stage_configs/bagel_usp2.yaml
@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Mar 19, 2026

Please provide the e2e latency and VRAM in your PR body. Add you may update the two files:

  • docs/user_guide/diffusion_acceleration.md
  • docs/user_guide/diffusion/parallelism_acceleration.md

Can you verify if SP works along with teacache or cache-dit? Because these two features are hooked-based design.

@princepride
Copy link
Copy Markdown
Collaborator Author

princepride commented Mar 19, 2026

Please provide the e2e latency and VRAM in your PR body. Add you may update the two files:

  • docs/user_guide/diffusion_acceleration.md
  • docs/user_guide/diffusion/parallelism_acceleration.md

Can you verify if SP works along with teacache or cache-dit? Because these two features are hooked-based design.

Sure, I will do it later, can you help me approve #1998, I split bagel e2e test to L2 and L3, it may help us download hf weight.

@princepride princepride removed the ready label to trigger buildkite CI label Mar 19, 2026
Signed-off-by: princepride <wangzhipeng628@gmail.com>
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

any ci test and doc updates? perf improvment and memory comsumption?

@princepride
Copy link
Copy Markdown
Collaborator Author

Will update L4 test.

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR adds Ulysses & Ring sequence parallelism support for BAGEL with a well-structured implementation. The visual test evidence demonstrates correctness across SP configurations.

✅ Validated

  • Implementation follows existing patterns
  • Visual test results show consistent output quality
  • Stage config for 4-GPU setup included
  • Broadcast-aware KV transfer for multi-GPU stages

🔧 Changes Needed

Missing automated tests - For a distributed feature like sequence parallelism, E2E tests are essential for CI coverage. Please add tests covering:

  1. Basic SP correctness test - Verify output with SP matches non-SP baseline (or document expected differences)
  2. Multi-GPU SP test - At minimum, a test for ulysses=2 on 2 GPUs

Suggested test location: tests/e2e/offline_inference/test_bagel_sp.py

Example test structure:

@pytest.mark.parametrize("sp_config", ["ulysses2", "ring2"])
@hardware_test(res={"cuda": "H100"}, num_cards={"cuda": 2})
def test_bagel_sp_correctness(sp_config):
    # Compare SP output against non-SP baseline

📝 Minor (non-blocking)

  • The KV transfer method change (receive_multi_kv_cachereceive_multi_kv_cache_distributed) affects all diffusion models. Consider verifying other models still work correctly.
  • Single-image batch limitation is asserted but could be documented in a README.

Thanks for the comprehensive test evidence in the PR description!

Comment thread vllm_omni/diffusion/worker/diffusion_model_runner.py
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
@princepride princepride changed the title [Bagel]: Support sp [Bagel]: Support SP Mar 19, 2026
@princepride
Copy link
Copy Markdown
Collaborator Author

@wtomin @hsliuustc0106 PTAL I update the benchmark result in this pr's description. I also update the docs and L4 unit test.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Main concern is the receive_multi_kv_cache_distributed change in model runner — that's a shared path.

Comment thread vllm_omni/diffusion/worker/diffusion_model_runner.py
Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py
Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py
Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py
Comment thread vllm_omni/distributed/omni_connectors/kv_transfer_manager.py
Comment thread vllm_omni/distributed/omni_connectors/kv_transfer_manager.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Copy link
Copy Markdown
Collaborator Author

@princepride princepride left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Mar 22, 2026

It is good to see E2E latency and VRAM reported in the PR body. Various combinations of acceleration features have been tested, plus the online serving test. It's almost perfect to me.

I just wondering if there is a reason that this PR only includes bagel_usp2_ring2.yaml, because usp2+ring2 seems not the best combo.

@princepride
Copy link
Copy Markdown
Collaborator Author

It is good to see E2E latency and VRAM reported in the PR body. Various combinations of acceleration features have been tested, plus the online serving test. It's almost perfect to me.

I just wondering if there is a reason that this PR only includes bagel_usp2_ring2.yaml, because usp2+ring2 seems not the best combo.

Because I think other yaml can easy create from bagel_usp2_ring2.yaml😂

@princepride
Copy link
Copy Markdown
Collaborator Author

@wtomin @hsliuustc0106 Can someone help me approve it?

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Copy link
Copy Markdown
Collaborator

@wtomin wtomin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Comment thread vllm_omni/model_executor/stage_configs/bagel_usp2.yaml Outdated
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
@wtomin wtomin added the ready label to trigger buildkite CI label Mar 23, 2026
Signed-off-by: princepride <wangzhipeng628@gmail.com>
@princepride princepride enabled auto-merge (squash) March 23, 2026 12:00
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
@hsliuustc0106 hsliuustc0106 disabled auto-merge March 23, 2026 13:52
@hsliuustc0106 hsliuustc0106 merged commit 77d773a into vllm-project:main Mar 23, 2026
7 of 8 checks passed
@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Mar 23, 2026

@princepride SP and TeaCache are intertwined. Please check my recent bugfix for Qwen-Image #2101 and verify if SP and teacache are compatible for bagel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants