[Bagel]: Support SP#1903
Conversation
Signed-off-by: princepride <wangzhipeng628@gmail.com>
|
@wtomin PTAL |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4245ced511
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: princepride <wangzhipeng628@gmail.com>
princepride
left a comment
There was a problem hiding this comment.
Review Summary
Validated:
- All gates pass (DCO, pre-commit, build, docs)
- Visual quality evidence provided (4-way image comparison)
- SP entry/exit context tracking via
_sp_enter/_sp_exitand_sp_shard_depthis consistent with the forward-context contract _forward_sp_genoutput slicing is correct:joint_strategy="front"prepends joint query, soattn_out[:text_len]is text andattn_out[text_len:]is VAE — confirmed against Ulyssespost_attention- Registry addition of
"bagel"totransformer_attrswith_sp_plan = {}correctly triggerssp_plan_hooks_applied=Truesosp_activeis gated by_sp_shard_depth
Issues requiring fixes before merge:
-
No automated test.
tests/e2e/offline_inference/test_sequence_parallel.pyexists but BAGEL SP cases (ulysses_degree=2, ring_degree=2, combined) are not added. The manual script in the PR body is also broken —nameis not defined in the loop scope (should bet["name"]) — so it cannot have been run as written. -
_split_vae_for_spelsebranch reconstructs wrong indices. See inline comment at line 1106. -
local_packed_indexessilently discards the caller'spacked_indexes. See inline comment at line 2069.
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Removed comments about KV cache handling in the inference context. Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
|
Please provide the e2e latency and VRAM in your PR body. Add you may update the two files:
Can you verify if SP works along with teacache or cache-dit? Because these two features are hooked-based design. |
Sure, I will do it later, can you help me approve #1998, I split bagel e2e test to L2 and L3, it may help us download hf weight. |
Signed-off-by: princepride <wangzhipeng628@gmail.com>
|
any ci test and doc updates? perf improvment and memory comsumption? |
|
Will update L4 test. |
hsliuustc0106
left a comment
There was a problem hiding this comment.
Review Summary
This PR adds Ulysses & Ring sequence parallelism support for BAGEL with a well-structured implementation. The visual test evidence demonstrates correctness across SP configurations.
✅ Validated
- Implementation follows existing patterns
- Visual test results show consistent output quality
- Stage config for 4-GPU setup included
- Broadcast-aware KV transfer for multi-GPU stages
🔧 Changes Needed
Missing automated tests - For a distributed feature like sequence parallelism, E2E tests are essential for CI coverage. Please add tests covering:
- Basic SP correctness test - Verify output with SP matches non-SP baseline (or document expected differences)
- Multi-GPU SP test - At minimum, a test for
ulysses=2on 2 GPUs
Suggested test location: tests/e2e/offline_inference/test_bagel_sp.py
Example test structure:
@pytest.mark.parametrize("sp_config", ["ulysses2", "ring2"])
@hardware_test(res={"cuda": "H100"}, num_cards={"cuda": 2})
def test_bagel_sp_correctness(sp_config):
# Compare SP output against non-SP baseline📝 Minor (non-blocking)
- The KV transfer method change (
receive_multi_kv_cache→receive_multi_kv_cache_distributed) affects all diffusion models. Consider verifying other models still work correctly. - Single-image batch limitation is asserted but could be documented in a README.
Thanks for the comprehensive test evidence in the PR description!
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
|
@wtomin @hsliuustc0106 PTAL I update the benchmark result in this pr's description. I also update the docs and L4 unit test. |
lishunyang12
left a comment
There was a problem hiding this comment.
Left a few comments. Main concern is the receive_multi_kv_cache_distributed change in model runner — that's a shared path.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: princepride <wangzhipeng628@gmail.com>
|
It is good to see E2E latency and VRAM reported in the PR body. Various combinations of acceleration features have been tested, plus the online serving test. It's almost perfect to me. I just wondering if there is a reason that this PR only includes |
Because I think other yaml can easy create from |
|
@wtomin @hsliuustc0106 Can someone help me approve it? |
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
|
@princepride SP and TeaCache are intertwined. Please check my recent bugfix for Qwen-Image #2101 and verify if SP and teacache are compatible for bagel. |
Purpose
Related: #1217
Implement Ulysses & Ring Sequence Parallelism for BAGEL model.
Key Changes
broadcast_object_listreceive_multi_kv_cache_distributedadded toOmniKVTransferManagerto minimize changes todiffusion_model_runner.py_split_vae_for_sp)Benchmarks
BAGEL-7B-MoT, 1024×1024 images, 50 inference steps, CFG (text_scale=4.0, img_scale=1.5).
Each diffusion worker uses ~27.20 GiB model VRAM.
Test Plan
Test Result
SP Only
SP + TeaCache
SP + Cache-DiT (Fn=4, W=8)