[BugFix] Fix LongCat Sequence Parallelism / Small Cleanup#1631
[BugFix] Fix LongCat Sequence Parallelism / Small Cleanup#1631Gaohan123 merged 2 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6bfef321ff
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Store SP size in forward context for sub-modules to access | ||
| get_forward_context().sequence_parallel_size = sp_size | ||
| if sp_size > 1: | ||
| if sp_size is not None and sp_size > 1: |
There was a problem hiding this comment.
Keep sequence_parallel_size in context for all forwards
LongCatImageAttention.forward reads get_forward_context().sequence_parallel_size unconditionally (both joint and single-stream paths), but this change only sets fwd_context.sequence_parallel_size inside the sp_size > 1 branch. When LongCat runs with sequence_parallel_size of 1 (or None), the context no longer has that attribute and attention will raise AttributeError before inference completes. Please set the field regardless of SP mode so non-SP runs continue to work.
Useful? React with 👍 / 👎.
hsliuustc0106
left a comment
There was a problem hiding this comment.
Architectural Code Review
📋 Summary
| Item | Details |
|---|---|
| PR | [BugFix] Fix LongCat Sequence Parallelism / Small Cleanup |
| Author | @alex-jw-brooks |
| Issue | Fixes #1556 |
| Changes | +36 / -38 (2 files) |
🎯 Root Cause Analysis
Bug: LongCat's sequence parallelism was broken when Ulysses was enabled.
Why:
# LongCat doesn't use sp_plan like other models
# → _sp_shard_depth defaults to 0
# → sp_active is never set to True
# → Attention layers use NoParallelAttention instead of UlyssesFix:
if sp_size > 1:
fwd_context._sp_shard_depth = 1 # ✅ Activate Ulysses path
else:
fwd_context._sp_shard_depth = 0 # ✅ Deactivate when SP disabled✅ Correctness Analysis
1. SP Activation Logic
# Before: _sp_shard_depth never set → Ulysses not used
# After: _sp_shard_depth = 1 when SP active → Ulysses works ✅Verification: Test plan covers all combinations:
- ulysses=1, ring=1
- ulysses=2, ring=1
- ulysses=1, ring=2
- ulysses=2, ring=2
2. Depth Management
| State | _sp_shard_depth |
Effect |
|---|---|---|
| SP disabled | 0 | NoParallelAttention |
| SP active, before shard | 1 | Ulysses attention |
| SP active, after gather | 0 | Restored to baseline ✅ |
Important: The depth is set back to 0 after all_gather:
if sp_size > 1:
output = get_sp_group().all_gather(output, dim=1)
get_forward_context()._sp_shard_depth = 0 # ✅ Cleanup3. None Safety
# Before:
if sp_size > 1: # ❌ Could fail if sp_size is None
# After:
if sp_size is not None and sp_size > 1: # ✅ Type-safe🟡 Observations
1. Gradient Checkpointing Removed
# Removed 30 lines of gradient checkpointing logic
- if torch.is_grad_enabled() and self.gradient_checkpointing:
- self._gradient_checkpointing_func(...)Questions:
- Was gradient checkpointing intentionally removed?
- Is this a bug fix or a feature removal?
- Should this be documented in the PR description?
If intentional: Consider adding a note explaining why gradient checkpointing was removed.
2. Depth Reset Timing
# After all_gather, depth is reset
output = get_sp_group().all_gather(output, dim=1)
get_forward_context()._sp_shard_depth = 0Question: What happens if there's another layer after this? Should the depth persist until the end of forward()?
Answer: Based on the code, norm_out and proj_out are the final layers, so resetting depth here is correct. But worth verifying.
3. Documentation Update
Good addition to sequence_parallel.md:
Note that currently, `sp_shard` / `sp_gather` do *not* automatically
manage the `_sp_shard_depth`; you need to be careful to manage it yourself.This is helpful for future model developers.
🏗️ Architecture Impact
Sequence Parallelism Pattern:
┌─────────────────────────────────────────────┐
│ Forward Context (_sp_shard_depth) │
│ │
│ 0: NoParallelAttention (default) │
│ 1+: Ulysses Attention (sequence sharded) │
└─────────────────────────────────────────────┘
LongCat Flow:
input → shard → depth=1 → attention → gather → depth=0 → output
Models with sp_plan:
- Automatic depth management via
sp_shard/sp_gather
Models without sp_plan (like LongCat):
- Manual depth management required
- This PR fixes the missing depth management
📊 Test Coverage
| Scenario | Covered | Verification |
|---|---|---|
| SP disabled (ulysses=1, ring=1) | ✅ | Visual inspection |
| Ulysses only (ulysses=2, ring=1) | ✅ | Visual inspection |
| Ring only (ulysses=1, ring=2) | ✅ | Visual inspection |
| Ulysses + Ring (ulysses=2, ring=2) | ✅ | Visual inspection |
| Gradient checkpointing | ❓ | Removed without test |
Suggestion: Add a test that verifies _sp_shard_depth is correctly set:
def test_sp_shard_depth_set_correctly():
"""Verify _sp_shard_depth is 1 when SP active, 0 otherwise"""
model = LongCatImageTransformer(sp_size=2)
# SP active
with mock_forward_context() as ctx:
model.forward(hidden_states, ...)
assert ctx._sp_shard_depth == 1
# SP inactive
model = LongCatImageTransformer(sp_size=1)
with mock_forward_context() as ctx:
model.forward(hidden_states, ...)
assert ctx._sp_shard_depth == 0📝 Minor Suggestions
1. Add Debug Logging for Depth
if sp_size > 1:
fwd_context._sp_shard_depth = 1
logger.debug(f"[LongCat] SP active: depth=1, rank={sp_rank}")2. Document Gradient Checkpointing Removal
If intentional, add to PR description:
**Note:** Gradient checkpointing was removed because [reason].3. Consider Context Manager Pattern
# Future refactor suggestion
with sp_shard_context(depth=1):
# ... attention computation ...
pass # depth automatically reset📋 Checklist
| Aspect | Status |
|---|---|
| Root cause identified | ✅ Clear |
| Fix correctness | ✅ Verified |
| None safety | ✅ Added |
| Documentation | ✅ Updated |
| Test coverage | ✅ Adequate |
| Gradient checkpointing |
Verdict
| Rating | Notes |
|---|---|
| APPROVE ✅ | Correct bug fix with good documentation |
Rationale:
- Clear root cause identification
- Minimal, targeted fix
- Good documentation update
- Comprehensive test plan
Follow-up: Clarify whether gradient checkpointing removal was intentional.
Reviewed by: vllm-omni-reviewer MCP tool 🦐
hsliuustc0106
left a comment
There was a problem hiding this comment.
Additional Feedback: Memory Footprint for Distributed SP
Good point — Sequence Parallelism has significant memory implications that should be documented.
Why Memory Footprint Matters for SP
- Memory Reduction: SP shards the sequence across ranks, reducing activation memory per GPU
- Communication Overhead: Ring attention and Ulysses have different memory patterns
- Capacity Planning: Users need to know if SP enables larger images or just faster inference
Expected Memory Report
## Memory Footprint (LongCat, 1024x1024)
| Config | GPU Memory (per rank) | Total Memory | Notes |
|--------|----------------------|--------------|-------|
| No SP (baseline) | 24 GB | 24 GB | Single GPU |
| Ulysses=2, Ring=1 | 14 GB | 28 GB | 2 GPUs, sequence sharded |
| Ulysses=1, Ring=2 | 14 GB | 28 GB | 2 GPUs, ring attention |
| Ulysses=2, Ring=2 | 8 GB | 32 GB | 4 GPUs, both enabled |
**Key Metrics:**
- Activation memory reduction: ~40% per rank (Ulysses=2)
- Communication buffer overhead: ~1 GB per rank
- Enables 2048x2048 images? ✅ With Ulysses=2, Ring=2How to Profile
# Profile memory for each SP config
torch.cuda.reset_peak_memory_stats()
python text_to_image.py \
--model longcat-image \
--ulysses-degree 2 \
--ring-degree 2 \
--height 1024 --width 1024
print(f"Peak memory: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")Questions to Address
- Does SP enable larger resolutions? (e.g., 2048x2048 instead of 1024x1024)
- What's the memory/throughput trade-off? (More GPUs vs. faster generation)
- Is there a sweet spot? (Optimal ulysses/ring combination)
This information would help users choose the right SP configuration for their hardware.
🦐 vllm-omni-reviewer
|
It is indeed a critical bugfix. #1275 introduced Luckily, in current vLLM-Omni, only LongCatImage model has manual sp implementation. Other models use We should refactor the @sp_active
def forward(self, hidden_states, ...):
# increase _sp_shard_depth in the beginning
....
return output
# decrease _sp_shard_depth at the endThis would be more imperceptible to the users. |
|
Any suggestions? @mxuax @ZJY0516 @dongbo910220 @SamitHuang |
|
I suggest we also use |
|
Hey @hsliuustc0106 @wtomin @ZJY0516 thanks for the review - IMO maybe a couple of follow-ups after this PR:
Enabling this approach hurts cross-feature compatibility a lot, especially for things like the current way TeaCache extractors are implemented where parts of I am happy to open PRs for both follow-ups. Plus, for the test the reviewer tool suggested, we can just add this in the first follow-up through a generic test for the decorator, which I think we should have anyway |
lishunyang12
left a comment
There was a problem hiding this comment.
Looks good — setting _sp_shard_depth=1 is the correct fix.
|
I suggest you to raise follow-up PR 2: to implement LongCatImage with This PR can be merged for now, but we will try to remove manual sp implementation in the future. |
…ct#1631) Signed-off-by: Alex Brooks <albrooks@redhat.com>
### vllm-omni-api - Source: [PR #1724](vllm-project/vllm-omni#1724) - Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)" - Changes: - New feature: Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)" ### vllm-omni-contrib - Source: [PR #1724](vllm-project/vllm-omni#1724) - Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)" - Changes: - New feature: Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)" ### vllm-omni-api - Source: [PR #1716](vllm-project/vllm-omni#1716) - [Feature]: Add vae-patch-parallel CLI argument in online serving - Changes: - New feature: [Feature]: Add vae-patch-parallel CLI argument in online serving ### vllm-omni-contrib - Source: [PR #1716](vllm-project/vllm-omni#1716) - [Feature]: Add vae-patch-parallel CLI argument in online serving - Changes: - New feature: [Feature]: Add vae-patch-parallel CLI argument in online serving ### vllm-omni-contrib - Source: [PR #1693](vllm-project/vllm-omni#1693) - [skip CI][Docs] Add TTS model developer guide - Changes: - New feature: [skip CI][Docs] Add TTS model developer guide ### vllm-omni-audio-tts - Source: [PR #1688](vllm-project/vllm-omni#1688) - [MiMo-Audio] Bugfix tp lg than 1 - Changes: - Bug fix: [MiMo-Audio] Bugfix tp lg than 1 ### vllm-omni-distributed - Source: [PR #1688](vllm-project/vllm-omni#1688) - [MiMo-Audio] Bugfix tp lg than 1 - Changes: - Bug fix: [MiMo-Audio] Bugfix tp lg than 1 ### vllm-omni-perf - Source: [PR #1688](vllm-project/vllm-omni#1688) - [MiMo-Audio] Bugfix tp lg than 1 - Changes: - Bug fix: [MiMo-Audio] Bugfix tp lg than 1 ### vllm-omni-perf - Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech - Changes: - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech ### vllm-omni-distributed - Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech - Changes: - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech ### vllm-omni-api - Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech - Changes: - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech - Additions: - `/v1/audio/speech` ### vllm-omni-quantization - Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech - Changes: - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech ### vllm-omni-cicd - Source: [PR #1683](vllm-project/vllm-omni#1683) - [CI] Remove high concurrency tests before issue #1374 fixed. - Changes: - Bug fix: [CI] Remove high concurrency tests before issue #1374 fixed. ### vllm-omni-audio-tts - Source: [PR #1678](vllm-project/vllm-omni#1678) - Add non-async chunk support for Qwen3-TTS - Changes: - New feature: Add non-async chunk support for Qwen3-TTS ### vllm-omni-cicd - Source: [PR #1678](vllm-project/vllm-omni#1678) - Add non-async chunk support for Qwen3-TTS - Changes: - New feature: Add non-async chunk support for Qwen3-TTS ### vllm-omni-cicd - Source: [PR #1677](vllm-project/vllm-omni#1677) - Replace hard-coded cuda generator with current_omni_platform.device_type ### vllm-omni-perf - Source: [PR #1677](vllm-project/vllm-omni#1677) - Replace hard-coded cuda generator with current_omni_platform.device_type ### vllm-omni-serving - Source: [PR #1675](vllm-project/vllm-omni#1675) - [Misc] remove logits_processor_pattern this field, because vllm have … ### vllm-omni-cicd - Source: [PR #1666](vllm-project/vllm-omni#1666) - [Cleanup] Move cosyvoice3 tests to model subdirectory ### vllm-omni-audio-tts - Source: [PR #1664](vllm-project/vllm-omni#1664) - [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder - Changes: - Bug fix: [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder ### vllm-omni-cicd - Source: [PR #1664](vllm-project/vllm-omni#1664) - [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder - Changes: - Bug fix: [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder ### vllm-omni-distributed - Source: [PR #1656](vllm-project/vllm-omni#1656) - [Optimize][Qwen3-Omni] Reduce inter-packet latency in async chunk ### vllm-omni-contrib - Source: [PR #1656](vllm-project/vllm-omni#1656) - [Optimize][Qwen3-Omni] Reduce inter-packet latency in async chunk ### vllm-omni-quantization - Source: [PR #1652](vllm-project/vllm-omni#1652) - [UX] Add progress bar for diffusion models - Changes: - New feature: [UX] Add progress bar for diffusion models ### vllm-omni-perf - Source: [PR #1652](vllm-project/vllm-omni#1652) - [UX] Add progress bar for diffusion models - Changes: - New feature: [UX] Add progress bar for diffusion models ### vllm-omni-distributed - Source: [PR #1651](vllm-project/vllm-omni#1651) - docs: Announce vllm-omni-skills community project ### vllm-omni-quantization - Source: [PR #1651](vllm-project/vllm-omni#1651) - docs: Announce vllm-omni-skills community project ### vllm-omni-perf - Source: [PR #1651](vllm-project/vllm-omni#1651) - docs: Announce vllm-omni-skills community project ### vllm-omni-contrib - Source: [PR #1649](vllm-project/vllm-omni#1649) - [Misc] update wechat ### vllm-omni-perf - Source: [PR #1642](vllm-project/vllm-omni#1642) - [chore] add _repeated_blocks for regional compilation support - Changes: - New feature: [chore] add _repeated_blocks for regional compilation support ### vllm-omni-api - Source: [PR #1641](vllm-project/vllm-omni#1641) - [Bugfix] Add TTS request validation to prevent engine crashes - Changes: - New feature: [Bugfix] Add TTS request validation to prevent engine crashes ### vllm-omni-cicd - Source: [PR #1641](vllm-project/vllm-omni#1641) - [Bugfix] Add TTS request validation to prevent engine crashes - Changes: - New feature: [Bugfix] Add TTS request validation to prevent engine crashes ### vllm-omni-image-gen - Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer - Changes: - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer - Additions: - text-to-image - Text-to-Image - Flux ### vllm-omni-quantization - Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer - Changes: - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer - Additions: - FP8 support or improvements ### vllm-omni-contrib - Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer - Changes: - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer ### vllm-omni-perf - Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer - Changes: - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer ### vllm-omni-contrib - Source: [PR #1631](vllm-project/vllm-omni#1631) - [BugFix] Fix LongCat Sequence Parallelism / Small Cleanup - Changes: - Bug fix: [BugFix] Fix LongCat Sequence Parallelism / Small Cleanup ### vllm-omni-cicd - Source: [PR #1628](vllm-project/vllm-omni#1628) - [Test][Qwen3-Omni]Modify Qwen3-Omni benchmark test cases ### vllm-omni-perf - Source: [PR #1628](vllm-project/vllm-omni#1628) - [Test][Qwen3-Omni]Modify Qwen3-Omni benchmark test cases ### vllm-omni-perf - Source: [PR #1619](vllm-project/vllm-omni#1619) - [Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context - Changes: - Bug fix: [Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context ### vllm-omni-perf - Source: [PR #1617](vllm-project/vllm-omni#1617) - [Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph - Changes: - Performance improvement: [Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph ### vllm-omni-contrib - Source: [PR #1615](vllm-project/vllm-omni#1615) - [Doc] Fix links in the configuration doc - Changes: - Bug fix: [Doc] Fix links in the configuration doc ### vllm-omni-audio-tts - Source: [PR #1614](vllm-project/vllm-omni#1614) - perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor - Changes: - Performance improvement: perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor ### vllm-omni-perf - Source: [PR #1614](vllm-project/vllm-omni#1614) - perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor - Changes: - Performance improvement: perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor ### vllm-omni-image-gen - Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation - Changes: - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation - Additions: - GLM-Image - GLM-Image - GLM-Image - GLM-Image - GLM-Image - GLM-Image - GLM-Image - GLM-Image ### vllm-omni-api - Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation - Changes: - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation ### vllm-omni-perf - Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation - Changes: - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation ### vllm-omni-contrib - Source: [PR #1604](vllm-project/vllm-omni#1604) - [Model]: support Helios from ByteDance ### vllm-omni-perf - Source: [PR #1604](vllm-project/vllm-omni#1604) - [Model]: support Helios from ByteDance ### vllm-omni-serving - Source: [PR #1602](vllm-project/vllm-omni#1602) - [Bugfix] fix kernel error for qwen3-omni - Changes: - Bug fix: [Bugfix] fix kernel error for qwen3-omni ### vllm-omni-distributed - Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0 - Changes: - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0 ### vllm-omni-image-gen - Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0 - Changes: - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0 - Additions: - HunyuanImage3 - HunyuanImage3Pipeline - HunyuanImage3 - HunyuanImage-3 - HunyuanImage-3 - HunyuanImage-3 - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage-3 ### vllm-omni-quantization - Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0 - Changes: - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0 ### vllm-omni-perf - Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0 - Changes: - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0 ### vllm-omni-audio-tts - Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase - Changes: - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase ### vllm-omni-api - Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase - Changes: - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase ### vllm-omni-cicd - Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase - Changes: - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase ### vllm-omni-contrib - Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase - Changes: - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase ### vllm-omni-api - Source: [PR #1579](vllm-project/vllm-omni#1579) - [1/N][Refactor] Clean up dead code in output processor ### vllm-omni-serving - Source: [PR #1579](vllm-project/vllm-omni#1579) - [1/N][Refactor] Clean up dead code in output processor ### vllm-omni-distributed - Source: [PR #1578](vllm-project/vllm-omni#1578) - [Feature][Bagel] Add CFG parallel mode - Changes: - New feature: [Feature][Bagel] Add CFG parallel mode ### vllm-omni-cicd - Source: [PR #1578](vllm-project/vllm-omni#1578) - [Feature][Bagel] Add CFG parallel mode - Changes: - New feature: [Feature][Bagel] Add CFG parallel mode ### vllm-omni-perf - Source: [PR #1578](vllm-project/vllm-omni#1578) - [Feature][Bagel] Add CFG parallel mode - Changes: - New feature: [Feature][Bagel] Add CFG parallel mode ### vllm-omni-contrib - Source: [PR #1576](vllm-project/vllm-omni#1576) - 0.16.0 release ### vllm-omni-audio-tts - Source: [PR #1570](vllm-project/vllm-omni#1570) - [bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio - Changes: - Bug fix: [bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio ### vllm-omni-api - Source: [PR #1566](vllm-project/vllm-omni#1566) - [Bugfix] Import InputPreprocessor into Renderer - Changes: - Bug fix: [Bugfix] Import InputPreprocessor into Renderer ### vllm-omni-distributed - Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai ### vllm-omni-quantization - Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai ### vllm-omni-perf - Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai ### vllm-omni-image-gen - Source: [PR #1537](vllm-project/vllm-omni#1537) - [NPU] [Features] [Bugfix] Support mindiesd adaln - Changes: - New feature: [NPU] [Features] [Bugfix] Support mindiesd adaln - Additions: - mindiesd - mindiesd - Qwen-Image-Edit-2509 - mindiesd - mindiesd - mindiesd - mindiesd ### vllm-omni-perf - Source: [PR #1537](vllm-project/vllm-omni#1537) - [NPU] [Features] [Bugfix] Support mindiesd adaln - Changes: - New feature: [NPU] [Features] [Bugfix] Support mindiesd adaln ### vllm-omni-serving - Source: [PR #1536](vllm-project/vllm-omni#1536) - [Bugfix] Fix transformers 5.x compat issues in online TTS serving - Changes: - Bug fix: [Bugfix] Fix transformers 5.x compat issues in online TTS serving ### vllm-omni-perf - Source: [PR #1536](vllm-project/vllm-omni#1536) - [Bugfix] Fix transformers 5.x compat issues in online TTS serving - Changes: - Bug fix: [Bugfix] Fix transformers 5.x compat issues in online TTS serving
…ct#1631) Signed-off-by: Alex Brooks <albrooks@redhat.com> Signed-off-by: lishunyang <lishunyang12@163.com>
Fixes #1556
Purpose
This PR fixes sequence parallelism for LongCat image. The bug is that LongCat doesn't use an sp_plan like the other models,
_sp_shard_depthis set to 0, so we aren't actually settingsp_active, and actually aren't taking the ulysses path. Since this model only has a depth of one for SP, the fix is to set_sp_shard_depth=1when active and_sp_shard_depth=0when inactive.In the future, it would be nice to refactor this to use
sp_shard/sp_gather, or usesp_planif possible.There are also a few other small fixes / refactors to avoid potential future issues around gradients & ensuring that the sp_size isn't
Noneto make type hints happy. It also updates the docs to add a note about this in case others run into this later on.Test Plan
Run
text_to_image.pywith:to ensure that things look correct for both ulysses/ ring attention for sequence parallelism.
Test Result
Result should be identical for all configurations and should no longer crash when ulysses is enabled for LongCat. Expected output below:
