[Refactor] Use SP Plan for LongCat Sequence Parallelism#1772
[Refactor] Use SP Plan for LongCat Sequence Parallelism#1772wtomin merged 8 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Alex Brooks <albrooks@redhat.com> don't pass sp on fwd context Signed-off-by: Alex Brooks <albrooks@redhat.com> formatting Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 87d6cbacfc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
|
This SP implementation looks good to me. Please enrich your PR body with some experiment evidence:
|
|
|
||
| ## Approach 2: Intrusive Modification (For Complex Cases) | ||
|
|
||
| For models with dynamic sharding logic that cannot be expressed via `_sp_plan`, manually insert shard/gather calls. Importantly, when taking this approach, be careful to ensure that you correctly manage the `_sp_shard_depth`; if the sequence parallel shard depth is 0, Ulysses will not be used. |
There was a problem hiding this comment.
I suggest that we still keep the document related to intrusive modification, in case some models have very complicated sharding logic in the future.
Remove _sp_shard_depth in the document as it is not an interface supposed to be exposed to developers.
Since #1704 is merged, there is no bug when _sp_plan or manual SP implementation.
There was a problem hiding this comment.
@wtomin @lishunyang12 I see - added it back, but my main concern is contributors copy+pasting this approach across models where it isn't needed, and it being an extra maintenance burden with hooks that take a similar approach to TeaCache.
Maybe this is more of an antipattern on the way TeaCache is implemented, but given that we already have to copy so much forward code into the extractors, I think it would be cleaner & a similar amount of work to have a SupportsTeaCache protocol, and just refactor the model .forward boundaries to support it.
Any thoughts on this? I will open a PR so that we can discuss this more concretely, but the combination of SP potentially modifying forward with forward being mostly copied into extractors may be a frequent cause of bugs when they are used together
There was a problem hiding this comment.
Indeed, later we can discuss about a better interface for TeaCache that is both cleaner and easier to maintain.
lishunyang12
left a comment
There was a problem hiding this comment.
left a couple of nits. agree with wtomin's suggestion on keeping the intrusive modification docs.
|
|
||
| _repeated_blocks = ["LongCatImageTransformerBlock", "LongCatImageSingleTransformerBlock"] | ||
|
|
||
| # Sequence Parallelism for LongCat (following diffusers' _cp_plan pattern) |
There was a problem hiding this comment.
Nit: should this say _sp_plan? _cp_plan is the diffusers name for context parallelism.
There was a problem hiding this comment.
No, this is correct - it's saying that the pattern for sequence parallel in vLLM Omni is similar to the pattern Diffusers takes for context parallel
Signed-off-by: Alex Brooks <albrooks@redhat.com>
|
Thanks for the reviews @wtomin @lishunyang12 - for the inference script, you can just run the I ran it a few times with to compare to main, and the results are similar (metrics for running on h100). Main:
This branch:
For memory, both branches peak around ~ 35000MiB on each gpu for the number of gpus used for parallelism. |
Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
lishunyang12
left a comment
There was a problem hiding this comment.
Looks good overall — clean migration to _sp_plan. Left a couple comments.
| # LongCat uses dual-stream (text + image) with joint attention | ||
| # Text embeddings should be replicated across SP ranks for correctness | ||
| fwd_context.sequence_parallel_size = sp_size | ||
| fwd_context.split_text_embed_in_sp = False |
There was a problem hiding this comment.
This is only set when sp_size > 1, but single-stream attention reads it from forward_ctx unconditionally. The non-SP path relies on the ForwardContext default being False — works now but fragile. Consider always setting it explicitly.
There was a problem hiding this comment.
In the attention, it also checks that sp_size isn't None and > 1, so it is fine since this is an attr on fwd context - the other issues around the forward context were because this model was setting things on the forward context that weren't attributes 🙂
I can open a small follow-up around this, but let's do it separate from this PR since it's what the current code does - I think setting it as a flag in forward is an antipattern anyway, since it should be consistent behavior within each model, and currently AFAIK this flag is never True in Omni at the moment anyway
| and text_seq_len is not None | ||
| ): | ||
| # Ensure that the SP split won't cause out of bounds issues. | ||
| if text_seq_len < 0 or text_seq_len > query.shape[1]: |
There was a problem hiding this comment.
This allows text_seq_len == query.shape[1] (zero image tokens). Is that intentional? SP would split an empty tensor in that case.
There was a problem hiding this comment.
I had wanted to avoid sequence parallel specific validation that might be confusing, but good point - I think this will hit a bug in the existing code, because with the way it is currently written, it'll throw when applying the rotary embeddings from diffusers before it actually gets to the SP part. I will fix it 🙂
| self.parallel_config = od_config.parallel_config | ||
|
|
||
| self.pos_embed = LongCatImagePosEmbed(theta=10000, axes_dim=axes_dims_rope) | ||
| self.rope_preparer = RoPEPreparer(self.pos_embed) |
There was a problem hiding this comment.
Nit: pos_embed is now accessible via both self.pos_embed and self.rope_preparer.pos_embed. Any risk of duplicate keys in state_dict?
There was a problem hiding this comment.
No, since there are no learnable params at the moment. Good point though; this is consistent with what current models are doing, so I think it would be best to refactor them all together if we change this
…#1772) Signed-off-by: Alex Brooks <albrooks@redhat.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Signed-off-by: yiliu30 <yi4.liu@intel.com>
Purpose
Fix #1692
Refactors LongCat Image to use the SP Plan Approach
It also looks like LongCat Image is broken for the non-sp case due to a missing attr on
forward_ctx, probably from refactoring in the PR that fixed sp 😞This PR
Intrusive Modificationfor sequence parallelism to push contributors towards contributing sequence parallelism with a unified pattern moving forwardCC @wtomin @ZJY0516, could you PTAL?
I'll also wait for this to be merged before finishing #1487 so that the SP details won't bleed into the teacache extractor for this model