Fix num tokens when sp moe enabled#25828
Fix num tokens when sp moe enabled#25828wuxun-zhang wants to merge 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a bug in the GraniteMoeMoE layer where num_tokens was miscalculated when sequence parallelism is enabled with 3D input tensors. The change ensures num_tokens is derived after flattening the hidden_states tensor, which is the correct approach for handling inputs of various dimensions. The fix is sound and resolves the issue described. I have not found any high or critical severity issues in the proposed changes.
|
@tlrmchlsmth Hi, could you please take a look at this? I am not able to trigger CI. |
|
@tlrmchlsmth please take a further review, thanks! |
|
Update: for Gaudi, we are going to follow same assumption (2D hidden states) as community. Currently we have a flag for this. So this will be not hard requirement. But, I think it's still valid fix for 1D hidden states (as mentioned in code comment), though not sure what scenario 1D hidden states for. |
After vllm-project/vllm#24982 merged, sequence parallel MOE will be turned on when `enable_expert_parallel=True`, `tp_size > 1` and `dp_size > 1`. Since for Gaudi, there is no choice for `VLLM_ALL2ALL_BACKEND`, we can not easily bypass it. So this PR aims to support the feature. ```python class ParallelConfig: @Property def use_sequence_parallel_moe(self) -> bool: return (envs.VLLM_ALL2ALL_BACKEND in ("allgather_reducescatter", "naive", "deepep_high_throughput", "deepep_low_latency") and self.enable_expert_parallel and self.tensor_parallel_size > 1 and self.data_parallel_size > 1) ``` Update: No hard requirement on vllm-project/vllm#25828 --------- Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com>
After vllm-project/vllm#24982 merged, sequence parallel MOE will be turned on when `enable_expert_parallel=True`, `tp_size > 1` and `dp_size > 1`. Since for Gaudi, there is no choice for `VLLM_ALL2ALL_BACKEND`, we can not easily bypass it. So this PR aims to support the feature. ```python class ParallelConfig: @Property def use_sequence_parallel_moe(self) -> bool: return (envs.VLLM_ALL2ALL_BACKEND in ("allgather_reducescatter", "naive", "deepep_high_throughput", "deepep_low_latency") and self.enable_expert_parallel and self.tensor_parallel_size > 1 and self.data_parallel_size > 1) ``` Update: No hard requirement on vllm-project/vllm#25828 --------- Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com> Signed-off-by: Iryna Boiko <iboiko@habana.ai>
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you! |
For some backends like Gaudi, the input shape of first MOE layer could be 3D (bs, seqlen, hdim).
So here when SP MOE enabled (#24982),
num_tokensmay got wrong value.@tlrmchlsmth
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.