Skip to content

[Bugfix] bugfix for moe_mlp in vllm-ascend/v0.11.0-dev#4885

Merged
wangxiyuan merged 6 commits intovllm-project:v0.11.0-devfrom
Clorist33:bugfix_moe_mlp_for_dev_new
Dec 12, 2025
Merged

[Bugfix] bugfix for moe_mlp in vllm-ascend/v0.11.0-dev#4885
wangxiyuan merged 6 commits intovllm-project:v0.11.0-devfrom
Clorist33:bugfix_moe_mlp_for_dev_new

Conversation

@Clorist33
Copy link
Copy Markdown
Contributor

What this PR does / why we need it?

This PR fixes a bug in the moe_mlp module by correcting the arguments passed to the torch_npu.npu_dequant_swiglu_quant function.It properly converts group_list from a cumulative sum to counts for the group_index parameter.

Does this PR introduce any user-facing change?

No

Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>

Signed-off-by: tanqingshan (A)  <50050625@china.huawei.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the cumsum_group_list function to support bidirectional conversions between cumulative sum and count representations of expert groups. While this is a good improvement for code clarity, the refactoring has introduced a critical bug in the conversion from cumulative sum to counts. I've provided a code suggestion to fix this. Additionally, I've pointed out that the test suite is missing coverage for this new conversion path, which is why the bug was not caught. Please add the suggested test case to prevent future regressions.

Comment on lines +42 to +45
if src_list_type == 0 and dst_list_type == 1:
group_diff = torch.diff(group_list)
new_group = torch.cat([group_diff[0].unsqueeze(0), group_diff], dim=0)
return new_group
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a bug in the logic for converting group_list from cumulative sum (src_list_type=0) to counts (dst_list_type=1). The implementation incorrectly uses group_diff[0] instead of group_list[0] to construct the new group tensor. This will lead to incorrect counts and subsequent errors in npu_dequant_swiglu_quant.

The previous implementation before this refactoring was correct. You should use group_list[0] to get the first element of the cumulative sum, which corresponds to the first count.

Suggested change
if src_list_type == 0 and dst_list_type == 1:
group_diff = torch.diff(group_list)
new_group = torch.cat([group_diff[0].unsqueeze(0), group_diff], dim=0)
return new_group
if src_list_type == 0 and dst_list_type == 1:
group_diff = torch.diff(group_list)
new_group = torch.cat([group_list[0].unsqueeze(0), group_diff], dim=0)
return new_group

group_list_type = 0
result = cumsum_group_list(group_list, group_list_type)
result = cumsum_group_list(group_list, group_list_type, 0)
self.assertTrue(torch.equal(result, self.group_list))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test suite for cumsum_group_list is incomplete. It's missing a test case for converting from src_list_type=0 (cumulative sum) to dst_list_type=1 (counts). Adding this test case would have caught the critical bug introduced in moe_mlp.py.

Please add a test to cover this conversion. For example:

def test_cumsum_group_list_from_type_0_to_1(self):
    group_list_cumsum = self.experts.cumsum(dim=0)
    result = cumsum_group_list(group_list_cumsum, src_list_type=0, dst_list_type=1)
    self.assertTrue(torch.equal(result, self.experts))

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
@Clorist33 Clorist33 force-pushed the bugfix_moe_mlp_for_dev_new branch from 56bf855 to ac2a72c Compare December 11, 2025 09:19
@wangxiyuan
Copy link
Copy Markdown
Collaborator

wangxiyuan commented Dec 11, 2025

picked from #4822

@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Dec 11, 2025
@wangxiyuan wangxiyuan merged commit 4f0dddc into vllm-project:v0.11.0-dev Dec 12, 2025
36 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ops module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants