[Bugfix] bugfix for moe_mlp in vllm-ascend/v0.11.0-dev#4885
[Bugfix] bugfix for moe_mlp in vllm-ascend/v0.11.0-dev#4885wangxiyuan merged 6 commits intovllm-project:v0.11.0-devfrom
Conversation
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com> Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the cumsum_group_list function to support bidirectional conversions between cumulative sum and count representations of expert groups. While this is a good improvement for code clarity, the refactoring has introduced a critical bug in the conversion from cumulative sum to counts. I've provided a code suggestion to fix this. Additionally, I've pointed out that the test suite is missing coverage for this new conversion path, which is why the bug was not caught. Please add the suggested test case to prevent future regressions.
| if src_list_type == 0 and dst_list_type == 1: | ||
| group_diff = torch.diff(group_list) | ||
| new_group = torch.cat([group_diff[0].unsqueeze(0), group_diff], dim=0) | ||
| return new_group |
There was a problem hiding this comment.
There is a bug in the logic for converting group_list from cumulative sum (src_list_type=0) to counts (dst_list_type=1). The implementation incorrectly uses group_diff[0] instead of group_list[0] to construct the new group tensor. This will lead to incorrect counts and subsequent errors in npu_dequant_swiglu_quant.
The previous implementation before this refactoring was correct. You should use group_list[0] to get the first element of the cumulative sum, which corresponds to the first count.
| if src_list_type == 0 and dst_list_type == 1: | |
| group_diff = torch.diff(group_list) | |
| new_group = torch.cat([group_diff[0].unsqueeze(0), group_diff], dim=0) | |
| return new_group | |
| if src_list_type == 0 and dst_list_type == 1: | |
| group_diff = torch.diff(group_list) | |
| new_group = torch.cat([group_list[0].unsqueeze(0), group_diff], dim=0) | |
| return new_group |
| group_list_type = 0 | ||
| result = cumsum_group_list(group_list, group_list_type) | ||
| result = cumsum_group_list(group_list, group_list_type, 0) | ||
| self.assertTrue(torch.equal(result, self.group_list)) |
There was a problem hiding this comment.
The test suite for cumsum_group_list is incomplete. It's missing a test case for converting from src_list_type=0 (cumulative sum) to dst_list_type=1 (counts). Adding this test case would have caught the critical bug introduced in moe_mlp.py.
Please add a test to cover this conversion. For example:
def test_cumsum_group_list_from_type_0_to_1(self):
group_list_cumsum = self.experts.cumsum(dim=0)
result = cumsum_group_list(group_list_cumsum, src_list_type=0, dst_list_type=1)
self.assertTrue(torch.equal(result, self.experts))|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
56bf855 to
ac2a72c
Compare
|
picked from #4822 |
What this PR does / why we need it?
This PR fixes a bug in the moe_mlp module by correcting the arguments passed to the torch_npu.npu_dequant_swiglu_quant function.It properly converts group_list from a cumulative sum to counts for the group_index parameter.
Does this PR introduce any user-facing change?
No