use group gemm nz by ttanzhiqiang · Pull Request #910 · vllm-project/vllm-ascend

ttanzhiqiang · 2025-05-21T02:46:40Z

What this PR does / why we need it?

Update weight format to improve TPOP 3ms performance
Insert code to convert weight format of specific layers Need to update group gemm TASK_QUEUE_ENABLE=1 and TASK_QUEUE_ENABLE=2 of torch_npu

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: ttanzhiqiang <389825161@qq.com>

wangxiyuan

not sure if this change conflict with #907 but LGTM at least.

linfeng-yuan · 2025-05-23T12:56:35Z

vllm_ascend/quantization/w8a8_dynamic.py

            layer.w2_weight.data = layer.w2_weight.data.transpose(
                1, 2).contiguous()
+        torch_npu.npu_format_cast_(layer.w13_weight, 29)
+        torch_npu.npu_format_cast_(layer.w2_weight, 29)


Thanks for contribution. The transformation of grouped_matmul may lead to accuracy problem until newer torch_npu releases. I think it is good to hang up this PR temporarily.

realliujiaxu · 2025-06-07T03:09:59Z

this can not be merged until new version of torch_npu for community

### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + #910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] #1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>

### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>

### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

github-actions · 2025-06-23T14:05:51Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

codecov · 2025-06-23T14:23:05Z

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 27.24%. Comparing base (c30ddb8) to head (5c8b2c0).
⚠️ Report is 2001 commits behind head on main.

Files with missing lines	Patch %	Lines
vllm_ascend/quantization/w8a8_dynamic.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #910      +/-   ##
==========================================
- Coverage   27.39%   27.24%   -0.16%     
==========================================
  Files          56       56              
  Lines        6191     6222      +31     
==========================================
- Hits         1696     1695       -1     
- Misses       4495     4527      +32

Flag	Coverage Δ
unittests	`27.24% <0.00%> (-0.16%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>

github-actions · 2025-08-14T01:41:25Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>

wangxiyuan · 2025-11-18T02:58:39Z

i thinks this one can be closed now. Feel free to create a new one if it's still needed.

github-actions bot added the module:quantization label May 21, 2025

use group gemm nz

52d067d

Signed-off-by: ttanzhiqiang <389825161@qq.com>

ttanzhiqiang force-pushed the group_gemm_nz branch from 6f37250 to 52d067d Compare May 21, 2025 15:26

wangxiyuan approved these changes May 22, 2025

View reviewed changes

wangxiyuan added ready read for review and removed ready read for review labels May 22, 2025

linfeng-yuan reviewed May 23, 2025

View reviewed changes

wangxiyuan mentioned this pull request Jun 4, 2025

[release] 0.9.0rc1 release checklist #904

Closed

76 tasks

ttanzhiqiang mentioned this pull request Jun 6, 2025

etp best a2 #1101

Merged

github-actions bot added the merge-conflicts label Jun 23, 2025

Merge branch 'main' into group_gemm_nz

5c8b2c0

github-actions bot removed the merge-conflicts label Jun 23, 2025

github-actions bot added the merge-conflicts label Aug 14, 2025

wangxiyuan closed this Nov 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use group gemm nz#910

use group gemm nz#910
ttanzhiqiang wants to merge 2 commits intovllm-project:mainfrom
ttanzhiqiang:group_gemm_nz

ttanzhiqiang commented May 21, 2025

Uh oh!

wangxiyuan left a comment

Uh oh!

linfeng-yuan May 23, 2025

Uh oh!

realliujiaxu commented Jun 7, 2025

Uh oh!

github-actions bot commented Jun 23, 2025

Uh oh!

codecov bot commented Jun 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 14, 2025

Uh oh!

wangxiyuan commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ttanzhiqiang commented May 21, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

wangxiyuan left a comment

Choose a reason for hiding this comment

Uh oh!

linfeng-yuan May 23, 2025

Choose a reason for hiding this comment

Uh oh!

realliujiaxu commented Jun 7, 2025

Uh oh!

github-actions bot commented Jun 23, 2025

Uh oh!

codecov bot commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Aug 14, 2025

Uh oh!

wangxiyuan commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Jun 23, 2025 •

edited

Loading