[Bugfix] Reduce _npu_flash_attention mask to 128x128 for memory savings by ApsarasX · Pull Request #1100 · vllm-project/vllm-ascend

ApsarasX · 2025-06-06T03:39:21Z

What this PR does / why we need it?

Avoid generating a special attn_mask (which theoretically could be as large as bf16[max_model_len, max_model_len]) when chunked-prefill is not enabled and the user input is extremely long.

However, this PR depends on modifications to the torch_npu._npu_flash_attention

Does this PR introduce any user-facing change?

No

How was this patch tested?

No

vLLM version: v0.10.0
vLLM main: vllm-project/vllm@ebf7605

### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + #910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] #1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>

### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>

### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

github-actions · 2025-06-25T08:25:39Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>

Yikun · 2025-08-08T06:52:10Z

Is this PR still needed? If so, please rebase, otherwise please close it. Thanks

Signed-off-by: ApsarasX <apsarax@outlook.com>

ApsarasX · 2025-08-11T06:56:58Z

Is this PR still needed? If so, please rebase, otherwise please close it. Thanks

This PR is still necessary once torch_npu._npu_flash_attention supports compressed 128x128 mask

I have already rebased this PR

codecov · 2025-08-11T07:34:06Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.31%. Comparing base (1ab1541) to head (df4cee4).
⚠️ Report is 724 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1100   +/-   ##
=======================================
  Coverage   76.31%   76.31%           
=======================================
  Files         116      116           
  Lines       13238    13238           
=======================================
  Hits        10102    10102           
  Misses       3136     3136

Flag	Coverage Δ
unittests	`76.31% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-08-27T04:09:15Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>

ttanzhiqiang mentioned this pull request Jun 6, 2025

etp best a2 #1101

Merged

github-actions Bot added the merge-conflicts label Jun 25, 2025

ApsarasX mentioned this pull request Jul 1, 2025

[0.9.1][bugfix] fix deepseek memory bug #1551

Merged

[Bugfix] Reduce _npu_flash_attention mask to 128x128 for memory savings

df4cee4

Signed-off-by: ApsarasX <apsarax@outlook.com>

ApsarasX force-pushed the community-attn_mask branch from d951107 to df4cee4 Compare August 11, 2025 06:54

github-actions Bot removed the merge-conflicts label Aug 11, 2025

ApsarasX mentioned this pull request Aug 12, 2025

[V0.9.1][BugFix] Fix bugs and refactor cached mask generation logic #2326

Merged

github-actions Bot added the merge-conflicts label Aug 27, 2025

ApsarasX closed this Aug 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Reduce _npu_flash_attention mask to 128x128 for memory savings#1100

[Bugfix] Reduce _npu_flash_attention mask to 128x128 for memory savings#1100
ApsarasX wants to merge 1 commit intovllm-project:mainfrom
ApsarasX:community-attn_mask

ApsarasX commented Jun 6, 2025 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Jun 25, 2025

Uh oh!

Yikun commented Aug 8, 2025

Uh oh!

ApsarasX commented Aug 11, 2025 •

edited

Loading

Uh oh!

codecov Bot commented Aug 11, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ApsarasX commented Jun 6, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions Bot commented Jun 25, 2025

Uh oh!

Yikun commented Aug 8, 2025

Uh oh!

ApsarasX commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ApsarasX commented Jun 6, 2025 •

edited by github-actions Bot

Loading

ApsarasX commented Aug 11, 2025 •

edited

Loading

codecov Bot commented Aug 11, 2025 •

edited

Loading