Skip to content

[Bugfix] Reduce _npu_flash_attention mask to 128x128 for memory savings#1100

Closed
ApsarasX wants to merge 1 commit intovllm-project:mainfrom
ApsarasX:community-attn_mask
Closed

[Bugfix] Reduce _npu_flash_attention mask to 128x128 for memory savings#1100
ApsarasX wants to merge 1 commit intovllm-project:mainfrom
ApsarasX:community-attn_mask

Conversation

@ApsarasX
Copy link
Copy Markdown
Collaborator

@ApsarasX ApsarasX commented Jun 6, 2025

What this PR does / why we need it?

Avoid generating a special attn_mask (which theoretically could be as large as bf16[max_model_len, max_model_len]) when chunked-prefill is not enabled and the user input is extremely long.

However, this PR depends on modifications to the torch_npu._npu_flash_attention

Does this PR introduce any user-facing change?

No

How was this patch tested?

No

@ttanzhiqiang ttanzhiqiang mentioned this pull request Jun 6, 2025
wangxiyuan pushed a commit that referenced this pull request Jun 11, 2025
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ #910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
#1100 [Reduce memory
usage by splitting tokens in fused_experts]


---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ vllm-project#910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
vllm-project#1100 [Reduce memory
usage by splitting tokens in fused_experts]


---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ vllm-project#910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
vllm-project#1100 [Reduce memory
usage by splitting tokens in fused_experts]

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ vllm-project#910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
vllm-project#1100 [Reduce memory
usage by splitting tokens in fused_experts]

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

shiyuan680 pushed a commit to raindaywhu/vllm-ascend that referenced this pull request Jul 7, 2025
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ vllm-project#910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
vllm-project#1100 [Reduce memory
usage by splitting tokens in fused_experts]


---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
@Yikun
Copy link
Copy Markdown
Member

Yikun commented Aug 8, 2025

Is this PR still needed? If so, please rebase, otherwise please close it. Thanks

@ApsarasX ApsarasX force-pushed the community-attn_mask branch from d951107 to df4cee4 Compare August 11, 2025 06:54
@ApsarasX
Copy link
Copy Markdown
Collaborator Author

ApsarasX commented Aug 11, 2025

Is this PR still needed? If so, please rebase, otherwise please close it. Thanks

This PR is still necessary once torch_npu._npu_flash_attention supports compressed 128x128 mask

I have already rebased this PR

@codecov
Copy link
Copy Markdown

codecov Bot commented Aug 11, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.31%. Comparing base (1ab1541) to head (df4cee4).
⚠️ Report is 724 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1100   +/-   ##
=======================================
  Coverage   76.31%   76.31%           
=======================================
  Files         116      116           
  Lines       13238    13238           
=======================================
  Hits        10102    10102           
  Misses       3136     3136           
Flag Coverage Δ
unittests 76.31% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@ApsarasX ApsarasX closed this Aug 27, 2025
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ vllm-project#910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
vllm-project#1100 [Reduce memory
usage by splitting tokens in fused_experts]


---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ vllm-project#910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
vllm-project#1100 [Reduce memory
usage by splitting tokens in fused_experts]


---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants