Skip to content

Conversation

@linfeng-yuan
Copy link
Collaborator

@linfeng-yuan linfeng-yuan commented Sep 20, 2025

What this PR does / why we need it?

This PR removes redundant calling of reshape_and_cache operation at prefilling stage with torchair graph mode. This reduces prefilling latency as well as fixes accuracy problem while enable_kv_nz is enabled. Although #2988 fixes enable_kv_nz accuracy problem, the output tokens with deepseek is inaccurate, leading to a decline in benchmark scoring.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

We run e2e online serving and accuracy test containing eager mode with enable_shared_expert_dp and torchair graph mode with enable_kv_nz.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly removes a redundant _npu_reshape_and_cache operation that was being called in torchair graph mode. This is a good simplification, as caching is already handled by npu_kv_rmsnorm_rope_cache in that scenario. However, an important assertion checking the kv_cache size was also removed. I've recommended re-adding it to ensure code robustness and prevent potential runtime errors.

@linfeng-yuan linfeng-yuan force-pushed the fix_torchair_kv_nz branch 2 times, most recently from 1aace42 to 3893cd1 Compare September 20, 2025 13:19
@linfeng-yuan linfeng-yuan changed the title [bugfix] fix kv_nz accuracy problem and delete redundant reshape_and_cache operation [bugfix] fix kv_nz accuracy problem and remove redundant reshape_and_cache operation Sep 20, 2025
@codecov
Copy link

codecov bot commented Sep 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.96%. Comparing base (1bbb20e) to head (d745a8e).
⚠️ Report is 81 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3066      +/-   ##
==========================================
- Coverage   74.76%   71.96%   -2.81%     
==========================================
  Files         150      168      +18     
  Lines       20891    23544    +2653     
==========================================
+ Hits        15620    16943    +1323     
- Misses       5271     6601    +1330     
Flag Coverage Δ
unittests 71.96% <100.00%> (-2.81%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@linfeng-yuan linfeng-yuan added ready read for review ready-for-test start test by label for PR labels Sep 20, 2025
@linfeng-yuan linfeng-yuan changed the title [bugfix] fix kv_nz accuracy problem and remove redundant reshape_and_cache operation [bugfix][torchair] fix kv_nz accuracy problem and remove redundant reshape_and_cache operation Sep 20, 2025
@jianzs
Copy link
Collaborator

jianzs commented Sep 21, 2025

@linfeng-yuan Thanks a lot!!!

@wangxiyuan wangxiyuan added ready-for-test start test by label for PR and removed ready-for-test start test by label for PR labels Sep 22, 2025
@jianzs
Copy link
Collaborator

jianzs commented Sep 22, 2025

@linfeng-yuan This pull request only fixes one accuracy problem. Tests show accuracy is fine without KV NZ enabled, but still problematic when it's on, even after applying this change. The GSM-8K benchmark scores are still too low with KV NZ active....

@jianzs
Copy link
Collaborator

jianzs commented Sep 22, 2025

@linfeng-yuan The torch_npu.atb.npu_ring_mla and torch_npu._npu_flash_attention functions were used in the prefill stage, but the code doesn't seem to have any adaptations for KV NZ. This might be the cause of the problem?

@linfeng-yuan linfeng-yuan removed ready read for review ready-for-test start test by label for PR labels Sep 22, 2025
Copy link
Collaborator

@jianzs jianzs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR cannot be merged until kv_nz is supported during the prefill phase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants