-
Notifications
You must be signed in to change notification settings - Fork 629
[bugfix][torchair] fix kv_nz accuracy problem and remove redundant reshape_and_cache operation #3066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly removes a redundant _npu_reshape_and_cache operation that was being called in torchair graph mode. This is a good simplification, as caching is already handled by npu_kv_rmsnorm_rope_cache in that scenario. However, an important assertion checking the kv_cache size was also removed. I've recommended re-adding it to ensure code robustness and prevent potential runtime errors.
1aace42 to
3893cd1
Compare
…cache operation Signed-off-by: linfeng-yuan <[email protected]>
3893cd1 to
d745a8e
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3066 +/- ##
==========================================
- Coverage 74.76% 71.96% -2.81%
==========================================
Files 150 168 +18
Lines 20891 23544 +2653
==========================================
+ Hits 15620 16943 +1323
- Misses 5271 6601 +1330
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@linfeng-yuan Thanks a lot!!! |
|
@linfeng-yuan This pull request only fixes one accuracy problem. Tests show accuracy is fine without KV NZ enabled, but still problematic when it's on, even after applying this change. The GSM-8K benchmark scores are still too low with KV NZ active.... |
|
@linfeng-yuan The |
jianzs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR cannot be merged until kv_nz is supported during the prefill phase.
What this PR does / why we need it?
This PR removes redundant calling of
reshape_and_cacheoperation at prefilling stage with torchair graph mode. This reduces prefilling latency as well as fixes accuracy problem whileenable_kv_nzis enabled. Although #2988 fixesenable_kv_nzaccuracy problem, the output tokens with deepseek is inaccurate, leading to a decline in benchmark scoring.Does this PR introduce any user-facing change?
No.
How was this patch tested?
We run e2e online serving and accuracy test containing eager mode with
enable_shared_expert_dpand torchair graph mode withenable_kv_nz.