[Bugfix] fix kv nz accuracy bug #2988

realliujiaxu · 2025-09-17T11:54:15Z

What this PR does / why we need it?

when enable_kv_nz is true, output of Deepseek R1 is invalid.

{"id":"chatcmpl-f01fa84f397b417abf6d3c5243787d38","object":"chat.completion","created":1758106562,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>0I0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":7,"total_tokens":107,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}

The reason is that the decode stage incorrectly usage of torch_npu.npu_kv_rmsnorm_rope_cache. After fix:

{"id":"chatcmpl-4a9e79534b2b4945a9496621d1dc53f6","object":"chat.completion","created":1758108051,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>\nOkay, the user sent \"Hello, my name is\" and then the conversation ends. They might have intended to complete the sentence but didn't. I should prompt them to provide their name so I can address them properly. Maybe respond with a friendly greeting and ask for their name. Let me make sure to keep it welcoming and open. Something like, \"Hello! It's nice to meet you. Could you please tell me your name?\" That should encourage them to share their name without","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":7,"total_tokens":107,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}

run server:

nohup python -m vllm.entrypoints.openai.api_server --model=DeepSeek-R1-W8A8-VLLM \
    --served-model-name auto \
    --trust-remote-code \
    --distributed-executor-backend=mp \
    --port 8009 \
    -tp=8 \
    -dp=2 \
    --max-num-seqs 24 \
    --max-model-len 32768 \
    --max-num-batched-tokens 8192 \
    --additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24], "enable_kv_nz": true},"expert_tensor_parallel_size":16}' \
    --block-size 128 \
    --gpu-memory-utilization 0.96 &> run.log &
disown

curl:

curl --location 'http://127.0.0.1:8009/v1/chat/completions' --header 'Content-Type: application/json' --data '{
        "top_p": 1,
        "ignore_eos": false,
        "stream": false,
        "max_tokens": 100,
        "stop": "None",
        "top_k": -1,
        "temperature": 0.6,
        "messages": [
            {
                "role": "system",
                "content": "Hello, my name is"
            }
        ]
    }'

Does this PR introduce any user-facing change?

No

How was this patch tested?

vLLM version: v0.10.2
vLLM main: vllm-project/vllm@2b85697

Signed-off-by: realliujiaxu <[email protected]>

github-actions · 2025-09-17T11:54:24Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request addresses a critical bug causing invalid output for Deepseek R1 models when enable_kv_nz is active. The root cause was an incorrect cache_mode ("PA_BLK_NZ") used during the prefill stage in vllm_ascend/attention/mla_v1.py and vllm_ascend/torchair/torchair_mla.py. The fix correctly changes this to "PA_NZ", aligning it with the cache_mode used in the decode stage. This change is correct and effectively resolves the accuracy issue as demonstrated in the pull request description.

Yikun · 2025-09-18T03:57:12Z

It seems this introduce the break on main:
https://github.com/vllm-project/vllm-ascend/actions/runs/17816277772/job/50649989352?pr=2907
https://github.com/vllm-project/vllm-ascend/actions/runs/17816025095/job/50649237030

Retry: https://github.com/vllm-project/vllm-ascend/actions/runs/17817506855 to confirm

@jianzs BTW, please make sure every PR merged before label with ready and ready-for-test

when `enable_kv_nz` is true, output of Deepseek R1 is invalid. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@2b85697 Signed-off-by: realliujiaxu <[email protected]>

when `enable_kv_nz` is true, output of Deepseek R1 is invalid. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@2b85697 Signed-off-by: realliujiaxu <[email protected]> Signed-off-by: nsdie <[email protected]>

fix kv nz accuracy bug

ed18b4b

Signed-off-by: realliujiaxu <[email protected]>

gemini-code-assist bot reviewed Sep 17, 2025

View reviewed changes

realliujiaxu changed the title ~~fix kv nz accuracy bug~~ [Bugfix] fix kv nz accuracy bug Sep 17, 2025

jianzs approved these changes Sep 17, 2025

View reviewed changes

jianzs merged commit 723d460 into vllm-project:main Sep 17, 2025
14 checks passed

Yikun added ready read for review ready-for-test start test by label for PR labels Sep 18, 2025

linfeng-yuan mentioned this pull request Sep 20, 2025

[bugfix][torchair] fix kv_nz accuracy problem and remove redundant reshape_and_cache operation #3066

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] fix kv nz accuracy bug #2988

[Bugfix] fix kv nz accuracy bug #2988

Uh oh!

realliujiaxu commented Sep 17, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Yikun commented Sep 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Bugfix] fix kv nz accuracy bug #2988

[Bugfix] fix kv nz accuracy bug #2988

Uh oh!

Conversation

realliujiaxu commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Sep 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Yikun commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

realliujiaxu commented Sep 17, 2025 •

edited

Loading

Yikun commented Sep 18, 2025 •

edited

Loading