Skip to content

Conversation

@realliujiaxu
Copy link
Contributor

@realliujiaxu realliujiaxu commented Sep 17, 2025

What this PR does / why we need it?

when enable_kv_nz is true, output of Deepseek R1 is invalid.

{"id":"chatcmpl-f01fa84f397b417abf6d3c5243787d38","object":"chat.completion","created":1758106562,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>0I0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0O0","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":7,"total_tokens":107,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}

The reason is that the decode stage incorrectly usage of torch_npu.npu_kv_rmsnorm_rope_cache. After fix:

{"id":"chatcmpl-4a9e79534b2b4945a9496621d1dc53f6","object":"chat.completion","created":1758108051,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>\nOkay, the user sent \"Hello, my name is\" and then the conversation ends. They might have intended to complete the sentence but didn't. I should prompt them to provide their name so I can address them properly. Maybe respond with a friendly greeting and ask for their name. Let me make sure to keep it welcoming and open. Something like, \"Hello! It's nice to meet you. Could you please tell me your name?\" That should encourage them to share their name without","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":7,"total_tokens":107,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}

run server:

nohup python -m vllm.entrypoints.openai.api_server --model=DeepSeek-R1-W8A8-VLLM \
    --served-model-name auto \
    --trust-remote-code \
    --distributed-executor-backend=mp \
    --port 8009 \
    -tp=8 \
    -dp=2 \
    --max-num-seqs 24 \
    --max-model-len 32768 \
    --max-num-batched-tokens 8192 \
    --additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24], "enable_kv_nz": true},"expert_tensor_parallel_size":16}' \
    --block-size 128 \
    --gpu-memory-utilization 0.96 &> run.log &
disown

curl:

curl --location 'http://127.0.0.1:8009/v1/chat/completions' --header 'Content-Type: application/json' --data '{
        "top_p": 1,
        "ignore_eos": false,
        "stream": false,
        "max_tokens": 100,
        "stop": "None",
        "top_k": -1,
        "temperature": 0.6,
        "messages": [
            {
                "role": "system",
                "content": "Hello, my name is"
            }
        ]
    }'

Does this PR introduce any user-facing change?

No

How was this patch tested?

Signed-off-by: realliujiaxu <[email protected]>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical bug causing invalid output for Deepseek R1 models when enable_kv_nz is active. The root cause was an incorrect cache_mode ("PA_BLK_NZ") used during the prefill stage in vllm_ascend/attention/mla_v1.py and vllm_ascend/torchair/torchair_mla.py. The fix correctly changes this to "PA_NZ", aligning it with the cache_mode used in the decode stage. This change is correct and effectively resolves the accuracy issue as demonstrated in the pull request description.

@realliujiaxu realliujiaxu changed the title fix kv nz accuracy bug [Bugfix] fix kv nz accuracy bug Sep 17, 2025
@jianzs jianzs merged commit 723d460 into vllm-project:main Sep 17, 2025
14 checks passed
@Yikun Yikun added ready read for review ready-for-test start test by label for PR labels Sep 18, 2025
@Yikun
Copy link
Member

Yikun commented Sep 18, 2025

Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
when `enable_kv_nz` is true, output of Deepseek R1 is invalid.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@2b85697

Signed-off-by: realliujiaxu <[email protected]>
NSDie pushed a commit to NSDie/vllm-ascend that referenced this pull request Nov 24, 2025
when `enable_kv_nz` is true, output of Deepseek R1 is invalid.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@2b85697

Signed-off-by: realliujiaxu <[email protected]>
Signed-off-by: nsdie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants