[kernel] Fix FP8 paged MQA fallback for CUDA graph capture by ZJY0516 · Pull Request #36250 · vllm-project/vllm

ZJY0516 · 2026-03-06T11:26:44Z

Purpose

fp8_paged_mqa_logits_torch is not cudagraph compatible.

(Worker pid=3782129) (Worker_TP1 pid=3782129) ERROR 03-07 11:21:54 [multiproc_executor.py:927]   File "/mnt/data1/zjy/code/vllm-src/vllm/model_executor/layers/sparse_attn_indexer.py", line 193, in sparse_attn_indexer
(Worker pid=3782129) (Worker_TP1 pid=3782129) ERROR 03-07 11:21:54 [multiproc_executor.py:927]     logits = fp8_paged_mqa_logits_torch(
(Worker pid=3782129) (Worker_TP1 pid=3782129) ERROR 03-07 11:21:54 [multiproc_executor.py:927]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=3782129) (Worker_TP1 pid=3782129) ERROR 03-07 11:21:54 [multiproc_executor.py:927]   File "/mnt/data1/zjy/code/vllm-src/vllm/utils/deep_gemm.py", line 510, in fp8_paged_mqa_logits_torch
(Worker pid=3782129) (Worker_TP1 pid=3782129) ERROR 03-07 11:21:54 [multiproc_executor.py:927]     context_len = context_lens[i].item()
(Worker pid=3782129) (Worker_TP1 pid=3782129) ERROR 03-07 11:21:54 [multiproc_executor.py:927]                   ^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=3782129) (Worker_TP1 pid=3782129) ERROR 03-07 11:21:54 [multiproc_executor.py:927] torch.AcceleratorError: CUDA error: operation not permitted when stream is capturing
(Worker pid=3782129) (Worker_TP1 pid=3782129) ERROR 03-07 11:21:54 [multiproc_executor.py:927] Search for `cudaErrorStreamCaptureUnsupported' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.

This PR adds a triton kernel for this.

Test Plan

VLLM_USE_DEEP_GEMM=0 vllm serve /mnt/data3/DSModels/models/deepseek-ai/DeepSeek-V3.2/ -tp 8 --served-model-name deepseek-ai/DeepSeek-V3.2 --tokenizer-mode deepseek_v32 --enable-auto-tool-choice --tool-call-parser deepseek_v32 --reasoning-parser deepseek_v3

Test Result

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "deepseek-ai/DeepSeek-V3.2",
"messages": [
{
"role": "user",
"content": "Solve this problem step by step: What is 15% of 4800?"
}
],
"max_tokens": 2048
}'
{"id":"chatcmpl-a95500daaf6a3517","object":"chat.completion","created":1772853307,"model":"deepseek-ai/DeepSeek-V3.2","choices":[{"index":0,"message":{"role":"assistant","content":"Alright, let's go step by step.  \n\n---\n\n**Step 1: Understand the question**  \nWe want to find 15% of 4800.  \n\"Percent\" means \"per hundred,\" so 15% means \\( 15 / 100 \\).\n\n---\n\n**Step 2: Convert percentage to decimal**  \n\\[\n15\\% = \\frac{15}{100} = 0.15\n\\]\n\n---\n\n**Step 3: Multiply decimal by the number**  \n\\[\n0.15 \\times 4800\n\\]\n\nFirst, \\( 0.15 = \\frac{15}{100} \\), so:  \n\\[\n0.15 \\times 4800 = \\frac{15}{100} \\times 4800\n\\]\n\n---\n\n**Step 4: Simplify the fraction multiplication**  \n\\[\n\\frac{15 \\times 4800}{100} = 15 \\times 48\n\\]\n(because \\( 4800 \\div 100 = 48 \\))\n\n---\n\n**Step 5: Multiply**  \n\\[\n15 \\times 48 = 720\n\\]\n\n---\n\n**Final answer:**  \n\\[\n\\boxed{720}\n\\]","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":21,"total_tokens":264,"completion_tokens":243,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

gemini-code-assist

Code Review

This pull request refactors the PyTorch fallback implementation for FP8 paged MQA to make it compatible with CUDA graph capture. The changes involve vectorizing the implementation to remove host-device synchronization points and correctly handle the packed FP8 KV cache layout, which improves both correctness and performance. No security vulnerabilities were found.

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

LopezCastroRoberto

Can you add some e2 performance and accuracy numbers?

LopezCastroRoberto · 2026-03-10T09:35:43Z

vllm/v1/attention/ops/triton_fp8_paged_mqa_logits.py

+        k_scale_ptr + physical_block_id * stride_ks_blk + offs_k * stride_ks_pos,
+        mask=token_valid,
+        other=0.0,
+    ).to(tl.float16)


Is this cast to FP16 allowed accuracy-wise? The old PyTorch fallback used FP32 for dequantization.

scale = scale.contiguous().view(torch.float)

LopezCastroRoberto · 2026-03-10T09:41:13Z

vllm/v1/attention/ops/triton_fp8_paged_mqa_logits.py

+
+    logits = torch.full(
+        (batch_size * next_n, max_model_len),
+        float("-inf"),


clean_logits=False is now supported, so we shouldn't have to initialize logits to -inf

ZJY0516 · 2026-03-10T11:01:33Z

Can you add some e2 performance and accuracy numbers, so we can understand the impact of this PR?

The purpose of this pr is add a fall back for deepgemm and avoid #36519, so the performance is not very important.

LopezCastroRoberto · 2026-03-10T11:07:22Z

@ZJY0516 I agree to some extent. It was mainly out of curiosity to get a sense of the cost if deepgeem is not installed :)

ZJY0516 · 2026-03-10T14:47:31Z

will update accuracy test later

MatthewBonanni · 2026-03-10T18:51:44Z

Thanks for doing this! Since #36519 was merged, could you update this PR to change the reported CG support back to just UNIFORM_BATCH?

init

a808e94

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

ZJY0516 requested review from WoosukKwon, mgoin, tlrmchlsmth and yewentao256 as code owners March 6, 2026 11:26

mergify bot added nvidia bug Something isn't working labels Mar 6, 2026

github-project-automation bot added this to NVIDIA Mar 6, 2026

gemini-code-assist bot reviewed Mar 6, 2026

View reviewed changes

update

7cc4a1b

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

ZJY0516 requested review from LucasWilkinson and MatthewBonanni as code owners March 6, 2026 15:25

mergify bot added the v1 label Mar 6, 2026

ZJY0516 added 2 commits March 7, 2026 00:13

update

a2bb559

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

update

5f796b4

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

ZJY0516 changed the title ~~[Bugfix] Fix FP8 paged MQA fallback for CUDA graph capture~~ [kernel] Fix FP8 paged MQA fallback for CUDA graph capture Mar 7, 2026

ZJY0516 mentioned this pull request Mar 10, 2026

[Bugfix][Sparse MLA] report indexer CG support properly #36519

Merged

5 tasks

LopezCastroRoberto reviewed Mar 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[kernel] Fix FP8 paged MQA fallback for CUDA graph capture#36250

[kernel] Fix FP8 paged MQA fallback for CUDA graph capture#36250
ZJY0516 wants to merge 4 commits intovllm-project:mainfrom
ZJY0516:fix_v32_fallback

ZJY0516 commented Mar 6, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

LopezCastroRoberto left a comment •

edited

Loading

Uh oh!

LopezCastroRoberto Mar 10, 2026

Uh oh!

LopezCastroRoberto Mar 10, 2026

Uh oh!

ZJY0516 commented Mar 10, 2026

Uh oh!

LopezCastroRoberto commented Mar 10, 2026

Uh oh!

ZJY0516 commented Mar 10, 2026

Uh oh!

MatthewBonanni commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ZJY0516 commented Mar 6, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

LopezCastroRoberto left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LopezCastroRoberto Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

LopezCastroRoberto Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

ZJY0516 commented Mar 10, 2026

Uh oh!

LopezCastroRoberto commented Mar 10, 2026

Uh oh!

ZJY0516 commented Mar 10, 2026

Uh oh!

MatthewBonanni commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ZJY0516 commented Mar 6, 2026 •

edited by github-actions bot

Loading

LopezCastroRoberto left a comment •

edited

Loading