[Bugfix] Fix TurboQuant KV cache index-out-of-bounds in Triton decode kernel by devarakondasrikanth · Pull Request #40074 · vllm-project/vllm

devarakondasrikanth · 2026-04-16T22:10:06Z

Purpose

Clamp masked-out SIMD lanes to page_idx=0 before block table pointer arithmetic. Triton's bounds checker fires on the address even when the output is masked, causing an index error on long (e.g. 32k) sequences.

Fixes issue #39998 #39998

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

… kernel Clamp masked-out SIMD lanes to page_idx=0 before block table pointer arithmetic. Triton's bounds checker fires on the address even when the output is masked, causing an index error on long (e.g. 32k) sequences. Signed-off-by: devarakondasrikanth <devarakondasrikanth@ymail.com>

gemini-code-assist

Code Review

This pull request modifies the _tq_decode_stage1 function in vllm/v1/attention/ops/triton_turboquant_decode.py to prevent out-of-bounds pointer arithmetic in Triton. It introduces a safe_page_idx that clamps masked-out lanes to index 0 before loading from the block table, ensuring the bounds checker does not trigger on masked lanes. I have no feedback to provide.

devarakondasrikanth · 2026-04-17T02:34:10Z

@LucasWilkinson and @MatthewBonanni i started working with vlllm and came across this bug, i implemented minimal fix to solve the issue (harmless). As this is my first contribution "pre-commit / pre-run-check (pull_request)Failing after 5s" can you help here to run ci.

vibhavagarwal5 · 2026-04-18T07:42:59Z

@@ -133,8 +133,12 @@ def _tq_decode_stage1(



A better solution is to clamp kv_offs itself at the source by just adding this

kv_offs = tl.where(kv_mask, kv_offs, split_start)

jhsmith409 · 2026-04-20T15:38:03Z

Tried to reproduce the OOB crash on RTX 5090 (sm_120) + vllm/vllm-openai:cu130-nightly (0.19.2rc1.dev21+g893611813), running RedHatAI/Qwen3.6-35B-A3B-NVFP4 with --kv-cache-dtype=turboquant_k8v4 on top of JartX#10 overlay.

Experiment — 8 concurrent ×31 632 prompt tokens, 256 decode tokens each, ignore_eos=true:

Image	#40074 applied	Result
treatment	yes	8/8 OK, 57% peak KV usage, 42 tok/s decode
control	no	8/8 OK, 57% peak KV usage, 42 tok/s decode

Could not reproduce #39998's index out of bounds: 0 <= tmp16 < 40960 on this config — likely needs the exact combination from that issue (Qwen3-0.6B + turboquant_4bit_nc + 4090, where Triton's JIT presumably picks a tile/block layout that exposes the masked-lane address). sm_120 with turboquant_k8v4 on a 35 B MoE apparently picks a variant that avoids the OOB path.

That said, the fix is provably correct: mask= in tl.load only silences the value, not the address arithmetic, so a masked-out kv_offs lane can still compute Block_table_ptr + bt_base + page_idx outside the tensor's bounds-checker range. Clamping via tl.where(kv_mask, page_idx, 0) before the pointer math is the standard Triton hygiene and the PR's 5-line implementation is exactly that.

+1 to merge; applies cleanly on top of #39931 as well.

(AI-assisted verification run; human submitter reviewed and ran both A/B configurations.)

devarakondasrikanth requested review from LucasWilkinson and MatthewBonanni as code owners April 16, 2026 22:10

mergify Bot added v1 bug Something isn't working labels Apr 16, 2026

gemini-code-assist Bot reviewed Apr 16, 2026

View reviewed changes

Merge branch 'main' into issue_39998_v2

3bd43d3

Merge branch 'main' into issue_39998_v2

2b70652

gaby mentioned this pull request Apr 17, 2026

[Tracking issue]: TurboQuant/HIGGS Attention follow-ups #40069

Open

13 tasks

vibhavagarwal5 reviewed Apr 18, 2026

View reviewed changes

Merge branch 'main' into issue_39998_v2

d741a05

This was referenced Apr 20, 2026

[Perf] Re-enable dual-stream input projection for Qwen3/Qwen3.5 GDN #39748

Closed

[TurboQuant] enable FA3/FA4 for prefill paths #40092

Merged

Merge branch 'main' into issue_39998_v2

3deb399

Merge branch 'main' into issue_39998_v2

41eeb80

noonghunna mentioned this pull request Apr 24, 2026

[Bug]: TurboQuant KV × any speculative decoding (MTP or ngram) produces degenerate token loops — confirmed across dense and hybrid attention #40831

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix TurboQuant KV cache index-out-of-bounds in Triton decode kernel#40074

[Bugfix] Fix TurboQuant KV cache index-out-of-bounds in Triton decode kernel#40074
devarakondasrikanth wants to merge 6 commits intovllm-project:mainfrom
devarakondasrikanth:issue_39998_v2

devarakondasrikanth commented Apr 16, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

devarakondasrikanth commented Apr 17, 2026

Uh oh!

vibhavagarwal5 Apr 18, 2026

Uh oh!

jhsmith409 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

devarakondasrikanth commented Apr 16, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

devarakondasrikanth commented Apr 17, 2026

Uh oh!

vibhavagarwal5 Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

jhsmith409 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

devarakondasrikanth commented Apr 16, 2026 •

edited by github-actions Bot

Loading