Skip to content

[Bugfix] Fix TurboQuant KV cache index-out-of-bounds in Triton decode kernel#40074

Open
devarakondasrikanth wants to merge 6 commits intovllm-project:mainfrom
devarakondasrikanth:issue_39998_v2
Open

[Bugfix] Fix TurboQuant KV cache index-out-of-bounds in Triton decode kernel#40074
devarakondasrikanth wants to merge 6 commits intovllm-project:mainfrom
devarakondasrikanth:issue_39998_v2

Conversation

@devarakondasrikanth
Copy link
Copy Markdown

@devarakondasrikanth devarakondasrikanth commented Apr 16, 2026

Purpose

Clamp masked-out SIMD lanes to page_idx=0 before block table pointer arithmetic. Triton's bounds checker fires on the address even when the output is masked, causing an index error on long (e.g. 32k) sequences.

Fixes issue #39998 #39998

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

… kernel

Clamp masked-out SIMD lanes to page_idx=0 before block table pointer
arithmetic. Triton's bounds checker fires on the address even when the
output is masked, causing an index error on long (e.g. 32k) sequences.

Signed-off-by: devarakondasrikanth <devarakondasrikanth@ymail.com>
@mergify mergify Bot added v1 bug Something isn't working labels Apr 16, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the _tq_decode_stage1 function in vllm/v1/attention/ops/triton_turboquant_decode.py to prevent out-of-bounds pointer arithmetic in Triton. It introduces a safe_page_idx that clamps masked-out lanes to index 0 before loading from the block table, ensuring the bounds checker does not trigger on masked lanes. I have no feedback to provide.

@devarakondasrikanth
Copy link
Copy Markdown
Author

@LucasWilkinson and @MatthewBonanni i started working with vlllm and came across this bug, i implemented minimal fix to solve the issue (harmless). As this is my first contribution "pre-commit / pre-run-check (pull_request)Failing after 5s" can you help here to run ci.

@@ -133,8 +133,12 @@ def _tq_decode_stage1(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better solution is to clamp kv_offs itself at the source by just adding this

kv_offs = tl.where(kv_mask, kv_offs, split_start) 

@jhsmith409
Copy link
Copy Markdown
Contributor

Tried to reproduce the OOB crash on RTX 5090 (sm_120) + vllm/vllm-openai:cu130-nightly (0.19.2rc1.dev21+g893611813), running RedHatAI/Qwen3.6-35B-A3B-NVFP4 with --kv-cache-dtype=turboquant_k8v4 on top of JartX#10 overlay.

Experiment — 8 concurrent ×31 632 prompt tokens, 256 decode tokens each, ignore_eos=true:

Image #40074 applied Result
treatment yes 8/8 OK, 57% peak KV usage, 42 tok/s decode
control no 8/8 OK, 57% peak KV usage, 42 tok/s decode

Could not reproduce #39998's index out of bounds: 0 <= tmp16 < 40960 on this config — likely needs the exact combination from that issue (Qwen3-0.6B + turboquant_4bit_nc + 4090, where Triton's JIT presumably picks a tile/block layout that exposes the masked-lane address). sm_120 with turboquant_k8v4 on a 35 B MoE apparently picks a variant that avoids the OOB path.

That said, the fix is provably correct: mask= in tl.load only silences the value, not the address arithmetic, so a masked-out kv_offs lane can still compute Block_table_ptr + bt_base + page_idx outside the tensor's bounds-checker range. Clamping via tl.where(kv_mask, page_idx, 0) before the pointer math is the standard Triton hygiene and the PR's 5-line implementation is exactly that.

+1 to merge; applies cleanly on top of #39931 as well.

(AI-assisted verification run; human submitter reviewed and ran both A/B configurations.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants