Support non-contiguous KV cache in TRTLLM fp8 dequant kernel by vadiklyutiy · Pull Request #36867 · vllm-project/vllm

vadiklyutiy · 2026-03-12T10:06:56Z

Summary

Fix the trtllm_prefill_attn_kvfp8_dequant Triton kernel to support non-contiguous KV cache tensors (e.g. cross-layer unified allocation used by KV offloading).

The kernel previously computed flat pointer offsets from tensor shape, assuming contiguity. With cross-layer KV caches the per-layer view is non-contiguous (strides skip over other layers), causing incorrect memory reads. Now the kernel uses actual tensor strides for the page, K/V, and head dimensions.

Related: vllm-project/vllm#34158 — relaxes the same assertion and moves the contiguity check closer to the dequant kernel. Our change goes further by fixing the dequant kernel itself to work with non-contiguous inputs.

Test

New standalone test: tests/kernels/attention/test_trtllm_kvfp8_dequant.py

python -m pytest tests/kernels/attention/test_trtllm_kvfp8_dequant.py -v

71 tests total (64 parametrized + 7 corner cases), all passing:

Parametrized matrix: num_kv_heads={1,8}, head_size={64,128}, block_size={16,32}, batch_size={1,4}, num_pages_per_seq={3,8}, contiguous={True,False}
Corner cases: zero/padding pages in block tables, all-zero block tables, different K/V scales, single page per sequence, large page indices (32K), large block_size (64), cross-layer with 36 layers (real gpt-oss-120b pattern)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

gemini-code-assist

Code Review

This pull request introduces support for non-contiguous KV cache tensors in the trtllm_prefill_attn_kvfp8_dequant Triton kernel. The kernel is updated to use tensor strides for memory access, correctly handling layouts like those from cross-layer unified allocation. A comprehensive new test suite is also added, which thoroughly validates the changes against a reference implementation for both contiguous and non-contiguous cases, including numerous corner cases. The implementation appears correct and robust, and the tests provide strong confidence in the fix.

pavanimajety · 2026-03-13T18:48:38Z

@Etelis What do you think of this fix? Seems reasonable to me to remove the stride check for kv_cache throughout

Etelis · 2026-03-13T19:17:18Z

Looks good to me — using actual strides in the dequant kernel is the right fix. With this merged I can drop the is_strictly_contiguous assertion I kept for the FP8 path in #34158 and replace it with the same minimal stride check you have here.

pavanimajety

LGTM, thanks!

vllm/v1/attention/backends/flashinfer.py

Co-authored-by: Pavani Majety <pavanimajety@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>

mergify · 2026-03-14T13:28:37Z

Hi @vadiklyutiy, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy · 2026-03-16T08:47:14Z

if there are no further comments could we merge this PR?

…oject#36867) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: Pavani Majety <pavanimajety@gmail.com>

…oject#36867) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: Pavani Majety <pavanimajety@gmail.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

…oject#36867) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: Pavani Majety <pavanimajety@gmail.com>

…oject#36867) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: Pavani Majety <pavanimajety@gmail.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

…oject#36867) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: Pavani Majety <pavanimajety@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

Support non-contiguous KV cache in TRTLLM fp8 dequant kernel

8fa1d0a

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy requested review from WoosukKwon, mgoin, pavanimajety, tlrmchlsmth and yewentao256 as code owners March 12, 2026 10:06

mergify bot added nvidia v1 labels Mar 12, 2026

github-project-automation bot added this to NVIDIA Mar 12, 2026

vadiklyutiy mentioned this pull request Mar 12, 2026

[Bugfix] Relax TRTLLM KV cache contiguity assertion for cross-layer layout #34158

Merged

gemini-code-assist bot reviewed Mar 12, 2026

View reviewed changes

pavanimajety added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 13, 2026

pavanimajety approved these changes Mar 13, 2026

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

github-project-automation bot moved this to Ready in NVIDIA Mar 13, 2026

vadiklyutiy mentioned this pull request Mar 14, 2026

Revert "Enable Cross layers KV cache layout at NIXL Connector (#30207)" #33241

Merged

vadiklyutiy and others added 2 commits March 14, 2026 17:23

Merge branch 'main' into make_dequant_noncont

143364d

Apply suggestion from @pavanimajety

f8b8e59

Co-authored-by: Pavani Majety <pavanimajety@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>

pre-commit

d8635a3

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy self-assigned this Mar 14, 2026

haosdent mentioned this pull request Mar 15, 2026

[Bugfix] Disable cross-layer KV cache for MLA attention backends #37090

Merged

pavanimajety merged commit 6c1cfba into vllm-project:main Mar 17, 2026
58 of 59 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support non-contiguous KV cache in TRTLLM fp8 dequant kernel#36867

Support non-contiguous KV cache in TRTLLM fp8 dequant kernel#36867
pavanimajety merged 4 commits intovllm-project:mainfrom
CentML:make_dequant_noncont

vadiklyutiy commented Mar 12, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

pavanimajety commented Mar 13, 2026

Uh oh!

Etelis commented Mar 13, 2026

Uh oh!

pavanimajety left a comment

Uh oh!

Uh oh!

mergify bot commented Mar 14, 2026

Uh oh!

vadiklyutiy commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

vadiklyutiy commented Mar 12, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

pavanimajety commented Mar 13, 2026

Uh oh!

Etelis commented Mar 13, 2026

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Mar 14, 2026

Uh oh!

vadiklyutiy commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vadiklyutiy commented Mar 12, 2026 •

edited by github-actions bot

Loading

vadiklyutiy commented Mar 16, 2026 •

edited

Loading