fix(mla): widen page index to int64_t to avoid 32-bit overflow by Tracin · Pull Request #3136 · flashinfer-ai/flashinfer

Tracin · 2026-04-21T09:16:09Z

📌 Description

In the MLA decode/prefill KV load path, indices[q] * ckv_stride_page was computed in 32-bit because IdType is int32_t and *_stride_page is uint32_t; the product wraps modulo 2^32 before any widening to int64_t (Hopper) or pointer arithmetic (FA2). For large page pools (e.g. page_idx ~1M with page_size=32, kv_lora_rank=512, stride=16384) the true product exceeds 2^32 and the kernel reads the wrong page, producing all-zero outputs. Cast the selected page index to int64_t at all three sites (mla.cuh NUM_MMA_KV==1 and !=1 branches, and mla_hopper.cuh prefetch_offset) so the multiply executes in 64-bit.

🔍 Related Issues

#3130

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Bug Fixes
- Fixed potential integer overflow in memory address calculations for attention operations with large sequences.

…address computation In the MLA decode/prefill KV load path, `indices[q] * ckv_stride_page` was computed in 32-bit because `IdType` is `int32_t` and `*_stride_page` is `uint32_t`; the product wraps modulo 2^32 before any widening to `int64_t` (Hopper) or pointer arithmetic (FA2). For large page pools (e.g. page_idx ~1M with page_size=32, kv_lora_rank=512, stride=16384) the true product exceeds 2^32 and the kernel reads the wrong page, producing all-zero outputs. Cast the selected page index to `int64_t` at all three sites (mla.cuh NUM_MMA_KV==1 and !=1 branches, and mla_hopper.cuh prefetch_offset) so the multiply executes in 64-bit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-21T09:16:25Z

📝 Walkthrough

Walkthrough

Two CUDA attention kernel files are modified to prevent integer overflow in pointer arithmetic. The load_kv function in mla.cuh and the prefetch_offset function in mla_hopper.cuh now cast computed KV page indices to int64_t before multiplying by stride values, ensuring correct calculations for large indices.

Changes

Cohort / File(s)	Summary
MLA KV Loading `include/flashinfer/attention/mla.cuh`	Widened KV page index arithmetic to `int64_t` before stride multiplication in `load_kv` function across both code paths (single and multiple MMA KV cases).
MLA Hopper Prefetching `include/flashinfer/attention/mla_hopper.cuh`	Widened computed page-offset term to `int64_t` in `prefetch_offset` function before multiplying by stride values.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

[fp8_blockwise]Fix int32 overflow in TRTLLM fused MoE activation kernel #2642 — Applies the same int32 to int64_t widening pattern for pointer/index arithmetic overflow prevention in CUDA kernels.

Suggested labels

run-ci, op: attention

Suggested reviewers

yzh119
saltyminty
bkryu
nv-yunzheq

Poem

🐰 Hop along with careful stride,
Where index numbers multiply wide,
From thirty-two to sixty-four,
No overflow knocks down the door,
Attention kernels, safe and spry!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: widening page index to int64_t to prevent 32-bit overflow in MLA operations.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description provides clear context about a 32-bit overflow bug, explains the issue with concrete examples, and mentions the related GitHub issue.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request addresses potential 32-bit integer overflows in KV cache offset calculations within mla.cuh and mla_hopper.cuh by casting page indices to int64_t. Feedback suggests that the entire offset calculation should be promoted to 64-bit to prevent overflows in subsequent additions and to improve future-proofing.

qsang-nv · 2026-04-22T06:42:27Z

Nice fix — cast is in the right place and all three call sites are covered.

One suggestion: add a minimal regression test that forces page_idx * stride_page > 2^32. This is a silent-output-corruption bug (wrong output, no crash), so without a guard test, a future refactor of IdType or *_stride_page types could easily reintroduce it.

You don't need a huge KV cache — a sparse kv_indices with a few large index values pointing at a small real allocation should hit the overflow path in ~30 lines in tests/attention/test_mla_decode_kernel.py.

Tracin · 2026-04-22T09:27:46Z

@qsang-nv Thanks for the review! However I do not get how large index values pointing at a small real allocation. I suppose we need a real address for large page_idx * stride_page.

Tracin requested review from bkryu, nv-yunzheq, qsang-nv, saltyminty and yzh119 as code owners April 21, 2026 09:16

flashinfer-bot added the op: attention label Apr 21, 2026

Tracin mentioned this pull request Apr 21, 2026

[Bug] MLA decode return all zeros when index of page blocks is large #3130

Open

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread include/flashinfer/attention/mla.cuh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mla): widen page index to int64_t to avoid 32-bit overflow#3136

fix(mla): widen page index to int64_t to avoid 32-bit overflow#3136
Tracin wants to merge 1 commit intoflashinfer-ai:mainfrom
Tracin:fix_mla

Tracin commented Apr 21, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

qsang-nv commented Apr 22, 2026

Uh oh!

Tracin commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Tracin commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

qsang-nv commented Apr 22, 2026

Uh oh!

Tracin commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Tracin commented Apr 21, 2026 •

edited

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading