[mxfp8 etc.]Fix int32 overflow in TRTLLM fused MoE permute/finalize kernels#2643
[mxfp8 etc.]Fix int32 overflow in TRTLLM fused MoE permute/finalize kernels#2643charlotte12l wants to merge 1 commit intoflashinfer-ai:mainfrom
Conversation
…kernels (all quant paths)
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @charlotte12l, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses potential integer overflow issues in several TRTLLM fused MoE kernels by explicitly casting intermediate calculations to Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly addresses several potential int32 overflow issues in the MoE kernels by casting indices to int64_t before multiplication. The changes in permuteKernel, finalizeKernel, and finalizeDeepSeekKernel are consistent with the goal of preventing overflows with large tensor dimensions. I have one suggestion for an additional location where an overflow could occur, for the sake of completeness and to make the fix more robust.
| @@ -597,7 +597,7 @@ __global__ void permuteKernel(KernelParams params) { | |||
| int const permutedIdx = params.expandedIdxToPermutedIdx[expandedIdx]; | |||
|
|
|||
| int const idx_in = tokenIdx + params.numTokens * scaleIdx; | |||
There was a problem hiding this comment.
For consistency and to prevent potential integer overflow with large numTokens and hiddenDim, it's safer to use int64_t for this index calculation as well, similar to the change for idx_out. The product params.numTokens * scaleIdx could overflow if both numTokens and hiddenDim (which determines the range of scaleIdx) are large.
int64_t const idx_in = (int64_t)tokenIdx + (int64_t)params.numTokens * scaleIdx;
…el (#2642) <!-- .github/pull_request_template.md --> ## 📌 Description Fix CUDA Illegal Memory Access (IMA) caused by int32 overflow in activationKernel and activationDeepSeekKernel in the TRTLLM fused MoE pipeline. Root cause: The index computation `permutedIdx * params.innerDim + hiddenIdx` uses int32 arithmetic. With large MoE configurations (e.g. 256 global experts, topK=8, DP=2, EP=2), the values can exceed INT32_MAX: - num_tokens = 65536 (max_num_batched_tokens * DP) - totalNumPaddedTokens up to 524,288 65536 * 8, worst case all tokens route to local experts) - innerDim = 2 * intermediate_size, suppose its >5k - 524,287 * innerDim may be > INT32_MAX (2,147,483,647) The overflow produces a negative index, causing out-of-bounds memory access. Fix: Cast permutedIdx to int64_t before the multiplication in both activationKernel (line 82) and activationDeepSeekKernel (line 337). The overflow may also cause issue in other places, e.g. #2643, but I don't have time to validate #2643 yet. ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x ] I have installed the hooks with `pre-commit install`. - [ x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests Verified locally with the same model, works - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Fixed integer overflow issues in tensor indexing calculations, enabling proper support for larger tensor dimensions without overflow errors. Improves stability for large-scale tensor processing operations. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…el (flashinfer-ai#2642) <!-- .github/pull_request_template.md --> ## 📌 Description Fix CUDA Illegal Memory Access (IMA) caused by int32 overflow in activationKernel and activationDeepSeekKernel in the TRTLLM fused MoE pipeline. Root cause: The index computation `permutedIdx * params.innerDim + hiddenIdx` uses int32 arithmetic. With large MoE configurations (e.g. 256 global experts, topK=8, DP=2, EP=2), the values can exceed INT32_MAX: - num_tokens = 65536 (max_num_batched_tokens * DP) - totalNumPaddedTokens up to 524,288 65536 * 8, worst case all tokens route to local experts) - innerDim = 2 * intermediate_size, suppose its >5k - 524,287 * innerDim may be > INT32_MAX (2,147,483,647) The overflow produces a negative index, causing out-of-bounds memory access. Fix: Cast permutedIdx to int64_t before the multiplication in both activationKernel (line 82) and activationDeepSeekKernel (line 337). The overflow may also cause issue in other places, e.g. flashinfer-ai#2643, but I don't have time to validate flashinfer-ai#2643 yet. ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x ] I have installed the hooks with `pre-commit install`. - [ x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests Verified locally with the same model, works - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Fixed integer overflow issues in tensor indexing calculations, enabling proper support for larger tensor dimensions without overflow errors. Improves stability for large-scale tensor processing operations. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
…el (#2642) <!-- .github/pull_request_template.md --> ## 📌 Description Fix CUDA Illegal Memory Access (IMA) caused by int32 overflow in activationKernel and activationDeepSeekKernel in the TRTLLM fused MoE pipeline. Root cause: The index computation `permutedIdx * params.innerDim + hiddenIdx` uses int32 arithmetic. With large MoE configurations (e.g. 256 global experts, topK=8, DP=2, EP=2), the values can exceed INT32_MAX: - num_tokens = 65536 (max_num_batched_tokens * DP) - totalNumPaddedTokens up to 524,288 65536 * 8, worst case all tokens route to local experts) - innerDim = 2 * intermediate_size, suppose its >5k - 524,287 * innerDim may be > INT32_MAX (2,147,483,647) The overflow produces a negative index, causing out-of-bounds memory access. Fix: Cast permutedIdx to int64_t before the multiplication in both activationKernel (line 82) and activationDeepSeekKernel (line 337). The overflow may also cause issue in other places, e.g. flashinfer-ai/flashinfer#2643, but I don't have time to validate flashinfer-ai/flashinfer#2643 yet. ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x ] I have installed the hooks with `pre-commit install`. - [ x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests Verified locally with the same model, works - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Fixed integer overflow issues in tensor indexing calculations, enabling proper support for larger tensor dimensions without overflow errors. Improves stability for large-scale tensor processing operations. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
📌 Description
Extend the int32 overflow fix from #2642 to the remaining standalone
kernels in trtllm_fused_moe_dev_kernel.cu:
permuteKernel,finalizeKernel,and
finalizeDeepSeekKernel.The same root cause applies —
permutedIdx * dimensioncan exceed INT32_MAXwith large MoE configurations. This PR fixes all remaining overflow sites:
permuteKernel:permutedIdx * hiddenDim(line 642)permuteKernel(DeepSeek scale path):permutedIdx + totalNumPaddedTokens * scaleIdx(line 653)finalizeKernel:permutedIdx * hiddenDimPadded(lines 718, 720)finalizeDeepSeekKernel:permutedIdx + totalNumPaddedTokens * (hiddenIdx / 128)(line 982)finalizeDeepSeekKernel:permutedIdx * hiddenDimPadded(line 989)🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
We suspect this may be the root cause for our another IMA, not validated yet
unittest, etc.).Reviewer Notes