[fp8_blockwise]Fix int32 overflow in TRTLLM fused MoE activation kernel#2642
Conversation
…s (fp8_blockwise)
📝 WalkthroughWalkthroughThe pull request modifies a CUDA kernel file to convert 32-bit index calculations to 64-bit (int64_t) across multiple index computations in Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @charlotte12l, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly fixes an int32 overflow bug in activationKernel and activationDeepSeekKernel by promoting index calculations to int64_t. The changes are accurate and address the reported issue. However, my review of csrc/trtllm_fused_moe_dev_kernel.cu indicates that several other kernels in the file suffer from the same potential for int32 overflow in index calculations. These kernels include permuteKernel, finalizeKernel, and finalizeDeepSeekKernel. For example, in permuteKernel, expressions like tokenIdx * params.hiddenDim and permutedIdx * params.hiddenDim can overflow with large numTokens and hiddenDim values. I strongly recommend applying similar int64_t casting to all such index computations throughout the file to ensure the fused MoE pipeline is fully robust against these critical memory access bugs.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
csrc/trtllm_fused_moe_dev_kernel.cu (1)
81-82: Please add a regression test for the >INT32_MAX indexing path.Given this is a correctness fix for a prior IMA, a targeted large-shape test would help prevent regressions in future kernel refactors.
Also applies to: 265-273, 311-312, 329-329
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@csrc/trtllm_fused_moe_dev_kernel.cu` around lines 81 - 82, Add a regression test that exercises the >INT32_MAX indexing path by constructing input tensors/shapes so that (permutedIdx * params.innerDim) > INT32_MAX and verifying correctness; specifically, create a large-shape unit/integration test that invokes the kernels in csrc/trtllm_fused_moe_dev_kernel.cu (the code paths using baseIdx = (int64_t)permutedIdx * params.innerDim + hiddenIdx and the related sections around lines 265-273, 311-312, 329) to ensure those int64_t indices are exercised and results match the expected output (or a smaller reference computation). Ensure the test runs in CI, is deterministic, and includes both forward and backward (if applicable) checks to catch regressions in future refactors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@csrc/trtllm_fused_moe_dev_kernel.cu`:
- Around line 81-82: Add a regression test that exercises the >INT32_MAX
indexing path by constructing input tensors/shapes so that (permutedIdx *
params.innerDim) > INT32_MAX and verifying correctness; specifically, create a
large-shape unit/integration test that invokes the kernels in
csrc/trtllm_fused_moe_dev_kernel.cu (the code paths using baseIdx =
(int64_t)permutedIdx * params.innerDim + hiddenIdx and the related sections
around lines 265-273, 311-312, 329) to ensure those int64_t indices are
exercised and results match the expected output (or a smaller reference
computation). Ensure the test runs in CI, is deterministic, and includes both
forward and backward (if applicable) checks to catch regressions in future
refactors.
|
/bot run |
|
[FAILED] Pipeline #44998402: 10/20 passed |
Not able to open the link? |
|
@ChristinaZ for viz |
IwakuraRein
left a comment
There was a problem hiding this comment.
Thanks for the findings!
|
@flashinfer-ci-bot run |
ChristinaZ
left a comment
There was a problem hiding this comment.
Thanks for your work. The modification looks good to me.
…el (flashinfer-ai#2642) <!-- .github/pull_request_template.md --> ## 📌 Description Fix CUDA Illegal Memory Access (IMA) caused by int32 overflow in activationKernel and activationDeepSeekKernel in the TRTLLM fused MoE pipeline. Root cause: The index computation `permutedIdx * params.innerDim + hiddenIdx` uses int32 arithmetic. With large MoE configurations (e.g. 256 global experts, topK=8, DP=2, EP=2), the values can exceed INT32_MAX: - num_tokens = 65536 (max_num_batched_tokens * DP) - totalNumPaddedTokens up to 524,288 65536 * 8, worst case all tokens route to local experts) - innerDim = 2 * intermediate_size, suppose its >5k - 524,287 * innerDim may be > INT32_MAX (2,147,483,647) The overflow produces a negative index, causing out-of-bounds memory access. Fix: Cast permutedIdx to int64_t before the multiplication in both activationKernel (line 82) and activationDeepSeekKernel (line 337). The overflow may also cause issue in other places, e.g. flashinfer-ai#2643, but I don't have time to validate flashinfer-ai#2643 yet. ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x ] I have installed the hooks with `pre-commit install`. - [ x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests Verified locally with the same model, works - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Fixed integer overflow issues in tensor indexing calculations, enabling proper support for larger tensor dimensions without overflow errors. Improves stability for large-scale tensor processing operations. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
📌 Description
Fix CUDA Illegal Memory Access (IMA) caused by int32 overflow in activationKernel and activationDeepSeekKernel in the TRTLLM fused MoE pipeline.
Root cause: The index computation
permutedIdx * params.innerDim + hiddenIdxuses int32 arithmetic. With large MoE configurations (e.g. 256 global experts, topK=8, DP=2, EP=2), the values can exceed INT32_MAX:The overflow produces a negative index, causing out-of-bounds memory access.
Fix: Cast permutedIdx to int64_t before the multiplication in both
activationKernel (line 82) and activationDeepSeekKernel (line 337).
The overflow may also cause issue in other places, e.g. #2643, but I don't have time to validate #2643 yet.
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
Verified locally with the same model, works
unittest, etc.).Reviewer Notes
Summary by CodeRabbit