Fix unordered map crash with TRTLLM MoE kernels#2695
Fix unordered map crash with TRTLLM MoE kernels#2695danisereb wants to merge 5 commits intoflashinfer-ai:mainfrom
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical crash in TRTLLM MoE kernels that occurred due to a discrepancy in how tile sizes were determined between the Python autotuner and the C++ launcher. The autotuner, after a previous fix, could cache tactics with Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a critical crash in the TRTLLM MoE kernels caused by a mismatch in tile selection logic between the Python autotuner and the C++ kernel launcher. While the fix correctly prevents the unordered_map::at exception by creating launchers for all supported tiles, it introduces a potential Denial of Service vulnerability. The code still uses launchers_map.at(tile_N) without verifying its existence in the launchers_map, which can lead to a process crash if an invalid tactic is provided. This critical check is missing in 5 instances. Additionally, there is a suggestion to refactor duplicated code in csrc/trtllm_fused_moe_kernel_launcher.cu to improve maintainability.
|
This PR is still WIP. |
<!-- .github/pull_request_template.md --> ## 📌 Description PR #2617 added a fix that solves "using fallback tactic" for TRTLLM MoE kernels. But after running more tests (`lm_eval`) with flashinfer v0.6.5 another issue was found - an error from C++ file `csrc/trtllm_fused_moe_kernel_launcher.cu` (key not found in `launchers_map.at(tile_N)`). Fixing this is probably not simple, more details in this draft PR (**NOT** for v0.6.6): #2695 In order to prevent the crash, the change in `_find_nearest_profile` will be reverted (to match flashinfer v0.6.4). The relevant AutoTuner tests were marked with "skip": ``` tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[1000-512] SKIPPED (_find_nearest_profile linked-dimension mapping was reverted;...) tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[4000-2048] SKIPPED (_find_nearest_profile linked-dimension mapping was reverted...) tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[8000-4096] SKIPPED (_find_nearest_profile linked-dimension mapping was reverted...) tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[12000-8192] SKIPPED (_find_nearest_profile linked-dimension mapping was reverte...) tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_same_bucket_same_profile SKIPPED (_find_nearest_profile linked-dimension mapping was reverted; re-enab...) tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_maps_all_linked_dims SKIPPED (_find_nearest_profile linked-dimension mapping was reverted; re-enable when ...) ``` The AutoTuner rest of the tests are all successful: ``` pytest --tb short tests/autotuner/ ================================================================================= test session starts ================================================================================== platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 rootdir: /my_home/workspace/dani_flashinfer configfile: pytest.ini plugins: anyio-4.12.1 collected 39 items tests/autotuner/test_autotuner_bmm_fp8.py ............ [ 30%] tests/autotuner/test_autotuner_core.py ...........ssssss.......... [100%] ============================================================================ 33 passed, 6 skipped in 0.95s ============================================================================= ``` **Using this branch, the failure from `trtllm_fused_moe_kernel_launcher.cu` does not happen.** **vLLM main still uses flashinfer v0.6.4 (that does not include PR #2617).** **This change should be included in flashinfer v0.6.6 (for use by vLLM).** <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Temporarily disabled three autotuner tests pending restoration of linked-dimension bucket propagation functionality. Tests will be re-enabled once related features are restored. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
|
Is it possible to just include all possible launchers in the
Assuming the overhead of instantiating launcher is small. |
|
The root cause for the tileN limitation in the trtllm-gen moe launcher is that when tuning the trtllm-gen moe, the search space is the product of all FC1 and FC2 kernels. To prevent the auto tuner from running forever, we limited the available tileN in the tuning. In hindsight, a better approach would be to expose |
|
@IwakuraRein if you have a better fix I don't mind if you open a PR (and I can close this PR). I don't have a lot of context with TRTLLM code, so instead I chose to merge this fix: |
|
I'm currently working on a fix. Explanation of the problem:Different sets of values may be returned from For example:
rounds up to 128, so it would return
rounds up to 256, so it would return If tileN=64 was found to be the fastest, the autotuner would call the CPP with So I think the main question here is - how do we want to deal (in this example) with any |
<!-- .github/pull_request_template.md --> ## 📌 Description PR flashinfer-ai#2617 added a fix that solves "using fallback tactic" for TRTLLM MoE kernels. But after running more tests (`lm_eval`) with flashinfer v0.6.5 another issue was found - an error from C++ file `csrc/trtllm_fused_moe_kernel_launcher.cu` (key not found in `launchers_map.at(tile_N)`). Fixing this is probably not simple, more details in this draft PR (**NOT** for v0.6.6): flashinfer-ai#2695 In order to prevent the crash, the change in `_find_nearest_profile` will be reverted (to match flashinfer v0.6.4). The relevant AutoTuner tests were marked with "skip": ``` tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[1000-512] SKIPPED (_find_nearest_profile linked-dimension mapping was reverted;...) tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[4000-2048] SKIPPED (_find_nearest_profile linked-dimension mapping was reverted...) tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[8000-4096] SKIPPED (_find_nearest_profile linked-dimension mapping was reverted...) tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[12000-8192] SKIPPED (_find_nearest_profile linked-dimension mapping was reverte...) tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_same_bucket_same_profile SKIPPED (_find_nearest_profile linked-dimension mapping was reverted; re-enab...) tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_maps_all_linked_dims SKIPPED (_find_nearest_profile linked-dimension mapping was reverted; re-enable when ...) ``` The AutoTuner rest of the tests are all successful: ``` pytest --tb short tests/autotuner/ ================================================================================= test session starts ================================================================================== platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 rootdir: /my_home/workspace/dani_flashinfer configfile: pytest.ini plugins: anyio-4.12.1 collected 39 items tests/autotuner/test_autotuner_bmm_fp8.py ............ [ 30%] tests/autotuner/test_autotuner_core.py ...........ssssss.......... [100%] ============================================================================ 33 passed, 6 skipped in 0.95s ============================================================================= ``` **Using this branch, the failure from `trtllm_fused_moe_kernel_launcher.cu` does not happen.** **vLLM main still uses flashinfer v0.6.4 (that does not include PR flashinfer-ai#2617).** **This change should be included in flashinfer v0.6.6 (for use by vLLM).** <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Temporarily disabled three autotuner tests pending restoration of linked-dimension bucket propagation functionality. Tests will be re-enabled once related features are restored. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md --> ## 📌 Description PR flashinfer-ai#2617 added a fix that solves "using fallback tactic" for TRTLLM MoE kernels. But after running more tests (`lm_eval`) with flashinfer v0.6.5 another issue was found - an error from C++ file `csrc/trtllm_fused_moe_kernel_launcher.cu` (key not found in `launchers_map.at(tile_N)`). Fixing this is probably not simple, more details in this draft PR (**NOT** for v0.6.6): flashinfer-ai#2695 In order to prevent the crash, the change in `_find_nearest_profile` will be reverted (to match flashinfer v0.6.4). The relevant AutoTuner tests were marked with "skip": ``` tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[1000-512] SKIPPED (_find_nearest_profile linked-dimension mapping was reverted;...) tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[4000-2048] SKIPPED (_find_nearest_profile linked-dimension mapping was reverted...) tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[8000-4096] SKIPPED (_find_nearest_profile linked-dimension mapping was reverted...) tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_shared_num_tokens_axis[12000-8192] SKIPPED (_find_nearest_profile linked-dimension mapping was reverte...) tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_moe_same_bucket_same_profile SKIPPED (_find_nearest_profile linked-dimension mapping was reverted; re-enab...) tests/autotuner/test_autotuner_core.py::test_find_nearest_profile_maps_all_linked_dims SKIPPED (_find_nearest_profile linked-dimension mapping was reverted; re-enable when ...) ``` The AutoTuner rest of the tests are all successful: ``` pytest --tb short tests/autotuner/ ================================================================================= test session starts ================================================================================== platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 rootdir: /my_home/workspace/dani_flashinfer configfile: pytest.ini plugins: anyio-4.12.1 collected 39 items tests/autotuner/test_autotuner_bmm_fp8.py ............ [ 30%] tests/autotuner/test_autotuner_core.py ...........ssssss.......... [100%] ============================================================================ 33 passed, 6 skipped in 0.95s ============================================================================= ``` **Using this branch, the failure from `trtllm_fused_moe_kernel_launcher.cu` does not happen.** **vLLM main still uses flashinfer v0.6.4 (that does not include PR flashinfer-ai#2617).** **This change should be included in flashinfer v0.6.6 (for use by vLLM).** <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Temporarily disabled three autotuner tests pending restoration of linked-dimension bucket propagation functionality. Tests will be re-enabled once related features are restored. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
…e launchers for all supported tileN in trtllm fused MoE (#2821) ## 📌 Description It fixes two autotuner related bugs: 1. Revert back the autotuner fix that was reverted in #2697 2. Fix the issue that #2697 revealed, which is trtllm fused MoE kernel launcher crash when it receives tileN that is supported but filtered out by `computeSelectedTileN`, by creating kernel launchers for all supported tileN values. This PR continues the work in #2695 by @danisereb to revert bugfix 1 and to fix bug 2. More technical details: ### Bug 1: When given num_tokens that isn't a power-of-2, the autotuner (python side) fails to find its appropriate entry in the autotuner cache, so it falls back to passing default, which means passing `[-1, -1]` as the `(tileN, tactic)` to the CPP. It was fixed in [this PR](https://github.com/flashinfer-ai/flashinfer/pull/2617/changes#diff-1964ab957d8185d04b0d5f0cb02d0c7c0a3260ac0a6c573167af6875ab0b0e87L729-L734) but soon after merge, it was reverted [here](#2697), as it exposed the next bug. ### Bug 2 (exposed after fixing bug 1): Crash in fused MoE kernel launcher on forward pass on some values of num_tokens. The crash is at `launchers_map.at(tile_N)` in `trtllm_fused_moe_kernel_launcher.cu`. It happens because: The python side of the autotuner profiles num_tokens that are power of 2, and each such value represents the range until the next power of 2. e.g.: The profile for the range `[2048, 4095]` is done on num_tokens=2048. `computeSelectedTileN` function in `trtllm_fused_moe_kernel_launcher.cu` reduces the set of supported tileN values (to reduce the autotuner's search space), by choosing specific values from the supported tileN sorted list, the values are: `roundUpToPowerOfTwo(num_tokens * topK / numExperts)`, its previous one, and its next 2 values (max value is 256). So values in the same range can get different sets of tileN values. For example, on Nemotron 3 Super NVFP4: - `num_tokens=2048` -> `2048*22/512 = 88`, which rounds up to 128, so the tileN set is `(64, 128, 256)` - `num_tokens=3003` -> `3003*22/512 = 129.03`, which rounds up to 256, so the tileN set is `(128, 256)` In case `tileN=64` was found to be the fastest on `num_tokens=2048` for range `[2048, 4095]`, when given `num_tokens=3003`, the python side would pass `[64, someTactic]` to the CPP, but for `num_tokens=3003`, there's no launcher for `tileN=64` as `computeSelectedTileN` filtered it out. ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Stricter MoE tile validation and ensured all supported tiles are available at launch to avoid missing kernel configurations. * Autotuner mapping for linked dynamic dimensions now yields consistent cached bucket values. * **Tests** * Added SM100 MoE autotuner integration tests (including invalid-cached-tactic checks). * Re-enabled and expanded autotuner unit tests and added a test utility to reset the autotuner. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com> Co-authored-by: Daniel Serebrenik <daserebrenik@nvidia.com>
|
This PR fixed the issue: So my PR is no longer needed. |
📌 Description
This PR aims to fix a bug in TRTLLM MoE kernels.
The bug was discovered in flashinfer v0.6.5, following a fix that prevents fallback to default autotuner tactic:
#2617
Bug Description
The Python AutoTuner profiles MoE kernels by creating input tensors at power-of-2 "bucket" sizes (1, 2, 4, ..., 4096) and benchmarking different kernel configurations (tactics) for each bucket.
The buckets are generated using Python function
get_last_power_of_2_num_tokens_buckets, this function is called inMoERunner.refine_tuning_config.And
MoERunner.refine_tuning_configis called by:trtllm_fp4_block_scale_moe_optrtllm_fp8_block_scale_moe_optrtllm_bf16_moe_opEach tactic is a pair
[tile_N, config]wheretile_Nis a tile size for batching tokens across experts and config is a specific kernel variant for that tile. The autotuner picks the fastest tactic per bucket and caches it, keyed by the bucketed input shapes.During inference, the actual num_tokens (e.g., 1624) is rounded down to the nearest power of 2 (1024) to look up the cached tactic (
last_positive_power_of_2), which is then passed as two integers to the C++ launcher.The C++ launcher is located in file
csrc/trtllm_fused_moe_kernel_launcher.cu.For NVFP4, the relevant function is
trtllm_fp4_block_scale_moeand the two integers that represent the tactic areArray<int64_t> config_index.The
config_indexis the only tactic related data that is passed from Python to C++ code.Python’s full autotune search space/results (candidate tactic list, timing data, ranking, cache internals) are not transferred to C++ as a structure.
The C++ MoE kernel launcher receives the actual tensors and the tactic, then independently computes which tile sizes are appropriate for the actual num_tokens using function
computeSelectedTileN.This function calculates the average tokens per expert, rounds up to the next power of 2, and selects a small neighborhood of tiles around that value. It then builds launcher objects only for those selected tiles in an unordered_map, and looks up the tactic's tile_N in that map.
The conflict arises because Python rounds num_tokens down for bucketing while C++ rounds the derived average up for tile selection, and these are applied to different values (raw num_tokens vs. num_tokens * top_k / num_experts).
A tactic cached for the smaller bucketed num_tokens tends to favor smaller tiles, while the C++ launcher for the larger actual num_tokens selects larger tiles and excludes the small ones from its map. When the cached tile_N is not in the C++ launcher's map, unordered_map::at throws, crashing the process.
The fix builds launchers for all supported tiles so that any tactic the autotuner returns is always found.
Note
Using the same rounding direction in both Python and C++ does not solve the problem because they round different values. The Python autotuner rounds num_tokens directly (via
last_positive_power_of_2) to compute a cache bucket, while the C++ launcher rounds a derived value -num_tokens * top_k / num_experts(the average tokens per expert) - to select tile sizes.Even if both used the same rounding function, the derived average for the bucketed num_tokens and the actual num_tokens can land on different sides of a power-of-2 boundary whenever top_k / num_experts is not itself a power of 2.
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes