Conversation
…rt Nemot…" This reverts commit 83cdea3.
📝 WalkthroughWalkthroughThe pull request systematically replaces the generic Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @nv-yunzheq, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request performs a full revert of a prior feature that aimed to introduce non-gated Relu2 activation for Fused MoE operations in NVFP4 and FP8, alongside Nemotron support. The decision to revert was made due to the feature causing unit test regressions and imposing an undesirable restriction on the minimum number of experts for the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request reverts the feature supporting Fused MoE non-gated Relu2 and Nemotron. The changes are extensive, touching Python benchmarks, tests, and C++/CUDA source code. The revert appears to be thorough and consistent across the codebase, correctly removing the added functionalities and adjusting related logic. I've identified one potential bug in the CUDA kernel logic that could lead to out-of-bounds memory access, which I've detailed in a specific comment.
| __syncthreads(); | ||
| if (warpIdx == 0) { | ||
| int constexpr NumInterTopKPerThread = (NumInterTopK - 1) / WarpSize + 1; | ||
| int constexpr NumInterTopKPerThread = (NumInterTopK * NumExpertWarps - 1) / WarpSize + 1; |
There was a problem hiding this comment.
This calculation for NumInterTopKPerThread seems incorrect and could lead to an out-of-bounds access in the following loop.
NumInterTopK is defined as NumExpertWarps * MaxNumTopExperts. The shared memory arrays smemInterTopScores and smemInterTopExperts are of size NumInterTopK. The loop starting at line 195 iterates up to NumInterTopKPerThread * WarpSize, which approximates to NumInterTopK * NumExpertWarps. If NumExpertWarps > 1, this will cause out-of-bounds access to the shared memory arrays.
The previous implementation (NumInterTopK - 1) / WarpSize + 1 seems correct for calculating the number of elements per thread for the reduction. I suggest reverting to that.
int constexpr NumInterTopKPerThread = (NumInterTopK - 1) / WarpSize + 1;
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
flashinfer/fused_moe/core.py (1)
1939-1947:⚠️ Potential issue | 🟡 MinorSilence unused
gated_act_typein the fake op.Static analysis flags this as unused; add a no-op reference to keep the signature aligned with the real op.
🧹 Suggested fix
def _fake_trtllm_fp4_block_scale_moe( routing_logits: torch.Tensor, @@ gated_act_type: int, output: Optional[torch.Tensor], tune_max_num_tokens: int, ): + _ = gated_act_type # keep signature in sync with real op seq_len = hidden_states.shape[0] hidden_size = hidden_states.shape[1] if output is None else output.shape[1]tests/moe/test_trtllm_gen_fused_moe.py (1)
2094-2108:⚠️ Potential issue | 🟡 MinorReference paths hardcode SwiGlu; GeGlu cases may miscompare.
If GeGlu is supported for these modes, pass through
args.gated_act_type. If not, make sureskip_checksexplicitly skips GeGlu for these impls.🛠️ Suggested fix (pass through gated_act_type)
- GatedActType.SwiGlu.value, # gated_act_type + args.gated_act_type,- GatedActType.SwiGlu.value, # gated_act_type + args.gated_act_type,- GatedActType.SwiGlu.value, # gated_act_type + args.gated_act_type,Also applies to: 2131-2145, 2162-2176
🧹 Nitpick comments (1)
csrc/trtllm_fused_moe_routing_deepseek.cu (1)
120-192: Reduce per‑thread scratch size in the inter‑warp top‑K merge.
NumInterTopKalready includesNumExpertWarps, so multiplying byNumExpertWarpsagain inflatesNumInterTopKPerThreadand per‑thread arrays. Consider using a simple ceil division to avoid extra register pressure.♻️ Suggested adjustment
- int constexpr NumInterTopKPerThread = (NumInterTopK * NumExpertWarps - 1) / WarpSize + 1; + int constexpr NumInterTopKPerThread = (NumInterTopK - 1) / WarpSize + 1;
|
/bot run |
|
[FAILED] Pipeline #42953047: 10/20 passed |
…rt Nemotron" (flashinfer-ai#2451) Reverts flashinfer-ai#2304 As it introduces regression on unit test and no longer allow number of experts lower than 22 to run trtllm deepseek routing kernel <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **Refactor** * Consolidated gated activation type handling across MoE implementations with simplified parameter names and enum naming. * Unified intermediate size calculations to consistently use 2x configuration. * Streamlined routing logic for improved clarity and maintainability. * **Breaking Changes** * CLI argument `--activation-type` renamed to `--gated-act` with values "swiglu" or "geglu". * API parameter names updated from `activation_type` to `gated_act_type` across public interfaces. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Reverts #2304
As it introduces regression on unit test and no longer allow number of experts lower than 22 to run trtllm deepseek routing kernel
Summary by CodeRabbit
Release Notes
Refactor
Breaking Changes
--activation-typerenamed to--gated-actwith values "swiglu" or "geglu".activation_typetogated_act_typeacross public interfaces.✏️ Tip: You can customize this high-level summary in your review settings.