feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron, fixed#2462
Conversation
… adjusted test Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…gated_activation function in tests to tests/moe/utils.py Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…Type enum from core.py, update docstrings Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…_moe Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…ests skip_checks to skip on non-gated activation with quantizations that don't support it Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…seek routing for nemotron Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…spaceSizeInBytes, getDefaultValidConfigIndex, isValidConfigIndex Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…hmarks Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…fix) Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…tions Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…py for trtllm_fp8_block_scale_moe Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…k.cu Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
📝 WalkthroughWalkthroughReplaces GatedActType with a broader ActivationType across Python benchmarks/CLI, test utilities, C++ runners/launchers, headers, and CUDA kernels; threads activation_type through benchmark/autotuner/runner/kernel call paths; adds eltwise activation support and extends DeepSeek top-experts/top-K sizing and launch macros. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI/Bench
participant Py as Python MoE Routines
participant Runner as C++ MoE Runner
participant Kernel as CUDA Kernel
CLI->>Py: parse --activation-type (enum)
CLI->>Py: invoke benchmark/autotuner with activation_type
Py->>Runner: call MoE entry (activation_type.value, numTopExperts, eltwiseActType)
Runner->>Runner: map activation_type → gated/eltwise, compute intermediateSizeFactor
Runner->>Kernel: launch kernel with activation_type and top-K params
Kernel-->>Runner: return timings/results
Runner-->>Py: return results (include activation_type.name)
Py-->>CLI: print/store results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Important Action Needed: IP Allowlist UpdateIf your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:
Failure to add the new IP will result in interrupted reviews. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @amitz-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the capabilities of fused Mixture-of-Experts (MoE) operations by introducing a more flexible activation function system. It adds support for the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant feature enhancements and fixes. It adds support for the non-gated Relu2 activation in Fused MoE for NVFP4 and FP8, and support for Nemotron models in DeepSeek routing. A key refactoring is the replacement of GatedActType with a more generic ActivationType enum, which has been consistently applied across both Python and C++ codebases. The changes also include important fixes, such as removing a restrictive check on the number of experts and correcting a bug in the routing kernel. The test suite has been updated to cover these new features, ensuring robustness. Overall, the changes are well-implemented and improve the capabilities and correctness of the library.
| // For simplicity pass set scaleAct to scaleGateC | ||
| gemmData.mInputBuffers.mPtrScaleAct = scaleGateC; |
There was a problem hiding this comment.
The comment "For simplicity pass set scaleAct to scaleGateC" suggests this might be a temporary solution. While this might work for the current set of activation functions (e.g., if Relu2 doesn't use mPtrScaleAct), it could lead to latent bugs if new element-wise activations are added that require a specific scaleAct value different from scaleGateC.
To improve clarity and prevent future issues, consider passing scaleAct as a separate parameter to the run function and setting mPtrScaleAct accordingly. If scaleGateC is indeed the correct value for all cases, a more detailed comment explaining why would be beneficial.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
csrc/trtllm_fused_moe_routing_deepseek.cu (2)
196-216:⚠️ Potential issue | 🟡 MinorInconsistent sentinel values used for invalid expert indices.
Line 200 uses
MaxSupportedExpertCount - 1(511) as the sentinel for invalid entries, while Line 215 usesKernelParams::MaxNumExperts - 1for the same purpose. This inconsistency could cause subtle issues during the final top-K reduction if the sentinel values have different sort ordering relative to valid indices.Consider using a consistent sentinel value across both code paths.
🔧 Suggested fix for consistency
} else { intermidiateScore[ii] = invalidScoreFloat; - intermidiateExpert[ii] = KernelParams::MaxNumExperts - 1; + intermidiateExpert[ii] = MaxSupportedExpertCount - 1; }
518-541:⚠️ Potential issue | 🔴 CriticalAdd validation to enforce topK ≤ 8 for non-Nemotron expert counts, or extend conditional dispatch to all branches.
The
LAUNCH_ROUTING_DEEPSEEKmacro instantiates non-Nemotron expert configurations (MaxNumExpertsUnit, Deepseek, KimiK2) withMaxNumTopExperts=8, but the global validation at line 560 only enforcesmTopK <= 22without checking expert count. This creates a buffer overflow risk: if a non-Nemotron model passesmTopK > 8at runtime, the kernel'stopScores[MaxNumTopExperts]andtopExperts[MaxNumTopExperts]arrays (line 125-126) would overflow when the kernel attempts to write results at indices beyond 7, while kernel logic like lines 191-196 assumesMaxNumTopExperts >= mTopK.Either:
- Add a runtime check preventing non-Nemotron models from requesting
mTopK > 8, or- Extend the conditional topK dispatch (lines 531-537) to other expert branches as well.
benchmarks/routines/moe.py (1)
1230-1265:⚠️ Potential issue | 🟡 MinorAdd validation that
activation_typeis Swiglu for FP8 block scale MOE benchmark.The
run_fp8_block_moefunction doesn't validate the activation type, unlike the autotuner which explicitly rejects non-Swiglu activations. To prevent silent errors and maintain consistency withbench_trtllm_gen_fused_moe_autotuner.py, add a check that raises an error ifargs.activation_type != ActivationType.Swiglu.tests/moe/test_trtllm_gen_fused_moe.py (1)
2163-2170:⚠️ Potential issue | 🟠 MajorInconsistent
activation_typepassing: using.valuein some places but not others.At line 2167,
args.activation_type.valueis passed tomoe_args_dequant, but at line 2286 inrun_moe_reference_mxint4,args.activation_typeis passed directly (without.value). Similarly, lines 2204 and 2235 use.value.Looking at the
moe_args_dequantconstructor andrun_moe_dequantfunction, theactivation_typeis used in a dictionary lookup at line 1956-1960:activation_type_to_func = { ActivationType.Swiglu: F.silu, ActivationType.Geglu: F.gelu, ActivationType.Relu2: lambda x: F.relu(x) ** 2, } activation_func = activation_type_to_func[activation_type]This expects
ActivationTypeenum values, not integers. Using.valuewill cause aKeyErrorsince the dict keys are enum members, not integers.🐛 Proposed fix: Remove `.value` from activation_type arguments
--- a/tests/moe/test_trtllm_gen_fused_moe.py +++ b/tests/moe/test_trtllm_gen_fused_moe.py @@ -2164,7 +2164,7 @@ def run_moe_reference_dsfp8(args): gemm2_weights_dequant, args.permute_info, args.use_routing_scales_on_input, - args.activation_type.value, + args.activation_type, ) return run_moe_dequant(args_dequant, QuantMode.FP8_BLOCK_SCALE), args_dequant @@ -2201,7 +2201,7 @@ def run_moe_reference_per_tensor_scale_fp8(args): gemm2_weights_dequant, args.permute_info, args.use_routing_scales_on_input, - args.activation_type.value, + args.activation_type, ) return run_moe_dequant(args_dequant, QuantMode.FP8_PER_TENSOR), args_dequant @@ -2232,7 +2232,7 @@ def run_moe_reference_bf16(args): gemm2_weights_dequant, args.permute_info, args.use_routing_scales_on_input, - args.activation_type.value, + args.activation_type, ) return run_moe_dequant(args_dequant, QuantMode.BF16), args_dequant
🤖 Fix all issues with AI agents
In `@csrc/trtllm_fused_moe_runner.cu`:
- Around line 196-222: The function activationTypeToGatedActType is missing a
case for ActivationType::SwigluBias which causes the default-check to fire at
runtime; update activationTypeToGatedActType to handle
ActivationType::SwigluBias (returning the appropriate gated enum, e.g.,
ActType::SwiGlu) alongside ActivationType::Swiglu so that
isGatedActivation-consistent callers (like getOptions) no longer hit the
FLASHINFER_CHECK default branch.
In `@tests/moe/utils.py`:
- Around line 90-94: Update the three routing configuration dicts named kimi_k2,
DSv3, and DSLite in test_dpsk_fused_moe_fp8.py to include the required key
"compatible_activation_types" (a list of activation names). Add a
compatible_activation_types entry that includes the activation(s) used by the
tests (for example ["gelu", "relu", "swish"] or the specific activation(s) the
test suite iterates over) so pytest.skip is not triggered by an empty default.
🧹 Nitpick comments (5)
csrc/trtllm_batched_gemm_runner.cu (1)
226-227: Clarify the scaleAct assignment.The comment indicates this is a simplification. Consider adding a more detailed comment explaining when
scaleActshould differ fromscaleGateC, or whether this is the correct semantic for all element-wise activation types.include/flashinfer/trtllm/fused_moe/runner.h (1)
171-172: Consider throwing an error for invalid activation types.The TODO comment indicates this should throw an error. Currently, returning "InvalidActivationType" silently accepts invalid values.
Proposed fix
default: - return "InvalidActivationType"; // TODO throw error + FLASHINFER_CHECK(false, "Invalid activation type: ", static_cast<int64_t>(activationType)); + return "InvalidActivationType"; // Unreachablecsrc/trtllm_fused_moe_kernel_launcher.cu (1)
419-424: Consider exposing activation_type in BF16 MoE API.The comment "not exposed in api for now" suggests this is intentional. If non-gated activations (like Relu2) should eventually be supported for BF16 MoE, this would need to be parameterized.
tests/moe/utils.py (1)
105-108: Threshold change from>= 512to> 512.This change allows 512 experts with
intermediate_size > 512to run, where previously they were skipped. Verify this is intentional and that these configurations are supported.tests/moe/test_dpsk_fused_moe_fp8.py (1)
590-591: Redundant CUDA availability check.Per repository conventions, tests assume CUDA is available and do not require
torch.cuda.is_available()guards. This check is redundant but not harmful.
| static inline ActType activationTypeToGatedActType(ActivationType actType) { | ||
| switch (actType) { | ||
| case ActivationType::Swiglu: | ||
| return ActType::SwiGlu; | ||
| case ActivationType::Geglu: | ||
| return ActType::GeGlu; | ||
| default: | ||
| FLASHINFER_CHECK(false, "Unsupported gated activation type ", | ||
| serializeActivationType(actType), " of enum ", | ||
| static_cast<int64_t>(actType)); | ||
| } | ||
| return ActType::SwiGlu; | ||
| } | ||
|
|
||
| static inline EltwiseActType activationTypeToEltwiseActType(ActivationType actType) { | ||
| switch (actType) { | ||
| case ActivationType::Relu2: | ||
| return EltwiseActType::Relu2; | ||
| case ActivationType::Identity: | ||
| return EltwiseActType::None; | ||
| default: | ||
| FLASHINFER_CHECK(false, "Unsupported eltwise activation type ", | ||
| serializeActivationType(actType), " of enum ", | ||
| static_cast<int64_t>(actType)); | ||
| } | ||
| return EltwiseActType::None; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for SwigluBias in the codebase
rg -n "SwigluBias" --type-list | head -20Repository: flashinfer-ai/flashinfer
Length of output: 457
🏁 Script executed:
#!/bin/bash
# Find all instances of SwigluBias
rg -n "SwigluBias" --type cpp --type cu --type cuh --type py 2>/dev/null | head -50Repository: flashinfer-ai/flashinfer
Length of output: 50
🏁 Script executed:
#!/bin/bash
# Search for ActivationType enum definition and isGatedActivation function
rg -n "isGatedActivation|enum.*ActivationType" --type cpp --type cu --type cuh --type py 2>/dev/null | head -50Repository: flashinfer-ai/flashinfer
Length of output: 50
🏁 Script executed:
#!/bin/bash
# Simple search for SwigluBias anywhere in the repository
rg "SwigluBias" -i 2>/dev/null | head -30Repository: flashinfer-ai/flashinfer
Length of output: 2836
🏁 Script executed:
#!/bin/bash
# Look at the actual file to understand the context
cat -n csrc/trtllm_fused_moe_runner.cu | sed -n '150,250p'Repository: flashinfer-ai/flashinfer
Length of output: 4650
🏁 Script executed:
#!/bin/bash
# Search for ActivationType enum and isGatedActivation function
rg "enum.*ActivationType|isGatedActivation" -A 10 2>/dev/null | head -80Repository: flashinfer-ai/flashinfer
Length of output: 6230
Add SwigluBias case to activationTypeToGatedActType function.
The isGatedActivation helper correctly identifies SwigluBias as a gated activation, but activationTypeToGatedActType lacks a case for it, causing a runtime check failure when SwigluBias is used in line 234 of the getOptions function.
Proposed fix
static inline ActType activationTypeToGatedActType(ActivationType actType) {
switch (actType) {
case ActivationType::Swiglu:
return ActType::SwiGlu;
case ActivationType::Geglu:
return ActType::GeGlu;
+ case ActivationType::SwigluBias:
+ return ActType::SwiGlu;
default:
FLASHINFER_CHECK(false, "Unsupported gated activation type ",
serializeActivationType(actType), " of enum ",
static_cast<int64_t>(actType));
}
return ActType::SwiGlu;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| static inline ActType activationTypeToGatedActType(ActivationType actType) { | |
| switch (actType) { | |
| case ActivationType::Swiglu: | |
| return ActType::SwiGlu; | |
| case ActivationType::Geglu: | |
| return ActType::GeGlu; | |
| default: | |
| FLASHINFER_CHECK(false, "Unsupported gated activation type ", | |
| serializeActivationType(actType), " of enum ", | |
| static_cast<int64_t>(actType)); | |
| } | |
| return ActType::SwiGlu; | |
| } | |
| static inline EltwiseActType activationTypeToEltwiseActType(ActivationType actType) { | |
| switch (actType) { | |
| case ActivationType::Relu2: | |
| return EltwiseActType::Relu2; | |
| case ActivationType::Identity: | |
| return EltwiseActType::None; | |
| default: | |
| FLASHINFER_CHECK(false, "Unsupported eltwise activation type ", | |
| serializeActivationType(actType), " of enum ", | |
| static_cast<int64_t>(actType)); | |
| } | |
| return EltwiseActType::None; | |
| } | |
| static inline ActType activationTypeToGatedActType(ActivationType actType) { | |
| switch (actType) { | |
| case ActivationType::Swiglu: | |
| return ActType::SwiGlu; | |
| case ActivationType::Geglu: | |
| return ActType::GeGlu; | |
| case ActivationType::SwigluBias: | |
| return ActType::SwiGlu; | |
| default: | |
| FLASHINFER_CHECK(false, "Unsupported gated activation type ", | |
| serializeActivationType(actType), " of enum ", | |
| static_cast<int64_t>(actType)); | |
| } | |
| return ActType::SwiGlu; | |
| } | |
| static inline EltwiseActType activationTypeToEltwiseActType(ActivationType actType) { | |
| switch (actType) { | |
| case ActivationType::Relu2: | |
| return EltwiseActType::Relu2; | |
| case ActivationType::Identity: | |
| return EltwiseActType::None; | |
| default: | |
| FLASHINFER_CHECK(false, "Unsupported eltwise activation type ", | |
| serializeActivationType(actType), " of enum ", | |
| static_cast<int64_t>(actType)); | |
| } | |
| return EltwiseActType::None; | |
| } |
🤖 Prompt for AI Agents
In `@csrc/trtllm_fused_moe_runner.cu` around lines 196 - 222, The function
activationTypeToGatedActType is missing a case for ActivationType::SwigluBias
which causes the default-check to fire at runtime; update
activationTypeToGatedActType to handle ActivationType::SwigluBias (returning the
appropriate gated enum, e.g., ActType::SwiGlu) alongside ActivationType::Swiglu
so that isGatedActivation-consistent callers (like getOptions) no longer hit the
FLASHINFER_CHECK default branch.
tests/moe/utils.py
Outdated
| compatible_activation_types = routing_config.get("compatible_activation_types", []) | ||
| if activation_type not in compatible_activation_types: | ||
| pytest.skip( | ||
| f"Incompatible: activation_type={activation_type} not in compatible_activation_types ({compatible_activation_types})" | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check routing_config definitions and compatible_activation_types usage
rg -n "routing_config" tests/moe/ -B 2 -A 5 | head -100Repository: flashinfer-ai/flashinfer
Length of output: 6348
🏁 Script executed:
# Find all test files to understand routing_config usage patterns
find tests/moe -name "*.py" -type fRepository: flashinfer-ai/flashinfer
Length of output: 273
🏁 Script executed:
# Check if compatible_activation_types is present in routing configs
rg -n "compatible_activation_types" tests/moe/Repository: flashinfer-ai/flashinfer
Length of output: 897
🏁 Script executed:
# Find all routing_config dict definitions in test files
rg -n "\"compatible_activation_types\"" tests/moe/ -B 10 | grep -E "(routing_config|compatible_activation_types)" | head -50Repository: flashinfer-ai/flashinfer
Length of output: 808
🏁 Script executed:
# Check all test files for routing_config definitions without compatible_activation_types
rg -n "{" tests/moe/test_*.py -A 50 | grep -E "(routing_config\s*=|compatible_activation_types)" | head -100Repository: flashinfer-ai/flashinfer
Length of output: 144
🏁 Script executed:
# Check if flashinfer.utils functions are imported in utils.py
head -30 tests/moe/utils.pyRepository: flashinfer-ai/flashinfer
Length of output: 979
🏁 Script executed:
# Check for all routing_config dict creations in test files
rg -n "routing_config\s*=" tests/moe/test_*.py -A 25 | head -200Repository: flashinfer-ai/flashinfer
Length of output: 4185
🏁 Script executed:
# Check routing_config definitions in test_dpsk_fused_moe_fp8.py more thoroughly
rg -n "routing_config\s*=" tests/moe/test_dpsk_fused_moe_fp8.py -A 15 | head -80Repository: flashinfer-ai/flashinfer
Length of output: 1728
🏁 Script executed:
# Check all routing_config definitions in all test files
rg -n "routing_config\s*=" tests/moe/test_trtllm_cutlass_fused_moe.py -A 20 | head -100Repository: flashinfer-ai/flashinfer
Length of output: 50
🏁 Script executed:
# Check test_trtllm_gen_routed_fused_moe.py for routing_config definitions
rg -n "routing_config\s*=" tests/moe/test_trtllm_gen_routed_fused_moe.py -A 20 | head -100Repository: flashinfer-ai/flashinfer
Length of output: 50
🏁 Script executed:
# Check if compatible_activation_types is missing anywhere in routing configs
rg -n "routing_config\s*=\s*\{" tests/moe/ -A 30 | grep -v "compatible_activation_types" | grep -E "(routing_config|\")" | head -80Repository: flashinfer-ai/flashinfer
Length of output: 903
🏁 Script executed:
# Check the complete skip_checks function to see if flashinfer.utils is used
sed -n '50,130p' tests/moe/utils.pyRepository: flashinfer-ai/flashinfer
Length of output: 3638
🏁 Script executed:
# Check test_dpsk_fused_moe_fp8.py routing_config more completely
sed -n '747,762p' tests/moe/test_dpsk_fused_moe_fp8.pyRepository: flashinfer-ai/flashinfer
Length of output: 533
🏁 Script executed:
# Find all parametrize decorators with routing_config in MOE test files
rg -n "@pytest.mark.parametrize" tests/moe/test_dpsk_fused_moe_fp8.py -A 30 | head -150Repository: flashinfer-ai/flashinfer
Length of output: 2409
🏁 Script executed:
# Get the complete routing_config definitions from test_dpsk_fused_moe_fp8.py
sed -n '508,545p' tests/moe/test_dpsk_fused_moe_fp8.pyRepository: flashinfer-ai/flashinfer
Length of output: 1213
🏁 Script executed:
# Check if there are any other routing_config params beyond line 545
sed -n '545,600p' tests/moe/test_dpsk_fused_moe_fp8.pyRepository: flashinfer-ai/flashinfer
Length of output: 1554
🏁 Script executed:
# Verify imports in skip_checks to confirm flashinfer.utils usage
head -25 tests/moe/utils.pyRepository: flashinfer-ai/flashinfer
Length of output: 883
test_dpsk_fused_moe_fp8.py routing configs require compatible_activation_types field
The routing configuration dictionaries in test_dpsk_fused_moe_fp8.py (lines 508–545) are missing the compatible_activation_types field. Without this field, all tests using these configurations will be unexpectedly skipped when compatible_activation_types defaults to an empty list. Update the three routing config definitions (kimi_k2, DSv3, DSLite) to include this required field.
🤖 Prompt for AI Agents
In `@tests/moe/utils.py` around lines 90 - 94, Update the three routing
configuration dicts named kimi_k2, DSv3, and DSLite in
test_dpsk_fused_moe_fp8.py to include the required key
"compatible_activation_types" (a list of activation names). Add a
compatible_activation_types entry that includes the activation(s) used by the
tests (for example ["gelu", "relu", "swish"] or the specific activation(s) the
test suite iterates over) so pytest.skip is not triggered by an empty default.
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…ant launched max value Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
|
@flashinfer-bot run |
|
/bot run |
|
[CANCELING] Pipeline #43117758: canceled |
|
@flashinfer-bot run |
|
/bot run |
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
fa81b33 to
df1ae03
Compare
|
[FAILED] Pipeline #43168329: 10/20 passed |
|
LGTM. |
|
The CI is green for this pr: https://github.com/flashinfer-ai/flashinfer/actions/runs/21629745844?pr=2462 |
…ron, fixed (flashinfer-ai#2462) <!-- .github/pull_request_template.md --> ## 📌 Description - Support element wise activation (relu^2) in fused MoE in NVFP4 and in FP8PerTensor. - Use new ActivationType enum class instead of GatedActType. - Support Nemotron in deepseek routing as in NVIDIA/TensorRT-LLM#9792 - Remove 'A' suffix from UseShuffledMatrixA. NOTE: This is the fixed version of flashinfer-ai#2304 that was merged and reverted. - Replaced the problematic condition in deepseek routing that required `NumExperts >= MaxSupportedTopExperts` with `topK<=numExperts` - DeepSeek R1 works with it (tested with VLLM). - Removed irrelevant test cases. ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Replaced old gated-activation API with a unified ActivationType enum (many activation kinds supported). * Propagated activation_type across MoE workflows and kernels. * **New Features** * Added CLI option --activation-type to select activation kind for MoE benchmarks. * **Bug Fixes** * Enforced activation compatibility and validation for FP8/FP4 paths. * **Tests** * Updated and expanded tests to cover new activation types and compatibility scenarios. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
| - 4: Geglu | ||
| - 5: SwigluBias | ||
| - 6: Relu2 | ||
| - 7: Identity |
There was a problem hiding this comment.
In csrc/trtllm_fused_moe_runner.cu, the functions activationTypeToGatedActType and activationTypeToEltwiseActType restrict the activation functions to [Swiglu, Geglu, Relu2, Identity]. The comment needs to be updated accordingly.
Additionally, I doubted if Geglu is actually supported for per-tensor fp8. Seems the corresponding cubins are not generated
There was a problem hiding this comment.
You're right the docstring here should probably reflect what's supported instead of detailing the entire enum
📌 Description
NOTE: This is the fixed version of #2304 that was merged and reverted.
NumExperts >= MaxSupportedTopExpertswithtopK<=numExperts🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
Refactor
New Features
Bug Fixes
Tests