[feat] Add 2048 experts and 32 Top K #2744
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughSplit monolithic MoE routing into modular DeepSeek and Renormalize backends: added common headers, compile-time dispatch macros, multiple kernel launchers and run() orchestrators; updated KernelParams/top-K support, DevKernel macros, build/JIT source lists, and tests; removed legacy single-file routing implementations. Changes
Sequence Diagram(s)sequenceDiagram
rect rgba(100,150,240,0.5)
participant Host as Host
end
rect rgba(120,200,150,0.5)
participant Orch as Orchestrator (run)
end
rect rgba(240,180,120,0.5)
participant Macro as Launcher / Dispatch Macro
end
rect rgba(200,120,220,0.5)
participant GPU as GPU Kernels
end
Host->>Orch: call run(Data)
Orch->>Orch: validate inputs & pick backend (DeepSeek / Renormalize)
Orch->>Macro: select LAUNCH_ROUTING_* macro and specific launcher
Macro->>GPU: configure & launch kernel(s) (main/cluster/coop/histogram/offsets)
GPU-->>Macro: kernel(s) complete (optional PDL sync)
Macro-->>Orch: device buffers updated
Orch-->>Host: return / completion
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can disable sequence diagrams in the walkthrough.Disable the |
|
Warning Gemini is experiencing higher than usual traffic and was unable to create the summary. Please try again in a few hours by commenting |
|
/bot run |
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
include/flashinfer/trtllm/fused_moe/RoutingKernel.h (1)
279-303:⚠️ Potential issue | 🟠 MajorThread
DoSoftmaxBeforeTopKinto the renormalize packed-pointer type.
csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cuderives packed entries asPackedScoreIdx<conditional_t<DoSoftmaxBeforeTopK, float, InputT>>, but thisKernelParamsstill stores and castsmPtrTopKPackedasPackedScoreIdx<OutputT>*. WhenDoSoftmaxBeforeTopK == trueandOutputT != float, the packed buffer layout no longer matches the kernel contract.Suggested fix
template <typename InputT_, typename OutputT_, int MaxNumExperts_, int MaxNumTopExperts_, bool DoSoftmaxBeforeTopK_, bool isPow2_, bool UsePdl_> struct KernelParams : public KernelParamsBase<InputT_, OutputT_, MaxNumExperts_, MaxNumTopExperts_, isPow2_, UsePdl_> { using InputT = InputT_; using OutputT = OutputT_; + using PackedT = + PackedScoreIdx<std::conditional_t<DoSoftmaxBeforeTopK_, float, InputT_>>; static constexpr bool DoSoftmaxBeforeTopK = DoSoftmaxBeforeTopK_; - PackedScoreIdx<OutputT>* mPtrTopKPacked = nullptr; + PackedT* mPtrTopKPacked = nullptr; @@ - params.mPtrTopKPacked = (PackedScoreIdx<OutputT>*)data.mPtrTopKPacked; + params.mPtrTopKPacked = static_cast<PackedT*>(data.mPtrTopKPacked);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/flashinfer/trtllm/fused_moe/RoutingKernel.h` around lines 279 - 303, KernelParams currently stores mPtrTopKPacked as PackedScoreIdx<OutputT>*, but the launched kernel expects PackedScoreIdx<conditional_t<DoSoftmaxBeforeTopK, float, InputT>>; change mPtrTopKPacked's type declaration in struct KernelParams to use the same conditional type (e.g., PackedScoreIdx<std::conditional_t<DoSoftmaxBeforeTopK, float, OutputT>>* or, if the kernel uses InputT, PackedScoreIdx<std::conditional_t<DoSoftmaxBeforeTopK, float, InputT>>*), and update the assignment in KernelParams::setKernelParams to cast data.mPtrTopKPacked to that same conditional pointer type so the stored pointer type matches the kernel contract (refer to symbols KernelParams, mPtrTopKPacked, PackedScoreIdx, and setKernelParams).
🧹 Nitpick comments (1)
flashinfer/jit/moe_utils.py (1)
79-92: Extract the DeepSeek routing manifest into a shared helper/constant.This file list is now duplicated here and in
flashinfer/jit/fused_moe.py. The next TensorRT-LLM sync is likely to touch one list and miss the other, which would make the two JIT modules drift out of sync.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/jit/moe_utils.py` around lines 79 - 92, The DeepSeek routing CUDA source list is duplicated between moe_utils.py and fused_moe.py; extract that list into a single shared constant (e.g., DEEPSEEK_ROUTING_SOURCES or DEEPSEEK_ROUTING_MANIFEST) in a common module (create a small helper or export it from moe_utils.py) and replace the inline lists in both jit modules with imports of that constant so both trtllm_fused_moe routing lists reference the single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu`:
- Around line 189-195: Before calling cudaTriggerProgrammaticLaunchCompletion()
when KernelParams::UsePdl is true, ensure a system-wide memory fence so writes
to mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, mPtrNumNonExitingCtas,
mPtrTotalNumPaddedTokens and other global metadata are visible to the
subsequently-launched kernel; add an appropriate global fence (e.g.,
__threadfence_system() or your platform's equivalent) after the existing
__syncthreads() and before invoking cudaTriggerProgrammaticLaunchCompletion(),
keeping the conditional on KernelParams::UsePdl and preserving existing comments
and ordering.
In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/RoutingDeepSeekCommon.cuh`:
- Around line 39-52: getMaxNumExperts currently returns 0 on unsupported inputs
which allows callers like LAUNCH_ROUTING_DEEPSEEK to divide-by-zero or silently
skip launches; change getMaxNumExperts to fail fast instead of returning 0 by
logging the error and aborting (or throwing/asserting) so an invalid numExperts
causes a deterministic crash, and apply the same change to the analogous
function in the same file (the similar block around lines 72-88) so neither
function ever returns 0 for unsupported expert counts.
In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu`:
- Around line 69-78: The code writes to params.mPtrExpandedIdxToPermutedIdx
unconditionally in the expertIdx == -1 branch; guard that write with a
null-check so we only assign when the optional mapping was provided. Locate the
block handling params.mPtrTopKIds / expertIdx (within the warpIdx/laneIdx loop
using MaxNumExperts and smemKIdx) and wrap the
params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = -1
assignment inside an if (params.mPtrExpandedIdxToPermutedIdx != nullptr) check
(keeping the same index calculation).
In
`@csrc/fused_moe/trtllm_backend/routingRenormalize/RoutingRenormalizeCommon.cuh`:
- Around line 99-126: The macros currently treat any mTopK > NumTop16Experts as
MaxSupportedTopExperts, which causes mTopK > MaxSupportedTopExperts (e.g., 33+)
to fall through and index out-of-bounds; update LAUNCH_ROUTING_WITH_TOPK to
explicitly handle the 32-wide case and reject larger values: add an else-if
branch checking if data.mTopK <= NumTop32Experts and call
LAUNCH_ROUTING_WITH_NUM_EXPERTS with NumTop32Experts, and replace the final else
with an error/log and no-launch path for unsupported data.mTopK (referencing
data.mTopK, LAUNCH_ROUTING_WITH_TOPK, LAUNCH_ROUTING_WITH_NUM_EXPERTS,
NumTop8Experts, NumTop16Experts, NumTop32Experts, MaxSupportedTopExperts).
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu`:
- Around line 59-60: The code caches the SM count in a static variable (smCount)
which uses tensorrt_llm::common::getMultiProcessorCount() that is
device-specific; remove the static so the SM count is retrieved per-call (e.g.,
make smCount a non-static local variable where numBlocksCoop is computed) and
recompute numBlocksCoop from that per-device value (also guard numBlocksCoop to
not go negative if needed) to avoid pinning launch size to the first device that
hits this code.
- Around line 47-72: Validate data.mTopK before using it in the maxTokensCoop
calculation and ensure mTopK is within valid bounds relative to mNumExperts and
constants: check data.mTopK > 0 and data.mTopK <= data.mNumExperts and
data.mTopK <= MaxSupportedTopExperts early (before computing int const
maxTokensCoop = ... / data.mTopK), and reorder/adjust the later guards involving
data.mPtrTopKIds, data.mNumExperts, MaxSupportedTopExperts, and
MaxSupportedExpertCount so that the precomputed-TopK path cannot execute with
mTopK == 0 or mTopK > mNumExperts; reference the symbols data.mTopK,
data.mNumExperts, data.mPtrTopKIds, maxTokensCoop, MaxSupportedTopExperts, and
MaxSupportedExpertCount.
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu`:
- Around line 242-247: The code writes
params.mPtrPermutedIdxToExpandedIdx[permutedIdx] without checking isTokenRouted,
which allows writing invalid slots for local-but-unselected experts; update the
condition on the mPtrPermutedIdxToExpandedIdx write to mirror the other
reverse-map guard by requiring both isLocalExpert and isTokenRouted (i.e., only
write params.mPtrPermutedIdxToExpandedIdx[permutedIdx] when isTokenRouted is
true), keeping the existing use of permutedIdx, tokenIdx, finalExpertOffset and
localOffsetToken to compute the index.
In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 870-871: The runtime check using TVM_FFI_ICHECK currently enforces
args->top_k <= 32 but the error string still says "top_k<=10", so update the
diagnostic to match the enforced limit: change the message string in the
TVM_FFI_ICHECK call that references args->top_k to say "top_k<=32 && top_k>0"
(keep the same surrounding text "Current routing kernel (no groups, renormalize)
only supports ..."). Ensure you edit the string associated with the
TVM_FFI_ICHECK invocation so runtime diagnostics accurately reflect the new
limit.
In `@flashinfer/jit/fused_moe.py`:
- Around line 264-297: The sources list in fused_moe.py contains duplicate
basenames (e.g., launchClusterKernel.cu, launchHistogramKernel.cu,
launchInitExpertCounts.cu, launchOffsetsKernel.cu) that will collide when
JitSpec.get_object_paths() and generate_ninja_build_for_op() convert sources to
object names using source.with_suffix(object_suffix).name; fix by either (A)
deduplicating/renaming the entries in the sources list in fused_moe.py to ensure
unique basenames per compilation unit (move into uniquely named files or prefix
filenames by directory), or (B) change the object-naming logic in
JitSpec.get_object_paths() / generate_ninja_build_for_op() to include path
components or a short hash (e.g., incorporate source.parent.name or a hash of
source) when constructing the object filename to disambiguate
otherwise-identical basenames.
In `@include/flashinfer/trtllm/common/cudaUtils.h`:
- Around line 272-277: The getMultiProcessorCount function lacks CUDA error
checking and can return garbage; wrap the cudaGetDevice and
cudaDeviceGetAttribute calls with the same error-checking macro used elsewhere
(e.g., FLASHINFER_CHECK) or check their cudaError_t return values and handle
failures before returning count. Update the function (getMultiProcessorCount) to
call FLASHINFER_CHECK(cudaGetDevice(...)) and
FLASHINFER_CHECK(cudaDeviceGetAttribute(..., cudaDevAttrMultiProcessorCount,
device)) (or equivalent error handling) so failures are logged/handled and a
safe default or abort is used instead of returning an unchecked count.
In `@include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh`:
- Around line 440-453: The code paths that set idx only when mPtrTopKIds or
mPtrTopKWeights are present leave idx uninitialized when mPtrTopKIds==nullptr
and mPtrTopKWeights==nullptr; always decode params.mPtrTopKPacked[expandedIdx]
into scoreIdx and assign idx = scoreIdx.idx (and set mPtrTopKWeights if present)
before computing localExpertIdx/isLocalExpert so
atomicAdd(&smemExpertCount[idx], 1) uses a valid index; update the branch around
params.mPtrTopKIds/mPtrTopKWeights to unconditionally read params.mPtrTopKPacked
and populate idx first, then conditionally write params.mPtrTopKWeights.
In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 2872-2892: This 2048-expert config (num_experts=2048, top_k=32,
RoutingMethodType.Renormalize) must be removed from the large parametrized
matrix and isolated into a small targeted smoke test with autotune disabled:
create a new focused test that uses the same dict but sets enable_autotune=False
(and a distinct id like "RoutingRenormalize_large_experts_smoke"), and remove or
replace the pytest.param entry from the full matrix so the heavy case no longer
participates in the combinatorial num_tokens × zero_hidden_states × moe_impl ×
weight_processing × activation_type runs.
---
Outside diff comments:
In `@include/flashinfer/trtllm/fused_moe/RoutingKernel.h`:
- Around line 279-303: KernelParams currently stores mPtrTopKPacked as
PackedScoreIdx<OutputT>*, but the launched kernel expects
PackedScoreIdx<conditional_t<DoSoftmaxBeforeTopK, float, InputT>>; change
mPtrTopKPacked's type declaration in struct KernelParams to use the same
conditional type (e.g., PackedScoreIdx<std::conditional_t<DoSoftmaxBeforeTopK,
float, OutputT>>* or, if the kernel uses InputT,
PackedScoreIdx<std::conditional_t<DoSoftmaxBeforeTopK, float, InputT>>*), and
update the assignment in KernelParams::setKernelParams to cast
data.mPtrTopKPacked to that same conditional pointer type so the stored pointer
type matches the kernel contract (refer to symbols KernelParams, mPtrTopKPacked,
PackedScoreIdx, and setKernelParams).
---
Nitpick comments:
In `@flashinfer/jit/moe_utils.py`:
- Around line 79-92: The DeepSeek routing CUDA source list is duplicated between
moe_utils.py and fused_moe.py; extract that list into a single shared constant
(e.g., DEEPSEEK_ROUTING_SOURCES or DEEPSEEK_ROUTING_MANIFEST) in a common module
(create a small helper or export it from moe_utils.py) and replace the inline
lists in both jit modules with imports of that constant so both trtllm_fused_moe
routing lists reference the single source of truth.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3886ca25-8b46-4874-8000-8b83bfc701bc
📒 Files selected for processing (31)
csrc/fused_moe/trtllm_backend/routingDeepSeek/RoutingDeepSeekCommon.cuhcsrc/fused_moe/trtllm_backend/routingDeepSeek/launchClusterKernel.cucsrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cucsrc/fused_moe/trtllm_backend/routingDeepSeek/launchHistogramKernel.cucsrc/fused_moe/trtllm_backend/routingDeepSeek/launchInitExpertCounts.cucsrc/fused_moe/trtllm_backend/routingDeepSeek/launchMainKernel.cucsrc/fused_moe/trtllm_backend/routingDeepSeek/launchOffsetsKernel.cucsrc/fused_moe/trtllm_backend/routingRenormalize/RoutingRenormalizeCommon.cuhcsrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cucsrc/fused_moe/trtllm_backend/routingRenormalize/launchClusterKernel.cucsrc/fused_moe/trtllm_backend/routingRenormalize/launchHistogramKernel.cucsrc/fused_moe/trtllm_backend/routingRenormalize/launchHistogramScoresKernel.cucsrc/fused_moe/trtllm_backend/routingRenormalize/launchInitExpertCounts.cucsrc/fused_moe/trtllm_backend/routingRenormalize/launchOffsetsKernel.cucsrc/fused_moe/trtllm_backend/trtllm_fused_moe_dev_kernel.cucsrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cucsrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cucsrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_renormalize.cucsrc/trtllm_fused_moe_kernel_launcher.cucsrc/trtllm_fused_moe_routing_deepseek.cucsrc/trtllm_fused_moe_routing_renormalize.cucsrc/trtllm_fused_moe_runner.cuflashinfer/jit/cpp_ext.pyflashinfer/jit/fused_moe.pyflashinfer/jit/moe_utils.pyinclude/flashinfer/trtllm/common/cudaUtils.hinclude/flashinfer/trtllm/fused_moe/DevKernel.hinclude/flashinfer/trtllm/fused_moe/RoutingKernel.cuhinclude/flashinfer/trtllm/fused_moe/RoutingKernel.hinclude/flashinfer/trtllm/fused_moe/RoutingKernelTopK.cuhtests/moe/test_trtllm_gen_fused_moe.py
💤 Files with no reviewable changes (2)
- csrc/trtllm_fused_moe_routing_renormalize.cu
- csrc/trtllm_fused_moe_routing_deepseek.cu
| // trigger the secondary kernel when using PDL | ||
| // We can't do it earlier because FC1 depends on the mPtrCtaIdxXyToBatchIdx, | ||
| // mPtrCtaIdxXyToMnLimit, mPtrNumNonExitingCtas and mPtrTotalNumPaddedTokens | ||
| // TODO: this is not sufficient to ensure visibility in the next kernel! | ||
| if constexpr (KernelParams::UsePdl) { | ||
| cudaTriggerProgrammaticLaunchCompletion(); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Locate and examine the file
fd "launchCoopKernel.cu" --type fRepository: flashinfer-ai/flashinfer
Length of output: 133
🏁 Script executed:
# If found, read the relevant section with context
cat -n csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu | head -250 | tail -100Repository: flashinfer-ai/flashinfer
Length of output: 5119
🏁 Script executed:
# Search for PDL usage patterns in the codebase to understand the context
rg "cudaTriggerProgrammaticLaunchCompletion|cudaGridDependencySynchronize" -B 3 -A 3Repository: flashinfer-ai/flashinfer
Length of output: 41002
🏁 Script executed:
# Search for cooperative groups usage and grid.sync patterns
rg "grid\.sync|__threadfence|cg::sync" --type cu --type h -B 2 -A 2Repository: flashinfer-ai/flashinfer
Length of output: 94
🏁 Script executed:
# Check kernel launch configuration and block/grid structure in this file
rg "dim3|blockDim|gridDim" csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu -B 2 -A 2Repository: flashinfer-ai/flashinfer
Length of output: 283
🏁 Script executed:
# Search for where FC1 or subsequent kernels use cudaGridDependencySynchronize
rg "cudaGridDependencySynchronize" csrc/fused_moe/trtllm_backend/routingDeepSeek/ -B 5 -A 5Repository: flashinfer-ai/flashinfer
Length of output: 3524
🏁 Script executed:
# Check what kernel is launched after routingIndicesCoopKernel and if it synchronizes
rg "LAUNCH_ROUTING_DEEPSEEK|routingIndicesCoopKernel" csrc/fused_moe/trtllm_backend/routingDeepSeek/ -B 3 -A 8Repository: flashinfer-ai/flashinfer
Length of output: 12940
🏁 Script executed:
# Look for mPtrCtaIdxXyToBatchIdx and mPtrCtaIdxXyToMnLimit to understand memory scope
rg "mPtrCtaIdxXyToBatchIdx|mPtrCtaIdxXyToMnLimit" csrc/ -B 2 -A 2 | head -80Repository: flashinfer-ai/flashinfer
Length of output: 6867
🏁 Script executed:
# Check if __threadfence or __threadfence_system is used anywhere in PDL contexts
rg "__threadfence" csrc/fused_moe/trtllm_backend/ -B 3 -A 3Repository: flashinfer-ai/flashinfer
Length of output: 50
🏁 Script executed:
# Check if kernel launches with single or multiple blocks
rg "launchCoopKernel|numBlocksCoop" csrc/ --type cu -B 2 -A 2 | grep -A 5 "numBlocksCoop"Repository: flashinfer-ai/flashinfer
Length of output: 94
🏁 Script executed:
# Verify CUDA PDL semantics in comments/docs within repo
rg "PDL|Programmatic Dependent Launch" include/ --type h -B 2 -A 2 | head -60Repository: flashinfer-ai/flashinfer
Length of output: 7785
Add explicit memory ordering before PDL trigger to ensure metadata visibility.
Line 192 documents the issue: the current code does not guarantee visibility of mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, and scalar outputs to the next kernel before cudaTriggerProgrammaticLaunchCompletion() releases it.
The block-level __syncthreads() at line 187 is insufficient for global device memory. A system-wide fence is required before signaling PDL completion to ensure all threads' writes to global metadata are visible before the next kernel launches.
🛠️ Proposed fix
// TODO: this is not sufficient to ensure visibility in the next kernel!
if constexpr (KernelParams::UsePdl) {
+ __threadfence_system();
cudaTriggerProgrammaticLaunchCompletion();
}Verify this aligns with your CUDA toolkit's PDL memory-ordering contract and whether __threadfence_system() or an alternative primitive (e.g., grid synchronization) is required for your target SM version.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // trigger the secondary kernel when using PDL | |
| // We can't do it earlier because FC1 depends on the mPtrCtaIdxXyToBatchIdx, | |
| // mPtrCtaIdxXyToMnLimit, mPtrNumNonExitingCtas and mPtrTotalNumPaddedTokens | |
| // TODO: this is not sufficient to ensure visibility in the next kernel! | |
| if constexpr (KernelParams::UsePdl) { | |
| cudaTriggerProgrammaticLaunchCompletion(); | |
| } | |
| // trigger the secondary kernel when using PDL | |
| // We can't do it earlier because FC1 depends on the mPtrCtaIdxXyToBatchIdx, | |
| // mPtrCtaIdxXyToMnLimit, mPtrNumNonExitingCtas and mPtrTotalNumPaddedTokens | |
| // TODO: this is not sufficient to ensure visibility in the next kernel! | |
| if constexpr (KernelParams::UsePdl) { | |
| __threadfence_system(); | |
| cudaTriggerProgrammaticLaunchCompletion(); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu` around
lines 189 - 195, Before calling cudaTriggerProgrammaticLaunchCompletion() when
KernelParams::UsePdl is true, ensure a system-wide memory fence so writes to
mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, mPtrNumNonExitingCtas,
mPtrTotalNumPaddedTokens and other global metadata are visible to the
subsequently-launched kernel; add an appropriate global fence (e.g.,
__threadfence_system() or your platform's equivalent) after the existing
__syncthreads() and before invoking cudaTriggerProgrammaticLaunchCompletion(),
keeping the conditional on KernelParams::UsePdl and preserving existing comments
and ordering.
| int constexpr getMaxNumExperts(int32_t numExperts) { | ||
| if (numExperts <= topk::MaxNumExpertsUnit) { | ||
| return topk::MaxNumExpertsUnit; | ||
| } else if (numExperts <= NumDeepseekExperts) { | ||
| return NumDeepseekExperts; | ||
| } else if (numExperts <= NumKimiK2Experts) { | ||
| return NumKimiK2Experts; | ||
| } else if (numExperts <= NumNemotronExperts) { | ||
| return NumNemotronExperts; | ||
| } else { | ||
| TLLM_LOG_ERROR("Unsupported numExperts"); | ||
| return 0; | ||
| } | ||
| } |
There was a problem hiding this comment.
Fail fast on unsupported expert counts.
getMaxNumExperts() returns 0, and LAUNCH_ROUTING_DEEPSEEK only logs on the unsupported path. The new launchers use that value as numThreadsHist and as a divisor in grid sizing, so an out-of-range mNumExperts can degrade into divide-by-zero or a silently skipped launch rather than a deterministic failure.
Also applies to: 72-88
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/RoutingDeepSeekCommon.cuh`
around lines 39 - 52, getMaxNumExperts currently returns 0 on unsupported inputs
which allows callers like LAUNCH_ROUTING_DEEPSEEK to divide-by-zero or silently
skip launches; change getMaxNumExperts to fail fast instead of returning 0 by
logging the error and aborting (or throwing/asserting) so an invalid numExperts
causes a deterministic crash, and apply the same change to the analogous
function in the same file (the similar block around lines 72-88) so neither
function ever returns 0 for unsupported expert counts.
| if (params.mPtrTopKIds != nullptr) { | ||
| if (validToken) { | ||
| if (laneIdx < params.mTopK) { | ||
| auto expertIdx = params.mPtrTopKIds[warpIdx * params.mTopK + laneIdx]; | ||
| if (expertIdx != -1) { | ||
| int offset = warpIdx * MaxNumExperts + expertIdx; | ||
| smemKIdx[offset] = static_cast<int8_t>(laneIdx); | ||
| } else { | ||
| params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = int32_t{-1}; | ||
| } |
There was a problem hiding this comment.
Guard the optional mPtrExpandedIdxToPermutedIdx write.
mPtrExpandedIdxToPermutedIdx is optional, but the expertIdx == -1 branch dereferences it unconditionally. A caller that passes precomputed mPtrTopKIds with padded -1 entries and does not request the expanded-index mapping will crash here.
Suggested fix
- } else {
- params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = int32_t{-1};
+ } else if (params.mPtrExpandedIdxToPermutedIdx != nullptr) {
+ params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = int32_t{-1};
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu` around
lines 69 - 78, The code writes to params.mPtrExpandedIdxToPermutedIdx
unconditionally in the expertIdx == -1 branch; guard that write with a
null-check so we only assign when the optional mapping was provided. Locate the
block handling params.mPtrTopKIds / expertIdx (within the warpIdx/laneIdx loop
using MaxNumExperts and smemKIdx) and wrap the
params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = -1
assignment inside an if (params.mPtrExpandedIdxToPermutedIdx != nullptr) check
(keeping the same index calculation).
| // Helper macro: dispatch on topK tier for a given numExperts tier. | ||
| #define LAUNCH_ROUTING_WITH_TOPK(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \ | ||
| stream, extraFlag1, numExperts) \ | ||
| if (data.mTopK <= NumTop8Experts) { \ | ||
| LAUNCH_ROUTING_WITH_NUM_EXPERTS(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \ | ||
| stream, extraFlag1, numExperts, NumTop8Experts); \ | ||
| } else if (data.mTopK <= NumTop16Experts) { \ | ||
| LAUNCH_ROUTING_WITH_NUM_EXPERTS(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \ | ||
| stream, extraFlag1, numExperts, NumTop16Experts); \ | ||
| } else { \ | ||
| LAUNCH_ROUTING_WITH_NUM_EXPERTS(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \ | ||
| stream, extraFlag1, numExperts, MaxSupportedTopExperts); \ | ||
| } | ||
|
|
||
| #define LAUNCH_ROUTING_RENORMALIZE(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \ | ||
| stream, extraFlag1) \ | ||
| if (data.mNumExperts <= NumExperts128Experts) { \ | ||
| LAUNCH_ROUTING_WITH_TOPK(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, stream, \ | ||
| extraFlag1, NumExperts128Experts); \ | ||
| } else if (data.mNumExperts <= NumExperts512Experts) { \ | ||
| LAUNCH_ROUTING_WITH_TOPK(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, stream, \ | ||
| extraFlag1, NumExperts512Experts); \ | ||
| } else if (data.mNumExperts <= MaxSupportedExperts) { \ | ||
| LAUNCH_ROUTING_WITH_TOPK(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, stream, \ | ||
| extraFlag1, MaxSupportedExperts); \ | ||
| } else { \ | ||
| TLLM_LOG_ERROR("Unsupported numExperts"); \ | ||
| } |
There was a problem hiding this comment.
Reject mTopK > 32 instead of falling through to the 32-wide specialization.
The final else currently catches both the supported 17..32 range and unsupported larger values. When data.mTopK is 33+, the downstream renormalize kernels still index fixed-size [MaxNumTopExperts] buffers with laneIdx < params.mTopK, which will walk past those 32-entry arrays.
🛠️ Proposed fix
`#define` LAUNCH_ROUTING_WITH_TOPK(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \
stream, extraFlag1, numExperts) \
if (data.mTopK <= NumTop8Experts) { \
LAUNCH_ROUTING_WITH_NUM_EXPERTS(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \
stream, extraFlag1, numExperts, NumTop8Experts); \
} else if (data.mTopK <= NumTop16Experts) { \
LAUNCH_ROUTING_WITH_NUM_EXPERTS(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \
stream, extraFlag1, numExperts, NumTop16Experts); \
- } else { \
+ } else if (data.mTopK <= MaxSupportedTopExperts) { \
LAUNCH_ROUTING_WITH_NUM_EXPERTS(data, coopLaunch, kernel, numBlocks, numThreads, smemSize, \
stream, extraFlag1, numExperts, MaxSupportedTopExperts); \
+ } else { \
+ TLLM_LOG_ERROR("Unsupported topK"); \
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@csrc/fused_moe/trtllm_backend/routingRenormalize/RoutingRenormalizeCommon.cuh`
around lines 99 - 126, The macros currently treat any mTopK > NumTop16Experts as
MaxSupportedTopExperts, which causes mTopK > MaxSupportedTopExperts (e.g., 33+)
to fall through and index out-of-bounds; update LAUNCH_ROUTING_WITH_TOPK to
explicitly handle the 32-wide case and reject larger values: add an else-if
branch checking if data.mTopK <= NumTop32Experts and call
LAUNCH_ROUTING_WITH_NUM_EXPERTS with NumTop32Experts, and replace the final else
with an error/log and no-launch path for unsupported data.mTopK (referencing
data.mTopK, LAUNCH_ROUTING_WITH_TOPK, LAUNCH_ROUTING_WITH_NUM_EXPERTS,
NumTop8Experts, NumTop16Experts, NumTop32Experts, MaxSupportedTopExperts).
| FLASHINFER_CHECK(!data.mUseRoutingSoftmax, "Routing with softmax not implemented yet"); | ||
| int const numBlocks = data.mNumTokens; | ||
| int const numThreadsHist = getMaxNumExperts(data.mNumExperts); | ||
|
|
||
| bool const useSingleCluster = data.mNumTokens <= 1024; | ||
| if (!useSingleCluster) { | ||
| FLASHINFER_CHECK(data.mPtrExpertCounts != nullptr, | ||
| "When #tokens is large, `mPtrExpertCounts` is a required input."); | ||
| } else { | ||
| data.mPtrExpertCounts = nullptr; | ||
| } | ||
|
|
||
| static int const smCount = tensorrt_llm::common::getMultiProcessorCount(); | ||
| int const numBlocksCoop = smCount - 8; | ||
|
|
||
| int const maxTokensCoop = (numBlocksCoop * numThreadsHist * 64) / data.mTopK; | ||
| if (data.mPtrTopKIds == nullptr) { | ||
| FLASHINFER_CHECK(data.mNumExperts >= MaxSupportedTopExperts, | ||
| "Routing kernel expects %d to be at most #experts %d", MaxSupportedTopExperts, | ||
| data.mNumExperts); | ||
| FLASHINFER_CHECK(data.mNumExperts <= MaxSupportedExpertCount, | ||
| "Routing kernel expects #experts %d <= %d", data.mNumExperts, | ||
| MaxSupportedExpertCount); | ||
| FLASHINFER_CHECK(data.mTopK <= MaxSupportedTopExperts, | ||
| "Routing kernel expects topK experts <= %d, got %d", MaxSupportedTopExperts, | ||
| data.mTopK); |
There was a problem hiding this comment.
Validate mTopK before using it, and compare it to mNumExperts.
Line 62 divides by mTopK before any bound check, and the later data.mNumExperts >= MaxSupportedTopExperts guard rejects valid smaller-expert configs while still leaving the precomputed-TopK path unvalidated. mTopK == 0 is UB here, and data.mPtrTopKIds != nullptr can still send topK > 32 or topK > mNumExperts into the downstream kernels.
Suggested fix
FLASHINFER_CHECK(!data.mUseRoutingSoftmax, "Routing with softmax not implemented yet");
+ FLASHINFER_CHECK(data.mTopK > 0 && data.mTopK <= MaxSupportedTopExperts,
+ "Routing kernel expects topK in [1, %d], got %d",
+ MaxSupportedTopExperts, data.mTopK);
+ FLASHINFER_CHECK(data.mNumExperts >= data.mTopK,
+ "Routing kernel expects topK %d <= `#experts` %d",
+ data.mTopK, data.mNumExperts);
int const numBlocks = data.mNumTokens;
int const numThreadsHist = getMaxNumExperts(data.mNumExperts);
@@
- FLASHINFER_CHECK(data.mNumExperts >= MaxSupportedTopExperts,
- "Routing kernel expects %d to be at most `#experts` %d", MaxSupportedTopExperts,
- data.mNumExperts);
FLASHINFER_CHECK(data.mNumExperts <= MaxSupportedExpertCount,
"Routing kernel expects `#experts` %d <= %d", data.mNumExperts,
MaxSupportedExpertCount);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu` around
lines 47 - 72, Validate data.mTopK before using it in the maxTokensCoop
calculation and ensure mTopK is within valid bounds relative to mNumExperts and
constants: check data.mTopK > 0 and data.mTopK <= data.mNumExperts and
data.mTopK <= MaxSupportedTopExperts early (before computing int const
maxTokensCoop = ... / data.mTopK), and reorder/adjust the later guards involving
data.mPtrTopKIds, data.mNumExperts, MaxSupportedTopExperts, and
MaxSupportedExpertCount so that the precomputed-TopK path cannot execute with
mTopK == 0 or mTopK > mNumExperts; reference the symbols data.mTopK,
data.mNumExperts, data.mPtrTopKIds, maxTokensCoop, MaxSupportedTopExperts, and
MaxSupportedExpertCount.
| TVM_FFI_ICHECK(args->top_k <= 32 && args->top_k > 0) | ||
| << "Current routing kernel (no groups, renormalize) only supports top_k<=10 && top_k>0."; |
There was a problem hiding this comment.
Update the validation message to match the new limit.
Line 871 still says top_k<=10, but Line 870 now accepts top_k<=32. That will produce misleading runtime diagnostics for this path.
Proposed fix
- TVM_FFI_ICHECK(args->top_k <= 32 && args->top_k > 0)
- << "Current routing kernel (no groups, renormalize) only supports top_k<=10 && top_k>0.";
+ TVM_FFI_ICHECK(args->top_k <= 32 && args->top_k > 0)
+ << "Current routing kernel (no groups, renormalize) only supports top_k<=32 && top_k>0.";📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| TVM_FFI_ICHECK(args->top_k <= 32 && args->top_k > 0) | |
| << "Current routing kernel (no groups, renormalize) only supports top_k<=10 && top_k>0."; | |
| TVM_FFI_ICHECK(args->top_k <= 32 && args->top_k > 0) | |
| << "Current routing kernel (no groups, renormalize) only supports top_k<=32 && top_k>0."; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@csrc/trtllm_fused_moe_kernel_launcher.cu` around lines 870 - 871, The runtime
check using TVM_FFI_ICHECK currently enforces args->top_k <= 32 but the error
string still says "top_k<=10", so update the diagnostic to match the enforced
limit: change the message string in the TVM_FFI_ICHECK call that references
args->top_k to say "top_k<=32 && top_k>0" (keep the same surrounding text
"Current routing kernel (no groups, renormalize) only supports ..."). Ensure you
edit the string associated with the TVM_FFI_ICHECK invocation so runtime
diagnostics accurately reflect the new limit.
| inline int getMultiProcessorCount() { | ||
| int device, count; | ||
| cudaGetDevice(&device); | ||
| cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device); | ||
| return count; | ||
| } |
There was a problem hiding this comment.
Missing error handling for CUDA API calls.
The function does not check the return values of cudaGetDevice and cudaDeviceGetAttribute. This is inconsistent with the error handling pattern used elsewhere in this file (e.g., isCapturing at Line 115-116 uses FLASHINFER_CHECK). If either call fails, count may contain garbage, leading to incorrect kernel launch configurations.
🛡️ Proposed fix to add error checking
inline int getMultiProcessorCount() {
int device, count;
- cudaGetDevice(&device);
- cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device);
+ FLASHINFER_CHECK(cudaGetDevice(&device) == cudaSuccess,
+ "CUDA error in cudaGetDevice");
+ FLASHINFER_CHECK(cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device) == cudaSuccess,
+ "CUDA error in cudaDeviceGetAttribute");
return count;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| inline int getMultiProcessorCount() { | |
| int device, count; | |
| cudaGetDevice(&device); | |
| cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device); | |
| return count; | |
| } | |
| inline int getMultiProcessorCount() { | |
| int device, count; | |
| FLASHINFER_CHECK(cudaGetDevice(&device) == cudaSuccess, | |
| "CUDA error in cudaGetDevice"); | |
| FLASHINFER_CHECK(cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device) == cudaSuccess, | |
| "CUDA error in cudaDeviceGetAttribute"); | |
| return count; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@include/flashinfer/trtllm/common/cudaUtils.h` around lines 272 - 277, The
getMultiProcessorCount function lacks CUDA error checking and can return
garbage; wrap the cudaGetDevice and cudaDeviceGetAttribute calls with the same
error-checking macro used elsewhere (e.g., FLASHINFER_CHECK) or check their
cudaError_t return values and handle failures before returning count. Update the
function (getMultiProcessorCount) to call FLASHINFER_CHECK(cudaGetDevice(...))
and FLASHINFER_CHECK(cudaDeviceGetAttribute(..., cudaDevAttrMultiProcessorCount,
device)) (or equivalent error handling) so failures are logged/handled and a
safe default or abort is used instead of returning an unchecked count.
| if (params.mPtrTopKIds != nullptr) { | ||
| idx = params.mPtrTopKIds[expandedIdx]; | ||
| } else { | ||
| // If params.mPtrTopKIds != nullptr, we don't need to store the weights | ||
| if (params.mPtrTopKWeights != nullptr) { | ||
| scoreIdx = params.mPtrTopKPacked[expandedIdx]; | ||
| idx = scoreIdx.idx; | ||
| params.mPtrTopKWeights[expandedIdx] = static_cast<OutputT>(scoreIdx.score); | ||
| } | ||
| } | ||
| // check whether this expert is local to our GPU at all and ignore if not | ||
| auto localExpertIdx = idx - params.mLocalExpertsStartIdx; | ||
| auto isLocalExpert = localExpertIdx >= 0 && localExpertIdx < localExpertExtent && | ||
| (localExpertIdx & params.mLocalExpertsStrideLog2) == 0; | ||
| (localExpertIdx & ((1 << params.mLocalExpertsStrideLog2) - 1)) == 0; | ||
| if (isLocalExpert) { | ||
| atomicAdd(&smemExpertCount[idx], 1); |
There was a problem hiding this comment.
Always decode mPtrTopKPacked when mPtrTopKIds is null.
idx is needed for histogramming regardless of whether a standalone weight buffer is requested. In the packed-only path (mPtrTopKIds == nullptr && mPtrTopKWeights == nullptr), this branch leaves idx uninitialized and can send atomicAdd to a garbage expert slot, corrupting counts before the offsets stage runs.
🐛 Proposed fix
if (params.mPtrTopKIds != nullptr) {
idx = params.mPtrTopKIds[expandedIdx];
} else {
- if (params.mPtrTopKWeights != nullptr) {
- scoreIdx = params.mPtrTopKPacked[expandedIdx];
- idx = scoreIdx.idx;
+ scoreIdx = params.mPtrTopKPacked[expandedIdx];
+ idx = scoreIdx.idx;
+ if (params.mPtrTopKWeights != nullptr) {
params.mPtrTopKWeights[expandedIdx] = static_cast<OutputT>(scoreIdx.score);
}
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh` around lines 440 -
453, The code paths that set idx only when mPtrTopKIds or mPtrTopKWeights are
present leave idx uninitialized when mPtrTopKIds==nullptr and
mPtrTopKWeights==nullptr; always decode params.mPtrTopKPacked[expandedIdx] into
scoreIdx and assign idx = scoreIdx.idx (and set mPtrTopKWeights if present)
before computing localExpertIdx/isLocalExpert so
atomicAdd(&smemExpertCount[idx], 1) uses a valid index; update the branch around
params.mPtrTopKIds/mPtrTopKWeights to unconditionally read params.mPtrTopKPacked
and populate idx first, then conditionally write params.mPtrTopKWeights.
| pytest.param( | ||
| { | ||
| "num_experts": 2048, | ||
| "top_k": 32, | ||
| "padding": 8, | ||
| "n_groups": None, | ||
| "top_k_groups": None, | ||
| "routed_scaling": None, | ||
| "has_routing_bias": False, | ||
| "routing_method_type": RoutingMethodType.Renormalize, | ||
| "compatible_moe_impls": [ | ||
| FP8BlockScaleMoe, | ||
| FP4Moe, | ||
| BF16Moe, | ||
| MxInt4BlockScaleMoe, | ||
| ], | ||
| "compatible_intermediate_size": [384], | ||
| "enable_autotune": True, | ||
| }, | ||
| id="RoutingRenormalize_large_experts", | ||
| ), |
There was a problem hiding this comment.
This 2048-expert case is too large for the full parametrized matrix.
With hidden_size=1024 and gated activations, this single config already allocates about 3 GiB for gemm1_weights plus 1.5 GiB for gemm2_weights in bf16, before quantized/shuffled copies or reference tensors. Because it also participates in the full num_tokens × zero_hidden_states × moe_impl × weight_processing × activation_type matrix with autotune enabled, it is very likely to become an OOM/timeout source on common CI GPUs. Please isolate the 2048/32 coverage into a small targeted smoke test and keep autotune off there.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/moe/test_trtllm_gen_fused_moe.py` around lines 2872 - 2892, This
2048-expert config (num_experts=2048, top_k=32, RoutingMethodType.Renormalize)
must be removed from the large parametrized matrix and isolated into a small
targeted smoke test with autotune disabled: create a new focused test that uses
the same dict but sets enable_autotune=False (and a distinct id like
"RoutingRenormalize_large_experts_smoke"), and remove or replace the
pytest.param entry from the full matrix so the heavy case no longer participates
in the combinatorial num_tokens × zero_hidden_states × moe_impl ×
weight_processing × activation_type runs.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu (1)
69-78:⚠️ Potential issue | 🔴 CriticalGuard the optional
mPtrExpandedIdxToPermutedIdxwrite.
mPtrExpandedIdxToPermutedIdxis optional, but theexpertIdx == -1branch at line 77 dereferences it unconditionally. A caller that passes precomputedmPtrTopKIdswith padded-1entries without requesting the expanded-index mapping will crash here.Note: The same pointer is correctly guarded later at lines 242-244.
Suggested fix
} else { - params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = int32_t{-1}; + if (params.mPtrExpandedIdxToPermutedIdx != nullptr) { + params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = int32_t{-1}; + } },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu` around lines 69 - 78, The branch handling expertIdx == -1 writes to the optional buffer params.mPtrExpandedIdxToPermutedIdx unguarded; guard that write by checking params.mPtrExpandedIdxToPermutedIdx != nullptr before assigning int32_t{-1} (so when mPtrExpandedIdxToPermutedIdx is null we skip the write), keeping the existing behavior for smemKIdx and using the same null-check pattern as later code; locate this in the block that reads params.mPtrTopKIds, using warpIdx/laneIdx/MaxNumExperts to find the correct index.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu`:
- Around line 69-78: The branch handling expertIdx == -1 writes to the optional
buffer params.mPtrExpandedIdxToPermutedIdx unguarded; guard that write by
checking params.mPtrExpandedIdxToPermutedIdx != nullptr before assigning
int32_t{-1} (so when mPtrExpandedIdxToPermutedIdx is null we skip the write),
keeping the existing behavior for smemKIdx and using the same null-check pattern
as later code; locate this in the block that reads params.mPtrTopKIds, using
warpIdx/laneIdx/MaxNumExperts to find the correct index.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9b5e2c2d-a01d-435d-8308-22fdc910cbf7
📒 Files selected for processing (2)
csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cucsrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_renormalize.cu
|
/bot run |
|
[CANCELING] Pipeline #45886576: canceled |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (5)
include/flashinfer/trtllm/common/cudaUtils.h (1)
272-277:⚠️ Potential issue | 🟠 MajorAdd CUDA error checks in
getMultiProcessorCount.Line 274 and Line 275 ignore CUDA return codes; on failure,
countcan be invalid and propagate bad launch configuration decisions.🔧 Proposed fix
inline int getMultiProcessorCount() { int device, count; - cudaGetDevice(&device); - cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device); + FLASHINFER_CHECK(cudaGetDevice(&device) == cudaSuccess, + "CUDA error in cudaGetDevice"); + FLASHINFER_CHECK(cudaDeviceGetAttribute(&count, cudaDevAttrMultiProcessorCount, device) == cudaSuccess, + "CUDA error in cudaDeviceGetAttribute"); return count; }#!/bin/bash # Verify whether CUDA calls in getMultiProcessorCount are guarded. rg -n -A8 -B2 'inline int getMultiProcessorCount\(\)' include/flashinfer/trtllm/common/cudaUtils.h rg -n 'FLASHINFER_CHECK\(cudaGetDevice|FLASHINFER_CHECK\(cudaDeviceGetAttribute' include/flashinfer/trtllm/common/cudaUtils.h🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/flashinfer/trtllm/common/cudaUtils.h` around lines 272 - 277, The function getMultiProcessorCount currently calls cudaGetDevice and cudaDeviceGetAttribute without checking return values; wrap those calls with the project's CUDA-check macro (e.g., FLASHINFER_CHECK or equivalent) or explicitly check their cudaError_t results, log or handle failures, and return a safe sentinel (e.g., 0 or -1) when an error occurs instead of returning an uninitialized count; update references to cudaGetDevice and cudaDeviceGetAttribute inside getMultiProcessorCount to use the chosen error-checking pattern so callers never receive an invalid count.csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu (1)
246-247:⚠️ Potential issue | 🟠 MajorGuard
mPtrPermutedIdxToExpandedIdxwrites withisTokenRouted.At Line 246, writing reverse-map entries for
isLocalExpertwithout requiring routing selection can overwrite valid slots for subsequent routed tokens. Mirror the guard used by the other reverse-map write.💡 Proposed fix
- if (params.mPtrPermutedIdxToExpandedIdx != nullptr && isLocalExpert) { + if (params.mPtrPermutedIdxToExpandedIdx != nullptr && isLocalExpert && + isTokenRouted) { params.mPtrPermutedIdxToExpandedIdx[permutedIdx] = tokenIdx; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu` around lines 246 - 247, The reverse-map write to params.mPtrPermutedIdxToExpandedIdx guarded only by isLocalExpert can overwrite slots for tokens that were not routed; update the guard to require both isLocalExpert and isTokenRouted (mirror the other reverse-map write) before assigning params.mPtrPermutedIdxToExpandedIdx[permutedIdx] = tokenIdx so only routed tokens create reverse-map entries (use the same condition structure as the other reverse-map write to locate and fix the assignment).tests/moe/test_trtllm_gen_fused_moe.py (1)
2872-2892:⚠️ Potential issue | 🟠 MajorIsolate this 2048-expert config into a targeted smoke test.
This config allocates ~5 GiB for weights alone (2048 experts × gated intermediate × hidden_size in bf16) before quantized copies. Since it participates in the full parametrized matrix with
enable_autotune=True, it risks CI OOM/timeout on standard GPUs.Consider:
- Move to a separate focused test with
enable_autotune=False- Limit to a single
moe_implandweight_processingcombination- Use the smallest
num_tokensvalue,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/moe/test_trtllm_gen_fused_moe.py` around lines 2872 - 2892, This large 2048-expert pytest.param (id "RoutingRenormalize_large_experts") should be pulled out of the full parametrized matrix and converted into a focused smoke test to avoid CI OOMs: create a new dedicated test that uses this dict but set "enable_autotune": False, reduce "compatible_moe_impls" to a single implementation (pick one from the current list, e.g., BF16Moe), restrict any "weight_processing" combinations to a single option, and set the test's input size to the minimum by lowering "num_tokens" to the smallest allowed value; keep the original "id" for traceability and ensure the new test is run separately from the large parametrized suite.csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu (1)
189-195:⚠️ Potential issue | 🔴 CriticalInsert a global fence before completing the PDL launch.
Line 192 already calls out the problem:
__syncthreads()only orders the CTA, but the dependent kernel consumes global metadata written above. Without a device/system fence beforecudaTriggerProgrammaticLaunchCompletion(), the next kernel can observe stalemPtrCtaIdxXyToBatchIdx,mPtrCtaIdxXyToMnLimit,mPtrPermutedIdxSize, ormPtrNumNonExitingCtas.Suggested fix
if constexpr (KernelParams::UsePdl) { + __threadfence_system(); cudaTriggerProgrammaticLaunchCompletion(); }According to the official NVIDIA CUDA documentation, does `cudaTriggerProgrammaticLaunchCompletion()` require an explicit `__threadfence_system()` (or another memory fence) to guarantee that prior global-memory writes are visible to the dependent kernel before programmatic launch completion?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu` around lines 189 - 195, The programmatic launch completion lacks a system-wide memory fence: when KernelParams::UsePdl is true, insert a device/system fence (e.g., call __threadfence_system() or equivalent) immediately before the existing cudaTriggerProgrammaticLaunchCompletion() so that prior global writes to mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, mPtrPermutedIdxSize, mPtrNumNonExitingCtas, and mPtrTotalNumPaddedTokens are visible to the dependent kernel; update the block guarded by if constexpr (KernelParams::UsePdl) to call the global fence first, then call cudaTriggerProgrammaticLaunchCompletion().csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu (1)
69-78:⚠️ Potential issue | 🔴 CriticalGuard the optional expanded-index mapping write.
mPtrExpandedIdxToPermutedIdxis optional in this backend, but theexpertIdx == -1branch dereferences it unconditionally. A top-k-id caller that omits that mapping buffer will hit a null device pointer here.Suggested fix
if (expertIdx != -1) { int offset = warpIdx * MaxNumExperts + expertIdx; smemKIdx[offset] = static_cast<int8_t>(laneIdx); - } else { + } else if (params.mPtrExpandedIdxToPermutedIdx != nullptr) { params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = int32_t{-1}; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu` around lines 69 - 78, The branch handling expertIdx == -1 writes unconditionally to the optional device pointer params.mPtrExpandedIdxToPermutedIdx, which can be null; modify the code in the launchBlockKernel logic so you first test if params.mPtrExpandedIdxToPermutedIdx != nullptr before performing the write (params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = -1); if the pointer is null simply skip that write (leave smemKIdx handling unchanged), ensuring all accesses to params.mPtrExpandedIdxToPermutedIdx are guarded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/launchClusterKernel.cu`:
- Around line 25-27: The __launch_bounds__ macro in the
routingIndicesClusterKernel definition uses KernelParams::MaxNumExperts
unconditionally which can exceed CUDA's 1024-thread-per-block limit; change the
launch bound to cap MaxNumExperts at 1024 (i.e., use a conditional expression
like KernelParams::MaxNumExperts <= 1024 ? KernelParams::MaxNumExperts : 1024)
for the __launch_bounds__ on the routingIndicesClusterKernel function; also
apply the same conditional cap fix to the identical usage in launchCoopKernel.cu
to prevent compile/runtime failures when MaxNumExperts > 1024.
In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu`:
- Around line 127-130: The isLocal predicate incorrectly checks localExpIdx <
params.mNumLocalExperts but must account for the strided extent when expert
interleaving is enabled; change the upper bound to (params.mNumLocalExperts <<
params.mLocalExpertsStrideLog2) (i.e., use the strided local-expert extent) so
that localExpIdx is compared against the true number of interleaved slots, and
update the same logic where it appears (the other occurrences around the blocks
referencing localExpIdx/isLocal at the later sites). Ensure you keep the
existing bitmask check for stride alignment ((localExpIdx & ((1 <<
params.mLocalExpertsStrideLog2) - 1)) == 0) and only replace the plain
mNumLocalExperts bound with the shifted value.
---
Duplicate comments:
In `@csrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cu`:
- Around line 189-195: The programmatic launch completion lacks a system-wide
memory fence: when KernelParams::UsePdl is true, insert a device/system fence
(e.g., call __threadfence_system() or equivalent) immediately before the
existing cudaTriggerProgrammaticLaunchCompletion() so that prior global writes
to mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, mPtrPermutedIdxSize,
mPtrNumNonExitingCtas, and mPtrTotalNumPaddedTokens are visible to the dependent
kernel; update the block guarded by if constexpr (KernelParams::UsePdl) to call
the global fence first, then call cudaTriggerProgrammaticLaunchCompletion().
In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu`:
- Around line 69-78: The branch handling expertIdx == -1 writes unconditionally
to the optional device pointer params.mPtrExpandedIdxToPermutedIdx, which can be
null; modify the code in the launchBlockKernel logic so you first test if
params.mPtrExpandedIdxToPermutedIdx != nullptr before performing the write
(params.mPtrExpandedIdxToPermutedIdx[warpIdx * params.mTopK + laneIdx] = -1); if
the pointer is null simply skip that write (leave smemKIdx handling unchanged),
ensuring all accesses to params.mPtrExpandedIdxToPermutedIdx are guarded.
In `@csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cu`:
- Around line 246-247: The reverse-map write to
params.mPtrPermutedIdxToExpandedIdx guarded only by isLocalExpert can overwrite
slots for tokens that were not routed; update the guard to require both
isLocalExpert and isTokenRouted (mirror the other reverse-map write) before
assigning params.mPtrPermutedIdxToExpandedIdx[permutedIdx] = tokenIdx so only
routed tokens create reverse-map entries (use the same condition structure as
the other reverse-map write to locate and fix the assignment).
In `@include/flashinfer/trtllm/common/cudaUtils.h`:
- Around line 272-277: The function getMultiProcessorCount currently calls
cudaGetDevice and cudaDeviceGetAttribute without checking return values; wrap
those calls with the project's CUDA-check macro (e.g., FLASHINFER_CHECK or
equivalent) or explicitly check their cudaError_t results, log or handle
failures, and return a safe sentinel (e.g., 0 or -1) when an error occurs
instead of returning an uninitialized count; update references to cudaGetDevice
and cudaDeviceGetAttribute inside getMultiProcessorCount to use the chosen
error-checking pattern so callers never receive an invalid count.
In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 2872-2892: This large 2048-expert pytest.param (id
"RoutingRenormalize_large_experts") should be pulled out of the full
parametrized matrix and converted into a focused smoke test to avoid CI OOMs:
create a new dedicated test that uses this dict but set "enable_autotune":
False, reduce "compatible_moe_impls" to a single implementation (pick one from
the current list, e.g., BF16Moe), restrict any "weight_processing" combinations
to a single option, and set the test's input size to the minimum by lowering
"num_tokens" to the smallest allowed value; keep the original "id" for
traceability and ensure the new test is run separately from the large
parametrized suite.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: df6f5495-6b6d-40ac-97d5-297d77567843
📒 Files selected for processing (32)
csrc/fused_moe/trtllm_backend/routingDeepSeek/RoutingDeepSeekCommon.cuhcsrc/fused_moe/trtllm_backend/routingDeepSeek/launchClusterKernel.cucsrc/fused_moe/trtllm_backend/routingDeepSeek/launchCoopKernel.cucsrc/fused_moe/trtllm_backend/routingDeepSeek/launchHistogramKernel.cucsrc/fused_moe/trtllm_backend/routingDeepSeek/launchInitExpertCounts.cucsrc/fused_moe/trtllm_backend/routingDeepSeek/launchMainKernel.cucsrc/fused_moe/trtllm_backend/routingDeepSeek/launchOffsetsKernel.cucsrc/fused_moe/trtllm_backend/routingRenormalize/RoutingRenormalizeCommon.cuhcsrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cucsrc/fused_moe/trtllm_backend/routingRenormalize/launchClusterKernel.cucsrc/fused_moe/trtllm_backend/routingRenormalize/launchHistogramKernel.cucsrc/fused_moe/trtllm_backend/routingRenormalize/launchHistogramScoresKernel.cucsrc/fused_moe/trtllm_backend/routingRenormalize/launchInitExpertCounts.cucsrc/fused_moe/trtllm_backend/routingRenormalize/launchOffsetsKernel.cucsrc/fused_moe/trtllm_backend/trtllm_fused_moe_dev_kernel.cucsrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cucsrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_llama4.cucsrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_renormalize.cucsrc/moe_utils_binding.cucsrc/trtllm_fused_moe_kernel_launcher.cucsrc/trtllm_fused_moe_routing_deepseek.cucsrc/trtllm_fused_moe_routing_renormalize.cucsrc/trtllm_fused_moe_runner.cuflashinfer/jit/cpp_ext.pyflashinfer/jit/fused_moe.pyflashinfer/jit/moe_utils.pyinclude/flashinfer/trtllm/common/cudaUtils.hinclude/flashinfer/trtllm/fused_moe/DevKernel.hinclude/flashinfer/trtllm/fused_moe/RoutingKernel.cuhinclude/flashinfer/trtllm/fused_moe/RoutingKernel.hinclude/flashinfer/trtllm/fused_moe/RoutingKernelTopK.cuhtests/moe/test_trtllm_gen_fused_moe.py
💤 Files with no reviewable changes (3)
- csrc/moe_utils_binding.cu
- csrc/trtllm_fused_moe_routing_renormalize.cu
- csrc/trtllm_fused_moe_routing_deepseek.cu
🚧 Files skipped from review as they are similar to previous changes (7)
- csrc/trtllm_fused_moe_kernel_launcher.cu
- csrc/fused_moe/trtllm_backend/routingRenormalize/launchHistogramScoresKernel.cu
- csrc/fused_moe/trtllm_backend/trtllm_fused_moe_dev_kernel.cu
- csrc/fused_moe/trtllm_backend/routingRenormalize/launchHistogramKernel.cu
- csrc/trtllm_fused_moe_runner.cu
- flashinfer/jit/cpp_ext.py
- csrc/fused_moe/trtllm_backend/trtllm_fused_moe_routing_deepseek.cu
| int expert = threadIdx.x * ExpertsPerThread + e; | ||
| auto localExpIdx = expert - params.mLocalExpertsStartIdx; | ||
| auto isLocal = localExpIdx >= 0 && localExpIdx < params.mNumLocalExperts && | ||
| (localExpIdx & ((1 << params.mLocalExpertsStrideLog2) - 1)) == 0; |
There was a problem hiding this comment.
Use the strided local-expert extent here.
These predicates use localExpIdx < params.mNumLocalExperts, but the other routing paths already treat the local range as mNumLocalExperts << mLocalExpertsStrideLog2. With expert interleaving enabled, this block path will misclassify valid local experts as remote and produce wrong counts, CTA mappings, and permuted indices.
Suggested fix
+ auto localExpertExtent = params.mNumLocalExperts << params.mLocalExpertsStrideLog2;
int accExpertCount[ExpertsPerThread];
`#pragma` unroll
for (int e = 0; e < ExpertsPerThread; e++) {
int expert = threadIdx.x * ExpertsPerThread + e;
auto localExpIdx = expert - params.mLocalExpertsStartIdx;
- auto isLocal = localExpIdx >= 0 && localExpIdx < params.mNumLocalExperts &&
+ auto isLocal = localExpIdx >= 0 && localExpIdx < localExpertExtent &&
(localExpIdx & ((1 << params.mLocalExpertsStrideLog2) - 1)) == 0;
@@
int expert = threadIdx.x * ExpertsPerThread + e;
auto localExpIdx = expert - params.mLocalExpertsStartIdx;
- auto isLocal = localExpIdx >= 0 && localExpIdx < params.mNumLocalExperts &&
+ auto isLocal = localExpIdx >= 0 && localExpIdx < localExpertExtent &&
(localExpIdx & ((1 << params.mLocalExpertsStrideLog2) - 1)) == 0;
@@
if (smemKIdx[offset] >= 0) {
auto localExpIdx = expert - params.mLocalExpertsStartIdx;
- auto isLocal = localExpIdx >= 0 && localExpIdx < params.mNumLocalExperts &&
+ auto isLocal = localExpIdx >= 0 && localExpIdx < localExpertExtent &&
(localExpIdx & ((1 << params.mLocalExpertsStrideLog2) - 1)) == 0;Also applies to: 183-185, 233-235
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@csrc/fused_moe/trtllm_backend/routingRenormalize/launchBlockKernel.cu` around
lines 127 - 130, The isLocal predicate incorrectly checks localExpIdx <
params.mNumLocalExperts but must account for the strided extent when expert
interleaving is enabled; change the upper bound to (params.mNumLocalExperts <<
params.mLocalExpertsStrideLog2) (i.e., use the strided local-expert extent) so
that localExpIdx is compared against the true number of interleaved slots, and
update the same logic where it appears (the other occurrences around the blocks
referencing localExpIdx/isLocal at the later sites). Ensure you keep the
existing bitmask check for stride alignment ((localExpIdx & ((1 <<
params.mLocalExpertsStrideLog2) - 1)) == 0) and only replace the plain
mNumLocalExperts bound with the shifted value.
|
/bot run |
|
[FAILED] Pipeline #46225337: 9/20 passed |
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Made-with: Cursor
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
<!-- .github/pull_request_template.md --> ## 📌 Description - Integrate NVIDIA/TensorRT-LLM#11510 to support 2048 num of experts and 32 TopK in renormalize - Refactor MOE cu files ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Expanded MoE routing/renormalize to support up to 2,048 experts and top-k up to 32; backend reorganized to enable larger configurations. * **Bug Fixes** * Clamped token counts in kernel launches to prevent oversized grid launches. * **Performance** * Reworked routing/launch paths for improved scalability and throughput with large expert/top-k settings. * **Tests** * Added test scenarios covering large-expert (2,048) + top-k (32) configurations. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
<!-- .github/pull_request_template.md --> ## 📌 Description - Integrate NVIDIA/TensorRT-LLM#11510 to support 2048 num of experts and 32 TopK in renormalize - Refactor MOE cu files ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Expanded MoE routing/renormalize to support up to 2,048 experts and top-k up to 32; backend reorganized to enable larger configurations. * **Bug Fixes** * Clamped token counts in kernel launches to prevent oversized grid launches. * **Performance** * Reworked routing/launch paths for improved scalability and throughput with large expert/top-k settings. * **Tests** * Added test scenarios covering large-expert (2,048) + top-k (32) configurations. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
📌 Description
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
New Features
Bug Fixes
Performance
Tests