-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][fix] support topk autotuner input for expert slot per group larger than 32 #9087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][fix] support topk autotuner input for expert slot per group larger than 32 #9087
Conversation
|
/bot run --disable-fail-fast |
|
PR_Github #24267 [ run ] triggered by Bot. Commit: |
|
/bot kill |
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
23f9e2a to
efee075
Compare
|
PR_Github #24271 [ kill ] triggered by Bot. Commit: |
|
PR_Github #24267 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #24271 [ kill ] completed with state |
📝 WalkthroughWalkthroughRouting constraint validation in the DeepSeek kernel is deferred to conditional execution paths rather than unconditional checks. A new helper function manages dummy top-k tensor creation for MOE runners in attention DP scenarios via hook-based adaptation. Test suite extended with new parameterized test case. Changes
Sequence Diagram(s)sequenceDiagram
participant Tuner as Tuner/Profiler
participant Runner as MOE Runner
participant Hook as Input Hook
rect rgb(220, 240, 255)
Note over Tuner,Hook: Prepare Phase
Tuner->>Runner: Call prepare_dummy_topk_and_hook()
Runner->>Runner: Detect attention DP scenario
alt Attention DP Mode
Runner->>Runner: Create dummy topk_weights/topk_ids
Runner->>Runner: Suppress routing_logits
else Normal Routing
Runner->>Runner: Reuse provided topk tensors
end
Runner->>Hook: Register input hook for shape changes
Runner-->>Tuner: Return (routing_logits, topk_weights, topk_ids, tuning_config_with_hook)
end
rect rgb(240, 255, 240)
Note over Tuner,Hook: Tuning Phase (may reshape)
Tuner->>Hook: Token count changes during profiling
Hook->>Hook: Trigger recreate_dummy_topk_if_needed()
Hook->>Hook: Rebuild dummy tensors for new shape
end
rect rgb(255, 240, 240)
Note over Runner: Execution Phase
Runner->>Runner: Replace dummy tensors with actual routing/topk inputs
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~35 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu(2 hunks)tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py(14 hunks)tests/unittest/_torch/thop/parallel/test_moe.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...
Files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.
Files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cutests/unittest/_torch/thop/parallel/test_moe.pytensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cutests/unittest/_torch/thop/parallel/test_moe.pytensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tests/unittest/_torch/thop/parallel/test_moe.pytensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
🧠 Learnings (9)
📓 Common learnings
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
Applied to files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
Repo: NVIDIA/TensorRT-LLM PR: 7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.
Applied to files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Applied to files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.
Applied to files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu
📚 Learning: 2025-08-08T04:10:19.038Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Applied to files:
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
📚 Learning: 2025-10-20T17:07:18.745Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.
Applied to files:
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
🧬 Code graph analysis (2)
tests/unittest/_torch/thop/parallel/test_moe.py (3)
cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (8)
param(47-152)param(47-47)param(154-165)param(154-154)param(167-176)param(167-167)param(203-209)param(203-204)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
RoutingMethodType(39-131)tensorrt_llm/_torch/modules/fused_moe/routing.py (1)
RoutingMethodType(143-155)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (5)
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (2)
routing_logits(498-508)routing_logits(580-594)cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (2)
routing_logits(432-442)routing_logits(513-523)tensorrt_llm/_torch/autotuner.py (2)
TuningConfig(53-101)choose_one(623-778)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (3)
top_k(240-240)num_experts(233-233)local_num_experts(247-247)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
get_tuning_config(675-686)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (11)
tests/unittest/_torch/thop/parallel/test_moe.py (3)
1181-1210: LGTM! Well-structured test for EPLB 288 use case.The test appropriately exercises the new capability to support expert configurations with >32 experts per group when using topk input. The parameterization covers a good range of token counts (1, 256, 1024) to validate both small and large batch scenarios.
1242-1245: Appropriate validation for topk input mode.The guard correctly restricts
use_topk_as_inputtesting to DeepSeekV3 routing method, aligning with the implementation scope.
1502-1505: Consistent validation across test variants.Good consistency applying the same routing method constraint in the FP8-FP4 test variant.
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (2)
616-619: Clear documentation of deferred validation strategy.The comment effectively explains the conditional validation approach, and correctly retains the fundamental
mNumExperts % 4constraint as unconditional since it applies to the broader MOE pipeline.
650-676: Well-designed conditional validation enables EPLB 288 support.The deferred validation logic correctly distinguishes between:
- Routing execution path (when
mPtrTopKIds == nullptr): validates routing-specific constraints including experts per group ≤ WarpSize- Pre-computed topk path (when
mPtrTopKIds != nullptr): skips routing constraints since routing kernel won't executeThis enables configurations like EPLB 288 (288 experts / 8 groups = 36 experts per group > 32) when topk IDs/weights are provided directly, which is precisely the use case described in the PR objectives.
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (6)
1-1: LGTM: Import addition is appropriate.The
replaceimport is correctly added and used to create a modifiedTuningConfigwith the hook.
16-123: Well-structured helper function with minor dtype consideration.The function effectively centralizes dummy tensor creation logic for AutoTuner profiling. The hook-based approach to handle shape changes during profiling is appropriate.
However, note that dummy tensors use hardcoded
torch.bfloat16dtype (lines 57, 76, 97, 113). Consider whether this matches the actual dtype of tensors in all scenarios, or if it should be derived fromhidden_states.dtypeorrouting_logits.dtypewhere applicable. While this may be acceptable if the AutoTuner only uses these tensors for shape/kernel selection rather than actual computation, please verify this assumption holds.
380-437: LGTM: Clean refactor with proper separation of tuning and execution inputs.The refactor correctly uses the helper function and creates a fresh
input_tensorslist for final execution (lines 420-434), properly replacing dummy values with actual parameters.
1054-1058: Same aliasing issue as in fp8_block_scale_moe_runner.Lines 1054-1058 have the same in-place modification pattern that could corrupt tuner state. Apply the same fix as suggested for
fp8_block_scale_moe_runner(create a fresh list instead of aliasing).⛔ Skipped due to learnings
Learnt from: djns99 Repo: NVIDIA/TensorRT-LLM PR: 6915 File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626 Timestamp: 2025-08-19T03:35:20.866Z Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.Learnt from: jhaotingc Repo: NVIDIA/TensorRT-LLM PR: 7856 File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166 Timestamp: 2025-09-19T21:28:13.751Z Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.Learnt from: djns99 Repo: NVIDIA/TensorRT-LLM PR: 6915 File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012 Timestamp: 2025-08-14T23:23:27.449Z Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
1566-1571: Same aliasing issue: in-place modification of tuner inputs.Lines 1567-1571 have the same problematic pattern. Apply the same fix as suggested for other runners.
⛔ Skipped due to learnings
Learnt from: jhaotingc Repo: NVIDIA/TensorRT-LLM PR: 7856 File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166 Timestamp: 2025-09-19T21:28:13.751Z Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.Learnt from: djns99 Repo: NVIDIA/TensorRT-LLM PR: 6915 File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626 Timestamp: 2025-08-19T03:35:20.866Z Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.Learnt from: djns99 Repo: NVIDIA/TensorRT-LLM PR: 6915 File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012 Timestamp: 2025-08-14T23:23:27.449Z Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.Learnt from: sklevtsov-nvidia Repo: NVIDIA/TensorRT-LLM PR: 3294 File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127 Timestamp: 2025-08-09T20:57:04.084Z Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
1315-1320: Same aliasing issue: in-place modification of tuner inputs.Despite the clarifying comment at line 1315, lines 1316-1320 still alias and mutate
input_tensors_for_tuner. Apply the same fix as suggested earlier.⛔ Skipped due to learnings
Learnt from: djns99 Repo: NVIDIA/TensorRT-LLM PR: 6915 File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626 Timestamp: 2025-08-19T03:35:20.866Z Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.Learnt from: djns99 Repo: NVIDIA/TensorRT-LLM PR: 6915 File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012 Timestamp: 2025-08-14T23:23:27.449Z Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.Learnt from: sklevtsov-nvidia Repo: NVIDIA/TensorRT-LLM PR: 3294 File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127 Timestamp: 2025-08-09T20:57:04.084Z Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
|
PR_Github #24274 [ run ] triggered by Bot. Commit: |
|
PR_Github #24274 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #24299 [ run ] triggered by Bot. Commit: |
|
PR_Github #24299 [ run ] completed with state |
hyukn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. But notice that these shapes are exactly stored in the profiling cache. And when inference happens, these are the ones we expect to get from the real inputs. Otherwise, the cache will miss, and perf drop will happen due to the usage of fallback tactics.
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Thanks @hyukn for your comments and help, after offline discussion, added constraint for topk input to fix cache miss issue. |
|
/bot run --disable-fail-fast |
|
PR_Github #24403 [ run ] triggered by Bot. Commit: |
|
PR_Github #24403 [ run ] completed with state |
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #24466 [ run ] triggered by Bot. Commit: |
|
PR_Github #24466 [ run ] completed with state |
…rger than 32 (NVIDIA#9087) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
…rger than 32 (NVIDIA#9087) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Summary by CodeRabbit
Release Notes
Description
PR 8886 enables EPLB for trtllm-gen backend, but TRTLLM-Gen backend doesn't support EPLB 288 case when using AutoTuner.
This PR add support for TRTLLM-Gen backend with EPLB 288 cases for DeepSeek-R1.
Thanks @ChristinaZ for helping investigating this issue.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.