Skip to content

Conversation

@sklevtsov-nvidia
Copy link
Contributor

@sklevtsov-nvidia sklevtsov-nvidia commented Apr 4, 2025

This MR adds a new EVT-based implementation of GEMM2+Finalize fusion that supports both Blackwell and Hopper.

Using trtllm-bench --backend pytorch --model mistralai/Mixtral-8x7B-Instruct-v0.1 with --input-mean 1024 --output-mean 512 --num-requests 2048 on a single B200,

  • before (no fusion):
Request Throughput (req/sec):                     24.9259
Total Output Throughput (tokens/sec):             12762.0376
Per User Output Throughput (tokens/sec/user):     8.5910
Per GPU Output Throughput (tokens/sec/gpu):       12762.0376
Total Latency (ms):                               82163.6824
Average request latency (ms):                     62291.9844
  • after (with fusion):
Request Throughput (req/sec):                     25.9745
Total Output Throughput (tokens/sec):             13298.9388
Per User Output Throughput (tokens/sec/user):     8.9563
Per GPU Output Throughput (tokens/sec/gpu):       13298.9388
Total Latency (ms):                               78846.5916
Average request latency (ms):                     59856.6157

Summary by CodeRabbit

  • New Features

    • Added configuration options in Python and C++ APIs to enable or disable fused finalize optimization in Mixture-of-Experts (MoE) layers.
    • Introduced new environment variable and model configuration fields for controlling finalize fusion behavior.
    • Added support for routing scales and row permutation mappings in MoE kernel operations.
  • Enhancements

    • Improved flexibility in MoE kernel fusion, allowing users to recover deterministic numerical behavior when needed.
    • Updated internal logic and APIs to support additional routing and permutation parameters in MoE operations.
    • Modernized epilogue fusion handling and updated launcher code for better architecture detection and scheduling.
    • Simplified and unified stride and pointer management in fused finalize epilogue structures.
    • Refined supported tile shapes for SM100 architecture and optimized kernel dispatch logic.
  • Bug Fixes

    • Adjusted test tolerances for NVFP4 quantized fused MoE to improve reliability.
    • Corrected spelling in test variable names related to deterministic reduction.
  • Documentation

    • Added descriptions for new configuration fields related to MoE fusion options.
  • Chores

    • Updated internal library version hashes and checksums for consistency.

@sklevtsov-nvidia sklevtsov-nvidia force-pushed the feature/moe-fused-finalize branch from 099c239 to 299ea9e Compare April 4, 2025 22:28
@djns99 djns99 self-requested a review April 7, 2025 03:57
Copy link
Collaborator

@djns99 djns99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks! See my comments on the internal MR

@sklevtsov-nvidia sklevtsov-nvidia force-pushed the feature/moe-fused-finalize branch from 299ea9e to a281d1a Compare April 12, 2025 02:26
@sklevtsov-nvidia sklevtsov-nvidia marked this pull request as ready for review April 16, 2025 03:46
@sklevtsov-nvidia sklevtsov-nvidia force-pushed the feature/moe-fused-finalize branch from 7f16376 to 14d0868 Compare April 16, 2025 21:49
@sklevtsov-nvidia
Copy link
Contributor Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2520 [ run ] triggered by Bot

@sklevtsov-nvidia sklevtsov-nvidia force-pushed the feature/moe-fused-finalize branch from 14d0868 to a88ff47 Compare April 17, 2025 00:26
@sklevtsov-nvidia
Copy link
Contributor Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2534 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2520 [ run ] completed with state ABORTED
/LLM/main/L0_MergeRequest_PR pipeline #1807 completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2534 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1816 completed with status: 'FAILURE'

@sklevtsov-nvidia sklevtsov-nvidia force-pushed the feature/moe-fused-finalize branch from a88ff47 to 45df19e Compare April 17, 2025 20:42
@sklevtsov-nvidia
Copy link
Contributor Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2684 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2684 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1914 completed with status: 'FAILURE'

@sklevtsov-nvidia
Copy link
Contributor Author

/bot help

@github-actions
Copy link

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@sklevtsov-nvidia
Copy link
Contributor Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2706 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2706 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1930 completed with status: 'SUCCESS'

@sklevtsov-nvidia sklevtsov-nvidia force-pushed the feature/moe-fused-finalize branch from 45df19e to 9d6d482 Compare April 19, 2025 02:04
@sklevtsov-nvidia
Copy link
Contributor Author

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2808 [ reuse-pipeline ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2808 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #2706 for commit 9d6d482

@sklevtsov-nvidia sklevtsov-nvidia force-pushed the feature/moe-fused-finalize branch 3 times, most recently from 35828ee to 793c626 Compare May 7, 2025 18:33
@sklevtsov-nvidia sklevtsov-nvidia force-pushed the feature/moe-fused-finalize branch from ab14f1e to 4443cf9 Compare August 8, 2025 20:12
@sklevtsov-nvidia
Copy link
Contributor Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14643 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (1)

263-265: Macro line still >120 chars – please reflow
Same E501 violation as highlighted earlier; break the argument list over several lines for readability.

🧹 Nitpick comments (7)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl (2)

433-456: Workspace shape mock is pragmatic; document and guard

You already note this is relying on CUTLASS internals. Please:

  • Add a brief comment explaining that only group_count is used by get_workspace_size in current CUTLASS and that can_implement later guards correctness.
  • Optionally assert num_experts > 0 here.
-                typename TmaWarpSpecializedGroupedGemmInput::ProblemShape shape_info{num_experts, nullptr, nullptr};
+                // get_workspace_size only uses group_count in current CUTLASS; arguments are a minimal mock.
+                TLLM_CHECK(num_experts > 0);
+                typename TmaWarpSpecializedGroupedGemmInput::ProblemShape shape_info{num_experts, nullptr, nullptr};

405-426: Remove commented-out legacy type plumbing

The large commented block adds noise and risks bitrot. Please delete it; git history preserves it.

-            /*};
-            //
-            //        using namespace cute;
-            //        using GemmInfo = TmaWarpSpecializedGroupedGemmInfo;<ArchTag, T, WeightType, OutputType,
-            EpilogueTag,
-            //        MmaTileShape,
-            //            ClusterShape, BIAS, FUSION>;
-            //
-            //        using ElementAccumulator = typename GemmInfo::ElementAccumulator;
-            //        using ElementA = typename GemmInfo::ElementA;
-            //        using ElementB = typename GemmInfo::ElementB;
-            //        using ElementC = typename GemmInfo::ElementC;
-            //        using ElementCSafe = typename GemmInfo::ElementCSafe;
-            //        using ElementD = typename GemmInfo::ElementD;
-            //        using ElementFinalOutput = typename GemmInfo::ElementFinalOutput;
-            //        using ElementBias = typename GemmInfo::ElementBias;
-            //
-            //        using CollectiveMainloop = typename GemmInfo::CollectiveMainloop;
-            //        using CollectiveEpilogue = typename GemmInfo::CollectiveEpilogue;
-            //        using GemmKernel = typename GemmInfo::GemmKernel;
-            //        using GemmGrouped = typename GemmInfo::GemmGrouped;*/
+            // (removed: legacy alternative type plumbing retained by VCS history)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp (2)

270-273: Remove unused local tensors gIdx/gIdx_epi

These locals are computed but never used. Please remove to reduce register pressure and improve readability.

-    Tensor mIdx = make_tensor(params_ptr->ptr_index[l], make_shape(M,N,Int<1>{}), StrideIndex{});                           // (M,N,_1)
-    Tensor gIdx = local_tile(mIdx, take<0,2>(args.tile_shape_mnk), make_coord(m,n,Int<0>{}));                               // (CTA_M,CTA_N)
-    Tensor gIdx_epi = flat_divide(gIdx, args.epi_tile);                                                                     // (EPI_TILE_M,EPI_TILE_N,EPI_M,EPI_N)
+    // Index layout tensor not needed beyond gather function; avoid unused locals.

214-252: Reduction op selection is robust; consider documenting fallback

The typed atomic add fallback for non-(half/bfloat16) is sensible. Add a brief comment noting this path trades off performance for generality until vectorized reductions are supported for more types.

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (2)

2-2: Update header year to include 2025

Per coding guidelines, headers should include the current year.

- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2020-2025, NVIDIA CORPORATION.  All rights reserved.

234-235: Simplified setFinalizeFusionParams signature—document parameter semantics

Please update any comments/docs to reflect:

  • final_output is the scatter target buffer
  • hidden_size maps to num_rows_in_final_output (modulo)
  • num_output_tokens selects the group count
  • use_reduction toggles red.global vs st.global
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)

3311-3313: Logic coupling makes has_different_output_type_tma_ws fragile

has_different_output_type_tma_ws is defined as
!using_fused_finalize && using_tma_ws_gemm2.

When using_fused_finalize is later forced off for some edge-case (e.g. LoRA or W4-groupwise), this flag silently flips and the kernel selection drops back to the slow finalize path. Consider factoring the three booleans into a single helper (e.g. shouldRunSeparateFinalize()), to keep the condition consistent and easier to audit.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ab14f1e and 4443cf9.

⛔ Files ignored due to path filters (2)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/tensorrt_llm_internal_cutlass_kernels_static.tar.xz is excluded by !**/*.xz
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/tensorrt_llm_internal_cutlass_kernels_static.tar.xz is excluded by !**/*.xz
📒 Files selected for processing (32)
  • cpp/tensorrt_llm/common/envUtils.cpp (1 hunks)
  • cpp/tensorrt_llm/common/envUtils.h (1 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/collective/epilogue_moe_finalize.hpp (0 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp (1 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (0 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/util/gather_tensor.hpp (3 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (5 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (6 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl (10 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_mixed_input_launcher.inl (2 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (2 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu (6 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (16 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (2 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (9 hunks)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt (1 hunks)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h (5 hunks)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (13 hunks)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt (1 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1 hunks)
  • cpp/tensorrt_llm/thop/moeOp.cpp (4 hunks)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (7 hunks)
  • pyproject.toml (1 hunks)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (6 hunks)
  • tensorrt_llm/_torch/model_config.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
  • tensorrt_llm/llmapi/llm_args.py (2 hunks)
  • tests/unittest/_torch/modules/test_fused_moe.py (2 hunks)
💤 Files with no reviewable changes (2)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/collective/epilogue_moe_finalize.hpp
✅ Files skipped from review due to trivial changes (1)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt
🚧 Files skipped from review as they are similar to previous changes (22)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt
  • pyproject.toml
  • tensorrt_llm/_torch/pyexecutor/config.py
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/util/gather_tensor.hpp
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • cpp/tensorrt_llm/common/envUtils.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_mixed_input_launcher.inl
  • tests/unittest/_torch/modules/test_fused_moe.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • cpp/tensorrt_llm/common/envUtils.cpp
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • tensorrt_llm/llmapi/llm_args.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
🧠 Learnings (21)
📓 Common learnings
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.893Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
📚 Learning: 2025-08-08T05:10:38.893Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.893Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
📚 Learning: 2025-08-08T04:10:18.987Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:18.987Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{h,hpp} : Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Do not use smart pointers that have been deprecated in C++11.

Applied to files:

  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
📚 Learning: 2025-07-22T08:33:49.109Z
Learnt from: yiqingy0
PR: NVIDIA/TensorRT-LLM#5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.

Applied to files:

  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Avoid the use of #ifdef and #ifndef directives (except for header include guards). Prefer to use #if defined(...) or #if !defined(...) instead.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : C headers should not be used directly. Use C++ standard headers like <cstdint> instead of <stdint.h>.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Avoid dead code in C++.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Avoid dynamic_cast.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : When nesting preprocessor directives, use indentation after the hash mark (#).

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Literal suffixes should be upper case (e.g., 1234L instead of 1234l).

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Use the LLVM clang-format tool for formatting C++ code before submitting a PR.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Use a maximum of 120 characters per line in C++ code.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Avoid declaring large functions as inline, absent a quantifiable benefit.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
🪛 Ruff (0.12.2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py

264-264: Line too long (332 > 120)

(E501)


276-276: Line too long (314 > 120)

(E501)


362-362: DataType may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (8)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl (2)

347-357: Fused finalize epilogue wiring looks correct

Using PtrArray TMA epilogue with ScaledAccPerRowBiasPerColScaleScatter matches the new visitor-scatter implementation. D element is void in this path (scatter handles output), and LayoutD is still carried for scheduling. Per-row bias (Stride<_1,_0>) and per-col scale (Stride<_0,_1>) are consistent.

Also applies to: 359-361


501-505: Alpha‐scale pointer usage is correct
The alpha_scale_ptr_array is consistently defined as float const** for per‐expert α scales and is passed as the α‐pointer argument (with the β‐pointer set to nullptr) in both the FINALIZE and non‐FINALIZE paths. No mismatches were found across CUTLASS fusion callbacks.

• File cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl (501–505):
tma_ws_input.alpha_scale_ptr_array → α array
nullptr → β array

All inspected instantiations of cutlass::epilogue::fusion::LinearCombination treat alpha_scale_ptr_array exclusively as α. No fix needed.

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp (1)

507-510: Per-column scale stride fixed correctly

Using Stride<_0,_1,int64_t> for scale aligns with the per-column-only support in this fusion. Good simplification.

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (3)

82-83: D operand layout/stride and pointers promoted—LGTM

Introducing LayoutD/StrideD and promoting stride_d/ptr_d out of the old DefaultEpilogue simplifies the launcher wiring and matches the fused/non-fused split.

Also applies to: 118-119, 145-148


151-163: FusedFinalizeEpilogue pointer types align with fusion op needs

  • StrideFinalOutput via TagToStrideC_t is correct.
  • Double pointers for bias, router scales, and indices match per-group arrays.
  • use_reduction default true is sensible.

222-223: Workspace buffer count increase verified—no action required
All call sites now reference buffers.size() or a matching std::array<…, 20>, so they automatically adapt to the new count. No hard-coded loop bounds or allocations remain that could cause OOB accesses.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)

1168-1169: Potentially incorrect StrideD order

cute::make_shape(gemm_n, gemm_m, 1) flips M/N compared to the A-matrix shape (gemm_m , gemm_k).
For epilogues that later reinterpret D as “token × hidden”, this may be correct, but for the non-fused path you gated on (fusion == NONE) the old (gemm_m, gemm_n, 1) ordering is usually expected.

Please double-check that downstream consumers of ptr_d/stride_d still receive the intended row-major layout, otherwise write-backs will land transposed.


4010-4017: use_fused_finalize_ is re-evaluated but never validated

using_fused_finalize relies on the runner-level flag use_fused_finalize_ plus several runtime checks.
If the workspace was configured without those extra buffers (e.g. when use_fused_finalize_ was true at construction time, but a later call sets use_lora=true), the code still calls setFinalizeFusionParams, leading to out-of-bounds writes.

Add a defensive TLLM_CHECK(using_fused_finalize || !gemm2_tma_ws_input.fusion == FINALIZE); right after the decision to ensure the workspace was prepared accordingly.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14643 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11055 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@bobboli bobboli dismissed litaotju’s stale review August 9, 2025 17:52

already resolved by sergey

@bobboli bobboli enabled auto-merge (squash) August 9, 2025 17:52
@sklevtsov-nvidia
Copy link
Contributor Author

Passing CI now. Need codeowners approval @juney-nvidia @nv-guomingz @litaotju @yizhang-nv @Naveassaf @pcastonguay @lucaslie

Copy link
Collaborator

@Superjomn Superjomn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on the llmapi change.

Copy link
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@litaotju
Copy link
Collaborator

Could you help help to measure DeepSeek R1 performance with this one?

And what about WIDEEP moe backend? can it support this optimization?

@litaotju
Copy link
Collaborator

The feature is enabled by default. Can it support all the precisions we support in MOE?

BF16, FP8, FP4, and W4A8?

@litaotju
Copy link
Collaborator

This optimization is done using the public cutlass only right?

Why do we need to update the cutlass binary integration .a?

@sklevtsov-nvidia
Copy link
Contributor Author

The feature is enabled by default. Can it support all the precisions we support in MOE?

BF16, FP8, FP4, and W4A8?

It can only support BF16/FP16/FP32 outputs (due to use of HW atomics), but these are the only types for MoE plugin output we currently support on CUTLASS path. It is agnostic w.r.t. input or intermediate precision/quantization.

@sklevtsov-nvidia
Copy link
Contributor Author

This optimization is done using the public cutlass only right?

Why do we need to update the cutlass binary integration .a?

Correct. The only reason internal library is updated is to avoid code divergence (and because this was originally implemented in internal library before the kernels were OSS'ed, so I just kept the changes in both places).

@sklevtsov-nvidia
Copy link
Contributor Author

Could you help help to measure DeepSeek R1 performance with this one?

And what about WIDEEP moe backend? can it support this optimization?

Based on my earlier measurements (~2-3 months ago), there was no observable change in DSR1 perf due to a very small fraction of runtime taken by finalize kernel. I will repeat the measurement on current version of the code if time permits.

IIUC, whenever WIDEEP uses CUTLASS MoE backend, this fusion is enabled. I have not measured runtime impact in that case, but it's likely to be minimal.

@litaotju
Copy link
Collaborator

The feature is enabled by default. Can it support all the precisions we support in MOE?
BF16, FP8, FP4, and W4A8?

It can only support BF16/FP16/FP32 outputs (due to use of HW atomics), but these are the only types for MoE plugin output we currently support on CUTLASS path. It is agnostic w.r.t. input or intermediate precision/quantization.

Is only non-quantized version are supported. Will that cause issue when running quantized path? Given now this is enabled by default.

@litaotju litaotju requested a review from zongfeijing August 12, 2025 03:11
@sklevtsov-nvidia
Copy link
Contributor Author

Is only non-quantized version are supported. Will that cause issue when running quantized path? Given now this is enabled by default.

If/when work is carried out to support output quantization in CUTLASS MoE backend, appropriate conditions will need to be inserted to set the fusion flag to false here. At the moment this is not an issue since the quantized path for the MoE output does not exist in the CUTLASS-based op.

Copy link
Collaborator

@zongfeijing zongfeijing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.