-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][feat] CUTLASS MoE FC2+Finalize fusion #3294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][feat] CUTLASS MoE FC2+Finalize fusion #3294
Conversation
099c239 to
299ea9e
Compare
djns99
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks! See my comments on the internal MR
299ea9e to
a281d1a
Compare
7f16376 to
14d0868
Compare
|
/bot run |
|
PR_Github #2520 [ run ] triggered by Bot |
14d0868 to
a88ff47
Compare
|
/bot run |
|
PR_Github #2534 [ run ] triggered by Bot |
|
PR_Github #2520 [ run ] completed with state |
|
PR_Github #2534 [ run ] completed with state |
a88ff47 to
45df19e
Compare
|
/bot run |
|
PR_Github #2684 [ run ] triggered by Bot |
|
PR_Github #2684 [ run ] completed with state |
|
/bot help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand.
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
/bot run |
|
PR_Github #2706 [ run ] triggered by Bot |
|
PR_Github #2706 [ run ] completed with state |
45df19e to
9d6d482
Compare
|
/bot reuse-pipeline |
|
PR_Github #2808 [ reuse-pipeline ] triggered by Bot |
|
PR_Github #2808 [ reuse-pipeline ] completed with state |
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
Outdated
Show resolved
Hide resolved
35828ee to
793c626
Compare
ab14f1e to
4443cf9
Compare
|
/bot run |
|
PR_Github #14643 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
♻️ Duplicate comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (1)
263-265: Macro line still >120 chars – please reflow
Same E501 violation as highlighted earlier; break the argument list over several lines for readability.
🧹 Nitpick comments (7)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl (2)
433-456: Workspace shape mock is pragmatic; document and guardYou already note this is relying on CUTLASS internals. Please:
- Add a brief comment explaining that only group_count is used by get_workspace_size in current CUTLASS and that can_implement later guards correctness.
- Optionally assert num_experts > 0 here.
- typename TmaWarpSpecializedGroupedGemmInput::ProblemShape shape_info{num_experts, nullptr, nullptr}; + // get_workspace_size only uses group_count in current CUTLASS; arguments are a minimal mock. + TLLM_CHECK(num_experts > 0); + typename TmaWarpSpecializedGroupedGemmInput::ProblemShape shape_info{num_experts, nullptr, nullptr};
405-426: Remove commented-out legacy type plumbingThe large commented block adds noise and risks bitrot. Please delete it; git history preserves it.
- /*}; - // - // using namespace cute; - // using GemmInfo = TmaWarpSpecializedGroupedGemmInfo;<ArchTag, T, WeightType, OutputType, - EpilogueTag, - // MmaTileShape, - // ClusterShape, BIAS, FUSION>; - // - // using ElementAccumulator = typename GemmInfo::ElementAccumulator; - // using ElementA = typename GemmInfo::ElementA; - // using ElementB = typename GemmInfo::ElementB; - // using ElementC = typename GemmInfo::ElementC; - // using ElementCSafe = typename GemmInfo::ElementCSafe; - // using ElementD = typename GemmInfo::ElementD; - // using ElementFinalOutput = typename GemmInfo::ElementFinalOutput; - // using ElementBias = typename GemmInfo::ElementBias; - // - // using CollectiveMainloop = typename GemmInfo::CollectiveMainloop; - // using CollectiveEpilogue = typename GemmInfo::CollectiveEpilogue; - // using GemmKernel = typename GemmInfo::GemmKernel; - // using GemmGrouped = typename GemmInfo::GemmGrouped;*/ + // (removed: legacy alternative type plumbing retained by VCS history)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp (2)
270-273: Remove unused local tensors gIdx/gIdx_epiThese locals are computed but never used. Please remove to reduce register pressure and improve readability.
- Tensor mIdx = make_tensor(params_ptr->ptr_index[l], make_shape(M,N,Int<1>{}), StrideIndex{}); // (M,N,_1) - Tensor gIdx = local_tile(mIdx, take<0,2>(args.tile_shape_mnk), make_coord(m,n,Int<0>{})); // (CTA_M,CTA_N) - Tensor gIdx_epi = flat_divide(gIdx, args.epi_tile); // (EPI_TILE_M,EPI_TILE_N,EPI_M,EPI_N) + // Index layout tensor not needed beyond gather function; avoid unused locals.
214-252: Reduction op selection is robust; consider documenting fallbackThe typed atomic add fallback for non-(half/bfloat16) is sensible. Add a brief comment noting this path trades off performance for generality until vectorized reductions are supported for more types.
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (2)
2-2: Update header year to include 2025Per coding guidelines, headers should include the current year.
- * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved.
234-235: Simplified setFinalizeFusionParams signature—document parameter semanticsPlease update any comments/docs to reflect:
- final_output is the scatter target buffer
- hidden_size maps to num_rows_in_final_output (modulo)
- num_output_tokens selects the group count
- use_reduction toggles red.global vs st.global
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)
3311-3313: Logic coupling makeshas_different_output_type_tma_wsfragile
has_different_output_type_tma_wsis defined as
!using_fused_finalize && using_tma_ws_gemm2.When
using_fused_finalizeis later forced off for some edge-case (e.g. LoRA or W4-groupwise), this flag silently flips and the kernel selection drops back to the slow finalize path. Consider factoring the three booleans into a single helper (e.g.shouldRunSeparateFinalize()), to keep the condition consistent and easier to audit.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/tensorrt_llm_internal_cutlass_kernels_static.tar.xzis excluded by!**/*.xzcpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/tensorrt_llm_internal_cutlass_kernels_static.tar.xzis excluded by!**/*.xz
📒 Files selected for processing (32)
cpp/tensorrt_llm/common/envUtils.cpp(1 hunks)cpp/tensorrt_llm/common/envUtils.h(1 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/collective/epilogue_moe_finalize.hpp(0 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp(1 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h(0 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/util/gather_tensor.hpp(3 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h(5 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h(6 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl(10 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_mixed_input_launcher.inl(2 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h(2 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu(6 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu(16 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h(2 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py(9 hunks)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt(1 hunks)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h(5 hunks)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h(13 hunks)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt(1 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp(1 hunks)cpp/tensorrt_llm/thop/moeOp.cpp(4 hunks)cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu(7 hunks)pyproject.toml(1 hunks)tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py(1 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py(6 hunks)tensorrt_llm/_torch/model_config.py(1 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py(2 hunks)tensorrt_llm/_torch/pyexecutor/config.py(1 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(1 hunks)tensorrt_llm/llmapi/llm_args.py(2 hunks)tests/unittest/_torch/modules/test_fused_moe.py(2 hunks)
💤 Files with no reviewable changes (2)
- cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
- cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/collective/epilogue_moe_finalize.hpp
✅ Files skipped from review due to trivial changes (1)
- cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt
🚧 Files skipped from review as they are similar to previous changes (22)
- cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt
- pyproject.toml
- tensorrt_llm/_torch/pyexecutor/config.py
- cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/util/gather_tensor.hpp
- tensorrt_llm/_torch/model_config.py
- tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
- cpp/tensorrt_llm/common/envUtils.h
- cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h
- tensorrt_llm/_torch/pyexecutor/model_engine.py
- cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_mixed_input_launcher.inl
- tests/unittest/_torch/modules/test_fused_moe.py
- tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
- cpp/tensorrt_llm/common/envUtils.cpp
- cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
- tensorrt_llm/llmapi/llm_args.py
- cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
- cpp/tensorrt_llm/thop/moeOp.cpp
- cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
- cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
- tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
- cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
- cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cucpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.hcpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.hcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hppcpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cucpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.hcpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.hcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.hcpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.hcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
🧠 Learnings (21)
📓 Common learnings
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.893Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
📚 Learning: 2025-08-08T05:10:38.893Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.893Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cucpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inlcpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.hcpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.hcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hppcpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cucpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
📚 Learning: 2025-08-08T04:10:18.987Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:18.987Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cucpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.hcpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.hcpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cucpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cucpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inlcpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.hcpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.hcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hppcpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inlcpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.hcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hppcpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inlcpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.hcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{h,hpp} : Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inlcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hppcpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Do not use smart pointers that have been deprecated in C++11.
Applied to files:
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.hcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
📚 Learning: 2025-07-22T08:33:49.109Z
Learnt from: yiqingy0
PR: NVIDIA/TensorRT-LLM#5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.
Applied to files:
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Avoid the use of #ifdef and #ifndef directives (except for header include guards). Prefer to use #if defined(...) or #if !defined(...) instead.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : C headers should not be used directly. Use C++ standard headers like <cstdint> instead of <stdint.h>.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Avoid dead code in C++.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Avoid dynamic_cast.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : When nesting preprocessor directives, use indentation after the hash mark (#).
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Literal suffixes should be upper case (e.g., 1234L instead of 1234l).
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Use the LLVM clang-format tool for formatting C++ code before submitting a PR.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Use a maximum of 120 characters per line in C++ code.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Avoid declaring large functions as inline, absent a quantifiable benefit.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
🪛 Ruff (0.12.2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
264-264: Line too long (332 > 120)
(E501)
276-276: Line too long (314 > 120)
(E501)
362-362: DataType may be undefined, or defined from star imports
(F405)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl (2)
347-357: Fused finalize epilogue wiring looks correctUsing PtrArray TMA epilogue with ScaledAccPerRowBiasPerColScaleScatter matches the new visitor-scatter implementation. D element is void in this path (scatter handles output), and LayoutD is still carried for scheduling. Per-row bias (Stride<_1,_0>) and per-col scale (Stride<_0,_1>) are consistent.
Also applies to: 359-361
501-505: Alpha‐scale pointer usage is correct
Thealpha_scale_ptr_arrayis consistently defined asfloat const**for per‐expert α scales and is passed as the α‐pointer argument (with the β‐pointer set tonullptr) in both the FINALIZE and non‐FINALIZE paths. No mismatches were found across CUTLASS fusion callbacks.• File cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl (501–505):
–tma_ws_input.alpha_scale_ptr_array→ α array
–nullptr→ β arrayAll inspected instantiations of
cutlass::epilogue::fusion::LinearCombinationtreatalpha_scale_ptr_arrayexclusively as α. No fix needed.cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp (1)
507-510: Per-column scale stride fixed correctlyUsing Stride<_0,_1,int64_t> for scale aligns with the per-column-only support in this fusion. Good simplification.
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (3)
82-83: D operand layout/stride and pointers promoted—LGTMIntroducing LayoutD/StrideD and promoting stride_d/ptr_d out of the old DefaultEpilogue simplifies the launcher wiring and matches the fused/non-fused split.
Also applies to: 118-119, 145-148
151-163: FusedFinalizeEpilogue pointer types align with fusion op needs
- StrideFinalOutput via TagToStrideC_t is correct.
- Double pointers for bias, router scales, and indices match per-group arrays.
- use_reduction default true is sensible.
222-223: Workspace buffer count increase verified—no action required
All call sites now referencebuffers.size()or a matchingstd::array<…, 20>, so they automatically adapt to the new count. No hard-coded loop bounds or allocations remain that could cause OOB accesses.cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)
1168-1169: Potentially incorrectStrideDorder
cute::make_shape(gemm_n, gemm_m, 1)flips M/N compared to the A-matrix shape (gemm_m , gemm_k).
For epilogues that later reinterpret D as “token × hidden”, this may be correct, but for the non-fused path you gated on (fusion == NONE) the old(gemm_m, gemm_n, 1)ordering is usually expected.Please double-check that downstream consumers of
ptr_d/stride_dstill receive the intended row-major layout, otherwise write-backs will land transposed.
4010-4017:use_fused_finalize_is re-evaluated but never validated
using_fused_finalizerelies on the runner-level flaguse_fused_finalize_plus several runtime checks.
If the workspace was configured without those extra buffers (e.g. whenuse_fused_finalize_was true at construction time, but a later call setsuse_lora=true), the code still callssetFinalizeFusionParams, leading to out-of-bounds writes.Add a defensive
TLLM_CHECK(using_fused_finalize || !gemm2_tma_ws_input.fusion == FINALIZE);right after the decision to ensure the workspace was prepared accordingly.
...t_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
Show resolved
Hide resolved
|
PR_Github #14643 [ run ] completed with state |
|
Passing CI now. Need codeowners approval @juney-nvidia @nv-guomingz @litaotju @yizhang-nv @Naveassaf @pcastonguay @lucaslie |
Superjomn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on the llmapi change.
QiJune
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Could you help help to measure DeepSeek R1 performance with this one? And what about WIDEEP moe backend? can it support this optimization? |
|
The feature is enabled by default. Can it support all the precisions we support in MOE? BF16, FP8, FP4, and W4A8? |
|
This optimization is done using the public cutlass only right? Why do we need to update the cutlass binary integration .a? |
It can only support BF16/FP16/FP32 outputs (due to use of HW atomics), but these are the only types for MoE plugin output we currently support on CUTLASS path. It is agnostic w.r.t. input or intermediate precision/quantization. |
Correct. The only reason internal library is updated is to avoid code divergence (and because this was originally implemented in internal library before the kernels were OSS'ed, so I just kept the changes in both places). |
Based on my earlier measurements (~2-3 months ago), there was no observable change in DSR1 perf due to a very small fraction of runtime taken by finalize kernel. I will repeat the measurement on current version of the code if time permits. IIUC, whenever WIDEEP uses CUTLASS MoE backend, this fusion is enabled. I have not measured runtime impact in that case, but it's likely to be minimal. |
Is only non-quantized version are supported. Will that cause issue when running quantized path? Given now this is enabled by default. |
If/when work is carried out to support output quantization in CUTLASS MoE backend, appropriate conditions will need to be inserted to set the fusion flag to |
zongfeijing
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This reverts commit 27fc351.
This MR adds a new EVT-based implementation of GEMM2+Finalize fusion that supports both Blackwell and Hopper.
Using
trtllm-bench --backend pytorch --model mistralai/Mixtral-8x7B-Instruct-v0.1with--input-mean 1024 --output-mean 512 --num-requests 2048on a single B200,Summary by CodeRabbit
New Features
Enhancements
Bug Fixes
Documentation
Chores