- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.8k
[TRTLLM-7318][feat] MnnvlThroughput AlltoAll implementation. #7499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Caution Review failedAn error occurred during the review process. Please try again later. 📝 WalkthroughWalkthroughAdds a CUDA mixed-precision vec dtypes header, MOE all-to-all dispatch and combine kernels with headers and launchers, PyTorch bindings and build integration, multi-GPU unit tests, and a license entry for the new header. Changes
 Sequence Diagram(s)sequenceDiagram
  autonumber
  actor PyTorch
  participant THOP as Thop C++ op
  participant Launcher as Launcher
  participant CUDA as CUDA Kernel
  participant WS as Workspace/RecvBuffers
  participant Flags as CompletionFlags
  rect rgb(240,248,255)
    note over PyTorch,THOP: Dispatch flow
    PyTorch->>THOP: moe_a2a_dispatch(token_selected_experts, payloads, workspace,...)
    THOP->>THOP: validate inputs, partition workspace, prepare params
    THOP->>Launcher: moe_a2a_dispatch_launch(params)
    Launcher->>CUDA: moeA2ADispatchKernel<<<...>>>
    CUDA->>WS: vectorized writes to per-rank recv buffers
    CUDA->>Flags: st.release.sys.u32 to signal completion
    CUDA-->>THOP: update send_counters / send_indices
    THOP-->>PyTorch: return recv buffer views, send_counters, send_indices
  end
  rect rgb(245,255,240)
    note over PyTorch,THOP: Combine flow
    PyTorch->>THOP: moe_a2a_combine(send_indices, payload, workspace,...)
    THOP->>THOP: validate, copy local payload slice into workspace, prepare params
    THOP->>Launcher: moe_a2a_combine_launch(params)
    Launcher->>CUDA: moeA2ACombineKernel<T><<<...>>>
    CUDA->>Flags: ld.acquire.sys.u32 wait on completion flags
    CUDA->>WS: read per-rank recv buffers, vectorized-sum into output
    CUDA-->>THOP: output filled
    THOP-->>PyTorch: return combined output
  end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Suggested reviewers
 ✨ Finishing Touches🧪 Generate unit tests
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 20
🧹 Nitpick comments (17)
cpp/tensorrt_llm/common/vec_dtypes.cuh (3)
199-210: Add explicit “unsupported type” handling in get_exponent_bits/get_mantissa_bits.Prevents “control reaches end of non-void function” in some toolchains if used with an unexpected T.
Apply this diff:
template <typename T> constexpr FLASHINFER_INLINE int get_exponent_bits() { @@ - } + } else { + static_assert(!std::is_same_v<T, T>, "Unsupported type in get_exponent_bits"); + return 0; } } template <typename T> constexpr FLASHINFER_INLINE int get_mantissa_bits() { @@ - } + } else { + static_assert(!std::is_same_v<T, T>, "Unsupported type in get_mantissa_bits"); + return 0; + } }Also applies to: 212-224
1851-1854: Unify BF16 naming in vec2_dtype specialization.The rest of the file uses nv_bfloat16/nv_bfloat162. Specialize on nv_bfloat16 for consistency.
Apply this diff:
-template <> -struct vec2_dtype<__nv_bfloat16> { - using type = __nv_bfloat162; -}; +template <> +struct vec2_dtype<nv_bfloat16> { + using type = nv_bfloat162; +};If both aliases are used across the codebase, consider adding both specializations to be robust.
316-341: Inline PTX cvt instructions: consider adding memory clobbers and documenting constraints.Minor, but adding “: 'memory'” makes ordering explicit; also add a one-line comment citing required SM capability (Hopper+).
I can prepare a follow-up patch if you want this change applied throughout these four blocks.
Also applies to: 341-365, 366-397, 399-430
cpp/tensorrt_llm/thop/CMakeLists.txt (1)
1-15: Update header year to 2025 on next touch.Non-blocking; repository headers are shifting to current year on edits.
tests/unittest/_torch/multi_gpu/test_moe_a2a.py (6)
45-57: Consider removing commented code or marking it as example documentation.The commented function
compute_nvfp4_workspace_sizeappears to be unused. If it's meant as documentation for how workspace size is calculated, consider moving it to a docstring or removing it entirely.
92-93: Address the TODO for FP8 testing.There's a TODO comment indicating that FP8 testing failed. This should be tracked and resolved.
Would you like me to create an issue to track the FP8 testing failure investigation?
106-108: Remove commented debug code.The commented debug code for constructing data with rank and token indices should be removed if not needed, or uncommented if it's required for debugging.
- # Construct the data to contain info about send rank and local_token_idx, which is used for debugging - # token_final_scales[:, 0] = rank - # token_final_scales[:, 1] = torch.linspace(0, local_num_tokens - 1, local_num_tokens, dtype=torch.bfloat16, device='cuda')Also applies to: 174-176
409-411: Improve error handling for MNNVL initialization.The generic exception handling could mask specific errors. Consider catching specific exceptions or at least logging the error before skipping.
try: MnnvlMemory.initialize() assert MnnvlMemory.supports_mnnvl() - except Exception: - pytest.skip("MNNVL not supported on this system") + except Exception as e: + pytest.skip(f"MNNVL not supported on this system: {str(e)}")
493-494: Remove or refactor debug print statements.There are several debug print statements throughout the test. These should either be removed for production code or converted to use proper logging with appropriate log levels.
Consider using Python's logging module instead of print statements:
- print("Starting dispatch and combine on workers...") + logger.debug("Starting dispatch and combine on workers...")Also applies to: 502-511, 606-611, 617-620, 627-632
584-587: Remove redundant local MPI import and document barrier rationale
The file already importsMPIat the top, so you can remove the inlinefrom mpi4py import MPI. Before eachcomm.Barrier(), add a brief comment explaining why this synchronization is required (e.g. “ensure all ranks have completed dispatch before combine”).cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (3)
120-154: Potential integer overflow in bit manipulation.The use of
uint64_tforalready_copiedlimits tracking to 64 ranks. While the code validatesep_size <= kMaxRanks(64), this implicit limitation should be documented or the implementation should be made more robust.Add a static assertion or comment to document this limitation:
+ static_assert(kMaxRanks <= 64, "already_copied bit tracking requires kMaxRanks <= 64"); uint64_t already_copied = 0;
125-127: Add boundary check for target_rank bit shift.While
ep_sizeis validated to be ≤ 64, it's good practice to add an assertion to catch any potential issues.+ assert(target_rank < 64); // Ensure bit shift is valid if (already_copied & (1ULL << target_rank)) continue;
520-521: Add compile-time verification for unsupported types.The default case uses a runtime check, but compile-time verification would be better.
Consider using a static_assert in a template specialization pattern or at least making the error message more informative:
- default: TLLM_CHECK_WITH_INFO(false, "Unsupported data type for moe_a2a_combine"); + default: + TLLM_CHECK_WITH_INFO(false, + "Unsupported data type for moe_a2a_combine. Supported types: HALF, BF16, FLOAT");cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (1)
88-90: Consider using camelCase for function names.According to the coding guidelines, methods should use lowerCamelCase. Consider renaming the launch functions.
-void moe_a2a_dispatch_launch(MoeA2ADispatchParams const& params); +void moeA2ADispatchLaunch(MoeA2ADispatchParams const& params); -void moe_a2a_combine_launch(MoeA2ACombineParams const& params); +void moeA2ACombineLaunch(MoeA2ACombineParams const& params);Also applies to: 119-120
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (3)
142-143: Consider using torch::empty instead of torch::zeros for performance.The TODO comment suggests using torch::empty for better performance since the memory will be overwritten anyway.
- // TODO: Use torch empty to replace initialization - // Create send_counters tensor - tracks number of tokens sent to each target rank - torch::Tensor sendCounters = torch::zeros({epSize}, tokenSelectedExperts.options().dtype(torch::kInt32)); + // Create send_counters tensor - tracks number of tokens sent to each target rank + torch::Tensor sendCounters = torch::empty({epSize}, tokenSelectedExperts.options().dtype(torch::kInt32)); + sendCounters.zero_(); // Explicitly zero if needed for correctness
273-289: Consider adding support for FP8 data types.Given the AI summary mentions FP8 support in the mixed-precision CUDA header, consider adding support for FP8 data types here as well.
else if (payload.dtype() == torch::kFloat32) { nvDtype = nvinfer1::DataType::kFLOAT; } + else if (payload.dtype() == torch::kFloat8_e4m3fn) + { + nvDtype = nvinfer1::DataType::kFP8; // Or appropriate FP8 variant + } else { - TORCH_CHECK(false, "Unsupported data type for payload"); + TORCH_CHECK(false, "Unsupported data type for payload. Supported types: float16, bfloat16, float32"); }
25-27: Consider moving anonymous namespace content after namespace declaration.For better organization, consider moving the anonymous namespace inside the torch_ext namespace.
namespace torch_ext { - namespace {
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (8)
- cpp/tensorrt_llm/common/vec_dtypes.cuh(1 hunks)
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu(1 hunks)
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h(1 hunks)
- cpp/tensorrt_llm/thop/CMakeLists.txt(1 hunks)
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp(1 hunks)
- jenkins/license_cpp.json(1 hunks)
- tensorrt_llm/_torch/distributed/ops.py(1 hunks)
- tests/unittest/_torch/multi_gpu/test_moe_a2a.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}: Closing braces of C++ namespaces must include a comment naming the namespace (e.g., } // namespace foo)
Avoid using literals (except 0, nullptr, true, false) directly in logic; use named constants for comparisons
Use Allman brace style in C++
Place semicolon of empty for/while loop on its own line
Use brace-delimited statements for bodies of switch/while/do/for and always brace if/else bodies
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Non-static, externally visible globals use g prefix with lowerCamelCase (e.g., gDontUseGlobalFoos)
Static or anonymous-namespace globals use s prefix with lowerCamelCase (e.g., sMutableStaticGlobal)
Locally visible static variables use s prefix (e.g., static std::once_flag sFlag)
Member variables use m prefix with CamelCase (public may omit but encouraged)
Constants (enums, globals, static consts, function-scope magic numbers) use k prefix with UPPER_SNAKE (e.g., kDIGIT_NUM)
Function-scope non-literal, non-magic constants use normal non-const naming (e.g., const bool pass)
If macros are necessary, name them in UPPER_SNAKE_CASE
Avoid Hungarian notation except allowed app’s hungarian like nb for counts
Constructor parameters conflicting with member names get a trailing underscore (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L not 1234l)
Format C++ with clang-format (LLVM style), max line length 120; justify any exceptions with clang-format off/on blocks
Use C++-style comments; C comments not allowed except special inline cases; single-line comments use //
Use inline parameter comments in calls when arguments aren’t obvious (e.g., /* checkForErrors = / false)
Disable code with #if/#endif (optionally mnemonic conditions or no-op macros); do not comment out code; avoid dead code
Use the least forceful C++ cast; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void to T* with static_cas...
Files:
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
- cpp/tensorrt_llm/common/vec_dtypes.cuh
**/*.{h,hpp,hh,hxx}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx}: Prefer const or constexpr over #define for constants in C++ headers
Use Doxygen for documenting interfaces; use //! for comments and //!< for member annotations in C++
Use include guards in headers with symbol format TRTLLM__H (no underscores prefix/suffix; filename only)
Files:
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
**/*
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Filenames compiled into a target must be case-insensitively unique
Files:
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
- cpp/tensorrt_llm/thop/CMakeLists.txt
- jenkins/license_cpp.json
- tests/unittest/_torch/multi_gpu/test_moe_a2a.py
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
- tensorrt_llm/_torch/distributed/ops.py
- cpp/tensorrt_llm/common/vec_dtypes.cuh
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use spaces, not tabs; indent 4 spaces
Files:
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
- tests/unittest/_torch/multi_gpu/test_moe_a2a.py
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
- tensorrt_llm/_torch/distributed/ops.py
- cpp/tensorrt_llm/common/vec_dtypes.cuh
**/*.{cpp,cc,cxx,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
- tests/unittest/_torch/multi_gpu/test_moe_a2a.py
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
- tensorrt_llm/_torch/distributed/ops.py
- cpp/tensorrt_llm/common/vec_dtypes.cuh
**/*.{cc,cpp,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cc,cpp,cxx,cu}: Prefer const or constexpr variables over #define for constants in C++
Declare variables const if not modified after initialization
Use smart pointers for heap allocation; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only exceptionally; avoid deprecated smart pointers
Avoid declaring large functions inline unless there’s a quantifiable benefit; remember in-class definitions are implicitly inline
Every defined function must be referenced at least once; avoid unused methods
Files:
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs (Python)
Maintain module namespace on import: prefer from package.subpackage import foo; use foo.Symbol()
Python filenames use snake_case
Python class names use PascalCase
Python functions and methods use snake_case
Python local variables use snake_case; if starting with a number concept, prefix with k (e.g., k_99th_percentile)
Python global variables use G_ prefix with UPPER_SNAKE_CASE
Python constants use UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes
Initialize all externally visible class members in init
For public interfaces, prefer docstrings over comments; comments should be for in-function or file-local interfaces
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes and variables inline with docstrings immediately after assignment
Avoid reflection when a non-reflective approach suffices
Limit except clauses to specific exceptions where possible
When using try/except for duck-typing, keep try body minimal and move logic to else
Files:
- tests/unittest/_torch/multi_gpu/test_moe_a2a.py
- tensorrt_llm/_torch/distributed/ops.py
🧠 Learnings (2)
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
PR: NVIDIA/TensorRT-LLM#7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.
Applied to files:
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
🧬 Code graph analysis (4)
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (1)
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (4)
moe_a2a_dispatch_launch(194-232)
moe_a2a_dispatch_launch(194-194)
moe_a2a_combine_launch(473-522)
moe_a2a_combine_launch(473-473)
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (1)
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (4)
moe_a2a_dispatch_launch(194-232)
moe_a2a_dispatch_launch(194-194)
moe_a2a_combine_launch(473-522)
moe_a2a_combine_launch(473-473)
tests/unittest/_torch/multi_gpu/test_moe_a2a.py (4)
tensorrt_llm/_mnnvl_utils.py (3)
MnnvlMemory(53-336)
as_torch_strided_tensor(84-88)
supports_mnnvl(330-336)tensorrt_llm/_torch/distributed/ops.py (2)
moe_a2a_combine(646-651)
moe_a2a_dispatch(637-643)tensorrt_llm/mapping.py (1)
Mapping(32-513)tests/unittest/conftest.py (1)
mpi_pool_executor(113-121)
tensorrt_llm/_torch/distributed/ops.py (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
top_k(221-221)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (9)
cpp/tensorrt_llm/common/vec_dtypes.cuh (2)
741-764: Alignment assumptions for vectorized ld/st helpers.All these paths reinterpret to int4/uint2/float4. Please confirm the pointers are 16-byte aligned at call sites; otherwise add debug asserts or non-vectorized fallbacks.
I can scan call sites and add DCHECK-style alignment guards if helpful.
Also applies to: 1014-1037, 1202-1226, 1409-1432, 1652-1675, 1797-1820
53-77: Verify CUDA architecture targets and fallback compilation
Confirm your build configuration (e.g., CMakeLists or compiler flags) includes compute capability ≥ 90 for these helpers, and test building for sm_80 to ensure the fallback path (plain store +__threadfence_system()) compiles successfully.cpp/tensorrt_llm/thop/CMakeLists.txt (2)
73-73: Add moeAlltoAllOp.cpp to th_common: LGTM.
34-36: Remove NCCL/NVSHMEM link check for A2A op
moeAlltoAllOp.cpp contains no direct NCCL or NVSHMEM references; NCCL is already linked only for ncclCommunicatorOp under ENABLE_MULTI_DEVICE.Likely an incorrect or invalid review comment.
tensorrt_llm/_torch/distributed/ops.py (1)
637-643: LGTM! Clean wrapper implementation.The wrapper functions are well-implemented with clear docstrings explaining their purpose. The parameter forwarding is correct and maintains the same signature as the underlying PyTorch operations.
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (2)
1-21: LGTM! Clean header structure with proper include guards and licensing.The copyright header and includes are properly structured. Good organization of the dependencies.
22-23: Add closing brace comment for the namespace.According to the coding guidelines, closing braces of C++ namespaces must include a comment naming the namespace.
-} // namespace tensorrt_llm::kernels::moe_a2a +} // namespace moe_a2a +} // namespace kernels +} // namespace tensorrt_llm⛔ Skipped due to learnings
Learnt from: CR PR: NVIDIA/TensorRT-LLM#0 File: CODING_GUIDELINES.md:0-0 Timestamp: 2025-09-01T18:06:21.617Z Learning: Applies to **/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh} : Closing braces of C++ namespaces must include a comment naming the namespace (e.g., } // namespace foo)cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (2)
1-16: LGTM! Well-structured file with proper copyright header.The copyright header and includes follow the coding guidelines.
197-200: No update required for dispatch function name
The call to moe_a2a_dispatch_launch(params) matches the declaration in moeAlltoAllKernels.h (line 89) and no renaming was performed.Likely an incorrect or invalid review comment.
        
          
                cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
              
                Outdated
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
tests/unittest/_torch/multi_gpu/test_moe_a2a.py (4)
525-525: Remove unused variables flagged by Ruff.Clean up to keep tests tidy.
Apply:
- combine_payload = torch.zeros(rank_local_tokens, hidden_size, dtype=dtype, device=device)- expert_rank = expert_idx % ep_size- all_original_hidden_states = [r[1] for r in all_results]Also applies to: 551-551, 598-598
88-95: Scaling factors dtype mismatches the comment.Comment says FP8; code uses float32. If A2A is dtype-agnostic this is fine; otherwise, ensure the expected dtype is used.
If FP8 is required and supported in this test env, switch to
torch.float8_e4m3fn; otherwise update the comment to avoid confusion.
390-393: Avoid hardcoding 512 MiB workspace; compute exact size.Precomputing per-payload bytes improves determinism and reduces memory waste.
Consider reviving
compute_nvfp4_workspace_size(...)and using it here, with a safety margin if needed.Also applies to: 452-455
457-475: Reduce noisy prints in tests.Use pytest logging or keep output minimal to avoid cluttering CI logs.
You can replace
pytest.verbose-guarded logs or drop them after stabilizing.Also applies to: 477-481
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
- tests/unittest/_torch/multi_gpu/test_moe_a2a.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Filenames compiled into a target must be case-insensitively unique
Files:
- tests/unittest/_torch/multi_gpu/test_moe_a2a.py
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use spaces, not tabs; indent 4 spaces
Files:
- tests/unittest/_torch/multi_gpu/test_moe_a2a.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs (Python)
Maintain module namespace on import: prefer from package.subpackage import foo; use foo.Symbol()
Python filenames use snake_case
Python class names use PascalCase
Python functions and methods use snake_case
Python local variables use snake_case; if starting with a number concept, prefix with k (e.g., k_99th_percentile)
Python global variables use G_ prefix with UPPER_SNAKE_CASE
Python constants use UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes
Initialize all externally visible class members in init
For public interfaces, prefer docstrings over comments; comments should be for in-function or file-local interfaces
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes and variables inline with docstrings immediately after assignment
Avoid reflection when a non-reflective approach suffices
Limit except clauses to specific exceptions where possible
When using try/except for duck-typing, keep try body minimal and move logic to else
Files:
- tests/unittest/_torch/multi_gpu/test_moe_a2a.py
**/*.{cpp,cc,cxx,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
- tests/unittest/_torch/multi_gpu/test_moe_a2a.py
🧬 Code graph analysis (1)
tests/unittest/_torch/multi_gpu/test_moe_a2a.py (4)
tensorrt_llm/_mnnvl_utils.py (3)
MnnvlMemory(53-336)
as_torch_strided_tensor(84-88)
supports_mnnvl(330-336)tensorrt_llm/_torch/distributed/ops.py (2)
moe_a2a_combine(646-651)
moe_a2a_dispatch(637-643)tensorrt_llm/mapping.py (1)
Mapping(32-513)tests/unittest/conftest.py (1)
mpi_pool_executor(113-121)
🪛 Ruff (0.12.2)
tests/unittest/_torch/multi_gpu/test_moe_a2a.py
524-524: Local variable combine_payload is assigned to but never used
Remove assignment to unused variable combine_payload
(F841)
551-551: Local variable expert_rank is assigned to but never used
Remove assignment to unused variable expert_rank
(F841)
597-597: Local variable all_original_hidden_states is assigned to but never used
Remove assignment to unused variable all_original_hidden_states
(F841)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/unittest/_torch/multi_gpu/test_moe_a2a.py (2)
185-197: Confirm completion_flags_offset layout/alignment.The offset assumes no per-payload alignment/padding in the workspace. If kernels impose alignment, the slice may be wrong.
Can you confirm the dispatch workspace layout guarantees contiguous packing of all payload slabs with no extra alignment, or expose a helper to compute the flags offset?
39-43: LGTM: deterministic seed and quiet logger.Good practice for reproducible multi-GPU tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️  Outside diff range comments (1)
tensorrt_llm/_torch/distributed/__init__.py (1)
1-1: Add NVIDIA copyright header.All source files must include the NVIDIA copyright header for the current year.
Apply at the top:
+# +# Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved. +# +"""Distributed public API exports."""
♻️ Duplicate comments (6)
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (3)
383-394: Commented synchronization blocks: remove or re-enable with justification.Large sections of combine-kernel synchronization are commented out. Either delete dead code or reintroduce it with a clear rationale and tests.
Also applies to: 453-487
176-178: Resolve the TODO and enforce system-wide visibility before signaling.Cross-GPU visibility of payload writes from other threads requires a system fence prior to release-store signaling.
- // TODO: Is this membar sufficient/necessary? - // __threadfence_system(); + // Ensure all payload writes by any thread on this device are visible system-wide before signaling completion. + __threadfence_system();
427-446: Remove kernel printf or guard with a debug macro.Kernel-side printf is a perf hazard; gate under a compile-time flag or remove.
- if (lane_id == 0) - { + #ifdef DEBUG_MOE_A2A + if (lane_id == 0) { ... - } + } + #endifcpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (3)
167-173: Gate debug prints behind a macro.Avoid unconditional device prints in production.
- // Debug: print workspace addresses - if (epRank == 0) - { + // Debug: print workspace addresses + #ifdef DEBUG_MOE_A2A + if (epRank == 0) { printf("Rank %d: target_rank %d workspace at %p (base %p + stride %ld * %d)\n", ... - } + } + #endif @@ - // Debug: print completion flags pointer - printf("Dispatch: Rank %d completion_flags[%d] at %p (workspace %p + offset %ld)\n", ... + // Debug: print completion flags pointer + #ifdef DEBUG_MOE_A2A + printf("Dispatch: Rank %d completion_flags[%d] at %p (workspace %p + offset %ld)\n", ... + #endifAlso applies to: 186-189
326-326: Allocate localTokenCounter on the correct device.Currently defaults to CPU; allocate on CUDA to match kernel pointer usage.
- torch::Tensor localTokenCounter = torch::zeros({1}, torch::TensorOptions().dtype(torch::kInt32)); + torch::Tensor localTokenCounter = torch::zeros({1}, payload.options().dtype(torch::kInt32));
362-367: Gate combine debug print.Remove or guard with
#ifdef DEBUG_MOE_A2A.- // Debug: print completion flags pointer - if (true) - { - printf("Combine: Rank %d completion_flags[%d] at %p (workspace %p + offset %ld)\n", ... - } + // Debug: print completion flags pointer + #ifdef DEBUG_MOE_A2A + printf("Combine: Rank %d completion_flags[%d] at %p (workspace %p + offset %ld)\n", ... + #endif
🧹 Nitpick comments (5)
tensorrt_llm/_torch/distributed/__init__.py (1)
5-6: Confirm wrappers are imported from ops and keep ordering consistent.The new exports look correct. For readability, consider importing
moe_a2a_dispatchbeforemoe_a2a_combineto match all order.- MoEAllReduceParams, allgather, moe_a2a_combine, - moe_a2a_dispatch, reducescatter, + MoEAllReduceParams, allgather, moe_a2a_dispatch, + moe_a2a_combine, reducescatter,cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (3)
113-125: Flag initialization: clarify scope and ordering.Good to init remote flags once. Add a grid-wide single initializer (e.g., blockIdx.x==0 && threadIdx.x==0 is fine) but document that other CTAs may already copy;
__threadfence()is device-scoped only; keep the later__threadfence_system()before signaling.- // Device-level threadfence prevents reordering the writing of the flags (which might be done by another thread - // in the same rank) in front of the initializations above. - __threadfence(); + // Device fence to order this thread's writes; system fence occurs at completion signaling. + __threadfence();
16-22: Drop unused includes.
quantization.cuhandcooperative_groups.happear unused here; remove to speed up builds.-#include "tensorrt_llm/kernels/quantization.cuh" -#include <cooperative_groups.h>
214-252: Runtime validations present; add assert for token indices range.Consider validating expert IDs on host (in op) to be in [0, num_experts) to avoid OOB
compute_target_rank_id.Would you like a patch in moeAlltoAllOp.cpp to enforce this?
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (1)
315-325: Prefer non-blocking copy only when streams/devices are compatible.
copy_(..., /*non_blocking=*/true)is fine; ensure streams/devices align. Optionally expose abool non_blockingparam for callers.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu(1 hunks)
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h(1 hunks)
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp(1 hunks)
- tensorrt_llm/_torch/distributed/__init__.py(2 hunks)
- tensorrt_llm/_torch/distributed/ops.py(1 hunks)
- tests/unittest/_torch/multi_gpu/test_moe_a2a.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tensorrt_llm/_torch/distributed/ops.py
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
- tests/unittest/_torch/multi_gpu/test_moe_a2a.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Filenames compiled into a target must be case-insensitively unique
Files:
- tensorrt_llm/_torch/distributed/__init__.py
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use spaces, not tabs; indent 4 spaces
Files:
- tensorrt_llm/_torch/distributed/__init__.py
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs (Python)
Maintain module namespace on import: prefer from package.subpackage import foo; use foo.Symbol()
Python filenames use snake_case
Python class names use PascalCase
Python functions and methods use snake_case
Python local variables use snake_case; if starting with a number concept, prefix with k (e.g., k_99th_percentile)
Python global variables use G_ prefix with UPPER_SNAKE_CASE
Python constants use UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes
Initialize all externally visible class members in init
For public interfaces, prefer docstrings over comments; comments should be for in-function or file-local interfaces
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes and variables inline with docstrings immediately after assignment
Avoid reflection when a non-reflective approach suffices
Limit except clauses to specific exceptions where possible
When using try/except for duck-typing, keep try body minimal and move logic to else
Files:
- tensorrt_llm/_torch/distributed/__init__.py
**/*.{cpp,cc,cxx,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
- tensorrt_llm/_torch/distributed/__init__.py
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}: Closing braces of C++ namespaces must include a comment naming the namespace (e.g., } // namespace foo)
Avoid using literals (except 0, nullptr, true, false) directly in logic; use named constants for comparisons
Use Allman brace style in C++
Place semicolon of empty for/while loop on its own line
Use brace-delimited statements for bodies of switch/while/do/for and always brace if/else bodies
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Non-static, externally visible globals use g prefix with lowerCamelCase (e.g., gDontUseGlobalFoos)
Static or anonymous-namespace globals use s prefix with lowerCamelCase (e.g., sMutableStaticGlobal)
Locally visible static variables use s prefix (e.g., static std::once_flag sFlag)
Member variables use m prefix with CamelCase (public may omit but encouraged)
Constants (enums, globals, static consts, function-scope magic numbers) use k prefix with UPPER_SNAKE (e.g., kDIGIT_NUM)
Function-scope non-literal, non-magic constants use normal non-const naming (e.g., const bool pass)
If macros are necessary, name them in UPPER_SNAKE_CASE
Avoid Hungarian notation except allowed app’s hungarian like nb for counts
Constructor parameters conflicting with member names get a trailing underscore (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L not 1234l)
Format C++ with clang-format (LLVM style), max line length 120; justify any exceptions with clang-format off/on blocks
Use C++-style comments; C comments not allowed except special inline cases; single-line comments use //
Use inline parameter comments in calls when arguments aren’t obvious (e.g., /* checkForErrors = / false)
Disable code with #if/#endif (optionally mnemonic conditions or no-op macros); do not comment out code; avoid dead code
Use the least forceful C++ cast; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void to T* with static_cas...
Files:
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
**/*.{cc,cpp,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cc,cpp,cxx,cu}: Prefer const or constexpr variables over #define for constants in C++
Declare variables const if not modified after initialization
Use smart pointers for heap allocation; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only exceptionally; avoid deprecated smart pointers
Avoid declaring large functions inline unless there’s a quantifiable benefit; remember in-class definitions are implicitly inline
Every defined function must be referenced at least once; avoid unused methods
Files:
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
🧠 Learnings (3)
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
- cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
🧬 Code graph analysis (3)
tensorrt_llm/_torch/distributed/__init__.py (1)
tensorrt_llm/_torch/distributed/ops.py (2)
moe_a2a_combine(684-719)
moe_a2a_dispatch(637-681)
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (1)
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (4)
moe_a2a_dispatch_launch(214-252)
moe_a2a_dispatch_launch(214-214)
moe_a2a_combine_launch(493-542)
moe_a2a_combine_launch(493-493)
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (2)
tests/unittest/_torch/multi_gpu/test_moe_a2a.py (1)
compute_target_rank_id(58-70)cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
__syncwarp(703-707)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/distributed/__init__.py (1)
19-20: LGTM: public re-exports added.Exposing
moe_a2a_dispatchandmoe_a2a_combinevia__all__aligns with the new wrappers.cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (2)
146-154: Good: allocate counters/indices on the same device as inputs.
sendCounters,localTokenCounter, andsendIndicescorrectly inherit device fromtokenSelectedExperts.
382-398: Bindings look correct.Operator schemas match Python wrappers; CUDA impl registered as expected.
        
          
                cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
              
                Outdated
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (8)
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (8)
58-63: Adopt renamed constants and add epSize upper-bound check.Update to new constant names and validate epSize against kMaxRanks.
using tensorrt_llm::kernels::moe_a2a::PayloadDescriptor; using tensorrt_llm::kernels::moe_a2a::MoeA2ADispatchParams; using tensorrt_llm::kernels::moe_a2a::moe_a2a_dispatch_launch; - using tensorrt_llm::kernels::moe_a2a::kMaxTopK; - using tensorrt_llm::kernels::moe_a2a::kMaxPayloads; + using tensorrt_llm::kernels::moe_a2a::kMAX_TOP_K; + using tensorrt_llm::kernels::moe_a2a::kMAX_PAYLOADS; + using tensorrt_llm::kernels::moe_a2a::kMaxRanks;TORCH_CHECK(epRank >= 0 && epRank < epSize, "epRank must be in the range [0, epSize)"); - TORCH_CHECK(topK > 0 && topK <= kMaxTopK, "topK must be in the range (0, kMaxTopK]"); + TORCH_CHECK(epSize > 0 && epSize <= kMaxRanks, "epSize must be in the range (0, kMaxRanks]"); + TORCH_CHECK(topK > 0 && topK <= kMAX_TOP_K, "topK must be in the range (0, kMAX_TOP_K]"); TORCH_CHECK(!inputPayloads.empty(), "inputPayloads must not be empty"); - TORCH_CHECK(inputPayloads.size() <= kMaxPayloads, "Too many input payloads"); + TORCH_CHECK(inputPayloads.size() <= kMAX_PAYLOADS, "Too many input payloads");Also applies to: 72-76
64-78: Validate expert IDs fall within [0, numExperts).Prevents invalid routing/UB from out-of-range expert IDs.
TORCH_CHECK(numExperts >= epSize, "numExperts must be greater than or equal to epSize"); TORCH_CHECK(numExperts % epSize == 0, "numExperts must be divisible by epSize for contiguous partitioning"); + TORCH_CHECK(tokenSelectedExperts.min().item<int32_t>() >= 0 && + tokenSelectedExperts.max().item<int32_t>() < numExperts, + "tokenSelectedExperts must be in [0, numExperts).");
138-145: Update PayloadDescriptor member names to new struct API.Use the renamed fields with m-prefix to match header changes.
- PayloadDescriptor desc{}; - desc.src_data = inputPayloads[i].data_ptr(); - desc.element_size = payloadElementSizes[i]; - desc.elements_per_token = payloadElementsPerToken[i]; + PayloadDescriptor desc{}; + desc.mSrcData = inputPayloads[i].data_ptr(); + desc.mElementSize = payloadElementSizes[i]; + desc.mElementsPerToken = payloadElementsPerToken[i];
168-174: Gate debug prints.Wrap with a debug macro or remove to avoid noisy stdout and perf impact.
- // Debug: print workspace addresses - if (epRank == 0) - { - printf("Rank %d: target_rank %d workspace at %p (base %p + stride %ld * %d)\n", static_cast<int>(epRank), - target_rank, target_workspace, workspace_ptr, workspace.stride(0), target_rank); - } + // Debug: print workspace addresses + #ifdef DEBUG_MOE_A2A + if (epRank == 0) + { + printf("Rank %d: target_rank %d workspace at %p (base %p + stride %ld * %d)\n", static_cast<int>(epRank), + target_rank, target_workspace, workspace_ptr, workspace.stride(0), target_rank); + } + #endif
187-190: Gate debug prints.Same rationale as above.
- // Debug: print completion flags pointer - printf("Dispatch: Rank %d completion_flags[%d] at %p (workspace %p + offset %ld)\n", static_cast<int>(epRank), - target_rank, params.completion_flags[target_rank], target_workspace, offset); + // Debug: print completion flags pointer + #ifdef DEBUG_MOE_A2A + printf("Dispatch: Rank %d completion_flags[%d] at %p (workspace %p + offset %ld)\n", + static_cast<int>(epRank), target_rank, params.completion_flags[target_rank], target_workspace, offset); + #endif
262-266: Ensure payload is contiguous in combine.Matches dispatch path and avoids unexpected strides.
CHECK_TH_CUDA(payload); + CHECK_CONTIGUOUS(payload); TORCH_CHECK(payload.dim() == 3, "payload must be a 3D tensor [ep_size, max_tokens_per_rank, elements_per_token]");
330-330: Allocate localTokenCounter on the correct device.It currently defaults to CPU; use sendIndices (or payload) options to place it on CUDA.
- torch::Tensor localTokenCounter = torch::zeros({1}, torch::TensorOptions().dtype(torch::kInt32)); + torch::Tensor localTokenCounter = torch::zeros({1}, sendIndices.options().dtype(torch::kInt32));
367-371: Gate debug prints.Replace
if (true)with a compile-time guard.- // Debug: print completion flags pointer - if (true) - { - printf("Combine: Rank %d completion_flags[%d] at %p (workspace %p + offset %ld)\n", - static_cast<int>(epRank), target_rank, params.completion_flags[target_rank], target_workspace, offset); - } + // Debug: print completion flags pointer + #ifdef DEBUG_MOE_A2A + printf("Combine: Rank %d completion_flags[%d] at %p (workspace %p + offset %ld)\n", + static_cast<int>(epRank), target_rank, params.completion_flags[target_rank], target_workspace, offset); + #endif
🧹 Nitpick comments (4)
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (4)
153-155: Avoid magic literal -1 for sendIndices; use a named constant.Improves readability and aligns with guideline to avoid literals in logic.
- torch::Tensor sendIndices - = torch::full({localNumTokens, epSize}, -1, tokenSelectedExperts.options().dtype(torch::kInt32)); + torch::Tensor sendIndices = torch::full( + {localNumTokens, epSize}, kINVALID_INDEX, tokenSelectedExperts.options().dtype(torch::kInt32));Add near the top inside the anonymous namespace:
// Place near line ~31 inside the anonymous namespace constexpr int kINVALID_INDEX = -1;
156-161: Confirm param struct member names and stream type.If headers adopted m-prefixed members, update assignments; also prefer .stream() to pass cudaStream_t explicitly.
Example diff (adjust field names to actual header):
- params.token_selected_experts = tokenSelectedExperts.data_ptr<int32_t>(); - params.num_payloads = static_cast<int32_t>(payloadDescriptors.size()); - std::copy(payloadDescriptors.begin(), payloadDescriptors.end(), ¶ms.payloads[0]); + params.mTokenSelectedExperts = tokenSelectedExperts.data_ptr<int32_t>(); + params.mNumPayloads = static_cast<int32_t>(payloadDescriptors.size()); + std::copy(payloadDescriptors.begin(), payloadDescriptors.end(), ¶ms.mPayloads[0]);- params.max_tokens_per_rank = static_cast<int>(maxTokensPerRank); - params.send_counters = sendCounters.data_ptr<int>(); - params.local_token_counter = localTokenCounter.data_ptr<int>(); - params.send_indices = sendIndices.data_ptr<int>(); - params.local_num_tokens = static_cast<int>(localNumTokens); - params.ep_size = static_cast<int>(epSize); - params.ep_rank = static_cast<int>(epRank); - params.top_k = static_cast<int>(topK); - params.num_experts_per_rank = static_cast<int>(numExperts) / static_cast<int>(epSize); - params.stream = at::cuda::getCurrentCUDAStream(); + params.mMaxTokensPerRank = static_cast<int>(maxTokensPerRank); + params.mSendCounters = sendCounters.data_ptr<int>(); + params.mLocalTokenCounter = localTokenCounter.data_ptr<int>(); + params.mSendIndices = sendIndices.data_ptr<int>(); + params.mLocalNumTokens = static_cast<int>(localNumTokens); + params.mEpSize = static_cast<int>(epSize); + params.mEpRank = static_cast<int>(epRank); + params.mTopK = static_cast<int>(topK); + params.mNumExpertsPerRank = static_cast<int>(numExperts) / static_cast<int>(epSize); + params.mStream = at::cuda::getCurrentCUDAStream().stream();Also applies to: 191-201
324-329: Remove blocking syncs or gate behind debug.cudaDeviceSynchronize() and global MPI barrier here will stall the pipeline; prefer stream semantics or events. Leave only if required for debugging.
- // TODO: Copy and sync here should be removed. - // Copy the entire payload to workspace using PyTorch tensor operations - recvBuffer.copy_(payload, /*non_blocking=*/true); - cudaDeviceSynchronize(); - tensorrt_llm::mpi::MpiComm::world().barrier(); + // Copy the entire payload to workspace using PyTorch tensor operations + recvBuffer.copy_(payload, /*non_blocking=*/true); + #ifdef DEBUG_MOE_A2A + cudaDeviceSynchronize(); + tensorrt_llm::mpi::MpiComm::world().barrier(); + #endif
200-200: Use raw CUDA stream handle explicitly.Avoid relying on implicit conversions.
- params.stream = at::cuda::getCurrentCUDAStream(); + params.stream = at::cuda::getCurrentCUDAStream().stream();
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp(1 hunks)
- tests/unittest/_torch/multi_gpu/test_moe_a2a.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/unittest/_torch/multi_gpu/test_moe_a2a.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}: Closing braces of C++ namespaces must include a comment naming the namespace (e.g., } // namespace foo)
Avoid using literals (except 0, nullptr, true, false) directly in logic; use named constants for comparisons
Use Allman brace style in C++
Place semicolon of empty for/while loop on its own line
Use brace-delimited statements for bodies of switch/while/do/for and always brace if/else bodies
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Non-static, externally visible globals use g prefix with lowerCamelCase (e.g., gDontUseGlobalFoos)
Static or anonymous-namespace globals use s prefix with lowerCamelCase (e.g., sMutableStaticGlobal)
Locally visible static variables use s prefix (e.g., static std::once_flag sFlag)
Member variables use m prefix with CamelCase (public may omit but encouraged)
Constants (enums, globals, static consts, function-scope magic numbers) use k prefix with UPPER_SNAKE (e.g., kDIGIT_NUM)
Function-scope non-literal, non-magic constants use normal non-const naming (e.g., const bool pass)
If macros are necessary, name them in UPPER_SNAKE_CASE
Avoid Hungarian notation except allowed app’s hungarian like nb for counts
Constructor parameters conflicting with member names get a trailing underscore (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L not 1234l)
Format C++ with clang-format (LLVM style), max line length 120; justify any exceptions with clang-format off/on blocks
Use C++-style comments; C comments not allowed except special inline cases; single-line comments use //
Use inline parameter comments in calls when arguments aren’t obvious (e.g., /* checkForErrors = / false)
Disable code with #if/#endif (optionally mnemonic conditions or no-op macros); do not comment out code; avoid dead code
Use the least forceful C++ cast; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void to T* with static_cas...
Files:
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
**/*.{cc,cpp,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cc,cpp,cxx,cu}: Prefer const or constexpr variables over #define for constants in C++
Declare variables const if not modified after initialization
Use smart pointers for heap allocation; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only exceptionally; avoid deprecated smart pointers
Avoid declaring large functions inline unless there’s a quantifiable benefit; remember in-class definitions are implicitly inline
Every defined function must be referenced at least once; avoid unused methods
Files:
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
**/*
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Filenames compiled into a target must be case-insensitively unique
Files:
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use spaces, not tabs; indent 4 spaces
Files:
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
**/*.{cpp,cc,cxx,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
🧠 Learnings (3)
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
📚 Learning: 2025-09-02T13:42:44.885Z
Learnt from: pcastonguay
PR: NVIDIA/TensorRT-LLM#7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.
Applied to files:
- cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
🧬 Code graph analysis (1)
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (1)
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (4)
moe_a2a_dispatch_launch(214-252)
moe_a2a_dispatch_launch(214-214)
moe_a2a_combine_launch(493-542)
moe_a2a_combine_launch(493-493)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
tests/unittest/_torch/multi_gpu/test_moe_a2a.py (2)
483-485: Fix under-sized max_tokens_per_rank (risk: receiver overflow).Bound per receiver, not per sender. Use a safe worst-case upper bound across all senders.
- max_tokens_per_rank = max(all_num_tokens) + # Upper-bound the max tokens any single target rank can receive across all senders. + max_tokens_per_rank = sum(all_num_tokens)- max_tokens_per_rank = max(all_num_tokens) + # Upper-bound the max tokens any single target rank can receive across all senders. + max_tokens_per_rank = sum(all_num_tokens)Also applies to: 545-546
588-588: Use a proper torch.device; current_device() returns int and breaks tensor allocations.Create a torch.device tied to the rank and pass dtype through.
- device = torch.cuda.current_device() + device = torch.device("cuda", rank)- rank_experts = create_experts(num_experts_per_rank, hidden_size, rank, device, dtype=torch.bfloat16) + rank_experts = create_experts(num_experts_per_rank, hidden_size, rank, device, dtype=dtype)Also applies to: 632-632
🧹 Nitpick comments (5)
tests/unittest/_torch/multi_gpu/test_moe_a2a.py (5)
645-647: Compute completion_flags_offset from payload dtype instead of hardcoding bfloat16.Keeps combine test correct across dtypes and prevents misaligned reads.
- completion_flags_offset = ep_size * max_tokens_per_rank * hidden_size * 2 # bfloat16 = 2 bytes + elements_per_token = hidden_states_recv.shape[-1] + completion_flags_offset = ( + ep_size * max_tokens_per_rank * elements_per_token * hidden_states_recv.element_size() + )
206-209: Parameterize combine payload dtype; test input dtype is currently ignored.Honor the test’s dtype parameter so we actually exercise different precisions.
-def make_bfloat16_payloads(local_num_tokens: int, hidden_size: int, top_k: int, - rank: int, - token_selected_experts: torch.Tensor) -> list: +def make_bfloat16_payloads(local_num_tokens: int, hidden_size: int, top_k: int, + rank: int, + token_selected_experts: torch.Tensor, + dtype: torch.dtype = torch.bfloat16) -> list: @@ - hidden_states = torch.randn(local_num_tokens, - hidden_size, - dtype=torch.bfloat16, - device='cuda') + hidden_states = torch.randn(local_num_tokens, + hidden_size, + dtype=dtype, + device='cuda') @@ - token_final_scales = torch.rand(local_num_tokens, - top_k, - dtype=torch.bfloat16, - device='cuda') + token_final_scales = torch.rand(local_num_tokens, + top_k, + dtype=dtype, + device='cuda')- payloads = make_bfloat16_payloads(rank_local_tokens, hidden_size, top_k, - rank, token_selected_experts) + payloads = make_bfloat16_payloads(rank_local_tokens, hidden_size, top_k, + rank, token_selected_experts, dtype=dtype)Also applies to: 212-219, 224-229, 607-609
676-681: Assert on combine completion flags (don’t just print).Tests should fail if flags aren’t set as expected; retain prints for context if needed.
- if not torch.all(completion_flags == expected_flags).item(): - print( - f"ERROR: Rank {rank} completion flags after combine: {completion_flags.tolist()}, expected: {expected_flags.tolist()}" - ) + assert torch.all(completion_flags == expected_flags).item(), ( + f"Rank {rank} completion flags after combine: {completion_flags.tolist()}, " + f"expected: {expected_flags.tolist()}" + )
426-426: Typo in comment.- atol=0, # Dispatch is pure copy, should expact exactly the same + atol=0, # Dispatch is pure copy; expect exactly the same
91-92: Optional: Use the per-rank torch.device consistently for tensor creation.Prefer a shared device variable (torch.device("cuda", rank)) for all tensors to avoid implicit device reliance.
Also applies to: 173-176, 185-186, 196-197, 213-217, 228-229
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
- tests/unittest/_torch/multi_gpu/test_moe_a2a.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
- tests/unittest/_torch/multi_gpu/test_moe_a2a.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
- tests/unittest/_torch/multi_gpu/test_moe_a2a.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
- tests/unittest/_torch/multi_gpu/test_moe_a2a.py
🧬 Code graph analysis (1)
tests/unittest/_torch/multi_gpu/test_moe_a2a.py (5)
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (1)
tensorrt_llm(22-123)tensorrt_llm/_mnnvl_utils.py (3)
MnnvlMemory(53-336)
as_torch_strided_tensor(84-88)
supports_mnnvl(330-336)tensorrt_llm/_torch/distributed/ops.py (2)
moe_a2a_combine(684-719)
moe_a2a_dispatch(637-681)tensorrt_llm/mapping.py (1)
Mapping(32-513)tests/unittest/conftest.py (1)
mpi_pool_executor(113-121)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Combine implementation should be changed to be sending tokens from target_ranks. Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
| /bot run --reuse-test | 
| PR_Github #22494 [ run ] triggered by Bot. Commit:  | 
| PR_Github #22494 [ run ] completed with state  | 
Signed-off-by: Bo Li <[email protected]>
| /bot run | 
| PR_Github #22509 [ run ] triggered by Bot. Commit:  | 
| PR_Github #22509 [ run ] completed with state  | 
| /bot run --reuse-test | 
| PR_Github #22527 [ run ] triggered by Bot. Commit:  | 
| PR_Github #22527 [ run ] completed with state  | 
| /bot run --reuse-test | 
| PR_Github #22535 [ run ] triggered by Bot. Commit:  | 
| PR_Github #22535 [ run ] completed with state  | 
| /bot run | 
| PR_Github #22541 [ run ] triggered by Bot. Commit:  | 
| PR_Github #22541 [ run ] completed with state  | 
Signed-off-by: Bo Li <[email protected]>
| /bot kill | 
| PR_Github #22564 [ kill ] triggered by Bot. Commit:  | 
| PR_Github #22564 [ kill ] completed with state  | 
Signed-off-by: Bo Li <[email protected]>
| /bot run | 
| PR_Github #22570 [ run ] triggered by Bot. Commit:  | 
| PR_Github #22570 [ run ] completed with state  | 
| /bot run --reuse-test | 
| PR_Github #22601 [ run ] triggered by Bot. Commit:  | 
| PR_Github #22601 [ run ] completed with state  | 
Summary by CodeRabbit
MnnvlLatency, and the new one asMnnvlThroughput.fused_moeOP.Description
Perf comparison on gpt-oss-120b DP8EP8:

Test Coverage
tests/unittest/_torch/multi_gpu/test_moe_a2a.py
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.