[TRTLLM-7318][feat] MnnvlThroughput AlltoAll implementation. #7499

bobboli · 2025-09-03T07:19:41Z

Summary by CodeRabbit

New Features
A new AlltoAll implementation. It is intended to work within NVLink Domain. It provides better performance over the existing MNNVL AlltoAll under high throughput cases.
For differentiation, the existing Mnnvl AlltoAll is rebranded as MnnvlLatency, and the new one as MnnvlThroughput.
Integration of this AlltoAll with Cutlass MoE backend.
Cutlass MoE supports being provided the output tensor, instead of allocating one by itself in the fused_moe OP.

Description

Perf comparison on gpt-oss-120b DP8EP8:

Test Coverage

tests/unittest/_torch/multi_gpu/test_moe_a2a.py

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-09-03T07:19:48Z

Caution

Review failed

An error occurred during the review process. Please try again later.

📝 Walkthrough

Walkthrough

Adds a CUDA mixed-precision vec dtypes header, MOE all-to-all dispatch and combine kernels with headers and launchers, PyTorch bindings and build integration, multi-GPU unit tests, and a license entry for the new header.

Changes

Cohort / File(s)	Summary
Vector dtypes and casts `cpp/tensorrt_llm/common/vec_dtypes.cuh`	New CUDA header providing `vec_t` templates and specializations, `vec2` type mappings, `vec_cast` specializations, FP8↔FP16/BF16/float conversion helpers (hardware-gated and software fallback), global memory primitives (release/acquire/volatile), `FLASHINFER_INLINE` macro, and bit-width helpers under `flashinfer`.
MoE A2A kernels (dispatch/combine) `cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h`, `cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu`	New header and CUDA source adding constants, payload descriptors, pointer structs, dispatch kernel, templated combine kernel, warp/vectorized copy/sum/fill helpers, cross-GPU release/acquire sync primitives, and launcher functions.
thop op integration & build `cpp/tensorrt_llm/thop/CMakeLists.txt`, `cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp`	Adds `moeAlltoAllOp.cpp` to th_common build; implements `moe_a2a_dispatch` and `moe_a2a_combine` CUDA-backed PyTorch ops with input validation, workspace partitioning, param preparation, kernel launches, and TORCH_LIBRARY bindings.
Python wrappers & exports `tensorrt_llm/_torch/distributed/ops.py`, `tensorrt_llm/_torch/distributed/__init__.py`	Adds thin Python wrappers (`moe_a2a_dispatch`, `moe_a2a_combine`) that call `torch.ops.trtllm` and exposes them in the distributed package `__all__` with docstrings.
Tests (multi-GPU) `tests/unittest/_torch/multi_gpu/test_moe_a2a.py`	New multi-GPU unit tests covering dispatch and combine flows, payload generation (NVFP4/BF16), MPI/MNNVL rank orchestration, and verification utilities.
Licensing `jenkins/license_cpp.json`	Adds `tensorrt_llm/common/vec_dtypes.cuh` entry marked as `dual license`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor PyTorch
  participant THOP as Thop C++ op
  participant Launcher as Launcher
  participant CUDA as CUDA Kernel
  participant WS as Workspace/RecvBuffers
  participant Flags as CompletionFlags

  rect rgb(240,248,255)
    note over PyTorch,THOP: Dispatch flow
    PyTorch->>THOP: moe_a2a_dispatch(token_selected_experts, payloads, workspace,...)
    THOP->>THOP: validate inputs, partition workspace, prepare params
    THOP->>Launcher: moe_a2a_dispatch_launch(params)
    Launcher->>CUDA: moeA2ADispatchKernel<<<...>>>
    CUDA->>WS: vectorized writes to per-rank recv buffers
    CUDA->>Flags: st.release.sys.u32 to signal completion
    CUDA-->>THOP: update send_counters / send_indices
    THOP-->>PyTorch: return recv buffer views, send_counters, send_indices
  end

  rect rgb(245,255,240)
    note over PyTorch,THOP: Combine flow
    PyTorch->>THOP: moe_a2a_combine(send_indices, payload, workspace,...)
    THOP->>THOP: validate, copy local payload slice into workspace, prepare params
    THOP->>Launcher: moe_a2a_combine_launch(params)
    Launcher->>CUDA: moeA2ACombineKernel<T><<<...>>>
    CUDA->>Flags: ld.acquire.sys.u32 wait on completion flags
    CUDA->>WS: read per-rank recv buffers, vectorized-sum into output
    CUDA-->>THOP: output filled
    THOP-->>PyTorch: return combined output
  end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested reviewers

hlu1
dongxuy04
hyukn
yuxianq
chzblych

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 20

🧹 Nitpick comments (17)

cpp/tensorrt_llm/common/vec_dtypes.cuh (3)
199-210: Add explicit “unsupported type” handling in get_exponent_bits/get_mantissa_bits.

Prevents “control reaches end of non-void function” in some toolchains if used with an unexpected T.

Apply this diff:
 template <typename T>
 constexpr FLASHINFER_INLINE int get_exponent_bits() {
@@
-  }
+  } else {
+    static_assert(!std::is_same_v<T, T>, "Unsupported type in get_exponent_bits");
+    return 0;
   }
 }
 
 template <typename T>
 constexpr FLASHINFER_INLINE int get_mantissa_bits() {
@@
-  }
+  } else {
+    static_assert(!std::is_same_v<T, T>, "Unsupported type in get_mantissa_bits");
+    return 0;
+  }
 }
Also applies to: 212-224

1851-1854: Unify BF16 naming in vec2_dtype specialization.

The rest of the file uses nv_bfloat16/nv_bfloat162. Specialize on nv_bfloat16 for consistency.

Apply this diff:
-template <>
-struct vec2_dtype<__nv_bfloat16> {
-  using type = __nv_bfloat162;
-};
+template <>
+struct vec2_dtype<nv_bfloat16> {
+  using type = nv_bfloat162;
+};
If both aliases are used across the codebase, consider adding both specializations to be robust.

316-341: Inline PTX cvt instructions: consider adding memory clobbers and documenting constraints.

Minor, but adding “: 'memory'” makes ordering explicit; also add a one-line comment citing required SM capability (Hopper+).

I can prepare a follow-up patch if you want this change applied throughout these four blocks.

Also applies to: 341-365, 366-397, 399-430
cpp/tensorrt_llm/thop/CMakeLists.txt (1)

1-15: Update header year to 2025 on next touch.

Non-blocking; repository headers are shifting to current year on edits.
tests/unittest/_torch/multi_gpu/test_moe_a2a.py (6)
45-57: Consider removing commented code or marking it as example documentation.

The commented function compute_nvfp4_workspace_size appears to be unused. If it's meant as documentation for how workspace size is calculated, consider moving it to a docstring or removing it entirely.

92-93: Address the TODO for FP8 testing.

There's a TODO comment indicating that FP8 testing failed. This should be tracked and resolved.

Would you like me to create an issue to track the FP8 testing failure investigation?

106-108: Remove commented debug code.

The commented debug code for constructing data with rank and token indices should be removed if not needed, or uncommented if it's required for debugging.
-    # Construct the data to contain info about send rank and local_token_idx, which is used for debugging
-    # token_final_scales[:, 0] = rank
-    # token_final_scales[:, 1] = torch.linspace(0, local_num_tokens - 1, local_num_tokens, dtype=torch.bfloat16, device='cuda')
Also applies to: 174-176

409-411: Improve error handling for MNNVL initialization.

The generic exception handling could mask specific errors. Consider catching specific exceptions or at least logging the error before skipping.
     try:
         MnnvlMemory.initialize()
         assert MnnvlMemory.supports_mnnvl()
-    except Exception:
-        pytest.skip("MNNVL not supported on this system")
+    except Exception as e:
+        pytest.skip(f"MNNVL not supported on this system: {str(e)}")
493-494: Remove or refactor debug print statements.

There are several debug print statements throughout the test. These should either be removed for production code or converted to use proper logging with appropriate log levels.

Consider using Python's logging module instead of print statements:
-        print("Starting dispatch and combine on workers...")
+        logger.debug("Starting dispatch and combine on workers...")
Also applies to: 502-511, 606-611, 617-620, 627-632

584-587: Remove redundant local MPI import and document barrier rationale
The file already imports MPI at the top, so you can remove the inline from mpi4py import MPI. Before each comm.Barrier(), add a brief comment explaining why this synchronization is required (e.g. “ensure all ranks have completed dispatch before combine”).
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (3)
120-154: Potential integer overflow in bit manipulation.

The use of uint64_t for already_copied limits tracking to 64 ranks. While the code validates ep_size <= kMaxRanks (64), this implicit limitation should be documented or the implementation should be made more robust.

Add a static assertion or comment to document this limitation:
+    static_assert(kMaxRanks <= 64, "already_copied bit tracking requires kMaxRanks <= 64");
     uint64_t already_copied = 0;
125-127: Add boundary check for target_rank bit shift.

While ep_size is validated to be ≤ 64, it's good practice to add an assertion to catch any potential issues.
+        assert(target_rank < 64);  // Ensure bit shift is valid
         if (already_copied & (1ULL << target_rank))
             continue;
520-521: Add compile-time verification for unsupported types.

The default case uses a runtime check, but compile-time verification would be better.

Consider using a static_assert in a template specialization pattern or at least making the error message more informative:
-    default: TLLM_CHECK_WITH_INFO(false, "Unsupported data type for moe_a2a_combine");
+    default: 
+        TLLM_CHECK_WITH_INFO(false, 
+            "Unsupported data type for moe_a2a_combine. Supported types: HALF, BF16, FLOAT");
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (1)
88-90: Consider using camelCase for function names.

According to the coding guidelines, methods should use lowerCamelCase. Consider renaming the launch functions.
-void moe_a2a_dispatch_launch(MoeA2ADispatchParams const& params);
+void moeA2ADispatchLaunch(MoeA2ADispatchParams const& params);

-void moe_a2a_combine_launch(MoeA2ACombineParams const& params);
+void moeA2ACombineLaunch(MoeA2ACombineParams const& params);
Also applies to: 119-120
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (3)
142-143: Consider using torch::empty instead of torch::zeros for performance.

The TODO comment suggests using torch::empty for better performance since the memory will be overwritten anyway.
-    // TODO: Use torch empty to replace initialization
-    // Create send_counters tensor - tracks number of tokens sent to each target rank
-    torch::Tensor sendCounters = torch::zeros({epSize}, tokenSelectedExperts.options().dtype(torch::kInt32));
+    // Create send_counters tensor - tracks number of tokens sent to each target rank
+    torch::Tensor sendCounters = torch::empty({epSize}, tokenSelectedExperts.options().dtype(torch::kInt32));
+    sendCounters.zero_(); // Explicitly zero if needed for correctness
273-289: Consider adding support for FP8 data types.

Given the AI summary mentions FP8 support in the mixed-precision CUDA header, consider adding support for FP8 data types here as well.
     else if (payload.dtype() == torch::kFloat32)
     {
         nvDtype = nvinfer1::DataType::kFLOAT;
     }
+    else if (payload.dtype() == torch::kFloat8_e4m3fn)
+    {
+        nvDtype = nvinfer1::DataType::kFP8; // Or appropriate FP8 variant
+    }
     else
     {
-        TORCH_CHECK(false, "Unsupported data type for payload");
+        TORCH_CHECK(false, "Unsupported data type for payload. Supported types: float16, bfloat16, float32");
     }
25-27: Consider moving anonymous namespace content after namespace declaration.

For better organization, consider moving the anonymous namespace inside the torch_ext namespace.
 namespace torch_ext
 {
-
 namespace
 {

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 79d93f9 and dd4b4c7.

📒 Files selected for processing (8)

cpp/tensorrt_llm/common/vec_dtypes.cuh (1 hunks)
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (1 hunks)
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (1 hunks)
cpp/tensorrt_llm/thop/CMakeLists.txt (1 hunks)
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (1 hunks)
jenkins/license_cpp.json (1 hunks)
tensorrt_llm/_torch/distributed/ops.py (1 hunks)
tests/unittest/_torch/multi_gpu/test_moe_a2a.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (7)

**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}: Closing braces of C++ namespaces must include a comment naming the namespace (e.g., } // namespace foo)
Avoid using literals (except 0, nullptr, true, false) directly in logic; use named constants for comparisons
Use Allman brace style in C++
Place semicolon of empty for/while loop on its own line
Use brace-delimited statements for bodies of switch/while/do/for and always brace if/else bodies
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Non-static, externally visible globals use g prefix with lowerCamelCase (e.g., gDontUseGlobalFoos)
Static or anonymous-namespace globals use s prefix with lowerCamelCase (e.g., sMutableStaticGlobal)
Locally visible static variables use s prefix (e.g., static std::once_flag sFlag)
Member variables use m prefix with CamelCase (public may omit but encouraged)
Constants (enums, globals, static consts, function-scope magic numbers) use k prefix with UPPER_SNAKE (e.g., kDIGIT_NUM)
Function-scope non-literal, non-magic constants use normal non-const naming (e.g., const bool pass)
If macros are necessary, name them in UPPER_SNAKE_CASE
Avoid Hungarian notation except allowed app’s hungarian like nb for counts
Constructor parameters conflicting with member names get a trailing underscore (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L not 1234l)
Format C++ with clang-format (LLVM style), max line length 120; justify any exceptions with clang-format off/on blocks
Use C++-style comments; C comments not allowed except special inline cases; single-line comments use //
Use inline parameter comments in calls when arguments aren’t obvious (e.g., /* checkForErrors = / false)
Disable code with #if/#endif (optionally mnemonic conditions or no-op macros); do not comment out code; avoid dead code
Use the least forceful C++ cast; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void to T* with static_cas...

Files:

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
cpp/tensorrt_llm/common/vec_dtypes.cuh

**/*.{h,hpp,hh,hxx}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx}: Prefer const or constexpr over #define for constants in C++ headers
Use Doxygen for documenting interfaces; use //! for comments and //!< for member annotations in C++
Use include guards in headers with symbol format TRTLLM__H (no underscores prefix/suffix; filename only)

Files:

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h

**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
cpp/tensorrt_llm/thop/CMakeLists.txt
jenkins/license_cpp.json
tests/unittest/_torch/multi_gpu/test_moe_a2a.py
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
tensorrt_llm/_torch/distributed/ops.py
cpp/tensorrt_llm/common/vec_dtypes.cuh

**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces, not tabs; indent 4 spaces

Files:

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
tests/unittest/_torch/multi_gpu/test_moe_a2a.py
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
tensorrt_llm/_torch/distributed/ops.py
cpp/tensorrt_llm/common/vec_dtypes.cuh

**/*.{cpp,cc,cxx,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
tests/unittest/_torch/multi_gpu/test_moe_a2a.py
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
tensorrt_llm/_torch/distributed/ops.py
cpp/tensorrt_llm/common/vec_dtypes.cuh

**/*.{cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cc,cpp,cxx,cu}: Prefer const or constexpr variables over #define for constants in C++
Declare variables const if not modified after initialization
Use smart pointers for heap allocation; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only exceptionally; avoid deprecated smart pointers
Avoid declaring large functions inline unless there’s a quantifiable benefit; remember in-class definitions are implicitly inline
Every defined function must be referenced at least once; avoid unused methods

Files:

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs (Python)
Maintain module namespace on import: prefer from package.subpackage import foo; use foo.Symbol()
Python filenames use snake_case
Python class names use PascalCase
Python functions and methods use snake_case
Python local variables use snake_case; if starting with a number concept, prefix with k (e.g., k_99th_percentile)
Python global variables use G_ prefix with UPPER_SNAKE_CASE
Python constants use UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes
Initialize all externally visible class members in init
For public interfaces, prefer docstrings over comments; comments should be for in-function or file-local interfaces
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes and variables inline with docstrings immediately after assignment
Avoid reflection when a non-reflective approach suffices
Limit except clauses to specific exceptions where possible
When using try/except for duck-typing, keep try body minimal and move logic to else

Files:

tests/unittest/_torch/multi_gpu/test_moe_a2a.py
tensorrt_llm/_torch/distributed/ops.py

🧠 Learnings (2)

📚 Learning: 2025-08-20T07:43:36.447Z

Learnt from: ChristinaZ
PR: NVIDIA/TensorRT-LLM#7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.

Applied to files:

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu

🧬 Code graph analysis (4)

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (1)

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (4)

moe_a2a_dispatch_launch (194-232)

moe_a2a_dispatch_launch (194-194)

moe_a2a_combine_launch (473-522)

moe_a2a_combine_launch (473-473)

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (1)

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (4)

moe_a2a_dispatch_launch (194-232)

moe_a2a_dispatch_launch (194-194)

moe_a2a_combine_launch (473-522)

moe_a2a_combine_launch (473-473)

tests/unittest/_torch/multi_gpu/test_moe_a2a.py (4)

tensorrt_llm/_mnnvl_utils.py (3)

MnnvlMemory (53-336)

as_torch_strided_tensor (84-88)

supports_mnnvl (330-336)

tensorrt_llm/_torch/distributed/ops.py (2)

moe_a2a_combine (646-651)

moe_a2a_dispatch (637-643)

tensorrt_llm/mapping.py (1)

Mapping (32-513)

tests/unittest/conftest.py (1)

mpi_pool_executor (113-121)

tensorrt_llm/_torch/distributed/ops.py (1)

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)

top_k (221-221)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (9)

cpp/tensorrt_llm/common/vec_dtypes.cuh (2)

741-764: Alignment assumptions for vectorized ld/st helpers.

All these paths reinterpret to int4/uint2/float4. Please confirm the pointers are 16-byte aligned at call sites; otherwise add debug asserts or non-vectorized fallbacks.

I can scan call sites and add DCHECK-style alignment guards if helpful.

Also applies to: 1014-1037, 1202-1226, 1409-1432, 1652-1675, 1797-1820

53-77: Verify CUDA architecture targets and fallback compilation
Confirm your build configuration (e.g., CMakeLists or compiler flags) includes compute capability ≥ 90 for these helpers, and test building for sm_80 to ensure the fallback path (plain store + __threadfence_system()) compiles successfully.

cpp/tensorrt_llm/thop/CMakeLists.txt (2)

73-73: Add moeAlltoAllOp.cpp to th_common: LGTM.

34-36: Remove NCCL/NVSHMEM link check for A2A op
moeAlltoAllOp.cpp contains no direct NCCL or NVSHMEM references; NCCL is already linked only for ncclCommunicatorOp under ENABLE_MULTI_DEVICE.

Likely an incorrect or invalid review comment.

tensorrt_llm/_torch/distributed/ops.py (1)

637-643: LGTM! Clean wrapper implementation.

The wrapper functions are well-implemented with clear docstrings explaining their purpose. The parameter forwarding is correct and maintains the same signature as the underlying PyTorch operations.
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (2)
1-21: LGTM! Clean header structure with proper include guards and licensing.

The copyright header and includes are properly structured. Good organization of the dependencies.

22-23: Add closing brace comment for the namespace.

According to the coding guidelines, closing braces of C++ namespaces must include a comment naming the namespace.
-} // namespace tensorrt_llm::kernels::moe_a2a
+} // namespace moe_a2a
+} // namespace kernels
+} // namespace tensorrt_llm
⛔ Skipped due to learnings
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-01T18:06:21.617Z
Learning: Applies to **/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh} : Closing braces of C++ namespaces must include a comment naming the namespace (e.g., } // namespace foo)
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (2)

1-16: LGTM! Well-structured file with proper copyright header.

The copyright header and includes follow the coding guidelines.

197-200: No update required for dispatch function name
The call to moe_a2a_dispatch_launch(params) matches the declaration in moeAlltoAllKernels.h (line 89) and no renaming was performed.

Likely an incorrect or invalid review comment.

cpp/tensorrt_llm/common/vec_dtypes.cuh

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

jenkins/license_cpp.json

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

tests/unittest/_torch/multi_gpu/test_moe_a2a.py (4)
525-525: Remove unused variables flagged by Ruff.

Clean up to keep tests tidy.

Apply:
-        combine_payload = torch.zeros(rank_local_tokens, hidden_size, dtype=dtype, device=device)
-                expert_rank = expert_idx % ep_size
-    all_original_hidden_states = [r[1] for r in all_results]
Also applies to: 551-551, 598-598

88-95: Scaling factors dtype mismatches the comment.

Comment says FP8; code uses float32. If A2A is dtype-agnostic this is fine; otherwise, ensure the expected dtype is used.

If FP8 is required and supported in this test env, switch to torch.float8_e4m3fn; otherwise update the comment to avoid confusion.

390-393: Avoid hardcoding 512 MiB workspace; compute exact size.

Precomputing per-payload bytes improves determinism and reduces memory waste.

Consider reviving compute_nvfp4_workspace_size(...) and using it here, with a safety margin if needed.

Also applies to: 452-455

457-475: Reduce noisy prints in tests.

Use pytest logging or keep output minimal to avoid cluttering CI logs.

You can replace print with pytest.verbose-guarded logs or drop them after stabilizing.

Also applies to: 477-481

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between dd4b4c7 and f25b2ff.

📒 Files selected for processing (1)

tests/unittest/_torch/multi_gpu/test_moe_a2a.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

tests/unittest/_torch/multi_gpu/test_moe_a2a.py

**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces, not tabs; indent 4 spaces

Files:

tests/unittest/_torch/multi_gpu/test_moe_a2a.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs (Python)
Maintain module namespace on import: prefer from package.subpackage import foo; use foo.Symbol()
Python filenames use snake_case
Python class names use PascalCase
Python functions and methods use snake_case
Python local variables use snake_case; if starting with a number concept, prefix with k (e.g., k_99th_percentile)
Python global variables use G_ prefix with UPPER_SNAKE_CASE
Python constants use UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes
Initialize all externally visible class members in init
For public interfaces, prefer docstrings over comments; comments should be for in-function or file-local interfaces
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes and variables inline with docstrings immediately after assignment
Avoid reflection when a non-reflective approach suffices
Limit except clauses to specific exceptions where possible
When using try/except for duck-typing, keep try body minimal and move logic to else

Files:

tests/unittest/_torch/multi_gpu/test_moe_a2a.py

**/*.{cpp,cc,cxx,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tests/unittest/_torch/multi_gpu/test_moe_a2a.py

🧬 Code graph analysis (1)

tests/unittest/_torch/multi_gpu/test_moe_a2a.py (4)

tensorrt_llm/_mnnvl_utils.py (3)

MnnvlMemory (53-336)

as_torch_strided_tensor (84-88)

supports_mnnvl (330-336)

tensorrt_llm/_torch/distributed/ops.py (2)

moe_a2a_combine (646-651)

moe_a2a_dispatch (637-643)

tensorrt_llm/mapping.py (1)

Mapping (32-513)

tests/unittest/conftest.py (1)

mpi_pool_executor (113-121)

🪛 Ruff (0.12.2)

tests/unittest/_torch/multi_gpu/test_moe_a2a.py

524-524: Local variable combine_payload is assigned to but never used

Remove assignment to unused variable combine_payload

(F841)

551-551: Local variable expert_rank is assigned to but never used

Remove assignment to unused variable expert_rank

(F841)

597-597: Local variable all_original_hidden_states is assigned to but never used

Remove assignment to unused variable all_original_hidden_states

(F841)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

tests/unittest/_torch/multi_gpu/test_moe_a2a.py (2)

185-197: Confirm completion_flags_offset layout/alignment.

The offset assumes no per-payload alignment/padding in the workspace. If kernels impose alignment, the slice may be wrong.

Can you confirm the dispatch workspace layout guarantees contiguous packing of all payload slabs with no extra alignment, or expose a helper to compute the flags offset?

39-43: LGTM: deterministic seed and quiet logger.

Good practice for reproducible multi-GPU tests.

tests/unittest/_torch/multi_gpu/test_moe_a2a.py

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/distributed/__init__.py (1)
1-1: Add NVIDIA copyright header.

All source files must include the NVIDIA copyright header for the current year.

Apply at the top:
+#
+# Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
+#
+"""Distributed public API exports."""

♻️ Duplicate comments (6)

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (3)

383-394: Commented synchronization blocks: remove or re-enable with justification.

Large sections of combine-kernel synchronization are commented out. Either delete dead code or reintroduce it with a clear rationale and tests.

Also applies to: 453-487

176-178: Resolve the TODO and enforce system-wide visibility before signaling.

Cross-GPU visibility of payload writes from other threads requires a system fence prior to release-store signaling.
-    // TODO: Is this membar sufficient/necessary?
-    // __threadfence_system();
+    // Ensure all payload writes by any thread on this device are visible system-wide before signaling completion.
+    __threadfence_system();
427-446: Remove kernel printf or guard with a debug macro.

Kernel-side printf is a perf hazard; gate under a compile-time flag or remove.
-        if (lane_id == 0)
-        {
+        #ifdef DEBUG_MOE_A2A
+        if (lane_id == 0) {
             ...
-        }
+        }
+        #endif

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (3)

167-173: Gate debug prints behind a macro.

Avoid unconditional device prints in production.

-        // Debug: print workspace addresses
-        if (epRank == 0)
-        {
+        // Debug: print workspace addresses
+        #ifdef DEBUG_MOE_A2A
+        if (epRank == 0) {
             printf("Rank %d: target_rank %d workspace at %p (base %p + stride %ld * %d)\n", ...
-        }
+        }
+        #endif
@@
-        // Debug: print completion flags pointer
-        printf("Dispatch: Rank %d completion_flags[%d] at %p (workspace %p + offset %ld)\n", ...
+        // Debug: print completion flags pointer
+        #ifdef DEBUG_MOE_A2A
+        printf("Dispatch: Rank %d completion_flags[%d] at %p (workspace %p + offset %ld)\n", ...
+        #endif

Also applies to: 186-189

326-326: Allocate localTokenCounter on the correct device.

Currently defaults to CPU; allocate on CUDA to match kernel pointer usage.

-    torch::Tensor localTokenCounter = torch::zeros({1}, torch::TensorOptions().dtype(torch::kInt32));
+    torch::Tensor localTokenCounter = torch::zeros({1}, payload.options().dtype(torch::kInt32));

362-367: Gate combine debug print.

Remove or guard with #ifdef DEBUG_MOE_A2A.

-        // Debug: print completion flags pointer
-        if (true)
-        {
-            printf("Combine: Rank %d completion_flags[%d] at %p (workspace %p + offset %ld)\n", ...
-        }
+        // Debug: print completion flags pointer
+        #ifdef DEBUG_MOE_A2A
+        printf("Combine: Rank %d completion_flags[%d] at %p (workspace %p + offset %ld)\n", ...
+        #endif

🧹 Nitpick comments (5)

tensorrt_llm/_torch/distributed/__init__.py (1)
5-6: Confirm wrappers are imported from ops and keep ordering consistent.

The new exports look correct. For readability, consider importing moe_a2a_dispatch before moe_a2a_combine to match all order.
-                  MoEAllReduceParams, allgather, moe_a2a_combine,
-                  moe_a2a_dispatch, reducescatter,
+                  MoEAllReduceParams, allgather, moe_a2a_dispatch,
+                  moe_a2a_combine, reducescatter,
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (3)
113-125: Flag initialization: clarify scope and ordering.

Good to init remote flags once. Add a grid-wide single initializer (e.g., blockIdx.x==0 && threadIdx.x==0 is fine) but document that other CTAs may already copy; __threadfence() is device-scoped only; keep the later __threadfence_system() before signaling.
-        // Device-level threadfence prevents reordering the writing of the flags (which might be done by another thread
-        // in the same rank) in front of the initializations above.
-        __threadfence();
+        // Device fence to order this thread's writes; system fence occurs at completion signaling.
+        __threadfence();
16-22: Drop unused includes.

quantization.cuh and cooperative_groups.h appear unused here; remove to speed up builds.
-#include "tensorrt_llm/kernels/quantization.cuh"
-#include <cooperative_groups.h>
214-252: Runtime validations present; add assert for token indices range.

Consider validating expert IDs on host (in op) to be in [0, num_experts) to avoid OOB compute_target_rank_id.

Would you like a patch in moeAlltoAllOp.cpp to enforce this?
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (1)

315-325: Prefer non-blocking copy only when streams/devices are compatible.

copy_(..., /*non_blocking=*/true) is fine; ensure streams/devices align. Optionally expose a bool non_blocking param for callers.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4e980cf and 47acfba.

📒 Files selected for processing (6)

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (1 hunks)
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (1 hunks)
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (1 hunks)
tensorrt_llm/_torch/distributed/__init__.py (2 hunks)
tensorrt_llm/_torch/distributed/ops.py (1 hunks)
tests/unittest/_torch/multi_gpu/test_moe_a2a.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

tensorrt_llm/_torch/distributed/ops.py
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
tests/unittest/_torch/multi_gpu/test_moe_a2a.py

🧰 Additional context used

📓 Path-based instructions (6)

**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

tensorrt_llm/_torch/distributed/__init__.py
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu

**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces, not tabs; indent 4 spaces

Files:

tensorrt_llm/_torch/distributed/__init__.py
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs (Python)
Maintain module namespace on import: prefer from package.subpackage import foo; use foo.Symbol()
Python filenames use snake_case
Python class names use PascalCase
Python functions and methods use snake_case
Python local variables use snake_case; if starting with a number concept, prefix with k (e.g., k_99th_percentile)
Python global variables use G_ prefix with UPPER_SNAKE_CASE
Python constants use UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes
Initialize all externally visible class members in init
For public interfaces, prefer docstrings over comments; comments should be for in-function or file-local interfaces
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes and variables inline with docstrings immediately after assignment
Avoid reflection when a non-reflective approach suffices
Limit except clauses to specific exceptions where possible
When using try/except for duck-typing, keep try body minimal and move logic to else

Files:

tensorrt_llm/_torch/distributed/__init__.py

**/*.{cpp,cc,cxx,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/distributed/__init__.py
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu

**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}: Closing braces of C++ namespaces must include a comment naming the namespace (e.g., } // namespace foo)
Avoid using literals (except 0, nullptr, true, false) directly in logic; use named constants for comparisons
Use Allman brace style in C++
Place semicolon of empty for/while loop on its own line
Use brace-delimited statements for bodies of switch/while/do/for and always brace if/else bodies
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Non-static, externally visible globals use g prefix with lowerCamelCase (e.g., gDontUseGlobalFoos)
Static or anonymous-namespace globals use s prefix with lowerCamelCase (e.g., sMutableStaticGlobal)
Locally visible static variables use s prefix (e.g., static std::once_flag sFlag)
Member variables use m prefix with CamelCase (public may omit but encouraged)
Constants (enums, globals, static consts, function-scope magic numbers) use k prefix with UPPER_SNAKE (e.g., kDIGIT_NUM)
Function-scope non-literal, non-magic constants use normal non-const naming (e.g., const bool pass)
If macros are necessary, name them in UPPER_SNAKE_CASE
Avoid Hungarian notation except allowed app’s hungarian like nb for counts
Constructor parameters conflicting with member names get a trailing underscore (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L not 1234l)
Format C++ with clang-format (LLVM style), max line length 120; justify any exceptions with clang-format off/on blocks
Use C++-style comments; C comments not allowed except special inline cases; single-line comments use //
Use inline parameter comments in calls when arguments aren’t obvious (e.g., /* checkForErrors = / false)
Disable code with #if/#endif (optionally mnemonic conditions or no-op macros); do not comment out code; avoid dead code
Use the least forceful C++ cast; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void to T* with static_cas...

Files:

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu

**/*.{cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cc,cpp,cxx,cu}: Prefer const or constexpr variables over #define for constants in C++
Declare variables const if not modified after initialization
Use smart pointers for heap allocation; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only exceptionally; avoid deprecated smart pointers
Avoid declaring large functions inline unless there’s a quantifiable benefit; remember in-class definitions are implicitly inline
Every defined function must be referenced at least once; avoid unused methods

Files:

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu

🧠 Learnings (3)

📚 Learning: 2025-08-14T23:23:27.449Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu

🧬 Code graph analysis (3)

tensorrt_llm/_torch/distributed/__init__.py (1)

tensorrt_llm/_torch/distributed/ops.py (2)

moe_a2a_combine (684-719)

moe_a2a_dispatch (637-681)

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (1)

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (4)

moe_a2a_dispatch_launch (214-252)

moe_a2a_dispatch_launch (214-214)

moe_a2a_combine_launch (493-542)

moe_a2a_combine_launch (493-493)

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (2)

tests/unittest/_torch/multi_gpu/test_moe_a2a.py (1)

compute_target_rank_id (58-70)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)

__syncwarp (703-707)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

tensorrt_llm/_torch/distributed/__init__.py (1)

19-20: LGTM: public re-exports added.

Exposing moe_a2a_dispatch and moe_a2a_combine via __all__ aligns with the new wrappers.

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (2)

146-154: Good: allocate counters/indices on the same device as inputs.

sendCounters, localTokenCounter, and sendIndices correctly inherit device from tokenSelectedExperts.

382-398: Bindings look correct.

Operator schemas match Python wrappers; CUDA impl registered as expected.

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (8)

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (8)

58-63: Adopt renamed constants and add epSize upper-bound check.

Update to new constant names and validate epSize against kMaxRanks.

     using tensorrt_llm::kernels::moe_a2a::PayloadDescriptor;
     using tensorrt_llm::kernels::moe_a2a::MoeA2ADispatchParams;
     using tensorrt_llm::kernels::moe_a2a::moe_a2a_dispatch_launch;
-    using tensorrt_llm::kernels::moe_a2a::kMaxTopK;
-    using tensorrt_llm::kernels::moe_a2a::kMaxPayloads;
+    using tensorrt_llm::kernels::moe_a2a::kMAX_TOP_K;
+    using tensorrt_llm::kernels::moe_a2a::kMAX_PAYLOADS;
+    using tensorrt_llm::kernels::moe_a2a::kMaxRanks;

     TORCH_CHECK(epRank >= 0 && epRank < epSize, "epRank must be in the range [0, epSize)");
-    TORCH_CHECK(topK > 0 && topK <= kMaxTopK, "topK must be in the range (0, kMaxTopK]");
+    TORCH_CHECK(epSize > 0 && epSize <= kMaxRanks, "epSize must be in the range (0, kMaxRanks]");
+    TORCH_CHECK(topK > 0 && topK <= kMAX_TOP_K, "topK must be in the range (0, kMAX_TOP_K]");
     TORCH_CHECK(!inputPayloads.empty(), "inputPayloads must not be empty");
-    TORCH_CHECK(inputPayloads.size() <= kMaxPayloads, "Too many input payloads");
+    TORCH_CHECK(inputPayloads.size() <= kMAX_PAYLOADS, "Too many input payloads");

Also applies to: 72-76

64-78: Validate expert IDs fall within [0, numExperts).

Prevents invalid routing/UB from out-of-range expert IDs.

     TORCH_CHECK(numExperts >= epSize, "numExperts must be greater than or equal to epSize");
     TORCH_CHECK(numExperts % epSize == 0, "numExperts must be divisible by epSize for contiguous partitioning");
+    TORCH_CHECK(tokenSelectedExperts.min().item<int32_t>() >= 0 &&
+                tokenSelectedExperts.max().item<int32_t>() < numExperts,
+                "tokenSelectedExperts must be in [0, numExperts).");

138-145: Update PayloadDescriptor member names to new struct API.

Use the renamed fields with m-prefix to match header changes.

-        PayloadDescriptor desc{};
-        desc.src_data = inputPayloads[i].data_ptr();
-        desc.element_size = payloadElementSizes[i];
-        desc.elements_per_token = payloadElementsPerToken[i];
+        PayloadDescriptor desc{};
+        desc.mSrcData = inputPayloads[i].data_ptr();
+        desc.mElementSize = payloadElementSizes[i];
+        desc.mElementsPerToken = payloadElementsPerToken[i];

168-174: Gate debug prints.

Wrap with a debug macro or remove to avoid noisy stdout and perf impact.

-        // Debug: print workspace addresses
-        if (epRank == 0)
-        {
-            printf("Rank %d: target_rank %d workspace at %p (base %p + stride %ld * %d)\n", static_cast<int>(epRank),
-                target_rank, target_workspace, workspace_ptr, workspace.stride(0), target_rank);
-        }
+        // Debug: print workspace addresses
+        #ifdef DEBUG_MOE_A2A
+        if (epRank == 0)
+        {
+            printf("Rank %d: target_rank %d workspace at %p (base %p + stride %ld * %d)\n", static_cast<int>(epRank),
+                target_rank, target_workspace, workspace_ptr, workspace.stride(0), target_rank);
+        }
+        #endif

187-190: Gate debug prints.

Same rationale as above.

-        // Debug: print completion flags pointer
-        printf("Dispatch: Rank %d completion_flags[%d] at %p (workspace %p + offset %ld)\n", static_cast<int>(epRank),
-            target_rank, params.completion_flags[target_rank], target_workspace, offset);
+        // Debug: print completion flags pointer
+        #ifdef DEBUG_MOE_A2A
+        printf("Dispatch: Rank %d completion_flags[%d] at %p (workspace %p + offset %ld)\n",
+            static_cast<int>(epRank), target_rank, params.completion_flags[target_rank], target_workspace, offset);
+        #endif

262-266: Ensure payload is contiguous in combine.

Matches dispatch path and avoids unexpected strides.

     CHECK_TH_CUDA(payload);
+    CHECK_CONTIGUOUS(payload);
     TORCH_CHECK(payload.dim() == 3, "payload must be a 3D tensor [ep_size, max_tokens_per_rank, elements_per_token]");

330-330: Allocate localTokenCounter on the correct device.

It currently defaults to CPU; use sendIndices (or payload) options to place it on CUDA.

-    torch::Tensor localTokenCounter = torch::zeros({1}, torch::TensorOptions().dtype(torch::kInt32));
+    torch::Tensor localTokenCounter = torch::zeros({1}, sendIndices.options().dtype(torch::kInt32));

367-371: Gate debug prints.

Replace if (true) with a compile-time guard.

-        // Debug: print completion flags pointer
-        if (true)
-        {
-            printf("Combine: Rank %d completion_flags[%d] at %p (workspace %p + offset %ld)\n",
-                static_cast<int>(epRank), target_rank, params.completion_flags[target_rank], target_workspace, offset);
-        }
+        // Debug: print completion flags pointer
+        #ifdef DEBUG_MOE_A2A
+        printf("Combine: Rank %d completion_flags[%d] at %p (workspace %p + offset %ld)\n",
+            static_cast<int>(epRank), target_rank, params.completion_flags[target_rank], target_workspace, offset);
+        #endif

🧹 Nitpick comments (4)

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (4)

153-155: Avoid magic literal -1 for sendIndices; use a named constant.

Improves readability and aligns with guideline to avoid literals in logic.

-    torch::Tensor sendIndices
-        = torch::full({localNumTokens, epSize}, -1, tokenSelectedExperts.options().dtype(torch::kInt32));
+    torch::Tensor sendIndices = torch::full(
+        {localNumTokens, epSize}, kINVALID_INDEX, tokenSelectedExperts.options().dtype(torch::kInt32));

Add near the top inside the anonymous namespace:

// Place near line ~31 inside the anonymous namespace
constexpr int kINVALID_INDEX = -1;

156-161: Confirm param struct member names and stream type.

If headers adopted m-prefixed members, update assignments; also prefer .stream() to pass cudaStream_t explicitly.

Example diff (adjust field names to actual header):

-    params.token_selected_experts = tokenSelectedExperts.data_ptr<int32_t>();
-    params.num_payloads = static_cast<int32_t>(payloadDescriptors.size());
-    std::copy(payloadDescriptors.begin(), payloadDescriptors.end(), &params.payloads[0]);
+    params.mTokenSelectedExperts = tokenSelectedExperts.data_ptr<int32_t>();
+    params.mNumPayloads = static_cast<int32_t>(payloadDescriptors.size());
+    std::copy(payloadDescriptors.begin(), payloadDescriptors.end(), &params.mPayloads[0]);

-    params.max_tokens_per_rank = static_cast<int>(maxTokensPerRank);
-    params.send_counters = sendCounters.data_ptr<int>();
-    params.local_token_counter = localTokenCounter.data_ptr<int>();
-    params.send_indices = sendIndices.data_ptr<int>();
-    params.local_num_tokens = static_cast<int>(localNumTokens);
-    params.ep_size = static_cast<int>(epSize);
-    params.ep_rank = static_cast<int>(epRank);
-    params.top_k = static_cast<int>(topK);
-    params.num_experts_per_rank = static_cast<int>(numExperts) / static_cast<int>(epSize);
-    params.stream = at::cuda::getCurrentCUDAStream();
+    params.mMaxTokensPerRank = static_cast<int>(maxTokensPerRank);
+    params.mSendCounters = sendCounters.data_ptr<int>();
+    params.mLocalTokenCounter = localTokenCounter.data_ptr<int>();
+    params.mSendIndices = sendIndices.data_ptr<int>();
+    params.mLocalNumTokens = static_cast<int>(localNumTokens);
+    params.mEpSize = static_cast<int>(epSize);
+    params.mEpRank = static_cast<int>(epRank);
+    params.mTopK = static_cast<int>(topK);
+    params.mNumExpertsPerRank = static_cast<int>(numExperts) / static_cast<int>(epSize);
+    params.mStream = at::cuda::getCurrentCUDAStream().stream();

Also applies to: 191-201

324-329: Remove blocking syncs or gate behind debug.

cudaDeviceSynchronize() and global MPI barrier here will stall the pipeline; prefer stream semantics or events. Leave only if required for debugging.

-    // TODO: Copy and sync here should be removed.
-    // Copy the entire payload to workspace using PyTorch tensor operations
-    recvBuffer.copy_(payload, /*non_blocking=*/true);
-    cudaDeviceSynchronize();
-    tensorrt_llm::mpi::MpiComm::world().barrier();
+    // Copy the entire payload to workspace using PyTorch tensor operations
+    recvBuffer.copy_(payload, /*non_blocking=*/true);
+    #ifdef DEBUG_MOE_A2A
+    cudaDeviceSynchronize();
+    tensorrt_llm::mpi::MpiComm::world().barrier();
+    #endif

200-200: Use raw CUDA stream handle explicitly.

Avoid relying on implicit conversions.

-    params.stream = at::cuda::getCurrentCUDAStream();
+    params.stream = at::cuda::getCurrentCUDAStream().stream();

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 47acfba and 7e1e712.

📒 Files selected for processing (2)

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (1 hunks)
tests/unittest/_torch/multi_gpu/test_moe_a2a.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/unittest/_torch/multi_gpu/test_moe_a2a.py

🧰 Additional context used

📓 Path-based instructions (5)

**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh}: Closing braces of C++ namespaces must include a comment naming the namespace (e.g., } // namespace foo)
Avoid using literals (except 0, nullptr, true, false) directly in logic; use named constants for comparisons
Use Allman brace style in C++
Place semicolon of empty for/while loop on its own line
Use brace-delimited statements for bodies of switch/while/do/for and always brace if/else bodies
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Non-static, externally visible globals use g prefix with lowerCamelCase (e.g., gDontUseGlobalFoos)
Static or anonymous-namespace globals use s prefix with lowerCamelCase (e.g., sMutableStaticGlobal)
Locally visible static variables use s prefix (e.g., static std::once_flag sFlag)
Member variables use m prefix with CamelCase (public may omit but encouraged)
Constants (enums, globals, static consts, function-scope magic numbers) use k prefix with UPPER_SNAKE (e.g., kDIGIT_NUM)
Function-scope non-literal, non-magic constants use normal non-const naming (e.g., const bool pass)
If macros are necessary, name them in UPPER_SNAKE_CASE
Avoid Hungarian notation except allowed app’s hungarian like nb for counts
Constructor parameters conflicting with member names get a trailing underscore (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L not 1234l)
Format C++ with clang-format (LLVM style), max line length 120; justify any exceptions with clang-format off/on blocks
Use C++-style comments; C comments not allowed except special inline cases; single-line comments use //
Use inline parameter comments in calls when arguments aren’t obvious (e.g., /* checkForErrors = / false)
Disable code with #if/#endif (optionally mnemonic conditions or no-op macros); do not comment out code; avoid dead code
Use the least forceful C++ cast; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void to T* with static_cas...

Files:

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

**/*.{cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cc,cpp,cxx,cu}: Prefer const or constexpr variables over #define for constants in C++
Declare variables const if not modified after initialization
Use smart pointers for heap allocation; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only exceptionally; avoid deprecated smart pointers
Avoid declaring large functions inline unless there’s a quantifiable benefit; remember in-class definitions are implicitly inline
Every defined function must be referenced at least once; avoid unused methods

Files:

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces, not tabs; indent 4 spaces

Files:

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

**/*.{cpp,cc,cxx,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

🧠 Learnings (3)

📚 Learning: 2025-08-14T23:23:27.449Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

📚 Learning: 2025-09-02T13:42:44.885Z

Learnt from: pcastonguay
PR: NVIDIA/TensorRT-LLM#7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.

Applied to files:

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

🧬 Code graph analysis (1)

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (1)

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (4)

moe_a2a_dispatch_launch (214-252)

moe_a2a_dispatch_launch (214-214)

moe_a2a_combine_launch (493-542)

moe_a2a_combine_launch (493-493)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

tests/unittest/_torch/multi_gpu/test_moe_a2a.py (2)
483-485: Fix under-sized max_tokens_per_rank (risk: receiver overflow).

Bound per receiver, not per sender. Use a safe worst-case upper bound across all senders.
-        max_tokens_per_rank = max(all_num_tokens)
+        # Upper-bound the max tokens any single target rank can receive across all senders.
+        max_tokens_per_rank = sum(all_num_tokens)
-        max_tokens_per_rank = max(all_num_tokens)
+        # Upper-bound the max tokens any single target rank can receive across all senders.
+        max_tokens_per_rank = sum(all_num_tokens)
Also applies to: 545-546

588-588: Use a proper torch.device; current_device() returns int and breaks tensor allocations.

Create a torch.device tied to the rank and pass dtype through.
-    device = torch.cuda.current_device()
+    device = torch.device("cuda", rank)
-        rank_experts = create_experts(num_experts_per_rank, hidden_size, rank, device, dtype=torch.bfloat16)
+        rank_experts = create_experts(num_experts_per_rank, hidden_size, rank, device, dtype=dtype)
Also applies to: 632-632

🧹 Nitpick comments (5)

tests/unittest/_torch/multi_gpu/test_moe_a2a.py (5)

645-647: Compute completion_flags_offset from payload dtype instead of hardcoding bfloat16.

Keeps combine test correct across dtypes and prevents misaligned reads.

-        completion_flags_offset = ep_size * max_tokens_per_rank * hidden_size * 2  # bfloat16 = 2 bytes
+        elements_per_token = hidden_states_recv.shape[-1]
+        completion_flags_offset = (
+            ep_size * max_tokens_per_rank * elements_per_token * hidden_states_recv.element_size()
+        )

206-209: Parameterize combine payload dtype; test input dtype is currently ignored.

Honor the test’s dtype parameter so we actually exercise different precisions.

-def make_bfloat16_payloads(local_num_tokens: int, hidden_size: int, top_k: int,
-                           rank: int,
-                           token_selected_experts: torch.Tensor) -> list:
+def make_bfloat16_payloads(local_num_tokens: int, hidden_size: int, top_k: int,
+                           rank: int,
+                           token_selected_experts: torch.Tensor,
+                           dtype: torch.dtype = torch.bfloat16) -> list:
@@
-    hidden_states = torch.randn(local_num_tokens,
-                                hidden_size,
-                                dtype=torch.bfloat16,
-                                device='cuda')
+    hidden_states = torch.randn(local_num_tokens,
+                                hidden_size,
+                                dtype=dtype,
+                                device='cuda')
@@
-    token_final_scales = torch.rand(local_num_tokens,
-                                    top_k,
-                                    dtype=torch.bfloat16,
-                                    device='cuda')
+    token_final_scales = torch.rand(local_num_tokens,
+                                    top_k,
+                                    dtype=dtype,
+                                    device='cuda')

-        payloads = make_bfloat16_payloads(rank_local_tokens, hidden_size, top_k,
-                                          rank, token_selected_experts)
+        payloads = make_bfloat16_payloads(rank_local_tokens, hidden_size, top_k,
+                                          rank, token_selected_experts, dtype=dtype)

Also applies to: 212-219, 224-229, 607-609

676-681: Assert on combine completion flags (don’t just print).

Tests should fail if flags aren’t set as expected; retain prints for context if needed.

-        if not torch.all(completion_flags == expected_flags).item():
-            print(
-                f"ERROR: Rank {rank} completion flags after combine: {completion_flags.tolist()}, expected: {expected_flags.tolist()}"
-            )
+        assert torch.all(completion_flags == expected_flags).item(), (
+            f"Rank {rank} completion flags after combine: {completion_flags.tolist()}, "
+            f"expected: {expected_flags.tolist()}"
+        )

426-426: Typo in comment.

-                            atol=0, # Dispatch is pure copy, should expact exactly the same
+                            atol=0, # Dispatch is pure copy; expect exactly the same

91-92: Optional: Use the per-rank torch.device consistently for tensor creation.

Prefer a shared device variable (torch.device("cuda", rank)) for all tensors to avoid implicit device reliance.

Also applies to: 173-176, 185-186, 196-197, 213-217, 228-229

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 7e1e712 and 9466d43.

📒 Files selected for processing (1)

tests/unittest/_torch/multi_gpu/test_moe_a2a.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tests/unittest/_torch/multi_gpu/test_moe_a2a.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tests/unittest/_torch/multi_gpu/test_moe_a2a.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tests/unittest/_torch/multi_gpu/test_moe_a2a.py

🧬 Code graph analysis (1)

tests/unittest/_torch/multi_gpu/test_moe_a2a.py (5)

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (1)

tensorrt_llm (22-123)

tensorrt_llm/_mnnvl_utils.py (3)

MnnvlMemory (53-336)

as_torch_strided_tensor (84-88)

supports_mnnvl (330-336)

tensorrt_llm/_torch/distributed/ops.py (2)

moe_a2a_combine (684-719)

moe_a2a_dispatch (637-681)

tensorrt_llm/mapping.py (1)

Mapping (32-513)

tests/unittest/conftest.py (1)

mpi_pool_executor (113-121)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

tests/unittest/_torch/multi_gpu/test_moe_a2a.py

Signed-off-by: Bo Li <[email protected]>

Combine implementation should be changed to be sending tokens from target_ranks. Signed-off-by: Bo Li <[email protected]>

Signed-off-by: Bo Li <[email protected]>

bobboli · 2025-10-25T07:21:03Z

/bot run --reuse-test

tensorrt-cicd · 2025-10-25T07:28:42Z

PR_Github #22494 [ run ] triggered by Bot. Commit: fb3c529

tensorrt-cicd · 2025-10-25T13:12:50Z

PR_Github #22494 [ run ] completed with state SUCCESS. Commit: fb3c529
/LLM/main/L0_MergeRequest_PR pipeline #16953 completed with status: 'FAILURE'

Signed-off-by: Bo Li <[email protected]>

bobboli · 2025-10-25T17:25:09Z

/bot run

tensorrt-cicd · 2025-10-25T17:30:38Z

PR_Github #22509 [ run ] triggered by Bot. Commit: 4fc3d81

tensorrt-cicd · 2025-10-25T19:49:25Z

PR_Github #22509 [ run ] completed with state SUCCESS. Commit: 4fc3d81
/LLM/main/L0_MergeRequest_PR pipeline #16966 completed with status: 'FAILURE'

bobboli · 2025-10-26T08:51:26Z

/bot run --reuse-test

tensorrt-cicd · 2025-10-26T08:57:39Z

PR_Github #22527 [ run ] triggered by Bot. Commit: 4fc3d81

tensorrt-cicd · 2025-10-26T13:30:47Z

PR_Github #22527 [ run ] completed with state SUCCESS. Commit: 4fc3d81
/LLM/main/L0_MergeRequest_PR pipeline #16982 completed with status: 'FAILURE'

bobboli · 2025-10-26T14:17:53Z

/bot run --reuse-test

tensorrt-cicd · 2025-10-26T14:23:47Z

PR_Github #22535 [ run ] triggered by Bot. Commit: 4fc3d81

tensorrt-cicd · 2025-10-27T00:40:17Z

PR_Github #22535 [ run ] completed with state SUCCESS. Commit: 4fc3d81
/LLM/main/L0_MergeRequest_PR pipeline #16988 completed with status: 'FAILURE'

chzblych · 2025-10-27T00:50:01Z

/bot run

tensorrt-cicd · 2025-10-27T00:56:19Z

PR_Github #22541 [ run ] triggered by Bot. Commit: 4fc3d81

tensorrt-cicd · 2025-10-27T02:40:01Z

PR_Github #22541 [ run ] completed with state SUCCESS. Commit: 4fc3d81
/LLM/main/L0_MergeRequest_PR pipeline #16991 completed with status: 'FAILURE'

Signed-off-by: Bo Li <[email protected]>

bobboli · 2025-10-27T02:59:29Z

/bot kill

tensorrt-cicd · 2025-10-27T03:05:13Z

PR_Github #22564 [ kill ] triggered by Bot. Commit: ea35fd6

tensorrt-cicd · 2025-10-27T03:05:14Z

PR_Github #22564 [ kill ] completed with state SUCCESS. Commit: ea35fd6
Successfully killed previous jobs for commit ea35fd6

Signed-off-by: Bo Li <[email protected]>

bobboli · 2025-10-27T04:09:57Z

/bot run

tensorrt-cicd · 2025-10-27T04:16:14Z

PR_Github #22570 [ run ] triggered by Bot. Commit: 2c5989e

tensorrt-cicd · 2025-10-27T06:12:34Z

PR_Github #22570 [ run ] completed with state SUCCESS. Commit: 2c5989e
/LLM/main/L0_MergeRequest_PR pipeline #17014 completed with status: 'FAILURE'

bobboli · 2025-10-27T07:10:43Z

/bot run --reuse-test

tensorrt-cicd · 2025-10-27T07:16:29Z

PR_Github #22601 [ run ] triggered by Bot. Commit: 2c5989e

tensorrt-cicd · 2025-10-27T17:22:50Z

PR_Github #22601 [ run ] completed with state SUCCESS. Commit: 2c5989e
/LLM/main/L0_MergeRequest_PR pipeline #17037 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

coderabbitai bot reviewed Sep 3, 2025

View reviewed changes

tests/unittest/_torch/multi_gpu/test_moe_a2a.py Show resolved Hide resolved

tests/unittest/_torch/multi_gpu/test_moe_a2a.py Show resolved Hide resolved

tests/unittest/_torch/multi_gpu/test_moe_a2a.py Outdated Show resolved Hide resolved

bobboli force-pushed the alltoall branch from f25b2ff to 4e980cf Compare September 3, 2025 17:42

coderabbitai bot reviewed Sep 4, 2025

View reviewed changes

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp Outdated Show resolved Hide resolved

cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp Outdated Show resolved Hide resolved

bobboli changed the title ~~[TRTLLM-7318][feat]New MoE AlltoAll implementation.~~ [TRTLLM-7318][feat] New MoE AlltoAll implementation. Sep 5, 2025

coderabbitai bot reviewed Sep 5, 2025

View reviewed changes

tests/unittest/_torch/multi_gpu/test_moe_a2a.py Outdated Show resolved Hide resolved

bobboli force-pushed the alltoall branch from bfc82b3 to 761488f Compare September 9, 2025 02:33

bobboli added 21 commits September 22, 2025 10:23

Initial commit.

a51b974

Signed-off-by: Bo Li <[email protected]>

Update docstring.

d74eaaa

Signed-off-by: Bo Li <[email protected]>

Correct expert_id -> target_rank mapping logic.

080ebca

Signed-off-by: Bo Li <[email protected]>

Cleanup tests.

ac2887a

Combine implementation should be changed to be sending tokens from target_ranks. Signed-off-by: Bo Li <[email protected]>

fake_moe also performs linear projection.

f217f6d

Signed-off-by: Bo Li <[email protected]>

New synchronization design.

4776ae9

Signed-off-by: Bo Li <[email protected]>

Reimplement synchronization in combine.

47835d3

Signed-off-by: Bo Li <[email protected]>

Refactor API.

51dc193

Signed-off-by: Bo Li <[email protected]>

Add sync to synchronize.

4df7703

Signed-off-by: Bo Li <[email protected]>

Cleanup variables.

46a1cbc

Signed-off-by: Bo Li <[email protected]>

Add recv_counters. Avoid copying invalid tokens in prepare combine.

7d28f4c

Signed-off-by: Bo Li <[email protected]>

Add sanitize expertID.

b372ab6

Signed-off-by: Bo Li <[email protected]>

Turnoff debug print.

499b753

Signed-off-by: Bo Li <[email protected]>

New AlltoAll integration for Cutlass.

fd9db2c

Signed-off-by: Bo Li <[email protected]>

Add 8 GPU test.

ed8c5cc

Signed-off-by: Bo Li <[email protected]>

Share workspace among layers.

a5b22e1

Signed-off-by: Bo Li <[email protected]>

Let the for loop over peer rank completion flag be the inner loop.

9af153b

Signed-off-by: Bo Li <[email protected]>

Unroll warp vectorized functions.

9a4239e

Signed-off-by: Bo Li <[email protected]>

Optimize combine.

b98d2ee

Signed-off-by: Bo Li <[email protected]>

One block per token.

975bf76

Signed-off-by: Bo Li <[email protected]>

Cleanup thread_idx.

61f24d0

Signed-off-by: Bo Li <[email protected]>

Fix: Requires 8 GPUs to run the a2a unittest.

4fc3d81

Signed-off-by: Bo Li <[email protected]>

Merge branch 'main' into alltoall

ea35fd6

Signed-off-by: Bo Li <[email protected]>

Pre-commit all.

2c5989e

Signed-off-by: Bo Li <[email protected]>

bobboli enabled auto-merge (squash) October 27, 2025 16:21

bobboli merged commit 9c4432f into NVIDIA:main Oct 27, 2025
4 of 5 checks passed

bobboli mentioned this pull request Oct 28, 2025

[TRTLLM-8827] [feat] Enable low precision alltoall for Cutlass and TRTLLMGen backends #8675

Merged

1 task

[TRTLLM-7318][feat] MnnvlThroughput AlltoAll implementation. #7499

[TRTLLM-7318][feat] MnnvlThroughput AlltoAll implementation. #7499

Uh oh!

Conversation

bobboli commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bobboli commented Oct 25, 2025

Uh oh!

tensorrt-cicd commented Oct 25, 2025

Uh oh!

tensorrt-cicd commented Oct 25, 2025

Uh oh!

bobboli commented Oct 25, 2025

Uh oh!

tensorrt-cicd commented Oct 25, 2025

Uh oh!

tensorrt-cicd commented Oct 25, 2025

Uh oh!

bobboli commented Oct 26, 2025

Uh oh!

tensorrt-cicd commented Oct 26, 2025

Uh oh!

tensorrt-cicd commented Oct 26, 2025

Uh oh!

bobboli commented Oct 26, 2025

Uh oh!

tensorrt-cicd commented Oct 26, 2025

bobboli commented Sep 3, 2025 •

edited

Loading

coderabbitai bot commented Sep 3, 2025 •

edited

Loading