[None][feat] Add NCCL device kernels for AR+RMS fusion #7910

nv-lschneider · 2025-09-22T16:53:27Z

Summary by CodeRabbit

New Features
- Added AllReduce strategy NCCL_DEVICE with device-side fused AllReduce (residual RMSNorm), multimem support, and launch-configuration caching; exposed in Python and runtime config; userbuffers updated to support it.
Tests
- Added multi-GPU NCCL device tests and extended microbenchmark with only-ub option and improved size reporting.
Documentation
- API references updated to include NCCL_DEVICE.
Chores
- Updated spell-check and pre-commit configuration.

Description

The MR introduces a new kernel launch mechanism to support kernels with the NCCL device API.
It implements 1 kernel to start with: RESIDUAL_RMS_NORM for fp16 types.

This new kernel is meant to replace/enhance the performance of AllReduce using the stable NCCL API.
This is the first kernel of potentially more variations for best performance. The default AR selection strategy is not impacted yet.

It is designed to be low latency for small to medium message sizes.

The MR uses the existing NCCLUBAllocator and extends it to hold necessary persistent resources like NCCL registered memory windows and device communicators.

The AllReduce Operation is implemented as a new AllReduceStrategy and launched from the AllReduce plugin. cpp/tensorrt_llm/thop/allreduceOP.cpp.
It launches its own new kernels at cpp/tensorrt_llm/kernels/nccl_device.

The kernel itself is highly templated to be flexible for future demands without impeding runtime performance.

This MR implements the new kernel in a two-shot / fp16 implementation first.
It is already competitive in this form, however after adoption of this kernel further modifications and additions can be included.

one-shot flavor
fp8 and fp4
support in the future.

Test Coverage

Python unit test: tests/unittest/_torch/multi_gpu/test_nccl_device.py
Microbenchmakr: tests/microbenchmark/all_reduce.py

The microbenchmark has been updated slightly. It includes now the new strategy and optionally also UB for comparison.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

Important Caveat

This change requires NCCL 2.28 to run successfully.
SInce the current dev container of TRT-LLM does not use 2.28 yet, I would like to gather some feedback before 2.28 becomes available.
A real test will only be possible when version 2.28 is included in the dev container.

coderabbitai · 2025-09-22T17:17:59Z

📝 Walkthrough

Walkthrough

Adds a new NCCL_DEVICE all-reduce strategy and device-side fusion module (nccl_device): build targets, CUDA kernels, multimem/vector helpers, launch-config factory, runtime dispatch and allocator support, Python/enum plumbing, benchmarks, and tests.

Changes

Cohort / File(s)	Summary
Tooling: codespell & pre-commit `\.codespellignore`, `\.pre-commit-config.yaml`	Add "commIter" to `.codespellignore` and pass `.codespellignore` to the codespell pre-commit hook.
Build: enable nccl_device module `cpp/tensorrt_llm/kernels/CMakeLists.txt`, `cpp/tensorrt_llm/kernels/nccl_device/CMakeLists.txt`	Add `nccl_device` subdirectory and new CUDA library target `tensorrt_llm_nccl_device` with include paths, CUDA properties, link, and install rules.
Enum & bindings `cpp/tensorrt_llm/kernels/customAllReduceKernels.h`, `cpp/tensorrt_llm/pybind/runtime/bindings.cpp`, `tensorrt_llm/functional.py`	Add `NCCL_DEVICE = 9` to AllReduceStrategyType and expose it in Python bindings and the Python IntEnum.
nccl_device public headers & constants `cpp/tensorrt_llm/kernels/nccl_device/constants.h`, `.../vector_types.h`, `.../multimem.h`, `.../kernels.h`	Add device constants, vector wrapper types, architecture-gated multimem load/store intrinsics, warp/block reduce helpers, and the fusedAllReduceRMSNorm kernel templates.
nccl_device launch-config implementation `cpp/tensorrt_llm/kernels/nccl_device/config.h`, `cpp/tensorrt_llm/kernels/nccl_device/config.cu`	Add LaunchConfig base, TypedLaunchConfig, factory `makeLaunchConfig`, validity/occupancy checks, NCCL-version gating, kernel pointer resolution, and type-specialized kernel launch paths.
Allocator: NCCL device comm & LaunchConfig cache `cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h`, `.../ub_allocator.cpp`	Add destructor, device communicator create/destroy symbol resolution, getters for new NCCL symbols, per-block dev-comm map, LaunchConfigKey and mLaunchConfigCache, and `getCachedNCCLDeviceLaunchConfig` + `getNCCLDevComm`.
Runtime op dispatch (C++) `cpp/tensorrt_llm/thop/allreduceOp.cpp`	Add handling for `NCCL_DEVICE`, device-fusion path `runNCCLAllReduceDeviceFusion`, UB symmetry handling, logging and fallback behavior.
Python runtime wiring `tensorrt_llm/_torch/model_config.py`, `tensorrt_llm/_torch/pyexecutor/model_engine.py`, `tensorrt_llm/llmapi/llm_args.py`	Map string "NCCL_DEVICE" to enum, treat NCCL_DEVICE like NCCL_SYMMETRIC for enabling user buffers, and extend allowed Literal values to include "NCCL_DEVICE".
Plugins: runtime strategy support `cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp`	Treat `NCCL_DEVICE` alongside `NCCL_SYMMETRIC` in format checks, enqueue, initialization gating, and status/logging branches.
Benchmarks `tests/microbenchmarks/all_reduce.py`	Add `only_ub` mode, extend CLI, adjust benchmark loops and metrics, print message size in bytes.
Tests: multi-GPU `tests/unittest/_torch/multi_gpu/test_nccl_device.py`	New multi-GPU test validating UB + NCCL device RMSNorm all-reduce path with per-rank checks and MPI executor.
API stability reference `tests/unittest/api_stability/references/llm.yaml`	Allow `NCCL_DEVICE` in `allreduce_strategy` Literal.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant PY as Python API
  participant ME as ModelEngine
  participant OP as AllreduceOp (C++)
  participant UB as NCCLUserBufferAllocator
  participant KD as nccl_device::LaunchConfig
  participant KRN as nccl_device Kernel
  participant NCCL as NCCL (host+device)

  PY->>ME: request AllReduce with strategy="NCCL_DEVICE"
  ME->>OP: execute AllReduce (inputs, fusion attrs)
  OP->>UB: getNCCLDevComm(numBarriers)
  UB->>NCCL: resolve/create device communicator
  UB-->>OP: ncclDevComm
  OP->>UB: getCachedNCCLDeviceLaunchConfig(dtype, dims, flags)
  UB-->>OP: LaunchConfig (KD)
  OP->>KRN: KD.launchRMSNorm(..., devComm, stream)
  KRN->>NCCL: device-side allreduce (multimem ld/st)
  KRN-->>OP: outputs written
  OP-->>ME: return tensors
  ME-->>PY: result

sequenceDiagram
  autonumber
  participant OP as AllreduceOp (C++)
  participant SYM as UB Symmetric Buffers
  participant F as Fallback Path

  OP->>SYM: Verify symmetric UB buffer
  alt buffer missing
    OP->>SYM: Create symmetric input and copy data
  end
  alt device fusion unsupported or invalid
    OP->>F: fallbackRunSubsequentOps(...)
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[None][feat] Add NCCL Symmetric Integration for All Reduce #4500 — Modifies AllReduceStrategy and userbuffers/NCCL areas; overlapping enum and userbuffer/NCCL integration concerns.

Suggested reviewers

nv-guomingz
liji-nv
shaharmor98
Superjomn

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.43% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding NCCL device kernels for AllReduce with RMS normalization fusion.
Description check	✅ Passed	The description provides a clear explanation of what the PR does, why it's needed, test coverage, and relevant caveats about NCCL 2.28 requirement.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

nv-lschneider · 2025-09-22T17:22:58Z

@CodeRabbit review

tests/unittest/_torch/multi_gpu/test_nccl_device.py

nv-lschneider · 2025-10-13T19:40:59Z

@nv-lschneider Do we understand why NCCL device performs worse than NCCL symmetric for large message sizes?

I have a very strong suspicion:
In the fused kernel approach we are following a strict 1 block <-> 1 token approach. As that is necessary for the efficient RMS norm calculation.
With the NCCL_SYMMETRIC ncclAllReduce does not have this requirement thread to data requirement.
Which allows it to do other optimizations in the communication part, that we cannot pursue as easily in the fused kernel.
So the ncclAllReduce allows better bandwidth optimization for large message sizes, which seems to out-compete the advantage of fusing the computation and communication.

In general it is difficult to optimize a single kernel for small and large message sizes at the same time, which is why we want to use different strategies for different situations. ncclAllReduce does that automatically for us.

This may be another reason to leave NCCL_SYMMETRIC and NCCL_DEVICE as separate strategies, for different message sizes.

I am collecting data on this and will update the analysis with additional info.

tests/unittest/_torch/multi_gpu/test_nccl_device.py

Tabrizian · 2025-10-21T21:42:15Z

cpp/tensorrt_llm/kernels/nccl_device/config.cu

+        return false;
+    }
+
+    // 6. Query actual kernel resource usage from kernel pointer for the specific unroll factor


Are there 4 and 5 steps too?

Yes, the numbering was out of order.
I removed the numbering, since it is isn't necessary and brittle change.

The numbering is still incorrect.

Tabrizian · 2025-11-04T22:39:07Z

cpp/tensorrt_llm/kernels/nccl_device/CMakeLists.txt

+target_link_libraries(tensorrt_llm_nccl_device tensorrt_llm_common)
+
+# Install target
+install(


Is this required? Can we only link statically to the TRT-LLM library?

I don't think this is necessary. I remove this section.

Tabrizian · 2025-11-04T22:39:34Z

cpp/tensorrt_llm/kernels/nccl_device/config.cu

@@ -0,0 +1,516 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.


Please make the license header consistent with other files.

Tabrizian · 2025-11-04T22:41:10Z

cpp/tensorrt_llm/kernels/nccl_device/config.cu

+    {
+        int local_sms = 1;
+        int dev = -1;
+        cudaError_t cudaStatus = cudaGetDevice(&dev);


Can we use TLLM_CUDA_CHECK instead here and everywhere?

Suggested change

cudaError_t cudaStatus = cudaGetDevice(&dev);

TLLM_CUDA_CHECK(cudaGetDevice(&dev));

Tabrizian · 2025-11-04T22:43:47Z

cpp/tensorrt_llm/kernels/nccl_device/config.cu

+{
+    // Get CUDA device properties
+    int dev = -1;
+    cudaError_t cudaStatus = cudaGetDevice(&dev);


same as above

Tabrizian · 2025-11-04T22:44:22Z

cpp/tensorrt_llm/kernels/nccl_device/config.cu

+        return false;
+    }
+
+    // 6. Query actual kernel resource usage from kernel pointer for the specific unroll factor


The numbering is still incorrect.

Tabrizian · 2025-11-04T22:46:25Z

cpp/tensorrt_llm/kernels/nccl_device/config.cu

+        return false;
+    }
+
+    // 8. Check occupancy


Fix numbering

Tabrizian · 2025-11-04T22:48:29Z

cpp/tensorrt_llm/kernels/nccl_device/config.h

+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/


Update license please.

I updated the license of the 2 files. And double checked that the other files are OK too.

cpp/tensorrt_llm/kernels/nccl_device/kernels.cuh

cpp/tensorrt_llm/thop/allreduceOp.cpp

Tabrizian · 2025-11-04T22:54:38Z

cpp/tensorrt_llm/thop/allreduceOp.cpp

+                goto default_case;
+            }


@nv-lschneider is it possible to address this comment?

… and NEW NCCL device API to use NCCL to fuse RMS Norm with AllReduce. Signed-off-by: Ludwig Schneider <[email protected]>

…) (NVIDIA#7900) Signed-off-by: Yan Chunwei <[email protected]> Signed-off-by: Ludwig Schneider <[email protected]> pre-commit changes Signed-off-by: Ludwig Schneider <[email protected]> clang formatting Signed-off-by: Ludwig Schneider <[email protected]> safe guarding NCCL 2.27 build Signed-off-by: Ludwig Schneider <[email protected]> fixing precommit formatting Signed-off-by: Ludwig Schneider <[email protected]> most of code rabbit comments Signed-off-by: Ludwig Schneider <[email protected]> adding missing semi-colon Signed-off-by: Ludwig Schneider <[email protected]> removing unused comment lines Signed-off-by: Ludwig Schneider <[email protected]> Clarifying the test on how to compre residual chunked and unchunked. Signed-off-by: Ludwig Schneider <[email protected]> fixing pre-commit Signed-off-by: Ludwig Schneider <[email protected]> fixing pre-commit Signed-off-by: Ludwig Schneider <[email protected]> fixing missing variable, rebase complete and tested Signed-off-by: Ludwig Schneider <[email protected]> using a grid stride loop with less blocks launched for large message sizes Signed-off-by: Ludwig Schneider <[email protected]> using functioning grid stride loop for NCCL_DEVICE. It helps with better performance at larger message sizes Signed-off-by: Ludwig Schneider <[email protected]> initial oneshot implementation Signed-off-by: Ludwig Schneider <[email protected]> minor tweaks to include one shot fixes Signed-off-by: Ludwig Schneider <[email protected]> enabling grid stride loop, but no perf benefit. Signed-off-by: Ludwig Schneider <[email protected]> addressing review feedback Signed-off-by: Ludwig Schneider <[email protected]> fix formatting Signed-off-by: Ludwig Schneider <[email protected]>

Signed-off-by: Ludwig Schneider <[email protected]> better UB init handling Signed-off-by: Ludwig Schneider <[email protected]> accept multiple strategies Signed-off-by: Ludwig Schneider <[email protected]> test to debug mnnvl Signed-off-by: Ludwig Schneider <[email protected]> rebasing and addressing comments Signed-off-by: Ludwig Schneider <[email protected]> remove unneeded type decl Signed-off-by: Ludwig Schneider <[email protected]>

Signed-off-by: Ludwig Schneider <[email protected]>

nv-lschneider

I rebased the code and addressesd your comments. Thx for the patience.
Rebasing takes awhile.

nv-lschneider · 2025-11-05T16:51:24Z

cpp/tensorrt_llm/thop/allreduceOp.cpp

+                goto default_case;
+            }


Yes, using intentional fallthrough without goto instead.
(If we support more cases, we will have refactor the default case out.)

nv-lschneider · 2025-11-05T16:55:08Z

tests/unittest/_torch/multi_gpu/test_nccl_device.py

+        k_chunk_size = a.size(1) // tensor_parallel_size
+        b.size(0) // tensor_parallel_size


Removing unused command.

nv-lschneider · 2025-11-05T16:55:50Z

cpp/tensorrt_llm/kernels/nccl_device/config.h

+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/


I updated the license of the 2 files. And double checked that the other files are OK too.

cpp/tensorrt_llm/kernels/nccl_device/kernels.cuh

Signed-off-by: Ludwig Schneider <[email protected]>

Tabrizian

LGTM, merging is blocked until the PyTorch container upgrades to 2.28 version.

nv-lschneider requested review from a team as code owners September 22, 2025 16:53

nv-lschneider requested review from Fridah-nv, QiJune, brb-nv, byshiue, dongjiyingdjy, hlu1, laikhtewari, netanel-haber, nv-yilinf, syuoni and zeroepoch September 22, 2025 16:53

coderabbitai bot changed the title ~~[None] @coderabbit title~~ [None] [feat] Add NCCL device kernels; enable NCCL_DEVICE all-reduce title Sep 22, 2025

nv-lschneider force-pushed the introducing-nccl-device-ar branch 2 times, most recently from 6e1c6cd to 39b2e16 Compare September 22, 2025 17:08

Tabrizian reviewed Oct 12, 2025

View reviewed changes

tests/unittest/_torch/multi_gpu/test_nccl_device.py Outdated Show resolved Hide resolved

tests/unittest/_torch/multi_gpu/test_nccl_device.py Show resolved Hide resolved

nv-lschneider force-pushed the introducing-nccl-device-ar branch from 6449ab6 to 3f1e163 Compare October 13, 2025 19:30

nv-lschneider force-pushed the introducing-nccl-device-ar branch from 6dc8812 to 0f55f3f Compare October 14, 2025 18:18

Tabrizian reviewed Oct 21, 2025

View reviewed changes

tests/unittest/_torch/multi_gpu/test_nccl_device.py Outdated Show resolved Hide resolved

Tabrizian reviewed Oct 21, 2025

View reviewed changes

Tabrizian reviewed Nov 4, 2025

View reviewed changes

Tabrizian changed the title ~~[None] [feat] Add NCCL device kernels; enable NCCL_DEVICE all-reduce title~~ [None][feat] Add NCCL device kernels for AR+RMS fusion Nov 4, 2025

nv-lschneider requested a review from a team as a code owner November 5, 2025 18:00

nv-lschneider and others added 4 commits November 5, 2025 15:19

Introducing a new AR strategy that makes use if NCCL symmetric memory…

56bf9d0

… and NEW NCCL device API to use NCCL to fuse RMS Norm with AllReduce. Signed-off-by: Ludwig Schneider <[email protected]>

configurable explore 2d option

ca6dd60

Signed-off-by: Ludwig Schneider <[email protected]>

nv-lschneider force-pushed the introducing-nccl-device-ar branch from 9ead5ce to ca6dd60 Compare November 5, 2025 22:57

reset files, I don't want to touch

d50cd23

Signed-off-by: Ludwig Schneider <[email protected]>

nv-lschneider commented Nov 5, 2025

View reviewed changes

nv-lschneider added 12 commits November 5, 2025 19:31

fixing precommit

51e9ea1

Signed-off-by: Ludwig Schneider <[email protected]>

fix precommit

23d0bc8

Signed-off-by: Ludwig Schneider <[email protected]>

printing csv output correctly

fe263e1

Signed-off-by: Ludwig Schneider <[email protected]>

remove auto skip

847b37e

Signed-off-by: Ludwig Schneider <[email protected]>

instrumenting to investigate potential fallbacks

0ae48b5

Signed-off-by: Ludwig Schneider <[email protected]>

fixes

64f2c1d

Signed-off-by: Ludwig Schneider <[email protected]>

more instrumenting

ca75549

Signed-off-by: Ludwig Schneider <[email protected]>

remove old check

dc090ea

Signed-off-by: Ludwig Schneider <[email protected]>

exclude NCCL_DEVICE from stupid checks

6b9ec8d

Signed-off-by: Ludwig Schneider <[email protected]>

topology detection skip for stratgies that don't need it

b1ec81e

Signed-off-by: Ludwig Schneider <[email protected]>

refactor into function

260ee61

Signed-off-by: Ludwig Schneider <[email protected]>

reverting to simpler strategy

6cc2722

Signed-off-by: Ludwig Schneider <[email protected]>

nv-lschneider force-pushed the introducing-nccl-device-ar branch from a7b677d to 6cc2722 Compare November 7, 2025 21:00

Tabrizian reviewed Nov 7, 2025

View reviewed changes

		@@ -0,0 +1,516 @@
		/*************************************************************************
		* Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

	cudaError_t cudaStatus = cudaGetDevice(&dev);
	TLLM_CUDA_CHECK(cudaGetDevice(&dev));

		k_chunk_size = a.size(1) // tensor_parallel_size
		b.size(0) // tensor_parallel_size

[None][feat] Add NCCL device kernels for AR+RMS fusion #7910

Are you sure you want to change the base?

[None][feat] Add NCCL device kernels for AR+RMS fusion #7910

Uh oh!

Conversation

nv-lschneider commented Sep 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

Important Caveat

Uh oh!

coderabbitai bot commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

nv-lschneider commented Sep 22, 2025

Uh oh!

Uh oh!

Uh oh!

nv-lschneider commented Oct 13, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tabrizian Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nv-lschneider left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Tabrizian left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

nv-lschneider commented Sep 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 22, 2025 •

edited

Loading

Tabrizian Nov 4, 2025 •

edited

Loading