feat: large-scale EP(part 1: Add MNNVL MoE A2A support) #3504

dongxuy04 · 2025-04-13T13:04:32Z

Add MNNVL MoE AllToAll support for large scale expert parallism.

dongxuy04 · 2025-04-13T13:05:00Z

/bot run

tensorrt-cicd · 2025-04-13T13:10:36Z

PR_Github #2049 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-13T15:14:49Z

PR_Github #2049 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1508 completed with status: 'FAILURE'

dongxuy04 · 2025-04-14T05:40:36Z

/bot run

tensorrt-cicd · 2025-04-14T05:46:02Z

PR_Github #2113 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-14T11:42:31Z

PR_Github #2113 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1542 completed with status: 'SUCCESS'

dongxuy04 · 2025-04-15T15:11:59Z

/bot run

dongxuy04 · 2025-04-16T02:15:18Z

/bot run

tensorrt-cicd · 2025-04-16T02:20:41Z

PR_Github #2385 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-16T04:01:56Z

PR_Github #2385 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #1716 completed with status: 'FAILURE'

dongxuy04 · 2025-04-16T04:09:58Z

/bot run

tensorrt-cicd · 2025-04-16T04:15:42Z

PR_Github #2408 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-16T04:28:05Z

PR_Github #2408 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #1730 completed with status: 'FAILURE'

dongxuy04 · 2025-04-18T12:39:46Z

/bot run

tensorrt-cicd · 2025-04-18T12:45:22Z

PR_Github #2768 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-18T14:00:55Z

PR_Github #2768 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1958 completed with status: 'FAILURE'

dongxuy04 · 2025-04-18T14:11:00Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2025-04-18T14:16:32Z

PR_Github #2773 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-18T21:49:57Z

PR_Github #2773 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1962 completed with status: 'FAILURE'

dongxuy04 · 2025-04-21T01:35:25Z

/bot run --disable-fail-fast

tensorrt_llm/_torch/modules/fused_moe.py

tests/unittest/_torch/thop/test_moe_alltoall.py

tests/unittest/_torch/test_mnnvl_memory.py

dongxuy04 · 2025-04-24T01:09:18Z

/bot run

tensorrt-cicd · 2025-04-24T01:20:24Z

PR_Github #3227 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-24T03:24:21Z

PR_Github #3227 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2244 completed with status: 'FAILURE'

dongxuy04 · 2025-04-24T06:37:52Z

/bot run

tensorrt-cicd · 2025-04-24T06:43:21Z

PR_Github #3269 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-24T08:57:25Z

PR_Github #3269 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2272 completed with status: 'FAILURE'

Signed-off-by: Dongxu Yang <[email protected]>

dongxuy04 · 2025-04-25T02:39:54Z

/bot run

tensorrt-cicd · 2025-04-25T02:46:04Z

PR_Github #3341 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-25T09:20:05Z

PR_Github #3341 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2332 completed with status: 'SUCCESS'

jdemouth-nvidia · 2025-05-14T16:09:43Z

cpp/tensorrt_llm/kernels/moeCommKernels.h

+
+    struct GroupSharedBuffer
+    {
+        int groupIndiceBuffer[GROUP_MAX_INDICE_COUNT];


Note that it should either be Index or Indices(with a "s").Indexis the singular andIndices` is the plural.

jdemouth-nvidia · 2025-05-14T16:11:21Z

cpp/tensorrt_llm/kernels/moeCommKernels.h

+public:
+    static constexpr int GROUP_COUNT_PER_BLOCK = 8;
+    static_assert(GROUP_COUNT_PER_BLOCK <= 8, "GROUP_COUNT_PER_BLOCK must be less than or equal to 8");
+    static constexpr int WARP_PER_GROUP = 2;


It should be WARPS_PER_GROUP or WARP_COUNT_PER_GROUP.

jdemouth-nvidia · 2025-05-14T16:12:20Z

cpp/tensorrt_llm/kernels/moeCommKernels.h

+        TLLM_CHECK_WITH_INFO(
+            blockCountPerChannel <= smCount, "GPU should support at lease one channel, usableSmCount=%d", smCount);
+        int perferredChannel = smCount / 2 / blockCountPerChannel; // use half SMs for communication
+        int channelCount = std::max(perferredChannel, 1);          // at lease one channel


In the comment, it must be at least

jdemouth-nvidia · 2025-05-14T16:13:19Z

cpp/tensorrt_llm/kernels/moeCommKernels.h

+    static int computeMoeCommChannelCount(int epSize)
+    {
+        int smCount = getMaxUsableSmCount();
+        int blockCountPerChannel = (epSize + GROUP_COUNT_PER_BLOCK - 1) / GROUP_COUNT_PER_BLOCK;


Isn't there a function in TRT-LLM to compute this (something like divUp or ceilDiv are common names).

juney-nvidia changed the title ~~Add MNNVL MoE A2A support~~ feat: Add MNNVL MoE A2A support Apr 13, 2025

dongxuy04 force-pushed the user/dongxuy/mnnvl_a2a_github branch from 24a44a8 to 163b322 Compare April 16, 2025 02:15

dongxuy04 force-pushed the user/dongxuy/mnnvl_a2a_github branch from 163b322 to bfda091 Compare April 18, 2025 12:38

dongxuy04 marked this pull request as ready for review April 18, 2025 12:39

dongxuy04 requested review from hlu1, yuxianq and zongfeijing April 18, 2025 13:02

hlu1 requested review from HuiGao-NV and QiJune April 18, 2025 19:42

dongxuy04 force-pushed the user/dongxuy/mnnvl_a2a_github branch from bfda091 to 89ec4a5 Compare April 21, 2025 03:38

dongxuy04 force-pushed the user/dongxuy/mnnvl_a2a_github branch from b928430 to b81aae4 Compare April 23, 2025 10:35

yuxianq reviewed Apr 23, 2025

View reviewed changes

tensorrt_llm/_torch/modules/fused_moe.py Outdated Show resolved Hide resolved

yuxianq reviewed Apr 23, 2025

View reviewed changes

tests/unittest/_torch/thop/test_moe_alltoall.py Outdated Show resolved Hide resolved

yuxianq reviewed Apr 23, 2025

View reviewed changes

tests/unittest/_torch/test_mnnvl_memory.py Outdated Show resolved Hide resolved

dongxuy04 force-pushed the user/dongxuy/mnnvl_a2a_github branch from b81aae4 to dd0c525 Compare April 24, 2025 01:06

yuxianq approved these changes Apr 24, 2025

View reviewed changes

dongxuy04 added 8 commits April 25, 2025 09:51

add MNNVL memory mapping support

65a0c00

Signed-off-by: Dongxu Yang <[email protected]>

add more MPI environment for trtllm-llmapi-launch

5f27700

Signed-off-by: Dongxu Yang <[email protected]>

add MoE communication and prepare kernels

0cfd1aa

Signed-off-by: Dongxu Yang <[email protected]>

add MNNVL AlltoAll support for DeepSeekV3

212050f

Signed-off-by: Dongxu Yang <[email protected]>

add output dump for throughput benchmark

e55ec56

Signed-off-by: Dongxu Yang <[email protected]>

support dynamic kernel launch grid

9be387c

Signed-off-by: Dongxu Yang <[email protected]>

address review comments

72efaf9

Signed-off-by: Dongxu Yang <[email protected]>

address review comments NVIDIA#2

2070dca

Signed-off-by: Dongxu Yang <[email protected]>

dongxuy04 force-pushed the user/dongxuy/mnnvl_a2a_github branch from dd0c525 to 2070dca Compare April 25, 2025 02:39

dongxuy04 merged commit 1653599 into NVIDIA:main Apr 25, 2025
3 checks passed

juney-nvidia changed the title ~~feat: Add MNNVL MoE A2A support~~ feat: large-scale EP(part 1: Add MNNVL MoE A2A support) Apr 29, 2025

juney-nvidia mentioned this pull request May 7, 2025

[Call for contributions]The development plan of large-scale EP support in TensorRT-LLM #4127

Open

jdemouth-nvidia reviewed May 14, 2025

View reviewed changes

feat: large-scale EP(part 1: Add MNNVL MoE A2A support) #3504

feat: large-scale EP(part 1: Add MNNVL MoE A2A support) #3504

Uh oh!

Conversation

dongxuy04 commented Apr 13, 2025

Uh oh!

dongxuy04 commented Apr 13, 2025

Uh oh!

tensorrt-cicd commented Apr 13, 2025

Uh oh!

tensorrt-cicd commented Apr 13, 2025

Uh oh!

dongxuy04 commented Apr 14, 2025

Uh oh!

tensorrt-cicd commented Apr 14, 2025

Uh oh!

tensorrt-cicd commented Apr 14, 2025

Uh oh!

dongxuy04 commented Apr 15, 2025

Uh oh!

dongxuy04 commented Apr 16, 2025

Uh oh!

tensorrt-cicd commented Apr 16, 2025

Uh oh!

tensorrt-cicd commented Apr 16, 2025

Uh oh!

dongxuy04 commented Apr 16, 2025

Uh oh!

tensorrt-cicd commented Apr 16, 2025

Uh oh!

tensorrt-cicd commented Apr 16, 2025

Uh oh!

dongxuy04 commented Apr 18, 2025

Uh oh!

tensorrt-cicd commented Apr 18, 2025

Uh oh!

tensorrt-cicd commented Apr 18, 2025

Uh oh!

dongxuy04 commented Apr 18, 2025

Uh oh!

tensorrt-cicd commented Apr 18, 2025

Uh oh!

tensorrt-cicd commented Apr 18, 2025

Uh oh!

dongxuy04 commented Apr 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dongxuy04 commented Apr 24, 2025

Uh oh!

tensorrt-cicd commented Apr 24, 2025

Uh oh!

tensorrt-cicd commented Apr 24, 2025

Uh oh!

dongxuy04 commented Apr 24, 2025

Uh oh!

tensorrt-cicd commented Apr 24, 2025

Uh oh!

tensorrt-cicd commented Apr 24, 2025

Uh oh!

dongxuy04 commented Apr 25, 2025

Uh oh!

tensorrt-cicd commented Apr 25, 2025

Uh oh!

tensorrt-cicd commented Apr 25, 2025

Uh oh!

Uh oh!

jdemouth-nvidia May 14, 2025

Choose a reason for hiding this comment

Uh oh!

jdemouth-nvidia May 14, 2025

Choose a reason for hiding this comment

Uh oh!

jdemouth-nvidia May 14, 2025

Choose a reason for hiding this comment

Uh oh!

jdemouth-nvidia May 14, 2025

Choose a reason for hiding this comment

Uh oh!