Skip to content

Conversation

@dongxuy04
Copy link
Collaborator

Add MNNVL MoE AllToAll support for large scale expert parallism.

@dongxuy04
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2049 [ run ] triggered by Bot

@juney-nvidia juney-nvidia changed the title Add MNNVL MoE A2A support feat: Add MNNVL MoE A2A support Apr 13, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #2049 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1508 completed with status: 'FAILURE'

@dongxuy04
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2113 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2113 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1542 completed with status: 'SUCCESS'

@dongxuy04
Copy link
Collaborator Author

/bot run

@dongxuy04 dongxuy04 force-pushed the user/dongxuy/mnnvl_a2a_github branch from 24a44a8 to 163b322 Compare April 16, 2025 02:15
@dongxuy04
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2385 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2385 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #1716 completed with status: 'FAILURE'

@dongxuy04
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2408 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2408 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #1730 completed with status: 'FAILURE'

@dongxuy04 dongxuy04 force-pushed the user/dongxuy/mnnvl_a2a_github branch from 163b322 to bfda091 Compare April 18, 2025 12:38
@dongxuy04
Copy link
Collaborator Author

/bot run

@dongxuy04 dongxuy04 marked this pull request as ready for review April 18, 2025 12:39
@tensorrt-cicd
Copy link
Collaborator

PR_Github #2768 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2768 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1958 completed with status: 'FAILURE'

@dongxuy04
Copy link
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #2773 [ run ] triggered by Bot

@hlu1 hlu1 requested review from HuiGao-NV and QiJune April 18, 2025 19:42
@tensorrt-cicd
Copy link
Collaborator

PR_Github #2773 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1962 completed with status: 'FAILURE'

@dongxuy04
Copy link
Collaborator Author

/bot run --disable-fail-fast

@dongxuy04 dongxuy04 force-pushed the user/dongxuy/mnnvl_a2a_github branch from bfda091 to 89ec4a5 Compare April 21, 2025 03:38
@dongxuy04 dongxuy04 force-pushed the user/dongxuy/mnnvl_a2a_github branch from b928430 to b81aae4 Compare April 23, 2025 10:35
@dongxuy04 dongxuy04 force-pushed the user/dongxuy/mnnvl_a2a_github branch from b81aae4 to dd0c525 Compare April 24, 2025 01:06
@dongxuy04
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3227 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3227 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2244 completed with status: 'FAILURE'

@dongxuy04
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3269 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3269 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2272 completed with status: 'FAILURE'

@dongxuy04 dongxuy04 force-pushed the user/dongxuy/mnnvl_a2a_github branch from dd0c525 to 2070dca Compare April 25, 2025 02:39
@dongxuy04
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3341 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3341 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2332 completed with status: 'SUCCESS'

@dongxuy04 dongxuy04 merged commit 1653599 into NVIDIA:main Apr 25, 2025
3 checks passed
@juney-nvidia juney-nvidia changed the title feat: Add MNNVL MoE A2A support feat: large-scale EP(part 1: Add MNNVL MoE A2A support) Apr 29, 2025

struct GroupSharedBuffer
{
int groupIndiceBuffer[GROUP_MAX_INDICE_COUNT];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it should either be Index or Indices(with a "s").Indexis the singular andIndices` is the plural.

public:
static constexpr int GROUP_COUNT_PER_BLOCK = 8;
static_assert(GROUP_COUNT_PER_BLOCK <= 8, "GROUP_COUNT_PER_BLOCK must be less than or equal to 8");
static constexpr int WARP_PER_GROUP = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be WARPS_PER_GROUP or WARP_COUNT_PER_GROUP.

TLLM_CHECK_WITH_INFO(
blockCountPerChannel <= smCount, "GPU should support at lease one channel, usableSmCount=%d", smCount);
int perferredChannel = smCount / 2 / blockCountPerChannel; // use half SMs for communication
int channelCount = std::max(perferredChannel, 1); // at lease one channel
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the comment, it must be at least

static int computeMoeCommChannelCount(int epSize)
{
int smCount = getMaxUsableSmCount();
int blockCountPerChannel = (epSize + GROUP_COUNT_PER_BLOCK - 1) / GROUP_COUNT_PER_BLOCK;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a function in TRT-LLM to compute this (something like divUp or ceilDiv are common names).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants