Skip to content
Merged
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
a51b974
Initial commit.
bobboli Sep 3, 2025
d74eaaa
Update docstring.
bobboli Sep 3, 2025
080ebca
Correct expert_id -> target_rank mapping logic.
bobboli Sep 4, 2025
ac2887a
Cleanup tests.
bobboli Sep 4, 2025
f217f6d
fake_moe also performs linear projection.
bobboli Sep 5, 2025
4776ae9
New synchronization design.
bobboli Sep 7, 2025
47835d3
Reimplement synchronization in combine.
bobboli Sep 8, 2025
51dc193
Refactor API.
bobboli Sep 9, 2025
4df7703
Add sync to synchronize.
bobboli Sep 9, 2025
46a1cbc
Cleanup variables.
bobboli Sep 9, 2025
7d28f4c
Add recv_counters. Avoid copying invalid tokens in prepare combine.
bobboli Sep 9, 2025
b372ab6
Add sanitize expertID.
bobboli Sep 9, 2025
499b753
Turnoff debug print.
bobboli Sep 10, 2025
fd9db2c
New AlltoAll integration for Cutlass.
bobboli Sep 10, 2025
ed8c5cc
Add 8 GPU test.
bobboli Sep 15, 2025
a5b22e1
Share workspace among layers.
bobboli Sep 15, 2025
9af153b
Let the for loop over peer rank completion flag be the inner loop.
bobboli Sep 16, 2025
9a4239e
Unroll warp vectorized functions.
bobboli Sep 16, 2025
b98d2ee
Optimize combine.
bobboli Sep 17, 2025
975bf76
One block per token.
bobboli Sep 18, 2025
61f24d0
Cleanup thread_idx.
bobboli Sep 19, 2025
0af05d1
Cancel manual unrolling.
bobboli Sep 19, 2025
af737d3
Update test.
bobboli Sep 19, 2025
a09e64c
Unroll topK in combine.
bobboli Sep 19, 2025
8924602
Unroll topK for dispatch.
bobboli Sep 21, 2025
f7972dd
Use smem for topk_target_ranks and topk_send_indices.
bobboli Sep 24, 2025
709d713
Use envvar to control block size.
bobboli Sep 26, 2025
11a152a
Clean up send_indices as it has been replaced by topk_target_ranks an…
bobboli Sep 26, 2025
43bdb09
Support combine without copying to workspace.
bobboli Sep 26, 2025
fe5d1af
Avoid ld.acquire and st.release for recvCounters.
bobboli Sep 26, 2025
3d1fc4b
DISABLE_SYNC_FOR_PROFILING includes st of flags.
bobboli Oct 7, 2025
be54f0f
Use fence for flag ld and st (but hangs).
bobboli Oct 8, 2025
c10ebd9
Use relaxed atomic for sendCounters is enough.
bobboli Oct 8, 2025
43ce9d9
Disable release in dispatch kernel and acquire in combine kernel.
bobboli Oct 8, 2025
5431560
Dispatch and combine should use different completion flags, otherwise…
bobboli Oct 9, 2025
99bdc24
Support the case when combine payload is in workspace.
bobboli Oct 9, 2025
7686d6b
If prepareCombineKernel does not do copy, only one block is enough. O…
bobboli Oct 9, 2025
23ca8e3
Use a warp instead of a thread to do synchronization for the combine …
bobboli Oct 13, 2025
5bf1f11
Refactor sync code in combine.
bobboli Oct 13, 2025
4fbcebe
Let one warp to do synchronization in dispatch instead of one thread.
bobboli Oct 13, 2025
03dd9bb
Merge branch 'main' into alltoall
bobboli Oct 19, 2025
9e77641
Run pre-commit -a
bobboli Oct 19, 2025
1e162d9
Fix merge error.
bobboli Oct 20, 2025
462bd82
Rebrand: Use MnnvlLatency and MnnvlThroughput to differentiate two Al…
bobboli Oct 21, 2025
eda18a1
Rename: sm_topk_send_indices -> smem_topk_send_indices.
bobboli Oct 21, 2025
58c9761
Fix unittest.
bobboli Oct 21, 2025
9114538
moeA2APrepareDispatchKernel only needs ep_size threads.
bobboli Oct 22, 2025
30ef9db
Use lower case for a2a backend names.
bobboli Oct 22, 2025
15d2c94
Fix test_register_fake failure.
bobboli Oct 22, 2025
0f63602
Add alias and mutable mark to the tensors.
bobboli Oct 23, 2025
8431078
Use normal bindings rather than torch OP for moe A2A constants.
bobboli Oct 23, 2025
e544186
Update OP annotation.
bobboli Oct 23, 2025
d043cca
FIx import error. Adjust annotation.
bobboli Oct 24, 2025
0ffeb5c
Should not override test PATH to 20b inside 120b test class.
bobboli Oct 24, 2025
37684e6
Update license of cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
bobboli Oct 24, 2025
a04ac0b
Update license years.
bobboli Oct 24, 2025
3344f0f
compute_moe should pass a None output tensor (WAR to pass the test).
bobboli Oct 24, 2025
01c1c8d
Merge branch 'main' into alltoall
bobboli Oct 24, 2025
fb3c529
Add Attributions to flashinfer.
bobboli Oct 24, 2025
4fc3d81
Fix: Requires 8 GPUs to run the a2a unittest.
bobboli Oct 25, 2025
ea35fd6
Merge branch 'main' into alltoall
bobboli Oct 27, 2025
2c5989e
Pre-commit all.
bobboli Oct 27, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions cpp/tensorrt_llm/common/envUtils.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -450,4 +450,43 @@ bool getEnvDisableChunkedAttentionInGenPhase()
return getBoolEnv("TRTLLM_DISABLE_CHUNKED_ATTENTION_IN_GEN_PHASE");
}

bool getEnvMoeA2AOneBlockPerToken()
{
// Default true; return false only if env set to "0"
static std::optional<int32_t> const val = getIntEnv("TLLM_MOE_A2A_ONE_BLOCK_PER_TOKEN");
if (!val.has_value())
{
return true;
}
return val.value() != 0;
}

static int sanitizeBlockSize(std::optional<int32_t> const& val)
{
// Default 256 when not set or invalid
int block = val.value_or(256);
// Clamp to sane CUDA bounds and warp multiples
if (block <= 0)
block = 256;
if (block > 1024)
block = 1024;
// Round to nearest multiple of 32 (warp size)
block = (block + 31) / 32 * 32;
if (block == 0)
block = 256;
return block;
}

int getEnvMoeA2ADispatchBlockSize()
{
static int const kBlock = sanitizeBlockSize(getIntEnv("TLLM_MOE_A2A_DISPATCH_BLOCK_SIZE"));
return kBlock;
}

int getEnvMoeA2ACombineBlockSize()
{
static int const kBlock = sanitizeBlockSize(getIntEnv("TLLM_MOE_A2A_COMBINE_BLOCK_SIZE"));
return kBlock;
}

} // namespace tensorrt_llm::common
9 changes: 9 additions & 0 deletions cpp/tensorrt_llm/common/envUtils.h
Original file line number Diff line number Diff line change
Expand Up @@ -136,4 +136,13 @@ bool getEnvDisaggBenchmarkGenOnly();
// Whether to disable the chunked-attention in the generation phase.
bool getEnvDisableChunkedAttentionInGenPhase();

// Whether to use one block per token for MoE A2A kernels (default true).
bool getEnvMoeA2AOneBlockPerToken();

// TODO: For DEV purpose temporarily.
// Block size (threads per block) for MoE A2A Dispatch kernels (default 256 if unset or invalid)
int getEnvMoeA2ADispatchBlockSize();
// Block size (threads per block) for MoE A2A Combine kernels (default 256 if unset or invalid)
int getEnvMoeA2ACombineBlockSize();

} // namespace tensorrt_llm::common
Loading
Loading