Revert "[MoE Refactor] Remove MoE DP chunking" (#39107)#39853
Revert "[MoE Refactor] Remove MoE DP chunking" (#39107)#39853vllm-agent wants to merge 1 commit intovllm-project:mainfrom
Conversation
This reverts commit e1e318a.
There was a problem hiding this comment.
Code Review
This pull request introduces ChunkingMoERunner, a wrapper that enables processing large MoE batches in smaller chunks to maintain lockstep across Data-Parallel ranks. It adds environment variables VLLM_MOE_DP_CHUNK_SIZE and VLLM_ENABLE_MOE_DP_CHUNK for configuration and implements a chunked_sizes context manager in DPMetadata to manage per-rank token counts. The changes also include refactoring test utilities and engine configuration to decouple scheduler defaults from parallel settings. Review feedback identifies critical bugs in the chunking logic related to zero-token scenarios, which could cause negative indexing or failed tensor copies during collective operations.
| assert out_slice.size(0) >= slice_size | ||
| out_slice = out_slice[:slice_size, :] | ||
| out_slice.copy_(orig_slice, non_blocking=True) |
There was a problem hiding this comment.
The copy_ operation will fail if orig_slice is empty, which occurs when num_tokens is 0 on a Data Parallel rank. In such cases, the workspace buffer should be zeroed out to provide a valid (though dummy) input for the collective kernels to maintain lockstep across all ranks.
assert out_slice.size(0) >= slice_size
out_slice = out_slice[:slice_size, :]
if orig_slice.numel() > 0:
out_slice.copy_(orig_slice, non_blocking=True)
else:
out_slice.zero_()
return out_slice| chunk_start = min(chunk_start, num_tokens - 1) | ||
| chunk_end = min(chunk_end, num_tokens) |
There was a problem hiding this comment.
The clamping logic fails when num_tokens is 0 because num_tokens - 1 becomes -1, leading to negative indices and size mismatches during tensor copying. When num_tokens is 0, the runner should still provide a single dummy token (consistent with _compute_chunked_local_num_tokens in forward_context.py) to ensure all DP ranks participate in collective operations.
# clamp start and end
if num_tokens > 0:
chunk_start = min(chunk_start, num_tokens - 1)
chunk_end = min(chunk_end, num_tokens)
# If the chunk is entirely padding for this rank
if chunk_start >= chunk_end:
chunk_start = num_tokens - 1
chunk_end = num_tokens
else:
chunk_start = 0
chunk_end = 1|
|
||
| # Store outputs | ||
| # TODO(bnell): document when chunk_start >= num_tokens | ||
| if chunk_start < num_tokens: |
There was a problem hiding this comment.
The copy-back condition should use the original chunk start index (chunk_start_) rather than the clamped chunk_start. If chunk_start_ >= num_tokens, the chunk is purely padding for this rank, and its result should not be copied back into the output tensor (which would be redundant or even crash if num_tokens is 0).
| if chunk_start < num_tokens: | |
| if chunk_start_ < num_tokens: |
Revert of #39107
This reverts commit e1e318a.
Reason: This PR is suspected of causing 1 new CI failure in build #61308:
test_moe_layer[False-deepep_low_latency-2-1-True]crashed with SIGABRTThe PR modified
tests/kernels/moe/test_moe_layer.pyand multiple MoE layer/runner files (fused_moe/layer.py,fused_moe/config.py,fused_moe/runner/etc.).Original PR: #39107
Auto-generated by CI failure analyzer