[Bugfix] Fix MoE Model DP+TP with NaiveAll2AllManager Bug#32705
[Bugfix] Fix MoE Model DP+TP with NaiveAll2AllManager Bug#32705River12 wants to merge 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
The pull request effectively addresses a bug in the NaiveAll2AllManager where the broadcast operation was using an incorrect distributed group for MoE models with DP2TP2 configuration. The introduction of the dist_group variable correctly selects between the expert parallel group and the data parallel group based on is_sequence_parallel, ensuring the broadcast operation is performed within the appropriate communication context. The change directly resolves the identified issue, and no new critical or high-severity issues were found in the modified code.
5d9ef78 to
495c949
Compare
69fec47 to
4629615
Compare
4629615 to
6b1424c
Compare
6b1424c to
a603936
Compare
|
@River12 could you add a test plan? cc: @tlrmchlsmth / @mgoin would you be able to help review this change? |
a603936 to
c3b09a0
Compare
@sarckk Thanks, the test plan has been detailed . cc @tlrmchlsmth , @mgoin |
c3b09a0 to
fc0744f
Compare
tlrmchlsmth
left a comment
There was a problem hiding this comment.
Thanks for the fix!
Two questions:
- Does the same thing happen with
VLLM_ALL2ALL_BACKEND="allgather_reducescatter" - Seems like this could have been introduced in #32567 - could you confirm if that seems right?
Looked into it. There is no issue with AG/RS, as it already has the proper selection of the group I dont think that #32567 introduced this, I think this was just not correctly implemented for Naive before That being said, we should probably deprecate naive. Im not sure the value of it now that we have AG/RS |
Thanks for reviews.
|
Summary:
For MoE model DP2TP2, the two DP groups produce different responses when using NaiveAll2AllManager because the broadcast operation is used in an incorrect dist_group.
Signed-off-by: Dezhan Tu <dztu@meta.com>
Test Plan:
Test DP2TP2 with VLLM_ALL2ALL_BACKEND="naive" on MoE mdoel, and the below testing script is modified from `examples/offline_inference/torchrun_dp_example.py`
- Input the same prompt to 2 DP groups
- Use default MoE model `microsoft/Phi-mini-MoE-instruct`
```
import argparse
from vllm import LLM, SamplingParams
def parse_args():
parser = argparse.ArgumentParser(
description="Data-parallel inference with torchrun"
)
parser.add_argument(
"--tp-size",
type=int,
default=1,
help="Tensor parallel size (default: 1)",
)
parser.add_argument(
"--pp-size",
type=int,
default=1,
help="Pipeline parallel size (default: 1)",
)
parser.add_argument(
"--dp-size",
type=int,
default=2,
help="Data parallel size (default: 2)",
)
parser.add_argument(
"--enable-ep",
action="store_true",
help="Enable expert parallel (default: False)",
)
parser.add_argument(
"--model",
type=str,
default="microsoft/Phi-mini-MoE-instruct",
help="Model name or path (default: microsoft/Phi-mini-MoE-instruct)",
)
parser.add_argument(
"--max-model-len",
type=int,
default=4096,
help="Maximum model length (default: 4096)",
)
parser.add_argument(
"--gpu-memory-utilization",
type=float,
default=0.6,
help="GPU memory utilization (default: 0.6)",
)
parser.add_argument(
"--seed",
type=int,
default=1,
help="Random seed (default: 1)",
)
return parser.parse_args()
args = parse_args()
# Create prompts, the same across all ranks
prompts = [
"Hello, my name is",
"Hello, my name is",
]
# Create sampling parameters, the same across all ranks
sampling_params = SamplingParams(temperature=0.0, top_p=1.0)
# Use `distributed_executor_backend="external_launcher"` so that
# this llm engine/instance only creates one worker.
# it is important to set an explicit seed to make sure that
# all ranks have the same random seed, so that sampling can be
# deterministic across ranks.
llm = LLM(
model=args.model,
tensor_parallel_size=args.tp_size,
data_parallel_size=args.dp_size,
pipeline_parallel_size=args.pp_size,
enable_expert_parallel=args.enable_ep,
distributed_executor_backend="external_launcher",
max_model_len=args.max_model_len,
gpu_memory_utilization=args.gpu_memory_utilization,
seed=args.seed,
)
dp_rank = llm.llm_engine.vllm_config.parallel_config.data_parallel_rank
dp_size = llm.llm_engine.vllm_config.parallel_config.data_parallel_size
prompts = [
f"{idx}.{prompt}" for idx, prompt in enumerate(prompts) if idx % dp_size == dp_rank
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(
f"DP Rank: {dp_rank} Prompt: {prompt!r}\nGenerated text: {generated_text!r}\n"
)
```
Running command
```
FLASHINFER_DISABLE_VERSION_CHECK=1 VLLM_ALL2ALL_BACKEND="naive" \
torchrun --nproc-per-node=4 examples/offline_inference/torchrun_dp_example.py \
--tp-size=2 --dp-size=2
```
Log before fix, the responses from the 2nd DP group are wrong:
```
DP Rank: 0 Prompt: '0.Hello, my name is'
Generated text: ' 0.Hello, my name is 0.Hello, my name is'
DP Rank: 1 Prompt: '1.Hello, my name is'
Generated text: 'aaaa st sample task SS field Story notion snapshot Reyn final moment Reyn Ku Ent dead'
DP Rank: 0 Prompt: '0.Hello, my name is'
Generated text: ' 0.Hello, my name is 0.Hello, my name is'
DP Rank: 1 Prompt: '1.Hello, my name is'
Generated text: 'aaaa st sample task SS field Story notion snapshot Reyn final moment Reyn Ku Ent dead'
```
Log after fix:
```
DP Rank: 0 Prompt: '0.Hello, my name is'
Generated text: ' John.\n\n### Instruction 2 (Much more difficult with'
DP Rank: 1 Prompt: '1.Hello, my name is'
Generated text: ' John.\n2.I am a software developer.\n3.I love'
DP Rank: 1 Prompt: '1.Hello, my name is'
Generated text: ' John.\n2.I am a software developer.\n3.I love'
DP Rank: 0 Prompt: '0.Hello, my name is'
Generated text: ' John.\n\n### Instruction 2 (Much more difficult with'
```
Reviewed By: diviramon, mutinifni, wushidonguc
Differential Revision: D91016491
Head branch was pushed to by a user without write access
025bc98 to
1f73a3b
Compare
Summary: For MoE model DP2TP2, the responses from the 2nd DP group are wrong, when using NaiveAll2AllManager because the broadcast operation is used in an incorrect dist_group.
Test Plan:
Test DP2TP2 with VLLM_ALL2ALL_BACKEND="naive" on MoE mdoel, and the below testing script is modified from
examples/offline_inference/torchrun_dp_example.pymicrosoft/Phi-mini-MoE-instructRunning command
Log before fix, the responses from the 2nd DP group are wrong:
Log after fix:
Differential Revision: D91016491