[Model Runner V2] Support DP/EP for spec decoding#35248
[Model Runner V2] Support DP/EP for spec decoding#35248TheEpicDolphin wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds support for Data Parallelism (DP) in speculative decoding by correctly calculating and passing num_tokens_across_dp in the EagleSpeculator. The changes also include a bug fix in GPUModelRunner to correctly prepare the communication buffer for the speculator's model. The changes look correct and address existing FIXMEs. I have one suggestion to improve code readability and maintainability.
| num_prefill_tokens_across_dp = make_num_tokens_across_dp( | ||
| self.vllm_config.parallel_config.data_parallel_size, num_tokens | ||
| ) |
There was a problem hiding this comment.
To improve readability and reduce duplication, consider storing self.vllm_config.parallel_config.data_parallel_size in a local variable at the beginning of the propose method. This variable can then be used here and in the decode step later in the function (lines 320-322).
For example:
# At the beginning of the propose method
dp_size = self.vllm_config.parallel_config.data_parallel_size
# ... later in the method
num_prefill_tokens_across_dp = make_num_tokens_across_dp(dp_size, num_tokens)
# ... and later
num_decode_tokens_across_dp = make_num_tokens_across_dp(dp_size, num_tokens_padded)d621076 to
86ade2e
Compare
|
@TheEpicDolphin Thanks for this work! I found a hang issue when testing with MoE spec models under DP+EP. Repro: GLM-4.7-Flash, DP2 TP1 EP2, with The hang is caused by two issues:
The original test with I've submitted a follow-up PR #35294 to fix this on top of your changes. also cc @WoosukKwon |
86ade2e to
3460cbb
Compare
|
Hi @TheEpicDolphin, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
3460cbb to
756b507
Compare
|
Hi @TheEpicDolphin, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
756b507 to
bdc0ff5
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
bdc0ff5 to
56fd884
Compare
| # [data_parallel_size] | ||
| num_prefill_tokens_across_dp: torch.Tensor | None, |
There was a problem hiding this comment.
Can we just keep the name num_tokens_across_dp?
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
56fd884 to
c68670e
Compare
|
@TheEpicDolphin There is still a hang issue when using MoE-based spec decode models with the current changes. I've incorporated your changes along with the fix in #35294. Would it be possible to review and merge #35294 directly (rather than splitting DP/EP support for spec decoding into two separate PRs)? Or if you have any other suggestions, I'm happy to discuss. Thanks for your time! |
|
@izhuhaoran thanks for the MoE-based draft fix! Makes sense to just merge #35294 directly. I'll chat with Woosuk and see what the next steps should be |
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Currently, you get an error if you try to run spec decoding with DP + EP using the V2 model runner. The root cause is because
Nonewas being passed as thenum_tokens_across_dpparameter for both prefill and decode for the draft model (speculator). This causes a failure because the rank serving the request will attempt to coordinate with the other DP ranks (by callingcoordinate_batch_across_dp) to microbatch, but won't get a response, resulting in a runtime error.Test Plan
Non-spec-decoding server
No issues while serving requests.
Spec-decoding server
Client Script
Before
Runtime error:
After
No errors when serving requests. Output:
None-MoE models also work as expected with MRV2 + spec decode + DP. I verified by serving like so:
The Fix
.modules().