Skip to content

[Model Runner V2] Support DP/EP for spec decoding#35248

Closed
TheEpicDolphin wants to merge 1 commit intovllm-project:mainfrom
TheEpicDolphin:gdelfin/mrv2-spec-decode-ep
Closed

[Model Runner V2] Support DP/EP for spec decoding#35248
TheEpicDolphin wants to merge 1 commit intovllm-project:mainfrom
TheEpicDolphin:gdelfin/mrv2-spec-decode-ep

Conversation

@TheEpicDolphin
Copy link
Copy Markdown
Collaborator

@TheEpicDolphin TheEpicDolphin commented Feb 25, 2026

Purpose

Currently, you get an error if you try to run spec decoding with DP + EP using the V2 model runner. The root cause is because None was being passed as the num_tokens_across_dp parameter for both prefill and decode for the draft model (speculator). This causes a failure because the rank serving the request will attempt to coordinate with the other DP ranks (by calling coordinate_batch_across_dp) to microbatch, but won't get a response, resulting in a runtime error.

Test Plan

Non-spec-decoding server

VLLM_USE_V2_MODEL_RUNNER=1 vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size=1 --data-parallel-size=2 --enable-expert-parallel

No issues while serving requests.

Spec-decoding server

VLLM_USE_V2_MODEL_RUNNER=1 vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size=1 --data-parallel-size=2 --enable-expert-parallel  --speculative-config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 2}'

Client Script

from concurrent.futures import ThreadPoolExecutor, as_completed

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

MODEL = "mistralai/Mixtral-8x7B-Instruct-v0.1"

MAX_CONCURRENT_REQUESTS = 8

PROMPTS = [
    "Explain the theory of relativity in simple terms.",
    "What is the capital of France?",
    "Write a haiku about coding.",
    "List three benefits of regular exercise.",
    "How does a refrigerator keep food cold?",
    "What is the difference between HTTP and HTTPS?",
    "Suggest a short book to read on a rainy day.",
    "Name two inventions that changed the world.",
]

def run_one(prompt: str, request_id: int) -> tuple[int, str]:
    r = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=64,
    )
    return request_id, r.choices[0].message.content or ""


def main() -> None:
    num_workers = min(MAX_CONCURRENT_REQUESTS, len(PROMPTS))
    with ThreadPoolExecutor(max_workers=num_workers) as executor:
        futures = {
            executor.submit(run_one, prompt, i): i
            for i, prompt in enumerate(PROMPTS)
        }
        for future in as_completed(futures):
            req_id, content = future.result()
            print(f"[{req_id}] {content}\n")


if __name__ == "__main__":
    main()

Before

Runtime error:

(EngineCore_DP0 pid=926457)                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/miniconda3/envs/vllm/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=926457)     return self.__get_result()
(EngineCore_DP0 pid=926457)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/miniconda3/envs/vllm/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=926457)     raise self._exception
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/v1/executor/uniproc_executor.py", line 79, in collective_rpc
(EngineCore_DP0 pid=926457)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=926457)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=926457)     return func(*args, **kwargs)
(EngineCore_DP0 pid=926457)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=926457)     return func(*args, **kwargs)
(EngineCore_DP0 pid=926457)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/v1/worker/gpu_worker.py", line 633, in sample_tokens
(EngineCore_DP0 pid=926457)     return self.model_runner.sample_tokens(grammar_output)
(EngineCore_DP0 pid=926457)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=926457)     return func(*args, **kwargs)
(EngineCore_DP0 pid=926457)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/v1/worker/gpu/model_runner.py", line 1093, in sample_tokens
(EngineCore_DP0 pid=926457)     draft_tokens = self.speculator.propose(
(EngineCore_DP0 pid=926457)                    ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=926457)     return func(*args, **kwargs)
(EngineCore_DP0 pid=926457)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/v1/worker/gpu/spec_decode/eagle/speculator.py", line 234, in propose
(EngineCore_DP0 pid=926457)     last_hidden_states, hidden_states = self.run_model(
(EngineCore_DP0 pid=926457)                                         ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=926457)     return func(*args, **kwargs)
(EngineCore_DP0 pid=926457)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/v1/worker/gpu/spec_decode/eagle/speculator.py", line 99, in run_model
(EngineCore_DP0 pid=926457)     with set_forward_context(
(EngineCore_DP0 pid=926457)          ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore_DP0 pid=926457)     return next(self.gen)
(EngineCore_DP0 pid=926457)            ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/forward_context.py", line 346, in set_forward_context
(EngineCore_DP0 pid=926457)     _, num_tokens_across_dp, _ = coordinate_batch_across_dp(
(EngineCore_DP0 pid=926457)                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/v1/worker/dp_utils.py", line 230, in coordinate_batch_across_dp
(EngineCore_DP0 pid=926457)     _synchronize_dp_ranks(
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/v1/worker/dp_utils.py", line 134, in _synchronize_dp_ranks
(EngineCore_DP0 pid=926457)     tensor = _run_ar(
(EngineCore_DP0 pid=926457)              ^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/v1/worker/dp_utils.py", line 55, in _run_ar
(EngineCore_DP0 pid=926457)     dist.all_reduce(tensor, group=group)
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
(EngineCore_DP0 pid=926457)     return func(*args, **kwargs)
(EngineCore_DP0 pid=926457)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 3014, in all_reduce
(EngineCore_DP0 pid=926457)     work.wait()
(EngineCore_DP0 pid=926457) RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:538] Read error [172.27.59.29]:39861: Connection reset by peer

After

No errors when serving requests. Output:

[2]  Lines of code weave,
Intricate patterns emerge,
Logic's dance set free.

[4]  A refrigerator keeps food cold through the process of heat transfer. It uses a coolant, which is a substance that can easily convert between liquid and gas states, called a refrigerant. 

The process begins in the compressor, where the refrigerant is compressed into a high-pressure, high

[3]  1. Improved physical health: Regular exercise helps to control weight, reduce the risk of heart disease, strengthen bones and muscles, and improve overall physical fitness. It can also help to manage conditions such as diabetes, high blood pressure, and arthritis.

2. Enhanced mental well-

[0]  The theory of relativity is a concept introduced by Albert Einstein that describes how the universe works at high speeds and in extreme conditions, such as near black holes. It is actually made up of two parts: the special theory of relativity and the general theory of relativity.

1. Special Theory of Relativity

[6]  I would recommend "The Alchemist" by Paulo Coelho. It's a beautifully written, short novel that tells the story of a young Andalusian shepherd named Santiago who dreams of traveling the world in search of a treasure. Along the way, he meets a series of spiritual guides,

[1]  The capital city of France is Paris. Known for its impressive architecture, exquisite art, and delicious cuisine, Paris is one of the most popular tourist destinations in the world. Some of the city's most famous landmarks include the Eiffel Tower, the Louvre Museum, Notre-

[5]  HTTP (Hypertext Transfer Protocol) and HTTPS (Hypertext Transfer Protocol Secure) are both protocols used for transmitting data over the web, but there is a key difference between the two:

HTTP is an unsecured protocol, which means that any data transmitted using HTTP can be

[7]  Two inventions that have significantly changed the world are the printing press and the internet.

The printing press, invented by Johannes Gutenberg in the 15th century, revolutionized the way information was reproduced and disseminated. It enabled the mass production of books, pamphlets, and other

None-MoE models also work as expected with MRV2 + spec decode + DP. I verified by serving like so:

VLLM_USE_V2_MODEL_RUNNER=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct --tensor-parallel-size=1 --data-parallel-size=2 --speculative-config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 2}'

The Fix

  1. Model runner is attempting to prepare the communication buffer with the speculator object, but should be using the speculator model instead. EagleSpeculator is not an nn.Module instance, and doesn't have .modules().
  2. Compute num prefill and decode tokens for draft model across DP by using utils method.

@mergify mergify bot added the v1 label Feb 25, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Data Parallelism (DP) in speculative decoding by correctly calculating and passing num_tokens_across_dp in the EagleSpeculator. The changes also include a bug fix in GPUModelRunner to correctly prepare the communication buffer for the speculator's model. The changes look correct and address existing FIXMEs. I have one suggestion to improve code readability and maintainability.

Comment on lines +231 to +233
num_prefill_tokens_across_dp = make_num_tokens_across_dp(
self.vllm_config.parallel_config.data_parallel_size, num_tokens
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To improve readability and reduce duplication, consider storing self.vllm_config.parallel_config.data_parallel_size in a local variable at the beginning of the propose method. This variable can then be used here and in the decode step later in the function (lines 320-322).

For example:

# At the beginning of the propose method
dp_size = self.vllm_config.parallel_config.data_parallel_size

# ... later in the method
num_prefill_tokens_across_dp = make_num_tokens_across_dp(dp_size, num_tokens)

# ... and later
num_decode_tokens_across_dp = make_num_tokens_across_dp(dp_size, num_tokens_padded)

@TheEpicDolphin TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-ep branch 3 times, most recently from d621076 to 86ade2e Compare February 25, 2026 02:00
@TheEpicDolphin TheEpicDolphin marked this pull request as ready for review February 25, 2026 02:14
@izhuhaoran
Copy link
Copy Markdown
Contributor

@TheEpicDolphin Thanks for this work! I found a hang issue when testing with MoE spec models under DP+EP.

Repro: GLM-4.7-Flash, DP2 TP1 EP2, with --speculative-config '{"num_speculative_tokens": 2, "method": "mtp", "model": "<path_to_GLM-4.7-Flash>"}'. Sending a single request triggers the hang.

The hang is caused by two issues:

  1. execute_dummy_batch does not trigger the drafter's propose(), so the idle DP rank only runs the target model forward while the active rank enters speculative decoding — the drafter's EP communication has no counterpart.
  2. num_tokens_across_dp in the speculator is not synced across DP ranks.

The original test with yuhuili/EAGLE-LLaMA3.1-Instruct-8B didn't hit this because that draft model is not MoE, so no EP all-to-all communication is involved.

I've submitted a follow-up PR #35294 to fix this on top of your changes.

also cc @WoosukKwon

@TheEpicDolphin TheEpicDolphin changed the title [WIP][Model Runner V2] Support DP/EP for spec decoding [Model Runner V2] Support DP/EP for spec decoding Feb 25, 2026
@TheEpicDolphin TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-ep branch from 86ade2e to 3460cbb Compare February 26, 2026 00:08
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 26, 2026

Hi @TheEpicDolphin, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@TheEpicDolphin TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-ep branch from 3460cbb to 756b507 Compare February 26, 2026 00:21
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 26, 2026

Hi @TheEpicDolphin, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@TheEpicDolphin TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-ep branch from 756b507 to bdc0ff5 Compare February 26, 2026 00:33
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 26, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @TheEpicDolphin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 26, 2026
@TheEpicDolphin TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-ep branch from bdc0ff5 to 56fd884 Compare February 27, 2026 02:13
@mergify mergify bot removed the needs-rebase label Feb 27, 2026
Comment on lines +204 to +205
# [data_parallel_size]
num_prefill_tokens_across_dp: torch.Tensor | None,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just keep the name num_tokens_across_dp?

Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
@TheEpicDolphin TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-ep branch from 56fd884 to c68670e Compare February 27, 2026 03:36
@izhuhaoran
Copy link
Copy Markdown
Contributor

@TheEpicDolphin There is still a hang issue when using MoE-based spec decode models with the current changes. I've incorporated your changes along with the fix in #35294. Would it be possible to review and merge #35294 directly (rather than splitting DP/EP support for spec decoding into two separate PRs)? Or if you have any other suggestions, I'm happy to discuss. Thanks for your time!

@TheEpicDolphin
Copy link
Copy Markdown
Collaborator Author

TheEpicDolphin commented Feb 27, 2026

@izhuhaoran thanks for the MoE-based draft fix! Makes sense to just merge #35294 directly. I'll chat with Woosuk and see what the next steps should be

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 4, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @TheEpicDolphin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants