[Model Runner V2] Support DP/EP for spec decoding by TheEpicDolphin · Pull Request #35248 · vllm-project/vllm

TheEpicDolphin · 2026-02-25T00:37:19Z

Purpose

Currently, you get an error if you try to run spec decoding with DP + EP using the V2 model runner. The root cause is because None was being passed as the num_tokens_across_dp parameter for both prefill and decode for the draft model (speculator). This causes a failure because the rank serving the request will attempt to coordinate with the other DP ranks (by calling coordinate_batch_across_dp) to microbatch, but won't get a response, resulting in a runtime error.

Test Plan

Non-spec-decoding server

VLLM_USE_V2_MODEL_RUNNER=1 vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size=1 --data-parallel-size=2 --enable-expert-parallel

No issues while serving requests.

Spec-decoding server

VLLM_USE_V2_MODEL_RUNNER=1 vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size=1 --data-parallel-size=2 --enable-expert-parallel  --speculative-config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 2}'

Client Script

from concurrent.futures import ThreadPoolExecutor, as_completed

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

MODEL = "mistralai/Mixtral-8x7B-Instruct-v0.1"

MAX_CONCURRENT_REQUESTS = 8

PROMPTS = [
    "Explain the theory of relativity in simple terms.",
    "What is the capital of France?",
    "Write a haiku about coding.",
    "List three benefits of regular exercise.",
    "How does a refrigerator keep food cold?",
    "What is the difference between HTTP and HTTPS?",
    "Suggest a short book to read on a rainy day.",
    "Name two inventions that changed the world.",
]

def run_one(prompt: str, request_id: int) -> tuple[int, str]:
    r = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=64,
    )
    return request_id, r.choices[0].message.content or ""


def main() -> None:
    num_workers = min(MAX_CONCURRENT_REQUESTS, len(PROMPTS))
    with ThreadPoolExecutor(max_workers=num_workers) as executor:
        futures = {
            executor.submit(run_one, prompt, i): i
            for i, prompt in enumerate(PROMPTS)
        }
        for future in as_completed(futures):
            req_id, content = future.result()
            print(f"[{req_id}] {content}\n")


if __name__ == "__main__":
    main()

Before

Runtime error:

(EngineCore_DP0 pid=926457)                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/miniconda3/envs/vllm/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=926457)     return self.__get_result()
(EngineCore_DP0 pid=926457)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/miniconda3/envs/vllm/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=926457)     raise self._exception
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/v1/executor/uniproc_executor.py", line 79, in collective_rpc
(EngineCore_DP0 pid=926457)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=926457)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=926457)     return func(*args, **kwargs)
(EngineCore_DP0 pid=926457)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=926457)     return func(*args, **kwargs)
(EngineCore_DP0 pid=926457)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/v1/worker/gpu_worker.py", line 633, in sample_tokens
(EngineCore_DP0 pid=926457)     return self.model_runner.sample_tokens(grammar_output)
(EngineCore_DP0 pid=926457)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=926457)     return func(*args, **kwargs)
(EngineCore_DP0 pid=926457)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/v1/worker/gpu/model_runner.py", line 1093, in sample_tokens
(EngineCore_DP0 pid=926457)     draft_tokens = self.speculator.propose(
(EngineCore_DP0 pid=926457)                    ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=926457)     return func(*args, **kwargs)
(EngineCore_DP0 pid=926457)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/v1/worker/gpu/spec_decode/eagle/speculator.py", line 234, in propose
(EngineCore_DP0 pid=926457)     last_hidden_states, hidden_states = self.run_model(
(EngineCore_DP0 pid=926457)                                         ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=926457)     return func(*args, **kwargs)
(EngineCore_DP0 pid=926457)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/v1/worker/gpu/spec_decode/eagle/speculator.py", line 99, in run_model
(EngineCore_DP0 pid=926457)     with set_forward_context(
(EngineCore_DP0 pid=926457)          ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore_DP0 pid=926457)     return next(self.gen)
(EngineCore_DP0 pid=926457)            ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/forward_context.py", line 346, in set_forward_context
(EngineCore_DP0 pid=926457)     _, num_tokens_across_dp, _ = coordinate_batch_across_dp(
(EngineCore_DP0 pid=926457)                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/v1/worker/dp_utils.py", line 230, in coordinate_batch_across_dp
(EngineCore_DP0 pid=926457)     _synchronize_dp_ranks(
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/v1/worker/dp_utils.py", line 134, in _synchronize_dp_ranks
(EngineCore_DP0 pid=926457)     tensor = _run_ar(
(EngineCore_DP0 pid=926457)              ^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/vllm/vllm/v1/worker/dp_utils.py", line 55, in _run_ar
(EngineCore_DP0 pid=926457)     dist.all_reduce(tensor, group=group)
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
(EngineCore_DP0 pid=926457)     return func(*args, **kwargs)
(EngineCore_DP0 pid=926457)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=926457)   File "/home/gdelfin/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 3014, in all_reduce
(EngineCore_DP0 pid=926457)     work.wait()
(EngineCore_DP0 pid=926457) RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:538] Read error [172.27.59.29]:39861: Connection reset by peer

After

No errors when serving requests. Output:

[2]  Lines of code weave,
Intricate patterns emerge,
Logic's dance set free.

[4]  A refrigerator keeps food cold through the process of heat transfer. It uses a coolant, which is a substance that can easily convert between liquid and gas states, called a refrigerant. 

The process begins in the compressor, where the refrigerant is compressed into a high-pressure, high

[3]  1. Improved physical health: Regular exercise helps to control weight, reduce the risk of heart disease, strengthen bones and muscles, and improve overall physical fitness. It can also help to manage conditions such as diabetes, high blood pressure, and arthritis.

2. Enhanced mental well-

[0]  The theory of relativity is a concept introduced by Albert Einstein that describes how the universe works at high speeds and in extreme conditions, such as near black holes. It is actually made up of two parts: the special theory of relativity and the general theory of relativity.

1. Special Theory of Relativity

[6]  I would recommend "The Alchemist" by Paulo Coelho. It's a beautifully written, short novel that tells the story of a young Andalusian shepherd named Santiago who dreams of traveling the world in search of a treasure. Along the way, he meets a series of spiritual guides,

[1]  The capital city of France is Paris. Known for its impressive architecture, exquisite art, and delicious cuisine, Paris is one of the most popular tourist destinations in the world. Some of the city's most famous landmarks include the Eiffel Tower, the Louvre Museum, Notre-

[5]  HTTP (Hypertext Transfer Protocol) and HTTPS (Hypertext Transfer Protocol Secure) are both protocols used for transmitting data over the web, but there is a key difference between the two:

HTTP is an unsecured protocol, which means that any data transmitted using HTTP can be

[7]  Two inventions that have significantly changed the world are the printing press and the internet.

The printing press, invented by Johannes Gutenberg in the 15th century, revolutionized the way information was reproduced and disseminated. It enabled the mass production of books, pamphlets, and other

None-MoE models also work as expected with MRV2 + spec decode + DP. I verified by serving like so:

VLLM_USE_V2_MODEL_RUNNER=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct --tensor-parallel-size=1 --data-parallel-size=2 --speculative-config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 2}'

The Fix

Model runner is attempting to prepare the communication buffer with the speculator object, but should be using the speculator model instead. EagleSpeculator is not an nn.Module instance, and doesn't have .modules().
Compute num prefill and decode tokens for draft model across DP by using utils method.

gemini-code-assist

Code Review

This pull request adds support for Data Parallelism (DP) in speculative decoding by correctly calculating and passing num_tokens_across_dp in the EagleSpeculator. The changes also include a bug fix in GPUModelRunner to correctly prepare the communication buffer for the speculator's model. The changes look correct and address existing FIXMEs. I have one suggestion to improve code readability and maintainability.

gemini-code-assist · 2026-02-25T00:39:08Z

vllm/v1/worker/gpu/spec_decode/eagle/speculator.py

+        num_prefill_tokens_across_dp = make_num_tokens_across_dp(
+            self.vllm_config.parallel_config.data_parallel_size, num_tokens
+        )


To improve readability and reduce duplication, consider storing self.vllm_config.parallel_config.data_parallel_size in a local variable at the beginning of the propose method. This variable can then be used here and in the decode step later in the function (lines 320-322).

For example:

# At the beginning of the propose method dp_size = self.vllm_config.parallel_config.data_parallel_size # ... later in the method num_prefill_tokens_across_dp = make_num_tokens_across_dp(dp_size, num_tokens) # ... and later num_decode_tokens_across_dp = make_num_tokens_across_dp(dp_size, num_tokens_padded)

izhuhaoran · 2026-02-25T13:08:37Z

@TheEpicDolphin Thanks for this work! I found a hang issue when testing with MoE spec models under DP+EP.

Repro: GLM-4.7-Flash, DP2 TP1 EP2, with --speculative-config '{"num_speculative_tokens": 2, "method": "mtp", "model": "<path_to_GLM-4.7-Flash>"}'. Sending a single request triggers the hang.

The hang is caused by two issues:

execute_dummy_batch does not trigger the drafter's propose(), so the idle DP rank only runs the target model forward while the active rank enters speculative decoding — the drafter's EP communication has no counterpart.
num_tokens_across_dp in the speculator is not synced across DP ranks.

The original test with yuhuili/EAGLE-LLaMA3.1-Instruct-8B didn't hit this because that draft model is not MoE, so no EP all-to-all communication is involved.

I've submitted a follow-up PR #35294 to fix this on top of your changes.

also cc @WoosukKwon

vllm/v1/worker/gpu/spec_decode/eagle/speculator.py

mergify · 2026-02-26T00:13:13Z

Hi @TheEpicDolphin, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-26T00:26:21Z

Hi @TheEpicDolphin, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-26T19:13:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @TheEpicDolphin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

WoosukKwon · 2026-02-27T02:26:04Z

vllm/v1/worker/gpu/spec_decode/eagle/speculator.py

+        # [data_parallel_size]
+        num_prefill_tokens_across_dp: torch.Tensor | None,


Can we just keep the name num_tokens_across_dp?

Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

izhuhaoran · 2026-02-27T06:09:12Z

@TheEpicDolphin There is still a hang issue when using MoE-based spec decode models with the current changes. I've incorporated your changes along with the fix in #35294. Would it be possible to review and merge #35294 directly (rather than splitting DP/EP support for spec decoding into two separate PRs)? Or if you have any other suggestions, I'm happy to discuss. Thanks for your time!

TheEpicDolphin · 2026-02-27T18:54:34Z

@izhuhaoran thanks for the MoE-based draft fix! Makes sense to just merge #35294 directly. I'll chat with Woosuk and see what the next steps should be

mergify · 2026-03-04T04:04:09Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @TheEpicDolphin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the v1 label Feb 25, 2026

gemini-code-assist bot reviewed Feb 25, 2026

View reviewed changes

TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-ep branch 3 times, most recently from d621076 to 86ade2e Compare February 25, 2026 02:00

TheEpicDolphin marked this pull request as ready for review February 25, 2026 02:14

TheEpicDolphin requested a review from WoosukKwon as a code owner February 25, 2026 02:14

izhuhaoran mentioned this pull request Feb 25, 2026

[Model Runner V2] support dp & ep for spec decoding #35294

Merged

TheEpicDolphin changed the title ~~[WIP][Model Runner V2] Support DP/EP for spec decoding~~ [Model Runner V2] Support DP/EP for spec decoding Feb 25, 2026

WoosukKwon reviewed Feb 25, 2026

View reviewed changes

vllm/v1/worker/gpu/spec_decode/eagle/speculator.py Outdated Show resolved Hide resolved

TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-ep branch from 86ade2e to 3460cbb Compare February 26, 2026 00:08

TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-ep branch from 3460cbb to 756b507 Compare February 26, 2026 00:21

TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-ep branch from 756b507 to bdc0ff5 Compare February 26, 2026 00:33

mergify bot added the needs-rebase label Feb 26, 2026

TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-ep branch from bdc0ff5 to 56fd884 Compare February 27, 2026 02:13

mergify bot removed the needs-rebase label Feb 27, 2026

WoosukKwon reviewed Feb 27, 2026

View reviewed changes

[Model Runner V2] Support DP/EP for spec decoding

c68670e

Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-ep branch from 56fd884 to c68670e Compare February 27, 2026 03:36

mergify bot added the needs-rebase label Mar 4, 2026

TheEpicDolphin closed this Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model Runner V2] Support DP/EP for spec decoding#35248

[Model Runner V2] Support DP/EP for spec decoding#35248
TheEpicDolphin wants to merge 1 commit intovllm-project:mainfrom
TheEpicDolphin:gdelfin/mrv2-spec-decode-ep

TheEpicDolphin commented Feb 25, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 25, 2026

Uh oh!

izhuhaoran commented Feb 25, 2026

Uh oh!

Uh oh!

mergify bot commented Feb 26, 2026

Uh oh!

mergify bot commented Feb 26, 2026

Uh oh!

mergify bot commented Feb 26, 2026

Uh oh!

WoosukKwon Feb 27, 2026

Uh oh!

izhuhaoran commented Feb 27, 2026

Uh oh!

TheEpicDolphin commented Feb 27, 2026 •

edited

Loading

Uh oh!

mergify bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# [data_parallel_size]
		num_prefill_tokens_across_dp: torch.Tensor \| None,

Uh oh!

Conversation

TheEpicDolphin commented Feb 25, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Before

After

The Fix

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

izhuhaoran commented Feb 25, 2026

Uh oh!

Uh oh!

mergify bot commented Feb 26, 2026

Uh oh!

mergify bot commented Feb 26, 2026

Uh oh!

mergify bot commented Feb 26, 2026

Uh oh!

WoosukKwon Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

izhuhaoran commented Feb 27, 2026

Uh oh!

TheEpicDolphin commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TheEpicDolphin commented Feb 25, 2026 •

edited by github-actions bot

Loading

TheEpicDolphin commented Feb 27, 2026 •

edited

Loading