[BugFix] Support online dense model DP without overhead by njhill · Pull Request #30739 · vllm-project/vllm

njhill · 2025-12-16T04:08:06Z

Currently, there's unnecessary overhead when running non-MoE models in a data parallel configuration because the steps across the ranks are synchronized with redundant all-reduce ops and coordination is done to ensure "idle" ranks perform dummy forward passes.

This PR changes the parallel config at the worker level to be equivalent to DP=1 for non-MoE models, so each rank operates independently. When internal load-balancing is used, the DP coordinator still runs to propagate stats back from the engines for load balancing purposes, but the step/wave synchronization logic is disabled.

Fixes #24461.
Fixes #30655.

This is supported in the online / AsyncLLM case only.

The offline DP will now fail during startup for non-MoE models (it really makes no sense to use it in that configuration).

Benchmark on 4xH100:

vllm serve Qwen/Qwen3-8B --data-parallel-size 4 --uvicorn-log-level=error

vllm bench serve \
    --backend vllm \
    --model Qwen/Qwen3-8B \
    --dataset-name random \
    --random-input-len 128 \
    --random-output-len 512 \
    --ignore-eos \
    --port 8033 \
    --num-prompts 4000 \
    --max-concurrency 200 \
    --seed 42

Before

============ Serving Benchmark Result ============
Successful requests:                     4000      
Failed requests:                         0         
Maximum request concurrency:             200       
Benchmark duration (s):                  104.41    
Total input tokens:                      512000    
Total generated tokens:                  2048000   
Request throughput (req/s):              38.31     
Output token throughput (tok/s):         19615.26  
Peak output token throughput (tok/s):    21597.00  
Peak concurrent requests:                400.00    
Total token throughput (tok/s):          24519.08  
---------------Time to First Token----------------
Mean TTFT (ms):                          131.78    
Median TTFT (ms):                        124.31    
P99 TTFT (ms):                           404.36    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.93      
Median TPOT (ms):                        9.95      
P99 TPOT (ms):                           10.08     
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.93      
Median ITL (ms):                         9.79      
P99 ITL (ms):                            15.93     
==================================================

After

============ Serving Benchmark Result ============
Successful requests:                     4000      
Failed requests:                         0         
Maximum request concurrency:             200       
Benchmark duration (s):                  99.24     
Total input tokens:                      512000    
Total generated tokens:                  2048000   
Request throughput (req/s):              40.31     
Output token throughput (tok/s):         20636.52  
Peak output token throughput (tok/s):    22454.00  
Peak concurrent requests:                400.00    
Total token throughput (tok/s):          25795.66  
---------------Time to First Token----------------
Mean TTFT (ms):                          88.94     
Median TTFT (ms):                        74.67     
P99 TTFT (ms):                           379.48    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.50      
Median TPOT (ms):                        9.50      
P99 TPOT (ms):                           9.66      
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.50      
Median ITL (ms):                         9.38      
P99 ITL (ms):                            12.86     
==================================================

chatgpt-codex-connector · 2025-12-16T04:08:12Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request introduces a significant optimization for running dense (non-MoE) models in a data-parallel configuration by removing unnecessary synchronization overhead. The core idea is to treat each data-parallel rank as an independent worker for dense models, effectively setting their data-parallel size to 1 at the worker level. This avoids redundant all-reduce operations and complex wave synchronization, which are only necessary for MoE models. The DP coordinator's role is intelligently adapted: for dense models with internal load balancing, it continues to run for statistics propagation, but with wave coordination disabled. For external load balancing, it's disabled entirely for dense models. The changes are well-structured, with clear separation of concerns. The introduction of data_parallel_index to preserve the original rank is a clean solution. The related configurations and tests, especially the new test_needs_dp_coordination, are thorough and correctly validate the new logic. Overall, this is a solid improvement that should enhance performance for a common use case.

Signed-off-by: Nick Hill <nhill@redhat.com>

mgoin · 2026-01-02T23:19:17Z

@njhill do you think we should consider automatically setting api_server_count as we scale dp? For instance in your benchmarks you seemed to use 1/2 dp size

### What this PR does / why we need it? Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e) 1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since vllm-project/vllm#31517 deleted unused arg 2. Remove dense `Qwen/Qwen3-0.6B` in `tests/e2e/multicard/test_aclgraph_capture_replay.py` and `tests/e2e/multicard/test_data_parallel.py` due to vllm-project/vllm#30739 where offline data parallel mode will not be supported/useful for dense models 3. Adapt `vllm_ascend/worker/worker.py` due to vllm-project/vllm#31584 4. Adapt `self.block_size` calling due to vllm-project/vllm#31540 5. Modify `test_mla_v1.py` due to vllm-project/vllm#28454 , which refactorred `get_head_size()` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 Signed-off-by: wjunLu <wjunlu217@gmail.com>

…#30739) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: njhill <nickhill123@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e) 1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since vllm-project/vllm#31517 deleted unused arg 2. Remove dense `Qwen/Qwen3-0.6B` in `tests/e2e/multicard/test_aclgraph_capture_replay.py` and `tests/e2e/multicard/test_data_parallel.py` due to vllm-project/vllm#30739 where offline data parallel mode will not be supported/useful for dense models 3. Adapt `vllm_ascend/worker/worker.py` due to vllm-project/vllm#31584 4. Adapt `self.block_size` calling due to vllm-project/vllm#31540 5. Modify `test_mla_v1.py` due to vllm-project/vllm#28454 , which refactorred `get_head_size()` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 Signed-off-by: wjunLu <wjunlu217@gmail.com>

…#30739) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: njhill <nickhill123@gmail.com>

…llm-project#86) * Make engine core client handshake timeout configurable (vllm-project#27444) Signed-off-by: Seiji Eicher <seiji@anyscale.com> * [BugFix] Support online dense model DP without overhead (vllm-project#30739) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: njhill <nickhill123@gmail.com> --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: njhill <nickhill123@gmail.com> Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Co-authored-by: Nick Hill <nhill@redhat.com>

### What this PR does / why we need it? Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e) 1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since vllm-project/vllm#31517 deleted unused arg 2. Remove dense `Qwen/Qwen3-0.6B` in `tests/e2e/multicard/test_aclgraph_capture_replay.py` and `tests/e2e/multicard/test_data_parallel.py` due to vllm-project/vllm#30739 where offline data parallel mode will not be supported/useful for dense models 3. Adapt `vllm_ascend/worker/worker.py` due to vllm-project/vllm#31584 4. Adapt `self.block_size` calling due to vllm-project/vllm#31540 5. Modify `test_mla_v1.py` due to vllm-project/vllm#28454 , which refactorred `get_head_size()` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 Signed-off-by: wjunLu <wjunlu217@gmail.com>

…#30739) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: njhill <nickhill123@gmail.com>

…#30739) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: njhill <nickhill123@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

- use MoE model (Deepseek-V2-Lite) because vllm-project/vllm#30739 changes how vLLM handles DP ranks - overrides dp_size=1 and dp_rank=0 if non-MoE model. - fixes doc/source/llm/doc_code/serve/multi_gpu/dp_basic_example.py and doc/source/llm/doc_code/serve/multi_gpu/dp_pd_example.py Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

- Use a MoE model (Deepseek-V2-Lite) because vllm-project/vllm#30739 changes how vLLM handles DP ranks - overrides dp_size=1 and dp_rank=0 if non-MoE model - Fixes doc/source/llm/doc_code/serve/multi_gpu/dp_basic_example.py and doc/source/llm/doc_code/serve/multi_gpu/dp_pd_example.py - vLLM 0.14.0 commit bd877162e optimizes DP for dense models by making each rank independent and only preserving DP coordination for MoE models where it's needed for expert - Impact: Ray's DPServer DP coordination (rank assignment, stats addresses) was ignored for dense models like Qwen2.5-0.5B-Instruct, causing cascading assertion failures - Fix: The tests now use an MoE model where vLLM's DP coordination is preserved. Outside of this test, dense model deployments should use Ray Serve replicas (num_replicas) instead of vLLM's data_parallel_size. Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

…#30739) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: njhill <nickhill123@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e) 1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since vllm-project/vllm#31517 deleted unused arg 2. Remove dense `Qwen/Qwen3-0.6B` in `tests/e2e/multicard/test_aclgraph_capture_replay.py` and `tests/e2e/multicard/test_data_parallel.py` due to vllm-project/vllm#30739 where offline data parallel mode will not be supported/useful for dense models 3. Adapt `vllm_ascend/worker/worker.py` due to vllm-project/vllm#31584 4. Adapt `self.block_size` calling due to vllm-project/vllm#31540 5. Modify `test_mla_v1.py` due to vllm-project/vllm#28454 , which refactorred `get_head_size()` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e) 1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since vllm-project/vllm#31517 deleted unused arg 2. Remove dense `Qwen/Qwen3-0.6B` in `tests/e2e/multicard/test_aclgraph_capture_replay.py` and `tests/e2e/multicard/test_data_parallel.py` due to vllm-project/vllm#30739 where offline data parallel mode will not be supported/useful for dense models 3. Adapt `vllm_ascend/worker/worker.py` due to vllm-project/vllm#31584 4. Adapt `self.block_size` calling due to vllm-project/vllm#31540 5. Modify `test_mla_v1.py` due to vllm-project/vllm#28454 , which refactorred `get_head_size()` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 Signed-off-by: wjunLu <wjunlu217@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e) 1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since vllm-project/vllm#31517 deleted unused arg 2. Remove dense `Qwen/Qwen3-0.6B` in `tests/e2e/multicard/test_aclgraph_capture_replay.py` and `tests/e2e/multicard/test_data_parallel.py` due to vllm-project/vllm#30739 where offline data parallel mode will not be supported/useful for dense models 3. Adapt `vllm_ascend/worker/worker.py` due to vllm-project/vllm#31584 4. Adapt `self.block_size` calling due to vllm-project/vllm#31540 5. Modify `test_mla_v1.py` due to vllm-project/vllm#28454 , which refactorred `get_head_size()` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e) 1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since vllm-project/vllm#31517 deleted unused arg 2. Remove dense `Qwen/Qwen3-0.6B` in `tests/e2e/multicard/test_aclgraph_capture_replay.py` and `tests/e2e/multicard/test_data_parallel.py` due to vllm-project/vllm#30739 where offline data parallel mode will not be supported/useful for dense models 3. Adapt `vllm_ascend/worker/worker.py` due to vllm-project/vllm#31584 4. Adapt `self.block_size` calling due to vllm-project/vllm#31540 5. Modify `test_mla_v1.py` due to vllm-project/vllm#28454 , which refactorred `get_head_size()` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 Signed-off-by: wjunLu <wjunlu217@gmail.com>

njhill requested review from ApostaC, NickLucche, ProExpertProg, WoosukKwon, alexm-redhat, heheda12345, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao, ywang96 and zou3519 as code owners December 16, 2025 04:08

mergify bot added v1 kv-connector labels Dec 16, 2025

njhill mentioned this pull request Dec 16, 2025

[Performance]: Should I expect linear scaling with pure DP? #30084

Open

1 task

gemini-code-assist bot reviewed Dec 16, 2025

View reviewed changes

njhill mentioned this pull request Dec 16, 2025

[Feature]: Simple Data Parallelism in vLLM #9206

Open

1 task

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 16, 2025

njhill force-pushed the non-moe-dp branch from 3f00d38 to 8abdd3b Compare December 16, 2025 16:06

njhill added 3 commits December 16, 2025 19:35

[BugFix] Support online dense model DP without overhead

51296cc

Signed-off-by: Nick Hill <nhill@redhat.com>

also create EP group when there is no model_config

c6f7c1b

Signed-off-by: Nick Hill <nhill@redhat.com>

ignore new ParallelConfig parameter in hash computation

e440ac9

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill force-pushed the non-moe-dp branch from db69ceb to 4ba7d3f Compare December 17, 2025 03:40

fix ray case

f40e299

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill force-pushed the non-moe-dp branch from 4ba7d3f to f40e299 Compare December 17, 2025 03:53

Merge remote-tracking branch 'origin/main' into non-moe-dp

3a204fc

youkaichao merged commit bd87716 into vllm-project:main Jan 2, 2026
57 checks passed

njhill deleted the non-moe-dp branch January 2, 2026 16:18

wjunLu mentioned this pull request Jan 4, 2026

[Main2Main] Upgrade vllm commit to 0102 vllm-project/vllm-ascend#5573

Closed

wjunLu mentioned this pull request Jan 5, 2026

[Main2Main] Upgrade vllm commit to 0105 vllm-project/vllm-ascend#5595

Merged

LucasWilkinson pushed a commit to neuralmagic/vllm that referenced this pull request Jan 6, 2026

[BugFix] Support online dense model DP without overhead (vllm-project…

d12c3c6

…#30739) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: njhill <nickhill123@gmail.com>

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026

[BugFix] Support online dense model DP without overhead (vllm-project…

49d4dd3

…#30739) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: njhill <nickhill123@gmail.com>

dtcccc mentioned this pull request Jan 13, 2026

[Bug]: self.stats_update_address is None for non-Moe and external LB #32252

Open

1 task

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026

[BugFix] Support online dense model DP without overhead (vllm-project…

4b644dd

…#30739) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: njhill <nickhill123@gmail.com>

nrghosh mentioned this pull request Jan 22, 2026

[deps][LLM] Upgrade vLLM to 0.15.0 ray-project/ray#60253

Closed

6 tasks

dtcccc mentioned this pull request Jan 23, 2026

[P/D] rework mooncake connector and introduce its bootstrap server #31034

Merged

5 tasks

njhill mentioned this pull request Jan 25, 2026

[BugFix] Fix P/D with non-MoE DP #33037

Merged

zzhx1 mentioned this pull request Jan 27, 2026

[Bugfix] fine-grained TP can only be used in moe model vllm-project/vllm-ascend#6322

Open

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[BugFix] Support online dense model DP without overhead (vllm-project…

e446050

…#30739) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: njhill <nickhill123@gmail.com>

qgallouedec mentioned this pull request Mar 3, 2026

Update vLLM version support to include 0.14.0 and 0.14.1 huggingface/trl#5214

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix] Support online dense model DP without overhead#30739

[BugFix] Support online dense model DP without overhead#30739
youkaichao merged 7 commits intovllm-project:mainfrom
njhill:non-moe-dp

njhill commented Dec 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Dec 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mgoin commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

njhill commented Dec 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before

After

Uh oh!

chatgpt-codex-connector bot commented Dec 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mgoin commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

njhill commented Dec 16, 2025 •

edited by github-actions bot

Loading