[BugFix] Fix async scheduling for pooling models by njhill · Pull Request #31584 · vllm-project/vllm

njhill · 2025-12-31T18:56:11Z

Fix race condition for pooling model with async scheduling.

Also:

Move output cpu copy to dedicated stream which should hopefully unblock async performance gains
Some adjacent optimizations and code simplifications

gemini-code-assist

Code Review

This pull request introduces support for asynchronous scheduling for pooling models by adding the AsyncGPUPoolingModelRunnerOutput class, which is a commendable improvement. This change aligns the pooling model execution with the existing asynchronous pattern for generation models, enhancing performance by overlapping GPU-to-CPU data transfers. The inclusion of .copy() when creating ModelRunnerOutput is a crucial fix for preventing race conditions in asynchronous mode. Overall, the implementation is solid. I have identified one potential high-severity issue where a None output from a pooler could lead to a worker crash and have provided a suggestion for a fix.

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: njhill <nickhill123@gmail.com>

Signed-off-by: wjunLu <wjunlu217@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e) 1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since vllm-project/vllm#31517 deleted unused arg 2. Remove dense `Qwen/Qwen3-0.6B` in `tests/e2e/multicard/test_aclgraph_capture_replay.py` and `tests/e2e/multicard/test_data_parallel.py` due to vllm-project/vllm#30739 where offline data parallel mode will not be supported/useful for dense models 3. Adapt `vllm_ascend/worker/worker.py` due to vllm-project/vllm#31584 4. Adapt `self.block_size` calling due to vllm-project/vllm#31540 5. Modify `test_mla_v1.py` due to vllm-project/vllm#28454 , which refactorred `get_head_size()` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 Signed-off-by: wjunLu <wjunlu217@gmail.com>

Signed-off-by: njhill <nickhill123@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e) 1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since vllm-project/vllm#31517 deleted unused arg 2. Remove dense `Qwen/Qwen3-0.6B` in `tests/e2e/multicard/test_aclgraph_capture_replay.py` and `tests/e2e/multicard/test_data_parallel.py` due to vllm-project/vllm#30739 where offline data parallel mode will not be supported/useful for dense models 3. Adapt `vllm_ascend/worker/worker.py` due to vllm-project/vllm#31584 4. Adapt `self.block_size` calling due to vllm-project/vllm#31540 5. Modify `test_mla_v1.py` due to vllm-project/vllm#28454 , which refactorred `get_head_size()` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 Signed-off-by: wjunLu <wjunlu217@gmail.com>

Signed-off-by: njhill <nickhill123@gmail.com>

Signed-off-by: njhill <nickhill123@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Signed-off-by: njhill <nickhill123@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e) 1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since vllm-project/vllm#31517 deleted unused arg 2. Remove dense `Qwen/Qwen3-0.6B` in `tests/e2e/multicard/test_aclgraph_capture_replay.py` and `tests/e2e/multicard/test_data_parallel.py` due to vllm-project/vllm#30739 where offline data parallel mode will not be supported/useful for dense models 3. Adapt `vllm_ascend/worker/worker.py` due to vllm-project/vllm#31584 4. Adapt `self.block_size` calling due to vllm-project/vllm#31540 5. Modify `test_mla_v1.py` due to vllm-project/vllm#28454 , which refactorred `get_head_size()` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e) 1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since vllm-project/vllm#31517 deleted unused arg 2. Remove dense `Qwen/Qwen3-0.6B` in `tests/e2e/multicard/test_aclgraph_capture_replay.py` and `tests/e2e/multicard/test_data_parallel.py` due to vllm-project/vllm#30739 where offline data parallel mode will not be supported/useful for dense models 3. Adapt `vllm_ascend/worker/worker.py` due to vllm-project/vllm#31584 4. Adapt `self.block_size` calling due to vllm-project/vllm#31540 5. Modify `test_mla_v1.py` due to vllm-project/vllm#28454 , which refactorred `get_head_size()` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 Signed-off-by: wjunLu <wjunlu217@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e) 1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since vllm-project/vllm#31517 deleted unused arg 2. Remove dense `Qwen/Qwen3-0.6B` in `tests/e2e/multicard/test_aclgraph_capture_replay.py` and `tests/e2e/multicard/test_data_parallel.py` due to vllm-project/vllm#30739 where offline data parallel mode will not be supported/useful for dense models 3. Adapt `vllm_ascend/worker/worker.py` due to vllm-project/vllm#31584 4. Adapt `self.block_size` calling due to vllm-project/vllm#31540 5. Modify `test_mla_v1.py` due to vllm-project/vllm#28454 , which refactorred `get_head_size()` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e) 1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since vllm-project/vllm#31517 deleted unused arg 2. Remove dense `Qwen/Qwen3-0.6B` in `tests/e2e/multicard/test_aclgraph_capture_replay.py` and `tests/e2e/multicard/test_data_parallel.py` due to vllm-project/vllm#30739 where offline data parallel mode will not be supported/useful for dense models 3. Adapt `vllm_ascend/worker/worker.py` due to vllm-project/vllm#31584 4. Adapt `self.block_size` calling due to vllm-project/vllm#31540 5. Modify `test_mla_v1.py` due to vllm-project/vllm#28454 , which refactorred `get_head_size()` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 Signed-off-by: wjunLu <wjunlu217@gmail.com>

mergify bot added the v1 label Dec 31, 2025

gemini-code-assist bot reviewed Dec 31, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

njhill force-pushed the fix-async-pooling branch from 7390e7a to 01ad58f Compare December 31, 2025 19:18

njhill added the bug Something isn't working label Dec 31, 2025

njhill marked this pull request as ready for review December 31, 2025 19:20

njhill requested review from DarkLight1337 and noooop December 31, 2025 19:22

njhill marked this pull request as draft December 31, 2025 19:36

njhill force-pushed the fix-async-pooling branch from 01ad58f to e9ebe3c Compare December 31, 2025 19:41

njhill added 2 commits December 31, 2025 12:02

[BugFix] Fix async scheduling for pooling models

fde42b7

Signed-off-by: njhill <nickhill123@gmail.com>

avoid unnecessary list/tensor conversions

638b741

Signed-off-by: njhill <nickhill123@gmail.com>

njhill force-pushed the fix-async-pooling branch from 6f3a713 to 638b741 Compare December 31, 2025 20:02

vllm-project deleted a comment from mergify bot Dec 31, 2025

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 31, 2025

njhill marked this pull request as ready for review December 31, 2025 20:09

njhill requested a review from WoosukKwon as a code owner December 31, 2025 20:09

DarkLight1337 approved these changes Dec 31, 2025

View reviewed changes

vllm-bot merged commit 6c2cfb6 into vllm-project:main Dec 31, 2025
58 of 63 checks passed

njhill deleted the fix-async-pooling branch December 31, 2025 22:49

wjunLu mentioned this pull request Jan 4, 2026

[Main2Main] Upgrade vllm commit to 0102 vllm-project/vllm-ascend#5573

Closed

wjunLu added a commit to wjunLu/vllm-ascend that referenced this pull request Jan 4, 2026

Fix error due to vllm-project/vllm#31584

58c6d20

Signed-off-by: wjunLu <wjunlu217@gmail.com>

njhill mentioned this pull request Jan 4, 2026

[Minor] Small pooler output processing optimization #31667

Merged

wjunLu mentioned this pull request Jan 5, 2026

[Main2Main] Upgrade vllm commit to 0105 vllm-project/vllm-ascend#5595

Merged

iboiko-habana mentioned this pull request Jan 5, 2026

[FIX_FOR_VLLM_LATEST] Fix embedding models, after bug found in #27614 vllm-project/vllm-gaudi#774

Closed

pawel-olejniczak mentioned this pull request Jan 8, 2026

[FIX_FOR_VLLM_LATEST] Fix block_size used in eagle vllm-project/vllm-gaudi#773

Merged

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026

[BugFix] Fix async scheduling for pooling models (vllm-project#31584)

7e19e9b

Signed-off-by: njhill <nickhill123@gmail.com>

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026

[BugFix] Fix async scheduling for pooling models (vllm-project#31584)

e44d78e

Signed-off-by: njhill <nickhill123@gmail.com>

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[BugFix] Fix async scheduling for pooling models (vllm-project#31584)

c6559ef

Signed-off-by: njhill <nickhill123@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[BugFix] Fix async scheduling for pooling models (vllm-project#31584)

149c90f

Signed-off-by: njhill <nickhill123@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix] Fix async scheduling for pooling models#31584

[BugFix] Fix async scheduling for pooling models#31584
vllm-bot merged 2 commits intovllm-project:mainfrom
njhill:fix-async-pooling

njhill commented Dec 31, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

njhill commented Dec 31, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

njhill commented Dec 31, 2025 •

edited by github-actions bot

Loading