[Bugfix][Async] fix update_async_output_token_ids for async + spec by izhuhaoran · Pull Request #30122 · vllm-project/vllm

izhuhaoran · 2025-12-05T08:58:54Z

Purpose

Lines 944 to 954 in 6038b1b

    
           req_output_token_ids = output_token_ids[index] 
        
           if not req_output_token_ids or req_output_token_ids[-1] != -1: 
        
               # Final output id is not a placeholder, some tokens must have 
        
               # been discarded after a kv-load failure. 
        
               continue 
        
           if sampled_token_ids is None: 
        
               assert self.async_copy_ready_event is not None 
        
               self.async_copy_ready_event.synchronize() 
        
               sampled_token_ids = self.sampled_token_ids_cpu.squeeze(-1).tolist() 
        
           # Replace placeholder token id with actual sampled id. 
        
           req_output_token_ids[-1] = sampled_token_ids[prev_index]

In update_async_output_token_ids, we incorrectly replace only the last placeholder token in req_output_token_ids[-1] with sampled token (sampled_token_ids[prev_index]).

However, in spec decode:

req_output_token_ids is a 1D list (from the 2D output_token_ids linked to req_state.output_token_ids).
sampled_token_ids[prev_index] is a full list of sampled tokens for the request accept, not a single token.

The GPU runner extends output_token_ids with num_accepted placeholders (e.g., [-1, -1, ...]), so we must replace all placeholders with the entire list of sampled tokens from the prior step.

The current assignment req_output_token_ids[-1] = sampled_token_ids[prev_index] is incorrect—it inserts a list into a single position, leading to malformed output_token_ids (e.g., nested lists) and leaves some placeholders unreplaced. This PR fixes that issue.

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

gemini-code-assist

Code Review

The pull request modifies the update_async_output_token_ids function in vllm/v1/worker/gpu_input_batch.py. The change involves removing the .squeeze(-1) operation when processing sampled_token_ids_cpu, and updating the assignment logic for req_output_token_ids. Instead of replacing a single last element, the code now replaces a slice of elements at the end of req_output_token_ids with a list of sampled token IDs, suggesting an adaptation to handle multiple sampled tokens per request.

izhuhaoran · 2025-12-05T09:01:57Z

@njhill could you please have a look at this PR?

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/v1/worker/gpu_input_batch.py

njhill

Thanks @izhuhaoran for this!

I'm curious though, did you actually hit it in real usage?

AFAIK it shouldn't be possible currently, since this path is only taken when there are requests in the batch which require the output ids during sampling - specifically when there are penalties or "bad_words" specified, or possibly for custom logits processors.

But we don't yet support any of these for the async scheduling + spec decoding combo. We reject any requests with the corresponding parameters when such config is active and so it should never actually reach here.

The reason we don't yet support them however is just because we hadn't yet looked into what additional changes are needed for the combination. It's quite possible that your change here is all that's needed which would be awesome! And we could then remove the validation that blocks those kinds of requests.

(your change here is certainly necessary, we just need to check whether it's sufficient... I'll try to test that soon)

vllm/v1/worker/gpu_input_batch.py

njhill · 2025-12-05T19:32:21Z

(your change here is certainly necessary, we just need to check whether it's sufficient... I'll try to test that soon)

Unfortunately from some cursory tests it seems it's not sufficient, but still worth getting this merged anyhow.

izhuhaoran · 2025-12-06T05:15:59Z

@njhill Thanks for your time and detailed explanation!

specifically when there are penalties or "bad_words" specified

Yes—I hit this while testing async scheduling with penalty enabled, and find the update_async_output_token_ids bug.

Unfortunately from some cursory tests it seems it's not sufficient

I initially thought this was the only blocker, but agree further debugging is likely needed to fully support penalties/bad_words.

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

izhuhaoran · 2025-12-11T16:19:57Z

closed by #30495

fix: copy actual sampled_token_ids to req_output_token_ids

70189d8

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

mergify bot added the v1 label Dec 5, 2025

gemini-code-assist bot reviewed Dec 5, 2025

View reviewed changes

izhuhaoran marked this pull request as ready for review December 5, 2025 09:00

izhuhaoran changed the title ~~[Bugfix][Async] fix copy sampled_token_ids to output_token_ids for async + spec~~ [Bugfix][Async] fix update_async_output_token_ids for async + spec Dec 5, 2025

izhuhaoran changed the title ~~[Bugfix][Async] fix update_async_output_token_ids for async + spec~~ [Bugfix][Async] fix update_async_output_token_ids for async + spec Dec 5, 2025

chatgpt-codex-connector bot reviewed Dec 5, 2025

View reviewed changes

vllm/v1/worker/gpu_input_batch.py Outdated Show resolved Hide resolved

robertgshaw2-redhat assigned njhill Dec 5, 2025

njhill reviewed Dec 5, 2025

View reviewed changes

vllm/v1/worker/gpu_input_batch.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_input_batch.py Outdated Show resolved Hide resolved

fix: apply suggestions from @njhill

0999b72

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

This was referenced Dec 6, 2025

[BugFix] Fix mixed penalties batch with async scheduling #27910

Merged

[BugFix] Make penalties and bad_words work with async scheduling #26467

Merged

izhuhaoran mentioned this pull request Dec 11, 2025

[Async][Feat] support apply penalty or bad_words for async + spec #30495

Merged

izhuhaoran closed this Dec 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][Async] fix update_async_output_token_ids for async + spec#30122

[Bugfix][Async] fix update_async_output_token_ids for async + spec#30122
izhuhaoran wants to merge 2 commits intovllm-project:mainfrom
izhuhaoran:fix-async-spec-output-tokens

izhuhaoran commented Dec 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

izhuhaoran commented Dec 5, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

njhill left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

njhill commented Dec 5, 2025

Uh oh!

izhuhaoran commented Dec 6, 2025

Uh oh!

izhuhaoran commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	req_output_token_ids = output_token_ids[index]
	if not req_output_token_ids or req_output_token_ids[-1] != -1:
	# Final output id is not a placeholder, some tokens must have
	# been discarded after a kv-load failure.
	continue
	if sampled_token_ids is None:
	assert self.async_copy_ready_event is not None
	self.async_copy_ready_event.synchronize()
	sampled_token_ids = self.sampled_token_ids_cpu.squeeze(-1).tolist()
	# Replace placeholder token id with actual sampled id.
	req_output_token_ids[-1] = sampled_token_ids[prev_index]

Uh oh!

Conversation

izhuhaoran commented Dec 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

izhuhaoran commented Dec 5, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

njhill left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

njhill commented Dec 5, 2025

Uh oh!

izhuhaoran commented Dec 6, 2025

Uh oh!

izhuhaoran commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

izhuhaoran commented Dec 5, 2025 •

edited by github-actions bot

Loading

njhill left a comment •

edited

Loading