Skip to content

[Bugfix][Async] fix update_async_output_token_ids for async + spec#30122

Closed
izhuhaoran wants to merge 2 commits intovllm-project:mainfrom
izhuhaoran:fix-async-spec-output-tokens
Closed

[Bugfix][Async] fix update_async_output_token_ids for async + spec#30122
izhuhaoran wants to merge 2 commits intovllm-project:mainfrom
izhuhaoran:fix-async-spec-output-tokens

Conversation

@izhuhaoran
Copy link
Contributor

@izhuhaoran izhuhaoran commented Dec 5, 2025

Purpose

req_output_token_ids = output_token_ids[index]
if not req_output_token_ids or req_output_token_ids[-1] != -1:
# Final output id is not a placeholder, some tokens must have
# been discarded after a kv-load failure.
continue
if sampled_token_ids is None:
assert self.async_copy_ready_event is not None
self.async_copy_ready_event.synchronize()
sampled_token_ids = self.sampled_token_ids_cpu.squeeze(-1).tolist()
# Replace placeholder token id with actual sampled id.
req_output_token_ids[-1] = sampled_token_ids[prev_index]

In update_async_output_token_ids, we incorrectly replace only the last placeholder token in req_output_token_ids[-1] with sampled token (sampled_token_ids[prev_index]).

However, in spec decode:

  • req_output_token_ids is a 1D list (from the 2D output_token_ids linked to req_state.output_token_ids).
  • sampled_token_ids[prev_index] is a full list of sampled tokens for the request accept, not a single token.

The GPU runner extends output_token_ids with num_accepted placeholders (e.g., [-1, -1, ...]), so we must replace all placeholders with the entire list of sampled tokens from the prior step.

The current assignment req_output_token_ids[-1] = sampled_token_ids[prev_index] is incorrect—it inserts a list into a single position, leading to malformed output_token_ids (e.g., nested lists) and leaves some placeholders unreplaced. This PR fixes that issue.

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
@mergify mergify bot added the v1 label Dec 5, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request modifies the update_async_output_token_ids function in vllm/v1/worker/gpu_input_batch.py. The change involves removing the .squeeze(-1) operation when processing sampled_token_ids_cpu, and updating the assignment logic for req_output_token_ids. Instead of replacing a single last element, the code now replaces a slice of elements at the end of req_output_token_ids with a list of sampled token IDs, suggesting an adaptation to handle multiple sampled tokens per request.

@izhuhaoran izhuhaoran marked this pull request as ready for review December 5, 2025 09:00
@izhuhaoran
Copy link
Contributor Author

@njhill could you please have a look at this PR?

@izhuhaoran izhuhaoran changed the title [Bugfix][Async] fix copy sampled_token_ids to output_token_ids for async + spec [Bugfix][Async] fix update_async_output_token_ids for async + spec Dec 5, 2025
@izhuhaoran izhuhaoran changed the title [Bugfix][Async] fix update_async_output_token_ids for async + spec [Bugfix][Async] fix update_async_output_token_ids for async + spec Dec 5, 2025
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @izhuhaoran for this!

I'm curious though, did you actually hit it in real usage?

AFAIK it shouldn't be possible currently, since this path is only taken when there are requests in the batch which require the output ids during sampling - specifically when there are penalties or "bad_words" specified, or possibly for custom logits processors.

But we don't yet support any of these for the async scheduling + spec decoding combo. We reject any requests with the corresponding parameters when such config is active and so it should never actually reach here.


The reason we don't yet support them however is just because we hadn't yet looked into what additional changes are needed for the combination. It's quite possible that your change here is all that's needed which would be awesome! And we could then remove the validation that blocks those kinds of requests.

(your change here is certainly necessary, we just need to check whether it's sufficient... I'll try to test that soon)

@njhill
Copy link
Member

njhill commented Dec 5, 2025

(your change here is certainly necessary, we just need to check whether it's sufficient... I'll try to test that soon)

Unfortunately from some cursory tests it seems it's not sufficient, but still worth getting this merged anyhow.

@izhuhaoran
Copy link
Contributor Author

@njhill Thanks for your time and detailed explanation!

specifically when there are penalties or "bad_words" specified

Yes—I hit this while testing async scheduling with penalty enabled, and find the update_async_output_token_ids bug.

Unfortunately from some cursory tests it seems it's not sufficient

I initially thought this was the only blocker, but agree further debugging is likely needed to fully support penalties/bad_words.

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
@izhuhaoran
Copy link
Contributor Author

closed by #30495

@izhuhaoran izhuhaoran closed this Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants