[Bugfix][Async] fix update_async_output_token_ids for async + spec#30122
[Bugfix][Async] fix update_async_output_token_ids for async + spec#30122izhuhaoran wants to merge 2 commits intovllm-project:mainfrom
Conversation
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
There was a problem hiding this comment.
Code Review
The pull request modifies the update_async_output_token_ids function in vllm/v1/worker/gpu_input_batch.py. The change involves removing the .squeeze(-1) operation when processing sampled_token_ids_cpu, and updating the assignment logic for req_output_token_ids. Instead of replacing a single last element, the code now replaces a slice of elements at the end of req_output_token_ids with a list of sampled token IDs, suggesting an adaptation to handle multiple sampled tokens per request.
|
@njhill could you please have a look at this PR? |
update_async_output_token_ids for async + spec
update_async_output_token_ids for async + specThere was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Thanks @izhuhaoran for this!
I'm curious though, did you actually hit it in real usage?
AFAIK it shouldn't be possible currently, since this path is only taken when there are requests in the batch which require the output ids during sampling - specifically when there are penalties or "bad_words" specified, or possibly for custom logits processors.
But we don't yet support any of these for the async scheduling + spec decoding combo. We reject any requests with the corresponding parameters when such config is active and so it should never actually reach here.
The reason we don't yet support them however is just because we hadn't yet looked into what additional changes are needed for the combination. It's quite possible that your change here is all that's needed which would be awesome! And we could then remove the validation that blocks those kinds of requests.
(your change here is certainly necessary, we just need to check whether it's sufficient... I'll try to test that soon)
Unfortunately from some cursory tests it seems it's not sufficient, but still worth getting this merged anyhow. |
|
@njhill Thanks for your time and detailed explanation!
Yes—I hit this while testing async scheduling with
I initially thought this was the only blocker, but agree further debugging is likely needed to fully support penalties/bad_words. |
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
|
closed by #30495 |
Purpose
vllm/vllm/v1/worker/gpu_input_batch.py
Lines 944 to 954 in 6038b1b
In
update_async_output_token_ids, we incorrectly replace only the last placeholder token inreq_output_token_ids[-1]with sampled token (sampled_token_ids[prev_index]).However, in spec decode:
req_output_token_idsis a 1D list (from the 2Doutput_token_idslinked toreq_state.output_token_ids).sampled_token_ids[prev_index]is a full list of sampled tokens for the request accept, not a single token.The GPU runner extends
output_token_idswithnum_acceptedplaceholders (e.g.,[-1, -1, ...]), so we must replace all placeholders with the entire list of sampled tokens from the prior step.The current assignment
req_output_token_ids[-1] = sampled_token_ids[prev_index]is incorrect—it inserts a list into a single position, leading to malformed output_token_ids (e.g., nested lists) and leaves some placeholders unreplaced. This PR fixes that issue.