[CI failure] Revert "[BugFix] Fix the issue of thinker requests being preempted, causing shape mismatch."#3648
[CI failure] Revert "[BugFix] Fix the issue of thinker requests being preempted, causing shape mismatch."#3648Gaohan123 wants to merge 1 commit into
Conversation
…ausing s…" This reverts commit e7ee5de.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 145dc177b4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| task = { | ||
| "pooling_output": pooling_output, | ||
| "request": request, |
There was a problem hiding this comment.
Restore preemption guard before enqueuing async chunks
save_async now unconditionally enqueues every chunk task, so when async-chunk requests are preempted and request.num_computed_tokens rolls back, stale chunks can be sent again while put_req_chunk continues forward. Those duplicate/stale payloads are then merged on the receiver side via tensor/list concatenation, which can desynchronize thinker/talker sequence shapes and reintroduce the preemption shape-mismatch failure under normal preempt-and-resume scheduling.
Useful? React with 👍 / 👎.
| if start_index >= len(thinker_output_token_ids) - 1: | ||
| # When the tokens output by the thinker are exhausted, an EOS token needs to be appended. |
There was a problem hiding this comment.
Avoid emitting EOS before the final decode embedding
The new exhaustion check uses start_index >= len(ids.output) - 1, which marks decode as exhausted one step early. Because decode handoff starts with start_index=1, chunks with only 1–2 thinker output tokens (a common early/short-response case) immediately emit EOS/pad and skip projecting actual decode embeddings, leading to truncated or empty spoken output.
Useful? React with 👍 / 👎.
|
Please wait. |
|
check #3650 |
|
#3650 merged |
Reverts #3147 due to CI failure: https://buildkite.com/vllm/vllm-omni/builds/9719/canvas?sid=019e2aed-c3cc-45e8-8424-71d4506adbd4&tab=output