[Bugfix] Fix token loss in PP mode which causes degraded accuracy#41133
Conversation
Signed-off-by: Jing Wang <jingwang96@qq.com>
|
@yewentao256 Hi, Could you please review this PR? It's an issue related to PP mode. |
There was a problem hiding this comment.
Code Review
This pull request updates the _update_states method in gpu_model_runner.py to adjust how token_ids_cpu is updated for non-last ranks. A logic error was identified where the condition num_new_tokens > 0 likely fails because req_state.num_tokens is incremented before the check, and a simpler check for the presence of new_token_ids was suggested to ensure the update occurs correctly.
There was a problem hiding this comment.
Pull request overview
Fixes a pipeline-parallel (PP) mode state bookkeeping bug that can drop prompt tokens during InputBatch.condense(), degrading accuracy under high concurrency.
Changes:
- Adjust
_update_stateshandling oftoken_ids_cpu/num_tokens_no_specupdates for non-last PP ranks to avoid losing tokens during batch condensation.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Jing Wang <jingwang96@qq.com>
Signed-off-by: Jing Wang <jingwang96@qq.com>
yewentao256
left a comment
There was a problem hiding this comment.
LGTM, thanks for the work!
|
@yewentao256 Thanks for your review and approval! The failed CI checks seem unrelated to this PR. |
yewentao256
left a comment
There was a problem hiding this comment.
Let's retry before we force merge
hsliuustc0106
left a comment
There was a problem hiding this comment.
Thanks for the fix! The core logic of using num_tokens_no_spec as the write offset is correct and well-reasoned. However, I have a few concerns before this is ready to merge:
1. is_token_ids not updated (Medium)
Copilot raised this and the response was "the original code doesn't update it either." But the original code has a bug, so that's not a valid justification. When enable_prompt_embeds is used, the newly written token positions will remain False in is_token_ids, causing _prepare_inputs to treat output tokens as embed-provided tokens.
Could you confirm whether the PP non-last-rank path can ever reach the enable_prompt_embeds branch? If not, please add a comment explaining why. If yes, this should mirror _bookkeeping_sync (L3456) and set:
self.input_batch.is_token_ids[
req_index, start_token_index:end_token_index
] = True2. else branch for num_tokens_no_spec update (Low-Medium)
else:
self.input_batch.num_tokens_no_spec[req_index] = max(
self.input_batch.num_tokens_no_spec[req_index],
num_computed_tokens,
)This handles the case where new_token_ids is empty or num_new_tokens == 0. The logic is correct — in chunked prefill for non-last chunks, num_computed_tokens advances but there are no new sampled tokens. The max() is safe for decode cases too. But please add a brief comment explaining the rationale.
3. Missing regression test (Medium)
The PR shows gsm8k accuracy validation, which is great. However, this bug only manifests under high concurrency + PP, so it would be valuable to add a unit test that verifies token_ids_cpu and num_tokens_no_spec consistency between PP ranks after _update_states, e.g. PP=2, non-async-scheduling, multiple concurrent requests.
Head branch was pushed to by a user without write access
Signed-off-by: Jing Wang <jingwang96@qq.com>
|
@hsliuustc0106 Thanks for your suggestions. |
yewentao256
left a comment
There was a problem hiding this comment.
Thanks @hsliuustc0106 ! Could you take another look if the fix looks good
Also, @starkwj could you re-test the lm_eval result to make sure we don't hurt the acc?
Hi, I have re-tested the |
yewentao256
left a comment
There was a problem hiding this comment.
From my understanding, the comment by @hsliuustc0106 has been resolved, please take another look before we merge
lgtm |
…lm-project#41133) Signed-off-by: Jing Wang <jingwang96@qq.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: Libin Tang <libin.tang@intel.com>
Under spec-decode + non-async PP, the scheduler can advance
num_computed_tokens by 1 (rejection-bookkeeping for a verified token
that lives only on the last PP rank) while sending an empty
new_token_ids[i] slice to non-last ranks. The previous code crashed
with
IndexError: list index out of range
File "gpu_model_runner.py", line 1305, in _update_states
req_state.output_token_ids.append(new_token_ids[-1])
on the very first chat completion of an MTP+PP=2 deployment. Upstream
PR vllm-project#41133 fixed the parallel `token_ids_cpu` write path but didn't
extend the same tolerance to `output_token_ids` here.
The fix bounds the number of tokens we copy by the actual list
length, which also makes the `==1` and `>1` branches handle empty
input consistently — the elif branch already silently took the
empty slice, only the `==1` branch crashed.
Discovered while shipping MTP+PP on the wi-adam/vllm rebase fork
(2026-05-11). With this guard, the IndexError on first inference is
eliminated; bookkeeping on non-last PP ranks remains correct because
the verified token is only consumed by the last-rank sampler.
Signed-off-by: Adam Winstanley <adam@winstanley.industries>
…lm-project#41133) Signed-off-by: Jing Wang <jingwang96@qq.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
…lm-project#41133) Signed-off-by: Jing Wang <jingwang96@qq.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
…lm-project#41133) Signed-off-by: Jing Wang <jingwang96@qq.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
…lm-project#41133) Signed-off-by: Jing Wang <jingwang96@qq.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Purpose
Issue: The current PP mode causes degraded accuracy with concurrent requests.
gsm8k results with Qwen3-8B, num_concurrent 256:
Maybe related issue
Root cause
vllm/vllm/v1/worker/gpu_model_runner.py
Lines 1338 to 1347 in de3da0b
In gpu_model_runner.py, the
num_tokens_no_specshould indicate the number of current tokens without spec.However, in
_update_stateswith PP mode, the above code incorrectly updates it tonum_computed_tokens.For chunked prefill, the
num_computed_tokensis the number of current computed chunked prompts, which is less than the number of total prompts (i.e., the realnum_tokens_no_spec).And in chunked prefill, the index to update
token_ids_cpuis incorrect too, as it is less thannum_prompts. What this code does in chunked prefill:new_token_idsis empty, sotoken_ids_cpunot touched.new_token_idsis the chunk needs to be computed, andtoken_ids_cpuis covered with the same prompt chunk, so it's also meaningless.This is correct in
_bookkeeping_sync, where it works in non-PP mode and for the last rank in PP mode, as shown below. And actually, in chunked prefill, the below operations are skipped due to no valid sampled tokens (filtered bydiscard_request_mask).code in _bookkeeping_sync
vllm/vllm/v1/worker/gpu_model_runner.py
Lines 3453 to 3463 in de3da0b
Then, in
_update_states,condense()is called, which reorders requests in the input batch, moves the valid requests at the tail to the empty slots at the front.vllm/vllm/v1/worker/gpu_input_batch.py
Lines 711 to 724 in de3da0b
As shown above, tokens are copied to the empty slot.
However, the
num_tokensis calculated withnum_tokens_no_spec.Thus, with the incorrect update in
_update_states, it only copiesnum_computed_tokenstokens rather than all prompt tokens, and the remaining prompt tokens are lost (replaced by invalid tokens in the empty slot).The code is also incorrect for no-async-scheduling. Why its score of gsm8k seems normal?
In no-async-scheduling, for each deocde step, the request is removed from input_batch due to
unscheduledand re-add_request to input_batch beforecondense().The
add_requestwill add the request to empty slot first, leaving less empty slots at the front, thus greatly reduce the occurrences of swaps incondense().However, it still has probability for this bug, especially when prompts are long and
max-num-batched-tokensare relatively small.This PR fixes the incorrect update in
_update_states, adopts code logic similar to that in_bookkeeping_sync.Test Plan
lm_eval with gsm8k
lm_eval --model local-completions --model_args "model=qwen,base_url=http://localhost:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=256,max_length=40960" --tasks gsm8k --num_fewshot 5Also benched performance, no changes were observed.
Test Result
before:
after:
and does not affect no-async-ascheduling:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.