[Bugfix] missing tokens occur in harmony streaming#30437
[Bugfix] missing tokens occur in harmony streaming#30437chaunceyjiang merged 10 commits intovllm-project:mainfrom
Conversation
Signed-off-by: RioS <aa248424@gmail.com>
Signed-off-by: RioS <aa248424@gmail.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request addresses a bug in harmony streaming where only the last token's delta was considered when multiple tokens were yielded. The changes correctly accumulate deltas from all tokens. My review includes one critical comment to prevent a potential IndexError that could occur if a RequestOutput with no outputs is processed, which could lead to a crash.
vllm/entrypoints/context.py
Outdated
| last_delta_text = '' | ||
| for tok in output.outputs[0].token_ids: | ||
| self.parser.process(tok) | ||
| last_delta_text += self.parser.last_content_delta or '' | ||
| if last_delta_text: | ||
| self.last_delta = last_delta_text |
There was a problem hiding this comment.
The code directly accesses output.outputs[0] without checking if output.outputs is empty. This could lead to an IndexError if a RequestOutput is processed that contains no outputs, causing a crash. Other parts of the codebase, like _update_decode_token_usage, check for this possibility, indicating it's a valid scenario. It's safer to handle this case gracefully by providing a default empty list for token_ids when output.outputs is empty.
| last_delta_text = '' | |
| for tok in output.outputs[0].token_ids: | |
| self.parser.process(tok) | |
| last_delta_text += self.parser.last_content_delta or '' | |
| if last_delta_text: | |
| self.last_delta = last_delta_text | |
| last_delta_text = '' | |
| token_ids = output.outputs[0].token_ids if output.outputs else [] | |
| for tok in token_ids: | |
| self.parser.process(tok) | |
| last_delta_text += self.parser.last_content_delta or '' | |
| if last_delta_text: | |
| self.last_delta = last_delta_text |
|
Hi @Ri0S, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Ri0S <aa248424@gmail.com>
|
@chaunceyjiang Could you please confirm? The issue still occurs in the latest version, v0.13.0. if FastAPI’s throughput fails to match the token generation rate of the engine. |
|
I remember this issue has already been fixed in the latest version. |
|
@chaunceyjiang The fundamental issue hasn't been fixed in the code, so while it occurs less frequently than in the previous version, it still persists. The chat_completion contains code that accumulates last_content_delta. vllm/vllm/entrypoints/openai/serving_chat.py Line 809 in b9793e6 |
chaunceyjiang
left a comment
There was a problem hiding this comment.
Thnaks~ @Ri0S
Nit
Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: RioS <aa248424@gmail.com>
Signed-off-by: Ri0S <aa248424@gmail.com>
|
Thanks for putting this together @Ri0S. This fix would really help us. Is there anything still blocking the merge? |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Ri0S <aa248424@gmail.com>
…aming # Conflicts: # vllm/entrypoints/openai/serving_responses.py Signed-off-by: Ri0S <aa248424@gmail.com>
Bugfix/responses streaming
Head branch was pushed to by a user without write access
|
@Ri0S Thanks~ |
Signed-off-by: RioS <aa248424@gmail.com> Signed-off-by: Ri0S <aa248424@gmail.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Signed-off-by: RioS <aa248424@gmail.com> Signed-off-by: Ri0S <aa248424@gmail.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: RioS <aa248424@gmail.com> Signed-off-by: Ri0S <aa248424@gmail.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Purpose
Fixed an issue where in harmony streaming mode, when the engine yields more than one token at a time, only the last token is used.
FIX #28635 #30099
Test Plan
Test Result
no missing tokens
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Note
Addresses missing tokens when the engine outputs multiple tokens per step in Harmony streaming.
last_content_deltatoStreamingHarmonyContext; accumulate text across alltoken_idsper step inappend_outputand reset at message startserving_responses.pyto usectx.last_content_delta(and guard on it) for final, analysis, MCP/code-interpreter, MCP prefix, and function-call argument deltasWritten by Cursor Bugbot for commit f8d2831. This will update automatically on new commits. Configure here.