Skip to content

[Bugfix] missing tokens occur in harmony streaming#30437

Merged
chaunceyjiang merged 10 commits intovllm-project:mainfrom
Ri0S:main
Jan 9, 2026
Merged

[Bugfix] missing tokens occur in harmony streaming#30437
chaunceyjiang merged 10 commits intovllm-project:mainfrom
Ri0S:main

Conversation

@Ri0S
Copy link
Copy Markdown
Contributor

@Ri0S Ri0S commented Dec 10, 2025

Purpose

Fixed an issue where in harmony streaming mode, when the engine yields more than one token at a time, only the last token is used.

FIX #28635 #30099

Test Plan

uv run api_server.py --model openai/gpt-oss-120b --gpu-memory-utilization 0.95 --port 8000 --served-model-name gptoss120b --disable-log-request --tool-call-parser openai --enable-auto-tool-choice
from openai import AsyncOpenAI
import asyncio
import json

client = AsyncOpenAI(base_url='http://127.0.0.1:8000/v1', api_key='empty')

async def run(semaphore, i):
    async with semaphore:
        for count in range(100):
            print(f'{i}: {count}')
            a = []
            stream = await client.responses.create(model='gptoss120b', input='say something long.', stream=True)
            async for _ in stream:
                a.append(_)
            b = ''.join([_.delta for _ in a if _.type == 'response.output_text.delta'])
            c = a[-1].response.output_text
            if b != c:
                print(f'{i} {count} streaming_output: {json.dumps(b, ensure_ascii=False)}')
                print(f'{i} {count} last_output: {json.dumps(c, ensure_ascii=False)}')

async def main():
    semaphore = asyncio.Semaphore(5)
    tasks = list()
    for i in range(5):
        task = run(
            semaphore=semaphore,
            i=i
        )
        tasks.append(task)
    await asyncio.gather(*tasks)

if __name__ == "__main__":
    results = asyncio.run(main())

Test Result

no missing tokens


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

Addresses missing tokens when the engine outputs multiple tokens per step in Harmony streaming.

  • Add last_content_delta to StreamingHarmonyContext; accumulate text across all token_ids per step in append_output and reset at message start
  • Switch all streaming delta emitters in serving_responses.py to use ctx.last_content_delta (and guard on it) for final, analysis, MCP/code-interpreter, MCP prefix, and function-call argument deltas

Written by Cursor Bugbot for commit f8d2831. This will update automatically on new commits. Configure here.

Ri0S added 2 commits December 11, 2025 08:21
Signed-off-by: RioS <aa248424@gmail.com>
Signed-off-by: RioS <aa248424@gmail.com>
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in harmony streaming where only the last token's delta was considered when multiple tokens were yielded. The changes correctly accumulate deltas from all tokens. My review includes one critical comment to prevent a potential IndexError that could occur if a RequestOutput with no outputs is processed, which could lead to a crash.

Comment on lines +787 to +792
last_delta_text = ''
for tok in output.outputs[0].token_ids:
self.parser.process(tok)
last_delta_text += self.parser.last_content_delta or ''
if last_delta_text:
self.last_delta = last_delta_text
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The code directly accesses output.outputs[0] without checking if output.outputs is empty. This could lead to an IndexError if a RequestOutput is processed that contains no outputs, causing a crash. Other parts of the codebase, like _update_decode_token_usage, check for this possibility, indicating it's a valid scenario. It's safer to handle this case gracefully by providing a default empty list for token_ids when output.outputs is empty.

Suggested change
last_delta_text = ''
for tok in output.outputs[0].token_ids:
self.parser.process(tok)
last_delta_text += self.parser.last_content_delta or ''
if last_delta_text:
self.last_delta = last_delta_text
last_delta_text = ''
token_ids = output.outputs[0].token_ids if output.outputs else []
for tok in token_ids:
self.parser.process(tok)
last_delta_text += self.parser.last_content_delta or ''
if last_delta_text:
self.last_delta = last_delta_text

@mergify
Copy link
Copy Markdown

mergify bot commented Dec 10, 2025

Hi @Ri0S, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Ri0S <aa248424@gmail.com>
@chaunceyjiang chaunceyjiang self-assigned this Dec 12, 2025
@Ri0S
Copy link
Copy Markdown
Contributor Author

Ri0S commented Dec 28, 2025

@chaunceyjiang Could you please confirm? The issue still occurs in the latest version, v0.13.0. if FastAPI’s throughput fails to match the token generation rate of the engine.

@chaunceyjiang
Copy link
Copy Markdown
Collaborator

I remember this issue has already been fixed in the latest version.

@Ri0S
Copy link
Copy Markdown
Contributor Author

Ri0S commented Dec 29, 2025

@chaunceyjiang The fundamental issue hasn't been fixed in the code, so while it occurs less frequently than in the previous version, it still persists.
In the current code, responses api streaming only uses the last_content_delta from the harmony parser. However, if two or more tokens have already been generated by the time the engine yields tokens, the earlier tokens are lost because they don't accumulate in last_content_delta.

The chat_completion contains code that accumulates last_content_delta.

delta_text += harmony_parser.last_content_delta or ""

Copy link
Copy Markdown
Collaborator

@chaunceyjiang chaunceyjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thnaks~ @Ri0S

Nit

@github-project-automation github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Jan 4, 2026
Ri0S and others added 2 commits January 6, 2026 10:07
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Signed-off-by: RioS <aa248424@gmail.com>
Signed-off-by: Ri0S <aa248424@gmail.com>
Copy link
Copy Markdown
Collaborator

@chaunceyjiang chaunceyjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chaunceyjiang chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 6, 2026
@chaunceyjiang chaunceyjiang enabled auto-merge (squash) January 6, 2026 02:06
@danladis
Copy link
Copy Markdown

danladis commented Jan 8, 2026

Thanks for putting this together @Ri0S. This fix would really help us. Is there anything still blocking the merge?

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 9, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ri0S.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 9, 2026
Signed-off-by: Ri0S <aa248424@gmail.com>
Ri0S added 2 commits January 9, 2026 11:03
…aming

# Conflicts:
#	vllm/entrypoints/openai/serving_responses.py

Signed-off-by: Ri0S <aa248424@gmail.com>
auto-merge was automatically disabled January 9, 2026 02:08

Head branch was pushed to by a user without write access

@mergify mergify bot removed the needs-rebase label Jan 9, 2026
@chaunceyjiang chaunceyjiang enabled auto-merge (squash) January 9, 2026 02:10
@chaunceyjiang
Copy link
Copy Markdown
Collaborator

@Ri0S Thanks~

@chaunceyjiang chaunceyjiang merged commit e2d49ec into vllm-project:main Jan 9, 2026
51 of 52 checks passed
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
Signed-off-by: RioS <aa248424@gmail.com>
Signed-off-by: Ri0S <aa248424@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
Signed-off-by: RioS <aa248424@gmail.com>
Signed-off-by: Ri0S <aa248424@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Signed-off-by: RioS <aa248424@gmail.com>
Signed-off-by: Ri0S <aa248424@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: Streaming=True causes missing or scrambled tokens with GPT-OSS 120B on vLLM v0.11.0

3 participants