[Bugfix] missing tokens occur in harmony streaming by Ri0S · Pull Request #30437 · vllm-project/vllm

Ri0S · 2025-12-10T23:36:17Z

Purpose

Fixed an issue where in harmony streaming mode, when the engine yields more than one token at a time, only the last token is used.

FIX #28635 #30099

Test Plan

uv run api_server.py --model openai/gpt-oss-120b --gpu-memory-utilization 0.95 --port 8000 --served-model-name gptoss120b --disable-log-request --tool-call-parser openai --enable-auto-tool-choice

from openai import AsyncOpenAI
import asyncio
import json

client = AsyncOpenAI(base_url='http://127.0.0.1:8000/v1', api_key='empty')

async def run(semaphore, i):
    async with semaphore:
        for count in range(100):
            print(f'{i}: {count}')
            a = []
            stream = await client.responses.create(model='gptoss120b', input='say something long.', stream=True)
            async for _ in stream:
                a.append(_)
            b = ''.join([_.delta for _ in a if _.type == 'response.output_text.delta'])
            c = a[-1].response.output_text
            if b != c:
                print(f'{i} {count} streaming_output: {json.dumps(b, ensure_ascii=False)}')
                print(f'{i} {count} last_output: {json.dumps(c, ensure_ascii=False)}')

async def main():
    semaphore = asyncio.Semaphore(5)
    tasks = list()
    for i in range(5):
        task = run(
            semaphore=semaphore,
            i=i
        )
        tasks.append(task)
    await asyncio.gather(*tasks)

if __name__ == "__main__":
    results = asyncio.run(main())

Test Result

no missing tokens

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

Addresses missing tokens when the engine outputs multiple tokens per step in Harmony streaming.

Add last_content_delta to StreamingHarmonyContext; accumulate text across all token_ids per step in append_output and reset at message start
Switch all streaming delta emitters in serving_responses.py to use ctx.last_content_delta (and guard on it) for final, analysis, MCP/code-interpreter, MCP prefix, and function-call argument deltas

^{Written by Cursor Bugbot for commit f8d2831. This will update automatically on new commits. Configure here.}

Signed-off-by: RioS <aa248424@gmail.com>

github-actions · 2025-12-10T23:36:53Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request addresses a bug in harmony streaming where only the last token's delta was considered when multiple tokens were yielded. The changes correctly accumulate deltas from all tokens. My review includes one critical comment to prevent a potential IndexError that could occur if a RequestOutput with no outputs is processed, which could lead to a crash.

gemini-code-assist · 2025-12-10T23:38:40Z

vllm/entrypoints/context.py

+        last_delta_text = ''
        for tok in output.outputs[0].token_ids:
            self.parser.process(tok)
+            last_delta_text += self.parser.last_content_delta or ''
+        if last_delta_text:
+            self.last_delta = last_delta_text


The code directly accesses output.outputs[0] without checking if output.outputs is empty. This could lead to an IndexError if a RequestOutput is processed that contains no outputs, causing a crash. Other parts of the codebase, like _update_decode_token_usage, check for this possibility, indicating it's a valid scenario. It's safer to handle this case gracefully by providing a default empty list for token_ids when output.outputs is empty.

Suggested change

last_delta_text = ''

for tok in output.outputs[0].token_ids:

self.parser.process(tok)

last_delta_text += self.parser.last_content_delta or ''

if last_delta_text:

self.last_delta = last_delta_text

last_delta_text = ''

token_ids = output.outputs[0].token_ids if output.outputs else []

for tok in token_ids:

self.parser.process(tok)

last_delta_text += self.parser.last_content_delta or ''

if last_delta_text:

self.last_delta = last_delta_text

mergify · 2025-12-10T23:40:51Z

Hi @Ri0S, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Ri0S <aa248424@gmail.com>

Ri0S · 2025-12-28T23:41:06Z

@chaunceyjiang Could you please confirm? The issue still occurs in the latest version, v0.13.0. if FastAPI’s throughput fails to match the token generation rate of the engine.

chaunceyjiang · 2025-12-29T02:19:22Z

I remember this issue has already been fixed in the latest version.

Ri0S · 2025-12-29T02:59:51Z

@chaunceyjiang The fundamental issue hasn't been fixed in the code, so while it occurs less frequently than in the previous version, it still persists.
In the current code, responses api streaming only uses the last_content_delta from the harmony parser. However, if two or more tokens have already been generated by the time the engine yields tokens, the earlier tokens are lost because they don't accumulate in last_content_delta.

The chat_completion contains code that accumulates last_content_delta.

vllm/vllm/entrypoints/openai/serving_chat.py

Line 809 in b9793e6

delta_text += harmony_parser.last_content_delta or ""

chaunceyjiang

Thnaks~ @Ri0S

Nit

vllm/entrypoints/context.py

Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: RioS <aa248424@gmail.com>

Signed-off-by: Ri0S <aa248424@gmail.com>

chaunceyjiang

LGTM

danladis · 2026-01-08T09:42:00Z

Thanks for putting this together @Ri0S. This fix would really help us. Is there anything still blocking the merge?

mergify · 2026-01-09T01:22:09Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ri0S.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Ri0S <aa248424@gmail.com>

…aming # Conflicts: # vllm/entrypoints/openai/serving_responses.py Signed-off-by: Ri0S <aa248424@gmail.com>

Bugfix/responses streaming

chaunceyjiang · 2026-01-09T02:11:19Z

@Ri0S Thanks~

Signed-off-by: RioS <aa248424@gmail.com> Signed-off-by: Ri0S <aa248424@gmail.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>

Signed-off-by: RioS <aa248424@gmail.com> Signed-off-by: Ri0S <aa248424@gmail.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Signed-off-by: RioS <aa248424@gmail.com> Signed-off-by: Ri0S <aa248424@gmail.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>

Ri0S added 2 commits December 11, 2025 08:21

[bugfix] missing tokens occur in harmony streaming

be8d11e

Signed-off-by: RioS <aa248424@gmail.com>

[bugfix] missing tokens occur in harmony streaming

71a9203

Signed-off-by: RioS <aa248424@gmail.com>

Ri0S requested review from aarnphm and chaunceyjiang as code owners December 10, 2025 23:36

mergify bot added frontend gpt-oss Related to GPT-OSS models labels Dec 10, 2025

github-project-automation bot added this to gpt-oss Issues & Enhancements Dec 10, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Dec 10, 2025

gemini-code-assist bot reviewed Dec 10, 2025

View reviewed changes

[bugfix] missing tokens occur in harmony streaming

b897335

Signed-off-by: Ri0S <aa248424@gmail.com>

chaunceyjiang self-assigned this Dec 12, 2025

njhill mentioned this pull request Dec 30, 2025

[Bug]: --stream-interval > 1 causes tool call arguments to be empty/lost #31501

Open

1 task

chaunceyjiang approved these changes Jan 4, 2026

View reviewed changes

vllm/entrypoints/context.py Outdated Show resolved Hide resolved

github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Jan 4, 2026

Ri0S and others added 2 commits January 6, 2026 10:07

Update vllm/entrypoints/context.py

6ae8af9

Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: RioS <aa248424@gmail.com>

[bugfix] missing tokens occur in harmony streaming

5a0ae40

Signed-off-by: Ri0S <aa248424@gmail.com>

chaunceyjiang approved these changes Jan 6, 2026

View reviewed changes

chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 6, 2026

chaunceyjiang enabled auto-merge (squash) January 6, 2026 02:06

chaunceyjiang and others added 2 commits January 6, 2026 10:09

Merge branch 'main' into main

c181416

Merge branch 'main' into main

e8294f8

mergify bot added the needs-rebase label Jan 9, 2026

Resolve merge conflict

6e41966

Signed-off-by: Ri0S <aa248424@gmail.com>

Ri0S added 2 commits January 9, 2026 11:03

Merge remote-tracking branch 'origin/main' into bugfix/responses_stre…

a35d727

…aming # Conflicts: # vllm/entrypoints/openai/serving_responses.py Signed-off-by: Ri0S <aa248424@gmail.com>

Merge pull request #1 from Ri0S/bugfix/responses_streaming

f8d2831

Bugfix/responses streaming

auto-merge was automatically disabled January 9, 2026 02:08
Head branch was pushed to by a user without write access

mergify bot removed the needs-rebase label Jan 9, 2026

chaunceyjiang enabled auto-merge (squash) January 9, 2026 02:10

chaunceyjiang merged commit e2d49ec into vllm-project:main Jan 9, 2026
51 of 52 checks passed

github-project-automation bot moved this from Ready to Done in gpt-oss Issues & Enhancements Jan 9, 2026

will-deines mentioned this pull request Mar 4, 2026

[Bugfix] Fix Harmony streaming cross-channel delta accumulation #36011

Open

5 tasks

Uh oh!

Conversation

Ri0S commented Dec 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Dec 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 10, 2025

Uh oh!

Ri0S commented Dec 28, 2025

Uh oh!

chaunceyjiang commented Dec 29, 2025

Uh oh!

Ri0S commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaunceyjiang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chaunceyjiang left a comment

Choose a reason for hiding this comment

Uh oh!

danladis commented Jan 8, 2026

Uh oh!

mergify bot commented Jan 9, 2026

Uh oh!

chaunceyjiang commented Jan 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ri0S commented Dec 10, 2025 •

edited by github-actions bot

Loading

Ri0S commented Dec 29, 2025 •

edited

Loading