Skip to content

[Bugfix] Fix harmony streaming tool call crash and argument splitting#37070

Open
Pradyun92 wants to merge 1 commit intovllm-project:mainfrom
Pradyun92:fix/harmony-streaming-tool-call-crash
Open

[Bugfix] Fix harmony streaming tool call crash and argument splitting#37070
Pradyun92 wants to merge 1 commit intovllm-project:mainfrom
Pradyun92:fix/harmony-streaming-tool-call-crash

Conversation

@Pradyun92
Copy link
Copy Markdown
Contributor

@Pradyun92 Pradyun92 commented Mar 14, 2026

Purpose

Two fixes for Chat Completions streaming with harmony models (e.g., gpt_oss):

Bug 1: Tool call arguments split across indices

With speculative decoding (Eagle), multi-token batches can span channel boundaries. extract_harmony_streaming_delta reads harmony_parser.messages to compute tool call indices, so it must be called immediately after each token is processed — not after the entire batch. Otherwise the parser state is fully advanced and base_index is wrong for early tokens, causing argument fragments to land in separate tool calls.

For example, a single tool call produces two entries on the client:

Tool[0] get_weather: {"city": "Tokyo"    ← missing closing }
Tool[1] None: }                           ← stray } as separate "tool call"

Fix: Interleave token processing with delta extraction — process one token through harmony_parser.process(), immediately call extract_harmony_streaming_delta for that token, then process the next token. Merge the per-token deltas into a single DeltaMessage.

Bug 2: Server crash (IndexError) in autocomplete logic

The "unstreamed tool arg tokens" autocomplete logic at lines 1232-1281 accesses tool_parser.prev_tool_call_arr[index] and tool_parser.streamed_args_for_tool[index]. Harmony models use extract_harmony_streaming_delta() for tool call parsing, which never populates these arrays. Indexing into empty arrays raises IndexError.

Fix: Add and not self.use_harmony guard to skip the autocomplete block for harmony models.

Not duplicating existing PRs: Checked open PRs #33520, #35449, #35907, #36011, #36445 — none address these specific bugs (argument splitting across indices or IndexError in autocomplete logic).

AI assistance: This PR was developed with AI assistance (Claude). The submitter has reviewed all changes and tested end-to-end.

Test Plan

# Start vLLM with a harmony model + Eagle speculative decoding
vllm serve <harmony-model-path> \
  --enable-auto-tool-choice --tool-call-parser openai \
  --reasoning-parser openai_gptoss \
  --speculative-config '{"method":"eagle3","model":"<eagle-head-path>","num_speculative_tokens":5}'

# Streaming tool call (was crashing / producing malformed args):
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<model>",
    "stream": true,
    "messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
    "tools": [{"type": "function", "function": {"name": "get_weather", "description": "Get weather", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}}]
  }'

# BFCL multi-turn function calling benchmark:
OPENAI_BASE_URL="http://localhost:8000/v1" \
OPENAI_API_KEY="fake" \
bfcl generate \
  --model <model> \
  --test-category multi_turn \
  --num-threads 8

Test Result

Before fix (Bug 1): Streaming tool call arguments split across indices:

Tool[0] get_weather: {"city": "Tokyo"     ← incomplete
Tool[1] None: }                            ← stray fragment

Before fix (Bug 2): Server returns 500:

{"error": {"message": "list index out of range", "type": "InternalServerError", "code": 500}}

After fix: Streaming tool calls produce correct, complete JSON:

Tool[0] get_weather: {"city": "Tokyo", "units": "celsius"}

BFCL multi-turn results (gpt-oss-120b with Eagle3 speculative decoding):

Category Accuracy
Multi Turn Overall 54.75%
Base 65.00%
Miss Func 51.50%
Miss Param 48.50%
Long Context 54.00%

All 800 test cases completed with 0 malformed JSON errors in tool call arguments.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR
  • The test plan
  • The test results
  • (Optional) Documentation update
  • (Optional) Release notes update

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes an IndexError crash that occurs during streaming tool calls with harmony models. The change adds a condition, and not self.use_harmony, to prevent access to unpopulated tool_parser arrays for these models. This approach correctly addresses the described bug. The provided test plan appears sufficient to validate the fix. I have reviewed the changes and have no further comments.

@Pradyun92 Pradyun92 force-pushed the fix/harmony-streaming-tool-call-crash branch from 70cc7ae to f3e7988 Compare March 14, 2026 21:58
@Pradyun92 Pradyun92 changed the title [Bugfix] Fix harmony model streaming tool call crash (IndexError) [Bugfix] Fix harmony streaming tool call crash and argument splitting Mar 14, 2026
Two fixes for Chat Completions streaming with harmony models (gpt_oss):

1. Tool call arguments split across indices: With speculative decoding
   (Eagle), multi-token batches can span channel boundaries.
   extract_harmony_streaming_delta reads harmony_parser.messages to
   compute tool call indices, so it must be called immediately after
   each token is processed — not after the entire batch. Otherwise the
   parser state is fully advanced and base_index is wrong for early
   tokens, causing argument fragments to land in separate tool calls.

   Fix: Interleave token processing with delta extraction — process one
   token through harmony_parser.process(), immediately call
   extract_harmony_streaming_delta for that token, then process the
   next token. Merge the per-token deltas into a single DeltaMessage.

2. IndexError crash in autocomplete logic: The unstreamed tool arg
   tokens autocomplete logic accesses tool_parser.prev_tool_call_arr
   and tool_parser.streamed_args_for_tool, but harmony models never
   populate these arrays. Skip this block for harmony models.

Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
Co-authored-by: Claude
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working frontend gpt-oss Related to GPT-OSS models

Projects

Status: To Triage

Development

Successfully merging this pull request may close these issues.

1 participant