[Bugfix] Fix harmony streaming tool call crash and argument splitting#37070
Open
Pradyun92 wants to merge 1 commit intovllm-project:mainfrom
Open
[Bugfix] Fix harmony streaming tool call crash and argument splitting#37070Pradyun92 wants to merge 1 commit intovllm-project:mainfrom
Pradyun92 wants to merge 1 commit intovllm-project:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request fixes an IndexError crash that occurs during streaming tool calls with harmony models. The change adds a condition, and not self.use_harmony, to prevent access to unpopulated tool_parser arrays for these models. This approach correctly addresses the described bug. The provided test plan appears sufficient to validate the fix. I have reviewed the changes and have no further comments.
70cc7ae to
f3e7988
Compare
Two fixes for Chat Completions streaming with harmony models (gpt_oss): 1. Tool call arguments split across indices: With speculative decoding (Eagle), multi-token batches can span channel boundaries. extract_harmony_streaming_delta reads harmony_parser.messages to compute tool call indices, so it must be called immediately after each token is processed — not after the entire batch. Otherwise the parser state is fully advanced and base_index is wrong for early tokens, causing argument fragments to land in separate tool calls. Fix: Interleave token processing with delta extraction — process one token through harmony_parser.process(), immediately call extract_harmony_streaming_delta for that token, then process the next token. Merge the per-token deltas into a single DeltaMessage. 2. IndexError crash in autocomplete logic: The unstreamed tool arg tokens autocomplete logic accesses tool_parser.prev_tool_call_arr and tool_parser.streamed_args_for_tool, but harmony models never populate these arrays. Skip this block for harmony models. Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com> Co-authored-by: Claude
f3e7988 to
1a71d80
Compare
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Two fixes for Chat Completions streaming with harmony models (e.g.,
gpt_oss):Bug 1: Tool call arguments split across indices
With speculative decoding (Eagle), multi-token batches can span channel boundaries.
extract_harmony_streaming_deltareadsharmony_parser.messagesto compute tool call indices, so it must be called immediately after each token is processed — not after the entire batch. Otherwise the parser state is fully advanced andbase_indexis wrong for early tokens, causing argument fragments to land in separate tool calls.For example, a single tool call produces two entries on the client:
Fix: Interleave token processing with delta extraction — process one token through
harmony_parser.process(), immediately callextract_harmony_streaming_deltafor that token, then process the next token. Merge the per-token deltas into a singleDeltaMessage.Bug 2: Server crash (IndexError) in autocomplete logic
The "unstreamed tool arg tokens" autocomplete logic at lines 1232-1281 accesses
tool_parser.prev_tool_call_arr[index]andtool_parser.streamed_args_for_tool[index]. Harmony models useextract_harmony_streaming_delta()for tool call parsing, which never populates these arrays. Indexing into empty arrays raisesIndexError.Fix: Add
and not self.use_harmonyguard to skip the autocomplete block for harmony models.Not duplicating existing PRs: Checked open PRs #33520, #35449, #35907, #36011, #36445 — none address these specific bugs (argument splitting across indices or IndexError in autocomplete logic).
AI assistance: This PR was developed with AI assistance (Claude). The submitter has reviewed all changes and tested end-to-end.
Test Plan
Test Result
Before fix (Bug 1): Streaming tool call arguments split across indices:
Before fix (Bug 2): Server returns 500:
{"error": {"message": "list index out of range", "type": "InternalServerError", "code": 500}}After fix: Streaming tool calls produce correct, complete JSON:
BFCL multi-turn results (gpt-oss-120b with Eagle3 speculative decoding):
All 800 test cases completed with 0 malformed JSON errors in tool call arguments.
Essential Elements of an Effective PR Description Checklist