Skip to content

[Fix] Gemma4 tool parser: support multiple tool calls in single delta and refactor streaming extraction#43037

Open
Alex-ai-future wants to merge 7 commits into
vllm-project:mainfrom
Alex-ai-future:gemma-parser-muti-dispatch
Open

[Fix] Gemma4 tool parser: support multiple tool calls in single delta and refactor streaming extraction#43037
Alex-ai-future wants to merge 7 commits into
vllm-project:mainfrom
Alex-ai-future:gemma-parser-muti-dispatch

Conversation

@Alex-ai-future

@Alex-ai-future Alex-ai-future commented May 19, 2026

Copy link
Copy Markdown

Summary

Fixes #42696 — Gemma4 tool parser fails in streaming mode when multiple tool calls arrive in a single delta.

Root Cause

The old streaming parser used a tag-counting state machine that assumed only one active tool call at a time. When MTP/speculative decoding produced multiple <tool_call|> delimiters in one delta, it would:

  1. Lose subsequent tool calls — only the first one was emitted
  2. Emit incomplete DeltaToolCall — follow-up chunks lacked id, type, function.name, causing @ai-sdk (OpenCode) validation failures

Solution

Replace the tag-counting state machine with Hermes-style accumulated-text scan-and-diff:

  1. regex.findall() enumerates all complete tool calls in current_text
  2. Per-index state tracking (prev_tool_call_arr[i]) instead of global current_tool_id
  3. Each DeltaToolCall emission includes id, type, and function.name on first chunk
  4. Content between tool calls is properly extracted and emitted

Code changes

File + -
vllm/tool_parsers/gemma4_tool_parser.py ~130 ~200
tests/tool_parsers/test_gemma4_tool_parser.py ~110 ~5

Key refactorings:

  • Split _extract_streaming into 7 focused functions
  • Removed ~150 lines of dead code (_handle_tool_call_middle, _handle_tool_call_end, etc.)
  • Removed unused buffered_delta_text and _buffer_delta_text()

Tests

.venv/bin/python -m pytest tests/tool_parsers/test_gemma4_tool_parser.py -v
# 57 passed

New test cases:

  • test_streaming_multiple_tool_calls_in_single_delta — 2 calls in one delta
  • test_streaming_four_tool_calls_with_interleaved_text_in_single_delta — 4 calls + text
  • test_streaming_text_between_tool_calls_in_single_delta — content extraction between calls
  • test_streaming_multiple_tool_calls_sequential — cross-chunk multi-call
  • test_streaming_filename_suffix_preserved_across_chunks — file extension split
  • test_streaming_string_prefix_preserved_across_chunks — string prefix split

Why this is not duplicating an existing PR

  • [Bugfix] Rewrite Gemma4 streaming tool parser #42237 (whym:codex/gemma4-hermes-style-parser) has the same high-level approach (Hermes-style scan-and-diff). Our PR differs in:
    • Better code structure: 7 focused functions vs fewer large functions
    • More comprehensive tests: 4-call + interleaved text + content extraction (57 vs ~55)
    • No external dependency: Doesn't introduce partial_tag_overlap utility
    • Both PRs address the same root cause; maintainers can choose either approach

Signed-off-by: Alex <alex.tech.lab@outlook.com>
@github-actions

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Gemma4 streaming tool parser to support multiple tool calls within a single delta by adopting a full-match enumeration approach, accompanied by new regression tests. Review feedback identifies several critical issues: the prev_end_count variable is now unused, the use of a global current_tool_name_sent flag fails to correctly track state for individual tool calls in a multi-call sequence, and the content extraction logic is prone to data loss for text appearing before or between tool call tags.

Comment thread vllm/tool_parsers/gemma4_tool_parser.py Outdated
# Find all complete tool calls in current text
all_matches = self.tool_call_regex.findall(current_text)
prev_start_count = previous_text.count(self.tool_call_start_token)
prev_end_count = previous_text.count(self.tool_call_end_token)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variable prev_end_count is calculated but never used in the new implementation of _extract_streaming. It should be removed to avoid confusion and unnecessary computation.

Comment thread vllm/tool_parsers/gemma4_tool_parser.py Outdated

# Stream argument diff
if self.current_tool_name_sent and args_part:
partial_delta = self._emit_argument_diff_for_index(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of the global self.current_tool_name_sent flag is incorrect for multi-tool parsing. If a previous tool call in the same request has already finished, this flag will be True. When a subsequent tool call starts, line 622 may evaluate to True even if the current tool's name hasn't been sent yet (e.g., if func_name is still None at line 605). This could lead to streaming arguments for the wrong tool or before the tool identity is established. You should use the per-tool state stored in self.prev_tool_call_arr[partial_idx] to track if the name has been sent for the specific tool call being processed.

                if "name" in self.prev_tool_call_arr[partial_idx] and args_part:

Comment thread vllm/tool_parsers/gemma4_tool_parser.py Outdated
Comment on lines +629 to +651
# Case A: No tool call tokens at all — treat as pure content
if curr_start_count == 0 and prev_start_count == 0:
if delta_text:
return DeltaMessage(content=delta_text)
content_delta = delta_text
# Case B: All tool calls completed and text appears after them
elif curr_start_count > 0 and curr_start_count == curr_end_count:
# Find text after the last <tool_call|>
last_end = current_text.rfind(self.tool_call_end_token)
if last_end != -1:
text_after = current_text[last_end + len(self.tool_call_end_token):]
prev_last_end = previous_text.rfind(self.tool_call_end_token)
prev_text_after = ""
if prev_last_end != -1:
prev_text_after = previous_text[
prev_last_end + len(self.tool_call_end_token):
]
# New content is what appeared since last check
if text_after.startswith(prev_text_after):
new_content = text_after[len(prev_text_after):]
else:
new_content = text_after
if new_content:
content_delta = new_content

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The content extraction logic in Step 3 is incomplete and will lead to data loss in several scenarios:

  1. Preamble Loss: Any text appearing before the first tool call in the same delta (e.g., "Preamble <|tool_call>...") is ignored because curr_start_count > 0 and curr_start_count != curr_end_count.
  2. Inter-call Content Loss: Text between multiple tool calls (e.g., "<tool_call|> Inter <|tool_call>") is lost because the parser assumes content only exists when no tool calls are active.
  3. Delayed Emission: Content after a tool call is only emitted once that specific tool call is fully completed, which may delay the display of text to the user.

You should refactor this to identify all text segments outside of <|tool_call>...<tool_call|> blocks and emit the difference relative to what was previously sent. A more robust approach would be to calculate the total content in current_text (by removing all tool call blocks) and diffing it against the total content in previous_text.

Signed-off-by: Alex <alex.tech.lab@outlook.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
@Alex-ai-future Alex-ai-future changed the title fix muti function call for gemma Fix Gemma4 tool parser: support multiple tool calls in single delta and refactor streaming extraction May 19, 2026
@Alex-ai-future Alex-ai-future mentioned this pull request May 19, 2026
4 tasks
@Alex-ai-future Alex-ai-future changed the title Fix Gemma4 tool parser: support multiple tool calls in single delta and refactor streaming extraction [Fix] Gemma4 tool parser: support multiple tool calls in single delta and refactor streaming extraction May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[Bug]: Gemma4 tool parser is broken in the streaming mode (for OpenCode)

1 participant