Skip to content

UPSTREAM PR #19635: common : fix Step-3.5-Flash format detection and thinking support#1182

Open
loci-dev wants to merge 3 commits intomainfrom
loci/pr-19635-fix-step35-tool-call-detection
Open

UPSTREAM PR #19635: common : fix Step-3.5-Flash format detection and thinking support#1182
loci-dev wants to merge 3 commits intomainfrom
loci/pr-19635-fix-step35-tool-call-detection

Conversation

@loci-dev
Copy link

Note

Source pull request: ggml-org/llama.cpp#19635

Summary

Step-3.5-Flash (196B MoE) uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string(), so arguments stayed as JSON strings and templates using arguments|items crashed.

Reported by multiple users in #19283:

Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses.

Changes

  • Relax Qwen3-Coder XML detection to only require the 3 shared markers (<tool_call>, <function=, <parameter=)
  • Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think>
  • Add thinking_forced_open support to Qwen3-Coder-XML init function (same pattern as Nemotron v3, Hermes 2 Pro, Granite)
  • Add <think>/</think> to preserved tokens
  • Fix build_grammar_xml_tool_call to handle thinking_forced_open in the grammar root rule, allowing </think> before tool calls when tool_choice=required (also fixes pre-existing bug in MiniMax-M2 and GLM 4.5)
  • Add unmodified HuggingFace chat template and format detection test

Testing

  • test-chat passes (format detected as COMMON_CHAT_FORMAT_QWEN3_CODER_XML)
  • Validated tool calling via /v1/chat/completions: correct tool_calls with parsed arguments
  • Validated thinking via /v1/chat/completions: reasoning_content properly separated from content
  • Tested on Step-3.5-Flash IQ3_XXS with --reasoning-format auto (default)

Related

AI Disclosure

Claude was used for codebase exploration, pattern identification, and drafting. All changes follow established patterns from existing format handlers (Nemotron v3, Hermes 2 Pro, MiniMax-M2). Fully tested locally.

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string(), so arguments stayed as
JSON strings and templates using arguments|items crashed.

Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.

Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
  <parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/</think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
  grammar root rule, allowing </think> before tool calls
- Add Step-3.5-Flash chat template and format detection test

Builds on: ggml-org/llama.cpp#19283
Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and
Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with
unconditional <think> output. Route it to the Nemotron v3 PEG parser
for streaming and schema-aware parameter parsing.

Detection: templates with <think> + XML tool tags use Nemotron v3 PEG
parser; templates without <think> (Qwen3-Coder) use GBNF grammar.

Tests cover: basic messages, tool calls with/without thinking content,
parallel tool calls, code string parameters, optional </parameter>
closing tags, and JSON schema response format.
Remove thinking handling code that became unreachable after routing
Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no
<think> in its template, so the thinking_forced_open logic, preserved
tokens, and grammar prefix were dead paths.
@loci-dev loci-dev force-pushed the main branch 4 times, most recently from 073bd79 to 823244c Compare February 18, 2026 02:17
@loci-dev loci-dev force-pushed the loci/pr-19635-fix-step35-tool-call-detection branch from e26fa44 to bdc1dda Compare February 18, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 2cecc98 to a92fe2a Compare February 26, 2026 02:16
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 9f4f332 to 4298c74 Compare March 6, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 56aaa36 to 21147c2 Compare March 13, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants