Use model-specific tool parsers by guoqingbao · Pull Request #210 · guoqingbao/xinfer

guoqingbao · 2026-01-30T11:46:47Z

Improve Tool Call Parsing

This PR uses model-specific parsers for tool call parsing in both streaming and non-streaming modes. The goal is consistent parsing across models while remaining robust to partial output and format differences.

Parser selection

Parsers are selected in the following order:

--enforce-parser (if provided and valid)
Model-based heuristics (model type + model ID)
Fallback to passthrough

Invalid --enforce-parser values result in an error listing valid parser names.

Available parsers:
passthrough, json, mistral, qwen, qwen_coder, pythonic, llama, deepseek, glm45_moe, glm47_moe, step3, kimik2, minimax_m2

Streaming vs non-streaming

Streaming uses incremental parsing, accumulating tool call fragments and finalizing them when an end marker is detected. If parsing fails, content falls back to normal text to avoid output loss.

Non-streaming reuses the same logic via parse_complete_with_fallback, ensuring identical behavior across both paths.

Enforcing a parser

--enforce-parser qwen_coder

guoqingbao · 2026-01-30T11:47:04Z

@sempervictus Do you have time to test this?

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d7c37fa8fe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

sempervictus · 2026-01-30T15:42:59Z

@guoqingbao - Gemma3-27b seems to have some problems:

vllm-rs-svc0  | 2026-01-30T15:34:22.944163Z  WARN vllm_rs::core::engine: [Stream] New request [Seq_id 1, 803 tokens] received! (session_id: Some("299a9266-91ea-4087-a1c9-b6f216a0032e"))
vllm-rs-svc0  | 
vllm-rs-svc0  | 2026-01-30T15:34:22.944214Z  INFO vllm_rs::core::block_manager: Prefix cache miss seq 1 (803 tokens)
vllm-rs-svc0  | 2026-01-30T15:34:23.956334Z  WARN vllm_rs::core::runner: User's thinking preference for reasoning models: None
vllm-rs-svc0  | 2026-01-30T15:34:23.956350Z  WARN vllm_rs::core::runner: Using user's sampling params: temp=Some(0.5), top_k=Some(64), top_p=Some(0.95), freq_penalty=None, pres_penalty=None
vllm-rs-svc0  | 2026-01-30T15:34:23.975102Z  INFO vllm_rs::core::engine: Prefilling [seq_id 1]: 804 tokens in 1.10s (730.91 tokens/s)
vllm-rs-svc0  | 2026-01-30T15:34:27.871399Z  INFO vllm_rs::server::parser: Tool call buffering end, reached > (236813)
vllm-rs-svc0  | 2026-01-30T15:34:27.903060Z  INFO vllm_rs::core::block_manager: Prefix cache insert seq 1 (927 tokens, 14 blocks)
vllm-rs-svc0  | 2026-01-30T15:34:27.903091Z  WARN vllm_rs::tools::helpers: Schema validation failed for tool 'fs_ls': Missing required field: path. Schema: Object {"type": String("object"), "properties": Object {"path": Object {"type": String("string"), "description": String("The path of the directory to list")}}, "required": Array [String("path")]}, Args: Object {}
vllm-rs-svc0  | 2026-01-30T15:34:27.903107Z  WARN vllm_rs::server::server: [Seq 1] Dropped 1 invalid tool call(s)
vllm-rs-svc0  | 2026-01-30T15:34:27.903113Z  INFO vllm_rs::tools::helpers: Invalid tool call(s): fs_ls(args={})

similarly the 4b tries to do:

coder>temp) .model fac:Gemma3-4b                                                                                                                                                                                                                                                                                   1861(1.42%)

coder>temp) .empty session                                                                                                                                                                                                                                                                                         1861(5.68%)

coder>temp) list the contents of . and read any docs present                                                                                                                                                                                                                                                                 0
<thinking>
The user wants to list the contents of the current directory (represented by ".") and read any documentation present. The most appropriate tool for this task is `fs_ls`.  The `fs_ls` tool will list the contents of the directory.  The current directory is ".", so no additional arguments are needed.
</thinking>

{"name": "fs_ls", "arguments": {}}
</end_function_call>

Qwen3 seems happy at 30 and 235B, Alibaba-NLP/Tongyi-DeepResearch-30B-A3B unfortunately has become a babbling mess since the candle updates last night or somehow from this PR - yesterdays version figured out a course of action, today the same exact params/model just keeps wondering to itself which flags it needs to use on ls when the tools available are a constrained fs_ls and fs_cat with no params other than path... so something went south there (for all i know the GPUs need a reset, will look deeper into this).

Qwen3-Coder also happy at 30B (FP8 version on Spark)

sempervictus · 2026-01-30T16:09:27Z

So this used to break sessions and now just stalls for a second while buffering the decode stream but completes output permitting the session to continue (appears to fix #129):

coder>temp) explain your tool-use instructions

<tool-use-instructions>

...

Available tools and their optimal use cases:

fs_mkdir: Create new directories in the project structure.

fs_create: Generate new files with specified contents.

fs_patch: Examine and modify existing files.

fs_cat: View the contents of existing files without making changes.

fs_ls: Understand the current project structure or locate specific files.

...

Important Rules:

Wrap function name and arguments with <tool_call> and <tool_call> tags

Do NOT USE ANY code blocks

Tool-use must be placed at the end of your response (AFTER REASONING), top-level, and not nested within other tags.

Always adhere to this format for the tool use to ensure proper parsing and execution.

The "name" and "arguments" are necessary fields

DO NOT call ANY functions that DOES NOT defined between <tool> and </tool>

MUST FOLLOW the above instruction when using tool call!

</tool-use-instructions>

It "hung" on the Wrap function name... line while buffering the rest of the output till the end of message token since that's actually incorrect syntax (should end with </tool_call>) and it spat out the tags and text subsequently. It detected a partial capture:

vllm-rs-svc0  | 2026-01-30T16:01:19.313766Z  INFO vllm_rs::server::parser: Tool parser selected: qwen_coder (model_id=Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8, enforce_parser=none)
vllm-rs-svc0  | 2026-01-30T16:01:19.314393Z  INFO vllm_rs::core::block_manager: Prefix cache hit seq 3 (4800 cached tokens, 75 blocks)
vllm-rs-svc0  | 2026-01-30T16:01:19.443326Z  INFO vllm_rs::core::engine: Prefilling [seq_id 3]: 4835 tokens in 0.16s (30796.18 tokens/s, cache included)
vllm-rs-svc0  | 2026-01-30T16:01:24.445290Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [3]], avg. 29 tokens/s per request (total: 29 tokens/s)
vllm-rs-svc0  | 2026-01-30T16:01:29.472767Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [3]], avg. 29 tokens/s per request (total: 29 tokens/s)
vllm-rs-svc0  | 2026-01-30T16:01:29.472789Z  INFO vllm_rs::core::scheduler: GPU Kvcache: 4013 blocks (256832 tokens) free, used 2.0% (0.49GB/24.00GB); CPU swap used NaN% (NaNGB/0.00GB)
vllm-rs-svc0  | 2026-01-30T16:01:30.806269Z  INFO vllm_rs::server::parser: Tool call <tool_call> (151657) found, start buffering!
vllm-rs-svc0  | 2026-01-30T16:01:34.475671Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [3]], avg. 29 tokens/s per request (total: 29 tokens/s)
vllm-rs-svc0  | 2026-01-30T16:01:34.646129Z  INFO vllm_rs::core::block_manager: Prefix cache insert seq 3 (5278 tokens, 82 blocks)
vllm-rs-svc0  | 2026-01-30T16:01:34.646735Z  WARN vllm_rs::server::server: [Seq 3] Tool parse partial, flushing 490 chars

sempervictus · 2026-01-30T16:29:47Z

@guoqingbao we need to add a reasoning-level flag defaulting to "moderate" or something other than "philosophy student with access to recreational substances" - the new mechanism seems to be making "thinking" models think waaaay too much about which tool calls to make and how. I just got >2k thinking tokens from the 235B VL trying to figure out how to structure two searxng requests 🤦

sempervictus · 2026-01-30T18:16:42Z

Another apparent problem: the newly formed tool call chunks are not passing through intermediate gateways, there's some sort of structural concern (digging into it) stripping them out in tensorzero or doubleword-control-layer which was not happening prior to this PR

guoqingbao · 2026-01-30T23:38:21Z

@guoqingbao we need to add a reasoning-level flag defaulting to "moderate" or something other than "philosophy student with access to recreational substances" - the new mechanism seems to be making "thinking" models think waaaay too much about which tool calls to make and how. I just got >2k thinking tokens from the 235B VL trying to figure out how to structure two searxng requests 🤦

Wired, we didn't change the reasoning part. Not sure how it was affected.

guoqingbao · 2026-01-31T00:04:12Z

So this used to break sessions and now just stalls for a second while buffering the decode stream but completes output permitting the session to continue (appears to fix #129):

I tested with claude code, different agents may not receive consistent results, do you have these problems on claude code?

sempervictus · 2026-01-31T00:04:50Z

Candle update?

guoqingbao · 2026-01-31T00:15:44Z

Candle update?

The candle update only changed cudaforge from a GitHub repository to crates.io (I published it there to support the candle maintainers, as they also want to use it within candle).

guoqingbao · 2026-01-31T00:16:31Z

So this used to break sessions and now just stalls for a second while buffering the decode stream but completes output permitting the session to continue (appears to fix #129):

So, in general, this PR worth to be merged?

sempervictus · 2026-01-31T03:26:35Z

@guoqingbao yes, i think so - the diffusion i'm seeing (esp in longer contexts on bigger models) is likely a sampling collapse accumulating over KV which seems more likely due to changes in cutlass, candle, or attn.rs (or even cudaforge).

guoqingbao · 2026-01-31T04:03:47Z

@guoqingbao yes, i think so - the diffusion i'm seeing (esp in longer contexts on bigger models) is likely a sampling collapse accumulating over KV which seems more likely due to changes in cutlass, candle, or attn.rs (or even cudaforge).

It that specifically for fp8 models?

sempervictus · 2026-01-31T06:17:48Z

Oddly no, the V100s are doing it on FP16 DType Alibaba-NLP/Tongyi-DeepResearch-30B-A3B with no FP8 involved at q8_0 or q4k making me think there's a sampling bug somewhere (unless this code somehow overrides the sampling inputs for the turn, i would expect it to be deeper in the bowels of candle/cutlass/cudaforge)

guoqingbao · 2026-01-31T09:35:57Z

@guoqingbao - Gemma3-27b seems to have some problems:

That's the chat template problem, the official repo of gemma3 does not contains tool calling template, here is the resolution: https://www.reddit.com/r/LocalLLaMA/comments/1jauy8d/giving_native_tool_calling_to_gemma_3_or_really/

guoqingbao · 2026-01-31T10:52:04Z

Let me merge this first, we have another PR for the precision degradation issue. @sempervictus

* Use model-specific tool parsers * Compatible with goose & optional tool call validation

Use model-specific tool parsers

d7c37fa

chatgpt-codex-connector Bot reviewed Jan 30, 2026

View reviewed changes

Comment thread src/server/server.rs

guoqingbao mentioned this pull request Jan 30, 2026

Enforce guided model output for tool calling #208

Closed

This was referenced Jan 30, 2026

Special Token Parsing Problematic #206

Closed

Allow Default Reasoning Strength to be Configured for Thinking Models #211

Closed

sempervictus mentioned this pull request Jan 31, 2026

Reasoning Models Looping and Confused #213

Closed

sempervictus mentioned this pull request Jan 31, 2026

miromind-ai/MiroThinker-v1.5-235B degenerating while decoding #169

Closed

Compatible with goose & optional tool call validation

e2c0ab1

guoqingbao merged commit 7a73878 into main Jan 31, 2026
1 check passed

guoqingbao added a commit that referenced this pull request May 21, 2026

Use model-specific tool parsers (#210)

96b0ab3

* Use model-specific tool parsers * Compatible with goose & optional tool call validation

Conversation

guoqingbao commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Improve Tool Call Parsing

Parser selection

Streaming vs non-streaming

Enforcing a parser

Uh oh!

guoqingbao commented Jan 30, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

sempervictus commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sempervictus commented Jan 30, 2026

Important Rules:

Uh oh!

sempervictus commented Jan 30, 2026

Uh oh!

sempervictus commented Jan 30, 2026

Uh oh!

guoqingbao commented Jan 30, 2026

Uh oh!

guoqingbao commented Jan 31, 2026

Uh oh!

sempervictus commented Jan 31, 2026

Uh oh!

guoqingbao commented Jan 31, 2026

Uh oh!

guoqingbao commented Jan 31, 2026

Uh oh!

sempervictus commented Jan 31, 2026

Uh oh!

guoqingbao commented Jan 31, 2026

Uh oh!

sempervictus commented Jan 31, 2026

Uh oh!

guoqingbao commented Jan 31, 2026

Uh oh!

guoqingbao commented Jan 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guoqingbao commented Jan 30, 2026 •

edited

Loading

sempervictus commented Jan 30, 2026 •

edited

Loading