Skip to content

Use model-specific tool parsers#210

Merged
guoqingbao merged 2 commits into
mainfrom
tool_parser
Jan 31, 2026
Merged

Use model-specific tool parsers#210
guoqingbao merged 2 commits into
mainfrom
tool_parser

Conversation

@guoqingbao
Copy link
Copy Markdown
Owner

@guoqingbao guoqingbao commented Jan 30, 2026

Improve Tool Call Parsing

This PR uses model-specific parsers for tool call parsing in both streaming and non-streaming modes. The goal is consistent parsing across models while remaining robust to partial output and format differences.

Parser selection

Parsers are selected in the following order:

  1. --enforce-parser (if provided and valid)
  2. Model-based heuristics (model type + model ID)
  3. Fallback to passthrough

Invalid --enforce-parser values result in an error listing valid parser names.

Available parsers:
passthrough, json, mistral, qwen, qwen_coder, pythonic, llama, deepseek, glm45_moe, glm47_moe, step3, kimik2, minimax_m2

Streaming vs non-streaming

Streaming uses incremental parsing, accumulating tool call fragments and finalizing them when an end marker is detected. If parsing fails, content falls back to normal text to avoid output loss.

Non-streaming reuses the same logic via parse_complete_with_fallback, ensuring identical behavior across both paths.

Enforcing a parser

--enforce-parser qwen_coder

@guoqingbao
Copy link
Copy Markdown
Owner Author

@sempervictus Do you have time to test this?

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d7c37fa8fe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/server/server.rs
@sempervictus
Copy link
Copy Markdown
Contributor

sempervictus commented Jan 30, 2026

@guoqingbao - Gemma3-27b seems to have some problems:

vllm-rs-svc0  | 2026-01-30T15:34:22.944163Z  WARN vllm_rs::core::engine: [Stream] New request [Seq_id 1, 803 tokens] received! (session_id: Some("299a9266-91ea-4087-a1c9-b6f216a0032e"))
vllm-rs-svc0  | 
vllm-rs-svc0  | 2026-01-30T15:34:22.944214Z  INFO vllm_rs::core::block_manager: Prefix cache miss seq 1 (803 tokens)
vllm-rs-svc0  | 2026-01-30T15:34:23.956334Z  WARN vllm_rs::core::runner: User's thinking preference for reasoning models: None
vllm-rs-svc0  | 2026-01-30T15:34:23.956350Z  WARN vllm_rs::core::runner: Using user's sampling params: temp=Some(0.5), top_k=Some(64), top_p=Some(0.95), freq_penalty=None, pres_penalty=None
vllm-rs-svc0  | 2026-01-30T15:34:23.975102Z  INFO vllm_rs::core::engine: Prefilling [seq_id 1]: 804 tokens in 1.10s (730.91 tokens/s)
vllm-rs-svc0  | 2026-01-30T15:34:27.871399Z  INFO vllm_rs::server::parser: Tool call buffering end, reached > (236813)
vllm-rs-svc0  | 2026-01-30T15:34:27.903060Z  INFO vllm_rs::core::block_manager: Prefix cache insert seq 1 (927 tokens, 14 blocks)
vllm-rs-svc0  | 2026-01-30T15:34:27.903091Z  WARN vllm_rs::tools::helpers: Schema validation failed for tool 'fs_ls': Missing required field: path. Schema: Object {"type": String("object"), "properties": Object {"path": Object {"type": String("string"), "description": String("The path of the directory to list")}}, "required": Array [String("path")]}, Args: Object {}
vllm-rs-svc0  | 2026-01-30T15:34:27.903107Z  WARN vllm_rs::server::server: [Seq 1] Dropped 1 invalid tool call(s)
vllm-rs-svc0  | 2026-01-30T15:34:27.903113Z  INFO vllm_rs::tools::helpers: Invalid tool call(s): fs_ls(args={})

similarly the 4b tries to do:

coder>temp) .model fac:Gemma3-4b                                                                                                                                                                                                                                                                                   1861(1.42%)

coder>temp) .empty session                                                                                                                                                                                                                                                                                         1861(5.68%)

coder>temp) list the contents of . and read any docs present                                                                                                                                                                                                                                                                 0
<thinking>
The user wants to list the contents of the current directory (represented by ".") and read any documentation present. The most appropriate tool for this task is `fs_ls`.  The `fs_ls` tool will list the contents of the directory.  The current directory is ".", so no additional arguments are needed.
</thinking>

{"name": "fs_ls", "arguments": {}}
</end_function_call>

Qwen3 seems happy at 30 and 235B, Alibaba-NLP/Tongyi-DeepResearch-30B-A3B unfortunately has become a babbling mess since the candle updates last night or somehow from this PR - yesterdays version figured out a course of action, today the same exact params/model just keeps wondering to itself which flags it needs to use on ls when the tools available are a constrained fs_ls and fs_cat with no params other than path... so something went south there (for all i know the GPUs need a reset, will look deeper into this).

Qwen3-Coder also happy at 30B (FP8 version on Spark)

@sempervictus
Copy link
Copy Markdown
Contributor

So this used to break sessions and now just stalls for a second while buffering the decode stream but completes output permitting the session to continue (appears to fix #129):

coder>temp) explain your tool-use instructions

<tool-use-instructions>

...

Available tools and their optimal use cases:

  1. fs_mkdir: Create new directories in the project structure.
  2. fs_create: Generate new files with specified contents.
  3. fs_patch: Examine and modify existing files.
  4. fs_cat: View the contents of existing files without making changes.
  5. fs_ls: Understand the current project structure or locate specific files.

...

Important Rules:

  • Wrap function name and arguments with <tool_call> and <tool_call> tags
  • Do NOT USE ANY code blocks
  • Tool-use must be placed at the end of your response (AFTER REASONING), top-level, and not nested within other tags.
  • Always adhere to this format for the tool use to ensure proper parsing and execution.
  • The "name" and "arguments" are necessary fields
  • DO NOT call ANY functions that DOES NOT defined between <tool> and </tool>
  • MUST FOLLOW the above instruction when using tool call!

</tool-use-instructions>

It "hung" on the Wrap function name... line while buffering the rest of the output till the end of message token since that's actually incorrect syntax (should end with </tool_call>) and it spat out the tags and text subsequently. It detected a partial capture:

vllm-rs-svc0  | 2026-01-30T16:01:19.313766Z  INFO vllm_rs::server::parser: Tool parser selected: qwen_coder (model_id=Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8, enforce_parser=none)
vllm-rs-svc0  | 2026-01-30T16:01:19.314393Z  INFO vllm_rs::core::block_manager: Prefix cache hit seq 3 (4800 cached tokens, 75 blocks)
vllm-rs-svc0  | 2026-01-30T16:01:19.443326Z  INFO vllm_rs::core::engine: Prefilling [seq_id 3]: 4835 tokens in 0.16s (30796.18 tokens/s, cache included)
vllm-rs-svc0  | 2026-01-30T16:01:24.445290Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [3]], avg. 29 tokens/s per request (total: 29 tokens/s)
vllm-rs-svc0  | 2026-01-30T16:01:29.472767Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [3]], avg. 29 tokens/s per request (total: 29 tokens/s)
vllm-rs-svc0  | 2026-01-30T16:01:29.472789Z  INFO vllm_rs::core::scheduler: GPU Kvcache: 4013 blocks (256832 tokens) free, used 2.0% (0.49GB/24.00GB); CPU swap used NaN% (NaNGB/0.00GB)
vllm-rs-svc0  | 2026-01-30T16:01:30.806269Z  INFO vllm_rs::server::parser: Tool call <tool_call> (151657) found, start buffering!
vllm-rs-svc0  | 2026-01-30T16:01:34.475671Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [3]], avg. 29 tokens/s per request (total: 29 tokens/s)
vllm-rs-svc0  | 2026-01-30T16:01:34.646129Z  INFO vllm_rs::core::block_manager: Prefix cache insert seq 3 (5278 tokens, 82 blocks)
vllm-rs-svc0  | 2026-01-30T16:01:34.646735Z  WARN vllm_rs::server::server: [Seq 3] Tool parse partial, flushing 490 chars

@sempervictus
Copy link
Copy Markdown
Contributor

@guoqingbao we need to add a reasoning-level flag defaulting to "moderate" or something other than "philosophy student with access to recreational substances" - the new mechanism seems to be making "thinking" models think waaaay too much about which tool calls to make and how. I just got >2k thinking tokens from the 235B VL trying to figure out how to structure two searxng requests 🤦

@sempervictus
Copy link
Copy Markdown
Contributor

Another apparent problem: the newly formed tool call chunks are not passing through intermediate gateways, there's some sort of structural concern (digging into it) stripping them out in tensorzero or doubleword-control-layer which was not happening prior to this PR

@guoqingbao
Copy link
Copy Markdown
Owner Author

@guoqingbao we need to add a reasoning-level flag defaulting to "moderate" or something other than "philosophy student with access to recreational substances" - the new mechanism seems to be making "thinking" models think waaaay too much about which tool calls to make and how. I just got >2k thinking tokens from the 235B VL trying to figure out how to structure two searxng requests 🤦

Wired, we didn't change the reasoning part. Not sure how it was affected.

@guoqingbao
Copy link
Copy Markdown
Owner Author

So this used to break sessions and now just stalls for a second while buffering the decode stream but completes output permitting the session to continue (appears to fix #129):

I tested with claude code, different agents may not receive consistent results, do you have these problems on claude code?

@sempervictus
Copy link
Copy Markdown
Contributor

Candle update?

@guoqingbao
Copy link
Copy Markdown
Owner Author

Candle update?

The candle update only changed cudaforge from a GitHub repository to crates.io (I published it there to support the candle maintainers, as they also want to use it within candle).

@guoqingbao
Copy link
Copy Markdown
Owner Author

So this used to break sessions and now just stalls for a second while buffering the decode stream but completes output permitting the session to continue (appears to fix #129):

So, in general, this PR worth to be merged?

@sempervictus
Copy link
Copy Markdown
Contributor

@guoqingbao yes, i think so - the diffusion i'm seeing (esp in longer contexts on bigger models) is likely a sampling collapse accumulating over KV which seems more likely due to changes in cutlass, candle, or attn.rs (or even cudaforge).

@guoqingbao
Copy link
Copy Markdown
Owner Author

@guoqingbao yes, i think so - the diffusion i'm seeing (esp in longer contexts on bigger models) is likely a sampling collapse accumulating over KV which seems more likely due to changes in cutlass, candle, or attn.rs (or even cudaforge).

It that specifically for fp8 models?

@sempervictus
Copy link
Copy Markdown
Contributor

Oddly no, the V100s are doing it on FP16 DType Alibaba-NLP/Tongyi-DeepResearch-30B-A3B with no FP8 involved at q8_0 or q4k making me think there's a sampling bug somewhere (unless this code somehow overrides the sampling inputs for the turn, i would expect it to be deeper in the bowels of candle/cutlass/cudaforge)

@guoqingbao
Copy link
Copy Markdown
Owner Author

@guoqingbao - Gemma3-27b seems to have some problems:

That's the chat template problem, the official repo of gemma3 does not contains tool calling template, here is the resolution: https://www.reddit.com/r/LocalLLaMA/comments/1jauy8d/giving_native_tool_calling_to_gemma_3_or_really/

@guoqingbao
Copy link
Copy Markdown
Owner Author

Let me merge this first, we have another PR for the precision degradation issue. @sempervictus

@guoqingbao guoqingbao merged commit 7a73878 into main Jan 31, 2026
1 check passed
guoqingbao added a commit that referenced this pull request May 21, 2026
* Use model-specific tool parsers

* Compatible with goose & optional tool call validation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants