Use model-specific tool parsers#210
Conversation
|
@sempervictus Do you have time to test this? |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d7c37fa8fe
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@guoqingbao - Gemma3-27b seems to have some problems: vllm-rs-svc0 | 2026-01-30T15:34:22.944163Z WARN vllm_rs::core::engine: [Stream] New request [Seq_id 1, 803 tokens] received! (session_id: Some("299a9266-91ea-4087-a1c9-b6f216a0032e"))
vllm-rs-svc0 |
vllm-rs-svc0 | 2026-01-30T15:34:22.944214Z INFO vllm_rs::core::block_manager: Prefix cache miss seq 1 (803 tokens)
vllm-rs-svc0 | 2026-01-30T15:34:23.956334Z WARN vllm_rs::core::runner: User's thinking preference for reasoning models: None
vllm-rs-svc0 | 2026-01-30T15:34:23.956350Z WARN vllm_rs::core::runner: Using user's sampling params: temp=Some(0.5), top_k=Some(64), top_p=Some(0.95), freq_penalty=None, pres_penalty=None
vllm-rs-svc0 | 2026-01-30T15:34:23.975102Z INFO vllm_rs::core::engine: Prefilling [seq_id 1]: 804 tokens in 1.10s (730.91 tokens/s)
vllm-rs-svc0 | 2026-01-30T15:34:27.871399Z INFO vllm_rs::server::parser: Tool call buffering end, reached > (236813)
vllm-rs-svc0 | 2026-01-30T15:34:27.903060Z INFO vllm_rs::core::block_manager: Prefix cache insert seq 1 (927 tokens, 14 blocks)
vllm-rs-svc0 | 2026-01-30T15:34:27.903091Z WARN vllm_rs::tools::helpers: Schema validation failed for tool 'fs_ls': Missing required field: path. Schema: Object {"type": String("object"), "properties": Object {"path": Object {"type": String("string"), "description": String("The path of the directory to list")}}, "required": Array [String("path")]}, Args: Object {}
vllm-rs-svc0 | 2026-01-30T15:34:27.903107Z WARN vllm_rs::server::server: [Seq 1] Dropped 1 invalid tool call(s)
vllm-rs-svc0 | 2026-01-30T15:34:27.903113Z INFO vllm_rs::tools::helpers: Invalid tool call(s): fs_ls(args={})similarly the 4b tries to do: coder>temp) .model fac:Gemma3-4b 1861(1.42%)
coder>temp) .empty session 1861(5.68%)
coder>temp) list the contents of . and read any docs present 0
<thinking>
The user wants to list the contents of the current directory (represented by ".") and read any documentation present. The most appropriate tool for this task is `fs_ls`. The `fs_ls` tool will list the contents of the directory. The current directory is ".", so no additional arguments are needed.
</thinking>
{"name": "fs_ls", "arguments": {}}
</end_function_call>Qwen3 seems happy at 30 and 235B, Qwen3-Coder also happy at 30B (FP8 version on Spark) |
|
So this used to break sessions and now just stalls for a second while buffering the decode stream but completes output permitting the session to continue (appears to fix #129):
...
...
It "hung" on the vllm-rs-svc0 | 2026-01-30T16:01:19.313766Z INFO vllm_rs::server::parser: Tool parser selected: qwen_coder (model_id=Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8, enforce_parser=none)
vllm-rs-svc0 | 2026-01-30T16:01:19.314393Z INFO vllm_rs::core::block_manager: Prefix cache hit seq 3 (4800 cached tokens, 75 blocks)
vllm-rs-svc0 | 2026-01-30T16:01:19.443326Z INFO vllm_rs::core::engine: Prefilling [seq_id 3]: 4835 tokens in 0.16s (30796.18 tokens/s, cache included)
vllm-rs-svc0 | 2026-01-30T16:01:24.445290Z INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [3]], avg. 29 tokens/s per request (total: 29 tokens/s)
vllm-rs-svc0 | 2026-01-30T16:01:29.472767Z INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [3]], avg. 29 tokens/s per request (total: 29 tokens/s)
vllm-rs-svc0 | 2026-01-30T16:01:29.472789Z INFO vllm_rs::core::scheduler: GPU Kvcache: 4013 blocks (256832 tokens) free, used 2.0% (0.49GB/24.00GB); CPU swap used NaN% (NaNGB/0.00GB)
vllm-rs-svc0 | 2026-01-30T16:01:30.806269Z INFO vllm_rs::server::parser: Tool call <tool_call> (151657) found, start buffering!
vllm-rs-svc0 | 2026-01-30T16:01:34.475671Z INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [3]], avg. 29 tokens/s per request (total: 29 tokens/s)
vllm-rs-svc0 | 2026-01-30T16:01:34.646129Z INFO vllm_rs::core::block_manager: Prefix cache insert seq 3 (5278 tokens, 82 blocks)
vllm-rs-svc0 | 2026-01-30T16:01:34.646735Z WARN vllm_rs::server::server: [Seq 3] Tool parse partial, flushing 490 chars |
|
@guoqingbao we need to add a reasoning-level flag defaulting to "moderate" or something other than "philosophy student with access to recreational substances" - the new mechanism seems to be making "thinking" models think waaaay too much about which tool calls to make and how. I just got >2k thinking tokens from the 235B VL trying to figure out how to structure two searxng requests 🤦 |
|
Another apparent problem: the newly formed tool call chunks are not passing through intermediate gateways, there's some sort of structural concern (digging into it) stripping them out in |
Wired, we didn't change the reasoning part. Not sure how it was affected. |
I tested with claude code, different agents may not receive consistent results, do you have these problems on claude code? |
|
Candle update? |
The candle update only changed cudaforge from a GitHub repository to crates.io (I published it there to support the candle maintainers, as they also want to use it within candle). |
So, in general, this PR worth to be merged? |
|
@guoqingbao yes, i think so - the diffusion i'm seeing (esp in longer contexts on bigger models) is likely a sampling collapse accumulating over KV which seems more likely due to changes in cutlass, candle, or attn.rs (or even cudaforge). |
It that specifically for fp8 models? |
|
Oddly no, the V100s are doing it on FP16 DType |
That's the chat template problem, the official repo of gemma3 does not contains tool calling template, here is the resolution: https://www.reddit.com/r/LocalLLaMA/comments/1jauy8d/giving_native_tool_calling_to_gemma_3_or_really/ |
|
Let me merge this first, we have another PR for the precision degradation issue. @sempervictus |
* Use model-specific tool parsers * Compatible with goose & optional tool call validation
Improve Tool Call Parsing
This PR uses model-specific parsers for tool call parsing in both streaming and non-streaming modes. The goal is consistent parsing across models while remaining robust to partial output and format differences.
Parser selection
Parsers are selected in the following order:
--enforce-parser(if provided and valid)passthroughInvalid
--enforce-parservalues result in an error listing valid parser names.Available parsers:
passthrough,json,mistral,qwen,qwen_coder,pythonic,llama,deepseek,glm45_moe,glm47_moe,step3,kimik2,minimax_m2Streaming vs non-streaming
Streaming uses incremental parsing, accumulating tool call fragments and finalizing them when an end marker is detected. If parsing fails, content falls back to normal text to avoid output loss.
Non-streaming reuses the same logic via
parse_complete_with_fallback, ensuring identical behavior across both paths.Enforcing a parser