Fix reasoning marker issue#281
Conversation
9db347e to
8fb76e5
Compare
|
Unfortunately this makes the model throw think tags back to clients on every response including tool calls:
I can update #265 to handle those correctly but i'm wondering if we need the injection at all given our control of the generation space and if #282 has any merit. |
|
~3 line change to #265 handes it:
vllm-rs-svc2 | 2026-03-30T19:25:10.919096Z INFO vllm_rs::core::block_manager: Prefix cache insert seq 0 (1250 tokens, 19 blocks)
vllm-rs-svc2 | 2026-03-30T19:25:10.919197Z INFO vllm_rs::server::parser: Tool call buffering end, reached </tool_call> (248059)
vllm-rs-svc2 | 2026-03-30T19:25:10.919236Z INFO vllm_rs::server::parser: Building tool call: [StreamingToolCallState { name: Some("web_search_searxng"), arguments: "{\"query\": \"vllm.rs\"}" }]
vllm-rs-svc2 | 2026-03-30T19:25:10.919728Z INFO vllm_rs::tools::helpers: Valid tool call(s): web_search_searxng(args={"query":"vllm.rs"})
vllm-rs-svc2 | 2026-03-30T19:25:10.919738Z INFO vllm_rs::server::server: Final chunk emitted after tool-call delta chunk(s): ChatCompletionChunk { id: "seq-0", object: "chat.completion.chunk", created: 1774898710656, model: "default", choices: [ChatChoiceChunk { index: 0, delta: Delta { role: None, content: None, reasoning_content: None, tool_calls: None }, finish_reason: Some("tool_calls"), error: None }], usage: Some(Usage { prompt_tokens: 1213, completion_tokens: 37, total_tokens: 1250 }) }
vllm-rs-svc2 | 2026-03-30T19:25:10.919755Z WARN vllm_rs::server::server: --- Performance Metrics ---
vllm-rs-svc2 | 2026-03-30T19:25:10.919760Z INFO vllm_rs::server::server: [Seq 0] ⏱️ Prompt: 1213 tokens in 0.09s (13043.01 t/s)
vllm-rs-svc2 | 2026-03-30T19:25:10.919767Z INFO vllm_rs::server::server: [Seq 0] ⏱️ Decoded: 37 tokens in 0.17s (220.24 t/s)
vllm-rs-svc2 | 2026-03-30T19:25:10.940527Z INFO vllm_rs::core::scheduler: GPU Kvcache: 109 blocks (6976 tokens) free, used 14.8% (0.01GB/0.05GB); CPU swap used 0.0% (0.00GB/0.01GB)
vllm-rs-svc2 | 2026-03-30T19:25:10.940543Z INFO vllm_rs::core::scheduler: GPU MambaState: 1 / 30 slots used (3.3%), approx 0.01GB/0.28GB (slot 9.32MB)
vllm-rs-svc2 | 2026-03-30T19:25:11.996092Z WARN vllm_rs::server::parser: Tool start token IDs corrected from tokenizer for model Qwen3VL: {248058}
vllm-rs-svc2 | 2026-03-30T19:25:11.996128Z WARN vllm_rs::server::parser: Tool end token IDs corrected from tokenizer for model Qwen3VL: {248059}
vllm-rs-svc2 | 2026-03-30T19:25:11.996279Z WARN vllm_rs::server::server: Tools enabled for request
vllm-rs-svc2 | 2026-03-30T19:25:11.996922Z INFO vllm_rs::core::engine: [llg] Guidance enabled, trimming <think> from pre-generation. Generation starting at <|im_start|>assistant
vllm-rs-svc2 | 2026-03-30T19:25:12.001928Z INFO vllm_rs::core::prefix_cache: Prefix cache exact match: 18 blocks matched (tolerance: 0.05)
vllm-rs-svc2 | 2026-03-30T19:25:12.029106Z WARN vllm_rs::core::engine: [Stream] New request [Seq_id 1, 4070 tokens] received! (session_id: None)
vllm-rs-svc2 |
vllm-rs-svc2 | 2026-03-30T19:25:12.029145Z INFO vllm_rs::server::parser: Tool parser selected: qwen_coder (model_id=Qwen/Qwen3.5-0.8B, enforce_parser=none)
vllm-rs-svc2 | 2026-03-30T19:25:12.029219Z INFO vllm_rs::core::prefix_cache: Prefix cache exact match: 18 blocks matched (tolerance: 0.05)
vllm-rs-svc2 | 2026-03-30T19:25:12.029536Z INFO vllm_rs::core::prefix_cache: Prefix cache exact match: 18 blocks matched (tolerance: 0.05)
vllm-rs-svc2 | 2026-03-30T19:25:12.029634Z INFO vllm_rs::core::block_manager: Prefix cache hit seq 1 (1152 cached tokens, 18 blocks)
vllm-rs-svc2 | 2026-03-30T19:25:12.030740Z INFO vllm_rs::core::runner: Restored mamba prefix state for seq 1 (cached 1152 tokens)
vllm-rs-svc2 | 2026-03-30T19:25:12.030940Z INFO vllm_rs::core::runner: Restored mamba prefix state for seq 1 (cached 1152 tokens)
vllm-rs-svc2 | 2026-03-30T19:25:12.040993Z INFO vllm_rs::utils::guidance: GRAMMAR:
vllm-rs-svc2 | start: ( text | tool_call )+ eos
vllm-rs-svc2 | tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
vllm-rs-svc2 | tool_0: "<function=fetch_url_via_curl>" "\n" param_0_0 ( "\n" param_0_1)? "</function>" "\n"
vllm-rs-svc2 | tool_content: tool_0 ? tool_1 ? tool_2 ? tool_3 ? tool_4
vllm-rs-svc2 | tool_1: "<function=fs_cat>" "\n" param_1_0 "</function>" "\n"
vllm-rs-svc2 | tool_2: "<function=fs_ls>" "\n" param_2_0 "</function>" "\n"
vllm-rs-svc2 | tool_3: "<function=get_current_time>" "</function>" "\n"
vllm-rs-svc2 | tool_4: "<function=web_search_searxng>" "\n" param_4_0 ( "\n" param_4_1)? "</function>" "\n"
vllm-rs-svc2 | param_0_0: "<parameter=url>" "\n" value_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_0_1: "<parameter=proxy>" "\n" value_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_1_0: "<parameter=path>" "\n" value_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_2_0: "<parameter=path>" "\n" value_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_4_0: "<parameter=query>" "\n" value_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_4_1: "<parameter=searxng>" "\n" value_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | text: /(?s:.+?)/
vllm-rs-svc2 | value_string[suffix="\n</parameter>\n"]: /(?s:.+?)/
vllm-rs-svc2 | eos: ( <[248046]> <[248044]> )The py |
|
So far so good - 122B responds correctly with thinking, 80B coder does not think (yet), tool calls are sharp, prefix hits always show number-of-blocks-commited-by-last-turn minus 1 since the pfx cache discards partials.
#282 does not appear to be breaking anything but we're not falling-through to it on mismatches with this and #265 is cleaning up the anomalous generation of those |
It respond reasoning parts via reasoning_content, the popular AI agents like opencode, kilo code, claude code use that. |
The current approach in this PR follows the standard and specs, if the client unable to handle that, it's the client issue, no more changes needed. The new approach uses reasoning_content, a specific chunk type for streaming reasoning process. |
|
So it looks like images can still break prefix lookups :-\ vllm-rs-svc0 | 2026-03-31T00:52:32.475971Z WARN vllm_rs::server::server: Tools enabled for request
vllm-rs-svc0 | 2026-03-31T00:52:32.484124Z INFO vllm_rs::server: Chat image decoded: 1238 x 1222
vllm-rs-svc0 | 2026-03-31T00:52:32.565365Z INFO vllm_rs::server: Chat image decoded: 1568 x 980
vllm-rs-svc0 | 2026-03-31T00:52:32.672580Z INFO vllm_rs::server: 2 images detected in the chat message, combined image shape [2, 3024, 1536]
vllm-rs-svc0 | 2026-03-31T00:52:33.025641Z ERROR vllm_rs::server::server: Stream generation failed: Remaining 5120 kvcache tokens, but your prompt requires 182016 new tokens, please request later! |
Agreed, its mostly the TUI-based Rust ones which seem to have issues but the web-rendered ones simply present the chunk without "hiding" it behind a thinking modal in a slightly different font. Have been running this on some V100s all day and a merge with 265 on the 6KPros - aside form some clients "not liking being forced to see think tags" i think the operating logic is sound. i have still seen a few cache misses but i think we won't be able to avoid some level of that until we figure out a way to build |
|
With just this PR or with 265 aligned as it currently is, the clients which don't handle the think-chunks correctly (or when no thinking params are sent) all do this: in a multi-turn session, does that help the match occur correctly or not given the channelization? Prior to the last commit in #265 these were being removed at the last second from tokenization through the template to ensure alignment with grammar but i pushed the last commit to ensure alignment with this PR so figure its worth asking: do we actually need those in the final content tokenized in the chat template given that i saw no cache misses while they were being stripped (youd mentioned there's no log of it so maybe i missed it happening)? |
That's the client issue. This mean the reasoning is off. |
It’s essentially an all-cache-hit vs. partial-hit trade-off. In this PR, I’m allowing slight partial cache misses (still maintaining >95% content cache hit rates for tool calls). The cache miss you observed is another symptom, due to Mamba state eviction (some states were evicted). |
* Working solution * Thought process sent via reasoning_content chunk * Improve claude path * Minor fix * ReadMe for Kilo Code

A clean implmentation used to replace #279