Fix reasoning marker issue by guoqingbao · Pull Request #281 · guoqingbao/xinfer

guoqingbao · 2026-03-30T07:03:06Z

A clean implmentation used to replace #279

sempervictus · 2026-03-30T18:49:58Z

Hmm, seems to be minimal merge difference applying with #265 and #282 😄 - checking the openai clients to see if they're happy with this approach. Thanks as always

sempervictus · 2026-03-30T19:03:17Z

Unfortunately this makes the model throw think tags back to clients on every response including tool calls:

$ aichat -m g62 --role %research% web search for vllm.rs

<think>

</think>

Call web_search_searxng {"query":"vllm.rs"}
Processing 4096 tokens per chunk

<think>

</think>

vLLM.rs: A Minimalist vLLM in Rust

Core Architecture

vLLM.rs is a high-performance large language model inference engine written in Rust, implemented as a minimal alternative to the traditional PyTorch-based vLLM. It achieves state-of-the-art performance with no PyTorch dependency, leveraging Rust's type safety and performance characteristics.

I can update #265 to handle those correctly but i'm wondering if we need the injection at all given our control of the generation space and if #282 has any merit.

sempervictus · 2026-03-30T19:30:01Z

~3 line change to #265 handes it:

$ aichat -m g62 --role %research% web search for vllm.rs
Call web_search_searxng {"query":"vllm.rs"}
Processing 4096 tokens per chunk

vLLM.rs

Overview

vLLM.rs is a minimal, high-performance large language model (LLM) inference engine implemented in Rust. It provides fast, memory-efficient text generation with support for distributed GPU processing.

Architecture

...

vllm-rs-svc2  | 2026-03-30T19:25:10.919096Z  INFO vllm_rs::core::block_manager: Prefix cache insert seq 0 (1250 tokens, 19 blocks)
vllm-rs-svc2  | 2026-03-30T19:25:10.919197Z  INFO vllm_rs::server::parser: Tool call buffering end, reached </tool_call> (248059)
vllm-rs-svc2  | 2026-03-30T19:25:10.919236Z  INFO vllm_rs::server::parser: Building tool call: [StreamingToolCallState { name: Some("web_search_searxng"), arguments: "{\"query\": \"vllm.rs\"}" }]
vllm-rs-svc2  | 2026-03-30T19:25:10.919728Z  INFO vllm_rs::tools::helpers: Valid tool call(s): web_search_searxng(args={"query":"vllm.rs"})
vllm-rs-svc2  | 2026-03-30T19:25:10.919738Z  INFO vllm_rs::server::server: Final chunk emitted after tool-call delta chunk(s): ChatCompletionChunk { id: "seq-0", object: "chat.completion.chunk", created: 1774898710656, model: "default", choices: [ChatChoiceChunk { index: 0, delta: Delta { role: None, content: None, reasoning_content: None, tool_calls: None }, finish_reason: Some("tool_calls"), error: None }], usage: Some(Usage { prompt_tokens: 1213, completion_tokens: 37, total_tokens: 1250 }) }
vllm-rs-svc2  | 2026-03-30T19:25:10.919755Z  WARN vllm_rs::server::server: --- Performance Metrics ---
vllm-rs-svc2  | 2026-03-30T19:25:10.919760Z  INFO vllm_rs::server::server: [Seq 0] ⏱️ Prompt: 1213 tokens in 0.09s (13043.01 t/s)
vllm-rs-svc2  | 2026-03-30T19:25:10.919767Z  INFO vllm_rs::server::server: [Seq 0] ⏱️ Decoded: 37 tokens in 0.17s (220.24 t/s)
vllm-rs-svc2  | 2026-03-30T19:25:10.940527Z  INFO vllm_rs::core::scheduler: GPU Kvcache: 109 blocks (6976 tokens) free, used 14.8% (0.01GB/0.05GB); CPU swap used 0.0% (0.00GB/0.01GB)
vllm-rs-svc2  | 2026-03-30T19:25:10.940543Z  INFO vllm_rs::core::scheduler: GPU MambaState: 1 / 30 slots used (3.3%), approx 0.01GB/0.28GB (slot 9.32MB)
vllm-rs-svc2  | 2026-03-30T19:25:11.996092Z  WARN vllm_rs::server::parser: Tool start token IDs corrected from tokenizer for model Qwen3VL: {248058}
vllm-rs-svc2  | 2026-03-30T19:25:11.996128Z  WARN vllm_rs::server::parser: Tool end token IDs corrected from tokenizer for model Qwen3VL: {248059}
vllm-rs-svc2  | 2026-03-30T19:25:11.996279Z  WARN vllm_rs::server::server: Tools enabled for request
vllm-rs-svc2  | 2026-03-30T19:25:11.996922Z  INFO vllm_rs::core::engine: [llg] Guidance enabled, trimming <think> from pre-generation. Generation starting at <|im_start|>assistant
vllm-rs-svc2  | 2026-03-30T19:25:12.001928Z  INFO vllm_rs::core::prefix_cache: Prefix cache exact match: 18 blocks matched (tolerance: 0.05)
vllm-rs-svc2  | 2026-03-30T19:25:12.029106Z  WARN vllm_rs::core::engine: [Stream] New request [Seq_id 1, 4070 tokens] received! (session_id: None)
vllm-rs-svc2  | 
vllm-rs-svc2  | 2026-03-30T19:25:12.029145Z  INFO vllm_rs::server::parser: Tool parser selected: qwen_coder (model_id=Qwen/Qwen3.5-0.8B, enforce_parser=none)
vllm-rs-svc2  | 2026-03-30T19:25:12.029219Z  INFO vllm_rs::core::prefix_cache: Prefix cache exact match: 18 blocks matched (tolerance: 0.05)
vllm-rs-svc2  | 2026-03-30T19:25:12.029536Z  INFO vllm_rs::core::prefix_cache: Prefix cache exact match: 18 blocks matched (tolerance: 0.05)
vllm-rs-svc2  | 2026-03-30T19:25:12.029634Z  INFO vllm_rs::core::block_manager: Prefix cache hit seq 1 (1152 cached tokens, 18 blocks)
vllm-rs-svc2  | 2026-03-30T19:25:12.030740Z  INFO vllm_rs::core::runner: Restored mamba prefix state for seq 1 (cached 1152 tokens)
vllm-rs-svc2  | 2026-03-30T19:25:12.030940Z  INFO vllm_rs::core::runner: Restored mamba prefix state for seq 1 (cached 1152 tokens)
vllm-rs-svc2  | 2026-03-30T19:25:12.040993Z  INFO vllm_rs::utils::guidance: GRAMMAR:
vllm-rs-svc2  | start: ( text | tool_call )+ eos
vllm-rs-svc2  | tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
vllm-rs-svc2  | tool_0: "<function=fetch_url_via_curl>"  "\n" param_0_0 ( "\n" param_0_1)? "</function>" "\n"
vllm-rs-svc2  | tool_content: tool_0 ? tool_1 ? tool_2 ? tool_3 ? tool_4
vllm-rs-svc2  | tool_1: "<function=fs_cat>"  "\n" param_1_0 "</function>" "\n"
vllm-rs-svc2  | tool_2: "<function=fs_ls>"  "\n" param_2_0 "</function>" "\n"
vllm-rs-svc2  | tool_3: "<function=get_current_time>"  "</function>" "\n"
vllm-rs-svc2  | tool_4: "<function=web_search_searxng>"  "\n" param_4_0 ( "\n" param_4_1)? "</function>" "\n"
vllm-rs-svc2  | param_0_0: "<parameter=url>" "\n" value_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_0_1: "<parameter=proxy>" "\n" value_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_1_0: "<parameter=path>" "\n" value_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_2_0: "<parameter=path>" "\n" value_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_0: "<parameter=query>" "\n" value_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_1: "<parameter=searxng>" "\n" value_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | text: /(?s:.+?)/
vllm-rs-svc2  | value_string[suffix="\n</parameter>\n"]: /(?s:.+?)/
vllm-rs-svc2  | eos:  ( <[248046]> <[248044]> )

The py vllm tracks KV blocks on a Request level effectively tracking session_id like you did originally which absolves us of the whole matching problem but i'll put this through its paces on the coder model and see how it does working out the reasoning fallback grammar.

sempervictus · 2026-03-30T22:25:52Z

So far so good - 122B responds correctly with thinking, 80B coder does not think (yet), tool calls are sharp, prefix hits always show number-of-blocks-commited-by-last-turn minus 1 since the pfx cache discards partials.

#282 does not appear to be breaking anything but we're not falling-through to it on mismatches with this and #265 is cleaning up the anomalous generation of those <think> tags correctly although i havent tested whether that still works correctly with the guidance functions fully disabled.

guoqingbao · 2026-03-31T00:16:21Z

Unfortunately this makes the model throw think tags back to clients on every response including tool calls:

It respond reasoning parts via reasoning_content, the popular AI agents like opencode, kilo code, claude code use that.

guoqingbao · 2026-03-31T00:20:57Z

I can update #265 to handle those correctly but i'm wondering if we need the injection at all given our control of the generation space and if #282 has any merit.

The current approach in this PR follows the standard and specs, if the client unable to handle that, it's the client issue, no more changes needed. The new approach uses reasoning_content, a specific chunk type for streaming reasoning process.

sempervictus · 2026-03-31T00:54:23Z

So it looks like images can still break prefix lookups :-\

vllm-rs-svc0  | 2026-03-31T00:52:32.475971Z  WARN vllm_rs::server::server: Tools enabled for request
vllm-rs-svc0  | 2026-03-31T00:52:32.484124Z  INFO vllm_rs::server: Chat image decoded: 1238 x 1222
vllm-rs-svc0  | 2026-03-31T00:52:32.565365Z  INFO vllm_rs::server: Chat image decoded: 1568 x 980
vllm-rs-svc0  | 2026-03-31T00:52:32.672580Z  INFO vllm_rs::server: 2 images detected in the chat message, combined image shape [2, 3024, 1536]
vllm-rs-svc0  | 2026-03-31T00:52:33.025641Z ERROR vllm_rs::server::server: Stream generation failed: Remaining 5120 kvcache tokens, but your prompt requires 182016 new tokens, please request later!

sempervictus · 2026-03-31T00:59:21Z

The current approach in this PR follows the standard and specs, if the client unable to handle that, it's the client issue, no more changes needed. The new approach uses reasoning_content, a specific chunk type for streaming reasoning process.

Agreed, its mostly the TUI-based Rust ones which seem to have issues but the web-rendered ones simply present the chunk without "hiding" it behind a thinking modal in a slightly different font.

Have been running this on some V100s all day and a merge with 265 on the 6KPros - aside form some clients "not liking being forced to see think tags" i think the operating logic is sound. i have still seen a few cache misses but i think we won't be able to avoid some level of that until we figure out a way to build ConversationState or something along those lines which has all content and generation parameters (sampling, grammars, etc) affixed to each turn for perfect replay/regeneration

sempervictus · 2026-03-31T05:28:22Z

With just this PR or with 265 aligned as it currently is, the clients which don't handle the think-chunks correctly (or when no thinking params are sent) all do this:

$ aichat -m g62 --role %research% web search for vllm.rs
<think>



</think>



Call web_search_searxng {"query":"vllm.rs"}
Processing 4096 tokens per chunk
<think>



</think>



## Analysis of vLLM.rs

**Core Identity:** vLLM.rs is a minimalist, high-performance large language model inference engine in Rust, implemented by the vLLM project.

in a multi-turn session, does that help the match occur correctly or not given the channelization? Prior to the last commit in #265 these were being removed at the last second from tokenization through the template to ensure alignment with grammar but i pushed the last commit to ensure alignment with this PR so figure its worth asking: do we actually need those in the final content tokenized in the chat template given that i saw no cache misses while they were being stripped (youd mentioned there's no log of it so maybe i missed it happening)?

guoqingbao · 2026-03-31T06:07:55Z

With just this PR or with 265 aligned as it currently is, the clients which don't handle the think-chunks correctly (or when no thinking params are sent) all do this:

That's the client issue.

<think>



</think>

This mean the reasoning is off.

guoqingbao · 2026-03-31T06:13:41Z

do we actually need those in the final content tokenized in the chat template given that i saw no cache misses while they were being stripped (youd mentioned there's no log of it so maybe i missed it happening)?

It’s essentially an all-cache-hit vs. partial-hit trade-off. In this PR, I’m allowing slight partial cache misses (still maintaining >95% content cache hit rates for tool calls). The cache miss you observed is another symptom, due to Mamba state eviction (some states were evicted).

* Working solution * Thought process sent via reasoning_content chunk * Improve claude path * Minor fix * ReadMe for Kilo Code

Working solution

8fb76e5

guoqingbao force-pushed the decoding_cache branch from 9db347e to 8fb76e5 Compare March 30, 2026 11:26

Thought process sent via reasoning_content chunk

bc1efaa

guoqingbao changed the title ~~Fix decoding cache mismatch~~ Fix reasoning marker issue Mar 30, 2026

guoqingbao mentioned this pull request Mar 30, 2026

Rewrite special markers in prompts to fix cache mismatch #279

Closed

Improve claude path

30a97a4

guoqingbao added 2 commits March 31, 2026 05:51

Minor fix

950e3e5

ReadMe for Kilo Code

0b73e89

guoqingbao merged commit 7f1b6ce into main Mar 31, 2026
1 check passed

sempervictus mentioned this pull request Mar 31, 2026

Thought Exercise - Chained Semantic Prefix Cache Matching #282

Closed

This was referenced Mar 31, 2026

Prefix Cache Misses #276

Closed

First think tag missing in output #221

Closed

Improve tool call handling for reasoning models EricLBuehler/candle-vllm#399

Merged

sempervictus mentioned this pull request Apr 4, 2026

Support mxfp4 and nvfp4 models #285

Merged

guoqingbao deleted the decoding_cache branch April 30, 2026 16:27

guoqingbao added a commit that referenced this pull request May 21, 2026

Fix reasoning marker issue (#281)

f827b8d

* Working solution * Thought process sent via reasoning_content chunk * Improve claude path * Minor fix * ReadMe for Kilo Code

Conversation

guoqingbao commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sempervictus commented Mar 30, 2026

Uh oh!

sempervictus commented Mar 30, 2026

vLLM.rs: A Minimalist vLLM in Rust

Core Architecture

Uh oh!

sempervictus commented Mar 30, 2026

vLLM.rs

Overview

Architecture

Uh oh!

sempervictus commented Mar 30, 2026

Uh oh!

guoqingbao commented Mar 31, 2026

Uh oh!

guoqingbao commented Mar 31, 2026

Uh oh!

sempervictus commented Mar 31, 2026

Uh oh!

sempervictus commented Mar 31, 2026

Uh oh!

sempervictus commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guoqingbao commented Mar 31, 2026

Uh oh!

guoqingbao commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guoqingbao commented Mar 30, 2026 •

edited

Loading

sempervictus commented Mar 31, 2026 •

edited

Loading