Skip to content

Fix reasoning marker issue#281

Merged
guoqingbao merged 5 commits into
mainfrom
decoding_cache
Mar 31, 2026
Merged

Fix reasoning marker issue#281
guoqingbao merged 5 commits into
mainfrom
decoding_cache

Conversation

@guoqingbao

@guoqingbao guoqingbao commented Mar 30, 2026

Copy link
Copy Markdown
Owner

A clean implmentation used to replace #279

@guoqingbao guoqingbao changed the title Fix decoding cache mismatch Fix reasoning marker issue Mar 30, 2026
@sempervictus

Copy link
Copy Markdown
Contributor

Hmm, seems to be minimal merge difference applying with #265 and #282 😄 - checking the openai clients to see if they're happy with this approach. Thanks as always

@sempervictus

Copy link
Copy Markdown
Contributor

Unfortunately this makes the model throw think tags back to clients on every response including tool calls:

$ aichat -m g62 --role %research% web search for vllm.rs

<think>

</think>

Call web_search_searxng {"query":"vllm.rs"}
Processing 4096 tokens per chunk

<think>

</think>

vLLM.rs: A Minimalist vLLM in Rust

Core Architecture

vLLM.rs is a high-performance large language model inference engine written in Rust, implemented as a minimal alternative to the traditional PyTorch-based vLLM. It achieves state-of-the-art performance with no PyTorch dependency, leveraging Rust's type safety and performance characteristics.

I can update #265 to handle those correctly but i'm wondering if we need the injection at all given our control of the generation space and if #282 has any merit.

@sempervictus

Copy link
Copy Markdown
Contributor

~3 line change to #265 handes it:

$ aichat -m g62 --role %research% web search for vllm.rs
Call web_search_searxng {"query":"vllm.rs"}
Processing 4096 tokens per chunk

vLLM.rs

Overview

vLLM.rs is a minimal, high-performance large language model (LLM) inference engine implemented in Rust. It provides fast, memory-efficient text generation with support for distributed GPU processing.

Architecture

...

vllm-rs-svc2  | 2026-03-30T19:25:10.919096Z  INFO vllm_rs::core::block_manager: Prefix cache insert seq 0 (1250 tokens, 19 blocks)
vllm-rs-svc2  | 2026-03-30T19:25:10.919197Z  INFO vllm_rs::server::parser: Tool call buffering end, reached </tool_call> (248059)
vllm-rs-svc2  | 2026-03-30T19:25:10.919236Z  INFO vllm_rs::server::parser: Building tool call: [StreamingToolCallState { name: Some("web_search_searxng"), arguments: "{\"query\": \"vllm.rs\"}" }]
vllm-rs-svc2  | 2026-03-30T19:25:10.919728Z  INFO vllm_rs::tools::helpers: Valid tool call(s): web_search_searxng(args={"query":"vllm.rs"})
vllm-rs-svc2  | 2026-03-30T19:25:10.919738Z  INFO vllm_rs::server::server: Final chunk emitted after tool-call delta chunk(s): ChatCompletionChunk { id: "seq-0", object: "chat.completion.chunk", created: 1774898710656, model: "default", choices: [ChatChoiceChunk { index: 0, delta: Delta { role: None, content: None, reasoning_content: None, tool_calls: None }, finish_reason: Some("tool_calls"), error: None }], usage: Some(Usage { prompt_tokens: 1213, completion_tokens: 37, total_tokens: 1250 }) }
vllm-rs-svc2  | 2026-03-30T19:25:10.919755Z  WARN vllm_rs::server::server: --- Performance Metrics ---
vllm-rs-svc2  | 2026-03-30T19:25:10.919760Z  INFO vllm_rs::server::server: [Seq 0] ⏱️ Prompt: 1213 tokens in 0.09s (13043.01 t/s)
vllm-rs-svc2  | 2026-03-30T19:25:10.919767Z  INFO vllm_rs::server::server: [Seq 0] ⏱️ Decoded: 37 tokens in 0.17s (220.24 t/s)
vllm-rs-svc2  | 2026-03-30T19:25:10.940527Z  INFO vllm_rs::core::scheduler: GPU Kvcache: 109 blocks (6976 tokens) free, used 14.8% (0.01GB/0.05GB); CPU swap used 0.0% (0.00GB/0.01GB)
vllm-rs-svc2  | 2026-03-30T19:25:10.940543Z  INFO vllm_rs::core::scheduler: GPU MambaState: 1 / 30 slots used (3.3%), approx 0.01GB/0.28GB (slot 9.32MB)
vllm-rs-svc2  | 2026-03-30T19:25:11.996092Z  WARN vllm_rs::server::parser: Tool start token IDs corrected from tokenizer for model Qwen3VL: {248058}
vllm-rs-svc2  | 2026-03-30T19:25:11.996128Z  WARN vllm_rs::server::parser: Tool end token IDs corrected from tokenizer for model Qwen3VL: {248059}
vllm-rs-svc2  | 2026-03-30T19:25:11.996279Z  WARN vllm_rs::server::server: Tools enabled for request
vllm-rs-svc2  | 2026-03-30T19:25:11.996922Z  INFO vllm_rs::core::engine: [llg] Guidance enabled, trimming <think> from pre-generation. Generation starting at <|im_start|>assistant
vllm-rs-svc2  | 2026-03-30T19:25:12.001928Z  INFO vllm_rs::core::prefix_cache: Prefix cache exact match: 18 blocks matched (tolerance: 0.05)
vllm-rs-svc2  | 2026-03-30T19:25:12.029106Z  WARN vllm_rs::core::engine: [Stream] New request [Seq_id 1, 4070 tokens] received! (session_id: None)
vllm-rs-svc2  | 
vllm-rs-svc2  | 2026-03-30T19:25:12.029145Z  INFO vllm_rs::server::parser: Tool parser selected: qwen_coder (model_id=Qwen/Qwen3.5-0.8B, enforce_parser=none)
vllm-rs-svc2  | 2026-03-30T19:25:12.029219Z  INFO vllm_rs::core::prefix_cache: Prefix cache exact match: 18 blocks matched (tolerance: 0.05)
vllm-rs-svc2  | 2026-03-30T19:25:12.029536Z  INFO vllm_rs::core::prefix_cache: Prefix cache exact match: 18 blocks matched (tolerance: 0.05)
vllm-rs-svc2  | 2026-03-30T19:25:12.029634Z  INFO vllm_rs::core::block_manager: Prefix cache hit seq 1 (1152 cached tokens, 18 blocks)
vllm-rs-svc2  | 2026-03-30T19:25:12.030740Z  INFO vllm_rs::core::runner: Restored mamba prefix state for seq 1 (cached 1152 tokens)
vllm-rs-svc2  | 2026-03-30T19:25:12.030940Z  INFO vllm_rs::core::runner: Restored mamba prefix state for seq 1 (cached 1152 tokens)
vllm-rs-svc2  | 2026-03-30T19:25:12.040993Z  INFO vllm_rs::utils::guidance: GRAMMAR:
vllm-rs-svc2  | start: ( text | tool_call )+ eos
vllm-rs-svc2  | tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
vllm-rs-svc2  | tool_0: "<function=fetch_url_via_curl>"  "\n" param_0_0 ( "\n" param_0_1)? "</function>" "\n"
vllm-rs-svc2  | tool_content: tool_0 ? tool_1 ? tool_2 ? tool_3 ? tool_4
vllm-rs-svc2  | tool_1: "<function=fs_cat>"  "\n" param_1_0 "</function>" "\n"
vllm-rs-svc2  | tool_2: "<function=fs_ls>"  "\n" param_2_0 "</function>" "\n"
vllm-rs-svc2  | tool_3: "<function=get_current_time>"  "</function>" "\n"
vllm-rs-svc2  | tool_4: "<function=web_search_searxng>"  "\n" param_4_0 ( "\n" param_4_1)? "</function>" "\n"
vllm-rs-svc2  | param_0_0: "<parameter=url>" "\n" value_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_0_1: "<parameter=proxy>" "\n" value_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_1_0: "<parameter=path>" "\n" value_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_2_0: "<parameter=path>" "\n" value_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_0: "<parameter=query>" "\n" value_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_1: "<parameter=searxng>" "\n" value_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | text: /(?s:.+?)/
vllm-rs-svc2  | value_string[suffix="\n</parameter>\n"]: /(?s:.+?)/
vllm-rs-svc2  | eos:  ( <[248046]> <[248044]> )

The py vllm tracks KV blocks on a Request level effectively tracking session_id like you did originally which absolves us of the whole matching problem but i'll put this through its paces on the coder model and see how it does working out the reasoning fallback grammar.

@sempervictus

Copy link
Copy Markdown
Contributor

So far so good - 122B responds correctly with thinking, 80B coder does not think (yet), tool calls are sharp, prefix hits always show number-of-blocks-commited-by-last-turn minus 1 since the pfx cache discards partials.

image

#282 does not appear to be breaking anything but we're not falling-through to it on mismatches with this and #265 is cleaning up the anomalous generation of those <think> tags correctly although i havent tested whether that still works correctly with the guidance functions fully disabled.

@guoqingbao

Copy link
Copy Markdown
Owner Author

Unfortunately this makes the model throw think tags back to clients on every response including tool calls:

It respond reasoning parts via reasoning_content, the popular AI agents like opencode, kilo code, claude code use that.

@guoqingbao

Copy link
Copy Markdown
Owner Author

I can update #265 to handle those correctly but i'm wondering if we need the injection at all given our control of the generation space and if #282 has any merit.

The current approach in this PR follows the standard and specs, if the client unable to handle that, it's the client issue, no more changes needed. The new approach uses reasoning_content, a specific chunk type for streaming reasoning process.

@sempervictus

Copy link
Copy Markdown
Contributor

So it looks like images can still break prefix lookups :-\

vllm-rs-svc0  | 2026-03-31T00:52:32.475971Z  WARN vllm_rs::server::server: Tools enabled for request
vllm-rs-svc0  | 2026-03-31T00:52:32.484124Z  INFO vllm_rs::server: Chat image decoded: 1238 x 1222
vllm-rs-svc0  | 2026-03-31T00:52:32.565365Z  INFO vllm_rs::server: Chat image decoded: 1568 x 980
vllm-rs-svc0  | 2026-03-31T00:52:32.672580Z  INFO vllm_rs::server: 2 images detected in the chat message, combined image shape [2, 3024, 1536]
vllm-rs-svc0  | 2026-03-31T00:52:33.025641Z ERROR vllm_rs::server::server: Stream generation failed: Remaining 5120 kvcache tokens, but your prompt requires 182016 new tokens, please request later!

@sempervictus

Copy link
Copy Markdown
Contributor

The current approach in this PR follows the standard and specs, if the client unable to handle that, it's the client issue, no more changes needed. The new approach uses reasoning_content, a specific chunk type for streaming reasoning process.

Agreed, its mostly the TUI-based Rust ones which seem to have issues but the web-rendered ones simply present the chunk without "hiding" it behind a thinking modal in a slightly different font.

Have been running this on some V100s all day and a merge with 265 on the 6KPros - aside form some clients "not liking being forced to see think tags" i think the operating logic is sound. i have still seen a few cache misses but i think we won't be able to avoid some level of that until we figure out a way to build ConversationState or something along those lines which has all content and generation parameters (sampling, grammars, etc) affixed to each turn for perfect replay/regeneration

@sempervictus

sempervictus commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

With just this PR or with 265 aligned as it currently is, the clients which don't handle the think-chunks correctly (or when no thinking params are sent) all do this:

$ aichat -m g62 --role %research% web search for vllm.rs
<think>



</think>



Call web_search_searxng {"query":"vllm.rs"}
Processing 4096 tokens per chunk
<think>



</think>



## Analysis of vLLM.rs

**Core Identity:** vLLM.rs is a minimalist, high-performance large language model inference engine in Rust, implemented by the vLLM project.

in a multi-turn session, does that help the match occur correctly or not given the channelization? Prior to the last commit in #265 these were being removed at the last second from tokenization through the template to ensure alignment with grammar but i pushed the last commit to ensure alignment with this PR so figure its worth asking: do we actually need those in the final content tokenized in the chat template given that i saw no cache misses while they were being stripped (youd mentioned there's no log of it so maybe i missed it happening)?

@guoqingbao

Copy link
Copy Markdown
Owner Author

With just this PR or with 265 aligned as it currently is, the clients which don't handle the think-chunks correctly (or when no thinking params are sent) all do this:

That's the client issue.

<think>



</think>

This mean the reasoning is off.

@guoqingbao

Copy link
Copy Markdown
Owner Author

do we actually need those in the final content tokenized in the chat template given that i saw no cache misses while they were being stripped (youd mentioned there's no log of it so maybe i missed it happening)?

It’s essentially an all-cache-hit vs. partial-hit trade-off. In this PR, I’m allowing slight partial cache misses (still maintaining >95% content cache hit rates for tool calls). The cache miss you observed is another symptom, due to Mamba state eviction (some states were evicted).

@guoqingbao guoqingbao merged commit 7f1b6ce into main Mar 31, 2026
1 check passed
@guoqingbao guoqingbao deleted the decoding_cache branch April 30, 2026 16:27
guoqingbao added a commit that referenced this pull request May 21, 2026
* Working solution

* Thought process sent via reasoning_content chunk

* Improve claude path

* Minor fix

* ReadMe for Kilo Code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants