Rewrite special markers in prompts to fix cache mismatch#279
Rewrite special markers in prompts to fix cache mismatch#279guoqingbao wants to merge 4 commits into
Conversation
|
So ... this is interesting because with this grammar vllm-rs-svc2 | 2026-03-28T00:50:53.154035Z INFO vllm_rs::utils::guidance: GRAMMAR:
vllm-rs-svc2 | start: ( text | tool_call )+ eos
vllm-rs-svc2 | tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
vllm-rs-svc2 | tool_0: "<function=fetch_url_via_curl>" "\n" param_0_0 ( "\n" param_0_1)? "</function>" "\n"
vllm-rs-svc2 | tool_content: tool_0 | tool_1 | tool_2 | tool_3 | tool_4
vllm-rs-svc2 | tool_1: "<function=fs_cat>" "\n" param_1_0 "</function>" "\n"
vllm-rs-svc2 | tool_2: "<function=fs_ls>" "\n" param_2_0 "</function>" "\n"
vllm-rs-svc2 | tool_3: "<function=get_current_time>" "</function>" "\n"
vllm-rs-svc2 | tool_4: "<function=web_search_searxng>" "\n" param_4_0 ( "\n" param_4_1)? "</function>" "\n"
vllm-rs-svc2 | param_0_0: "<parameter=url>" "\n" value_0_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_0_1: "<parameter=proxy>" "\n" value_0_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_1_0: "<parameter=path>" "\n" value_1_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_2_0: "<parameter=path>" "\n" value_2_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_4_0: "<parameter=query>" "\n" value_4_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_4_1: "<parameter=searxng>" "\n" value_4_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | text: /(?s:.+?)/
vllm-rs-svc2 | value_0_0_string: %json {"type":"string","description":"The URL to scrape."}
vllm-rs-svc2 | value_0_1_string: %json {"type":"string","description":"The proxy URL in the format protocol://host:port."}
vllm-rs-svc2 | value_1_0_string: %json {"type":"string","description":"The path of the file to read"}
vllm-rs-svc2 | value_2_0_string: %json {"type":"string","description":"The path of the directory to list"}
vllm-rs-svc2 | value_4_0_string: %json {"type":"string","description":"The query to search for."}
vllm-rs-svc2 | value_4_1_string: %json {"type":"string","description":"Optional searxng URL overriding the env var"}
vllm-rs-svc2 | eos: ( <[248046]> <[248044]> )we should not have $ aichat -m g62 --role %research% research vllm.rs on the web
<think>
The user is asking me to research vllm.rs on the web. This is a research request about a specific project named vllm.rs. Let me search for information about this project.
I should use the web_search_searxng function to research this project. This will help me find information about vllm.rs, its developers, purpose, and what it does.
I'll start by searching for information about vllm.rs to understand what this project is about.
I'll make the tool call to search for information about vllm.rs.
I'll search for information about vllm.rs to understand what this project is about.
```tool_call
<function=web_search_searxng>
<parameter=query>
"vllm.rs"
</parameter>
</function>
Call web_search_searxng {"query":"vllm.rs"}
Processing 4096 tokens per chunk
<think>
I found information about vLLM.rs. Let me summarize the key findings:
1. vLLM.rs is a minimal, high-performance LLM inference engine implemented in Rust
2. It's a lightweight implementation of vLLM
3. Published on PyPI as vllm-rs
4. The project is maintained by guoqingbao
5. It supports CUDA and is cross-platform
6. It's designed for high-throughput and memory-efficient inference
I should provide a comprehensive overview of this project.
<function=web_search_searxng>
<parameter=query>
"vLLM.rs"
</parameter>
</function>
Call web_search_searxng {"query":"vLLM.rs"}
Processing 4096 tokens per chunk
<think>
I need to gather more specific information about vLLM.rs. Let me search for details about the implementation, architecture, and key components.
<function=web_search_searxng>
<parameter=query>
"vLLM.rs"
</parameter>
</function>
Call web_search_searxng {"query":"vLLM.rs"}This PR is very educational, thank you - starting to see an angle of this i missed before re alignment with |
|
BTW i'm trying to stuff all special token strings into SpecialTokens partially to ease agent development. We've come a long way toward being able to emit special token strings but its not perfect and it gets worse in tool calls. Idea here is to contain them all to one file - thoughts? |
I didn't escaping the special tokens within main content, so it might has problem of dealing with tool calling related topics. |
5c68ae2 to
efaca36
Compare
|
i've got q3n coder trying to adapt the grammars PR to this - never tried to have it keep up with an in-progress one before, effect is pure comedy. 😁 Looking to start: reasoning_block? ( text | tool_call )+ eos
reasoning_block: <[248068]> think_text <[248069]> ("\n")?
think_text[suffix="<[248069]>"]: /(?s:.+?)/
tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
tool_0: "<function=get_weather>" "\n" param_0_0 "</function>" "\n"
tool_content: tool_0
param_0_0: "<parameter=city>" "\n" value_0_0_string "\n" "</parameter>" "\n"
text: /(?s:.+?)/
value_0_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
eos: ( <[248046]> <[248044]> )is the current testing thread (will push soon as i can determine if it breaks anything else) |
|
@guoqingbao so we definitely need to stabilize the chat template generation for the grammar constraint application to work correctly because grammar constrains everything which the model outputs from start: reasoning_block? ( text | tool_call )+ eos
reasoning_block: <[248068]> think_text <[248069]> ("\n")?
think_text[suffix="<[248069]>"]: /(?s:.+?)/
tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
tool_0: "<function=get_weather>" "\n" param_0_0 "</function>" "\n"
tool_content: tool_0
param_0_0: "<parameter=city>" "\n" value_0_0_string "\n" "</parameter>" "\n"
text: /(?s:.+?)/
value_0_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
eos: ( <[248046]> <[248044]> )means that if the chat template already has When the template already provides think tags, the reasoning grammar can't work because those are already in-place. We will likely need dynamic chat template composition to really make that work but in order to implement that we first need a "this works out of the box for everything" approach to the template and build the composition logic atop that solid core. Alternatively we scrap trying to make the chat template pre-render this and use LLG's fast-tokens from the |
|
Pondering this a bit... i think i have a solution to address all code paths:
Somewhat separately: we need to address the tool-call instruction blocks and format shown to the model in its system prompt when we use Ideally we make system prompt separate from prefix cache because there are various bots which will change up the tools available per request and while the grammar-gen handles that gracefully it causes a full refill of the KV. Ditto having an IDE change out the system prompt by changing the "role" of the agent running in it. |
efaca36 to
52ed9d3
Compare
I updated using a different approach. |
|
This is the Heisenberg branch :-p. Reading but i actually think what i proposed above still flies (might take a bit more thinking to implement): if we control |
The difficulty lies in aligning with the decoding cache, even under reasoning. The latest approach seems working well. |
|
Will definitely try to get my approach aligned with the current one. BTW, before you force-pushed i had the coder implement a "comodal approach" between this branch and #265 which produces: vllm-rs-svc2 | 2026-03-28T16:58:22.020348Z INFO vllm_rs::utils::guidance: GRAMMAR:
vllm-rs-svc2 | start: ( text | tool_call )+ eos
vllm-rs-svc2 | tool_content: tool_0 | tool_1 | tool_2 | tool_3 | tool_4
vllm-rs-svc2 | tool_0: "<function=fetch_url_via_curl>" "\n" param_0_0 ( "\n" param_0_1)? "</function>" "\n"
vllm-rs-svc2 | tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
vllm-rs-svc2 | tool_1: "<function=fs_cat>" "\n" param_1_0 "</function>" "\n"
vllm-rs-svc2 | tool_2: "<function=fs_ls>" "\n" param_2_0 "</function>" "\n"
vllm-rs-svc2 | tool_3: "<function=get_current_time>" "</function>" "\n"
vllm-rs-svc2 | tool_4: "<function=web_search_searxng>" "\n" param_4_0 ( "\n" param_4_1)? "</function>" "\n"
vllm-rs-svc2 | param_0_0: "<parameter=url>" "\n" value_0_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_0_1: "<parameter=proxy>" "\n" value_0_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_1_0: "<parameter=path>" "\n" value_1_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_2_0: "<parameter=path>" "\n" value_2_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_4_0: "<parameter=query>" "\n" value_4_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_4_1: "<parameter=searxng>" "\n" value_4_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | text: /(?s:.+?)/
vllm-rs-svc2 | value_0_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2 | value_0_1_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2 | value_1_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2 | value_2_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2 | value_4_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2 | value_4_1_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2 | eos: ( <[248046]> <[248044]> )with an 0.8 doing this:
DetailsLLM Constraint Mechanics: Technical AnalysisExecutive SummaryThe Large Language Model (LLM) constraint ecosystem operates under a complex triad of architectural, computational, and physical constraints that govern deployment feasibility, performance metrics, and security implications. This analysis examines the interplay between architectural design principles, computational resource limitations, and physical infrastructure constraints within the GPUaaS (General Purpose AI Service) context. 1. Architectural Constraint Framework1.1. Computational Resource Constraints
1.2. Physical Infrastructure Constraints
2. Security Implications2.1. Access Control Mechanisms
2.2. Threat Modeling Considerations
3. Operational Design Considerations3.1. Cost Optimization Strategies
3.2. Disaster Recovery Considerations
4. Security Best Practices4.1. Network Security
4.2. Data Protection
4.3. Vulnerability Management
5. Critical Infrastructure Considerations5.1. Physical Datacenter Constraints
5.2. Energy Efficiency Goals
6. Security Architecture Recommendations6.1. Containerization Approach
6.2. Network Security Layers
6.3. Threat Defense Strategies
7. ConclusionThe LLM constraint mechanics represent a sophisticated ecosystem where architectural decisions directly impact security posture, operational efficiency, and long-term viability. As cloud infrastructure continues to evolve with AI workloads, understanding these constraints is essential for architects and engineers designing secure, efficient, and scalable solutions. Citations:
llguidance Rust Constraint MechanicsCore Architecturellguidance is a high-performance, Rust-based library for constrained decoding of Large Language Models (LLMs). It implements Earley parsers for efficient constrained generation. Key Characteristics:
The Constraint APIThe primary entry point is the // Two-step sampling loop:
let allowed = constraint.compute_mask();
let consumed = constraint.commit_token();
Implementation DetailsRust Crate StructureThe main crate follows a clean, documented interface: pub use guidance_ai::llguidance::{
Constraint, Matcher, TokenParser, TopLevelGrammar, TopLevelTokenGenerator,
};Token Generation Flow
Example Patternfn generate_with_constraints(input: &str) -> String {
let grammar = parse_grammar(&input);
let parser = ParserFactory::from_llguidance_json(grammar);
let constraint = Constraint::new();
let mut tokenizer = TokenGenerator::new(parser);
let result = tokenizer.generate(input);
constraint.commit_token();
result
}Architecture Layers
Performance Metrics
Key Design Decisions
Limitations and Considerations
Comparison with Alternatives
llguidance offers the best balance of performance, memory efficiency, and Rust compatibility for constrained decoding applications. For detailed documentation, consult the official Rust API at docs.rs/llguidance |
|
@guoqingbao - a possible solution for all of this: vllm-rs-svc2 | start: bos ( text | tool_call )+ eos
vllm-rs-svc2 | tool_content: tool_0 ? tool_1 ? tool_2 ? tool_3 ? tool_4
vllm-rs-svc2 | tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
vllm-rs-svc2 | tool_0: "<function=fetch_url_via_curl>" "\n" param_0_0 ( "\n" param_0_1)? "</function>" "\n"
vllm-rs-svc2 | tool_1: "<function=fs_cat>" "\n" param_1_0 "</function>" "\n"
vllm-rs-svc2 | tool_2: "<function=fs_ls>" "\n" param_2_0 "</function>" "\n"
vllm-rs-svc2 | tool_3: "<function=get_current_time>" "</function>" "\n"
vllm-rs-svc2 | tool_4: "<function=web_search_searxng>" "\n" param_4_0 ( "\n" param_4_1)? "</function>" "\n"
vllm-rs-svc2 | param_0_0: "<parameter=url>" "\n" value_0_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_0_1: "<parameter=proxy>" "\n" value_0_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_1_0: "<parameter=path>" "\n" value_1_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_2_0: "<parameter=path>" "\n" value_2_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_4_0: "<parameter=query>" "\n" value_4_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_4_1: "<parameter=searxng>" "\n" value_4_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | text: /(?s:.+?)/
vllm-rs-svc2 | value_0_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2 | value_0_1_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2 | value_1_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2 | value_2_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2 | value_4_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2 | value_4_1_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2 | eos: ( <[248046]> <[248044]> )
vllm-rs-svc2 | bos: <[248045]> "assistant\n" with the reasoning version being ~ wherein we always set In order to fit this PR the approach needs to be a bit different - i've added a function to ChatTemplate to determine IF you are pre-filling the generation with |
|
this PR has an odd effect - it "bumps" initial generation by ~ |
|
@guoqingbao i think i've solved this in #265 with a slightly different direction for template adjustment without the offset problem |
Because it replaced reasoning headers with whitespacewhitespace when sending to clients given that majority of AI agents won't strip out the reasoning markers in their outputs. |
It doesn't do that replacement for non tool call requests. |
If you strip out the whitespaces, it will cause cache miss because the return number of tokens from previous turns by the client willl not match with current cache. |
|
Have been running that for a few hrs and no cache misses yet because theres no mismatch - same tokens ('\n|) are generated just from guidance forcing it instead of template. This is the guided grammar after stripping that tagged reproducing the same line such that prefix matches are aligned: ...
2026-03-29T15:10:02.272392Z INFO vllm_rs::server::parser: Tool parser selected: qwen_coder (model_id=Qwen/Qwen3-Coder-Next-FP8, enforce_parser=none)
2026-03-29T15:10:02.273443Z INFO vllm_rs::core::block_manager: Prefix cache hit seq 95 (33280 cached tokens, 520 blocks)
2026-03-29T15:10:02.277654Z INFO vllm_rs::core::runner: Restored mamba prefix state for seq 95 (cached 33280 tokens)
2026-03-29T15:10:02.277669Z INFO vllm_rs::core::runner: Restored mamba prefix state for seq 95 (cached 33280 tokens)
2026-03-29T15:10:02.277803Z INFO vllm_rs::core::runner: Restored mamba prefix state for seq 95 (cached 33280 tokens)
2026-03-29T15:10:02.277812Z INFO vllm_rs::core::runner: Restored mamba prefix state for seq 95 (cached 33280 tokens)
2026-03-29T15:10:02.773425Z INFO vllm_rs::utils::guidance: GRAMMAR:
start: reasoning_block ( text | tool_call )+ eos
reasoning_block: <[151667]> "\n" think_text "\n" <[151668]> "\n"
think_text[suffix="\n"]: /[ -~]+/
tool_call: <[151657]> text <[151658]>
text: /(?s:.+?)/
eos: ( <[151645]> <[151643]> )
...That said: Qwen3.5 is full of "magic" - remember those Just to be sure though: could you possibly point me to the logic we are trying to ensure re aligning generation? I'm working off the premise that you mean "prefix cache alignment for matching" in which case everything works correctly because i'm stripping that start-reasoning line from the template at the last moment and generating the same tokens in Guidance so all of the prefix block accounting should be accurate without any change (same number of tokens emitted in same position as what was accounted for in the template before my exision). I have a branch of #265 and the current state of #279 together (its a 1-line change in |
That's the decoding cache miss and it won't report that, meaning each request can only reuse the previous prompt cache. That's why I added this PR. |
|
I see, so that is different from prefix cache? Will dig in on the merge branch for this and #265 since that now seems to be stable with and without reasoning levels set (testing some cleanup work to push presently) |
|
According to the 122B with thinking enabled via grammar: 2026-03-29T18:39:43.830675Z WARN vllm_rs::server::parser: Tool start token IDs corrected from tokenizer for model Qwen3VL: {248058}
2026-03-29T18:39:43.830708Z WARN vllm_rs::server::parser: Tool end token IDs corrected from tokenizer for model Qwen3VL: {248059}
2026-03-29T18:39:43.830936Z WARN vllm_rs::server::server: Tools enabled for request
2026-03-29T18:39:43.838353Z INFO vllm_rs::core::engine: [llg] Guidance enabled, trimming <think> from pre-generation. Generation starting at <|im_start|>assistant
2026-03-29T18:39:43.988069Z WARN vllm_rs::core::engine: [Stream] New request [Seq_id 27, 191202 tokens] received! (session_id: None)
2026-03-29T18:39:43.988159Z INFO vllm_rs::server::parser: Tool parser selected: qwen_coder (model_id=Qwen/Qwen3.5-122B-A10B-FP8, enforce_parser=none)
2026-03-29T18:39:43.990732Z INFO vllm_rs::core::block_manager: Prefix cache hit seq 27 (186624 cached tokens, 2916 blocks)
...
2026-03-29T18:39:44.002845Z INFO vllm_rs::core::runner: Restored mamba prefix state for seq 27 (cached 186624 tokens)
2026-03-29T18:39:45.366038Z INFO vllm_rs::utils::guidance: GRAMMAR:
start: reasoning_block? ( text | tool_call )+ eos
reasoning_block: <[248068]> "\n" think_text "\n" <[248069]> ("\n")?
think_text[suffix="\n"]: /[ -~]+/
tool_call: <[248058]> text <[248059]>
text: /(?s:.+?)/
eos: ( <[248046]> <[248044]> )the merged state between that PR and this one are complementary:
So... how do we want to handle merge ordering? Merge this one first and i realign #265 over it or should i push the merged branch into #265 for your review? |
|
BTW, we're basically 1 push away from havnig full BOS->EOS control over all rendering which should give you explicit cache-state matching control since you can generate cache-state UUIDs forcibly at the start and omit them from transmission to the user (internal watermarking, effectively). That trick should also empower The correct version of such an impl would require me to actually extract the string from the template right after the BOS to handle stuff like tool-response-role gracefully but i can throw together a commit which we know for a fact will work on the current qwen models as a test case and work our way to idiomatic coverage for all models from there. |
|
Unfortunately this branch breaks expectations of OpenAI API clients which expect the think tags to actually be sent through from the server - reasoning sections come through formatted differently but not within reasoning blocks as expected by the client:
or the "more fun" version from the 122B (the runaway region i what would be Since we can pin those tags now in the grammar, is there still a reason to do the masking/replacement piece for cache coherence? |
That's not a break, it simply made reasoning contents as normal output so all clients can render them normally, otherwise, certain agents like opencode and claude code render reasoning tags visually (which is annoying), and that will break the cache (chat template can strip out reasoning parts). |
It's irrelevant with the grammar, it's the server and client interaction behavior, if you send reasoning tags, they will send back and chat template will remove previous reasoning parts making a hole of kvcache mismatch, causing decoding cache unable to be reused. |
|
Related anomalyco/opencode#11439 |
|
It seems we need to keep the reasoning tags as it's original format when sending to client, it's the opencode bug, all of the popular inference frameworks including vLLM and sglang do sending raw reasoning markers to the client, same issue found within opencode. I may revert the reasoning tag replacement logic @sempervictus |
|
Finished in #281 |


No description provided.