Skip to content

Rewrite special markers in prompts to fix cache mismatch#279

Closed
guoqingbao wants to merge 4 commits into
mainfrom
rewrite_chat_template
Closed

Rewrite special markers in prompts to fix cache mismatch#279
guoqingbao wants to merge 4 commits into
mainfrom
rewrite_chat_template

Conversation

@guoqingbao

Copy link
Copy Markdown
Owner

No description provided.

@sempervictus

Copy link
Copy Markdown
Contributor

So ... this is interesting because with this grammar

vllm-rs-svc2  | 2026-03-28T00:50:53.154035Z  INFO vllm_rs::utils::guidance: GRAMMAR:
vllm-rs-svc2  | start: ( text | tool_call )+ eos
vllm-rs-svc2  | tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
vllm-rs-svc2  | tool_0: "<function=fetch_url_via_curl>"  "\n" param_0_0 ( "\n" param_0_1)? "</function>" "\n"
vllm-rs-svc2  | tool_content: tool_0 | tool_1 | tool_2 | tool_3 | tool_4
vllm-rs-svc2  | tool_1: "<function=fs_cat>"  "\n" param_1_0 "</function>" "\n"
vllm-rs-svc2  | tool_2: "<function=fs_ls>"  "\n" param_2_0 "</function>" "\n"
vllm-rs-svc2  | tool_3: "<function=get_current_time>"  "</function>" "\n"
vllm-rs-svc2  | tool_4: "<function=web_search_searxng>"  "\n" param_4_0 ( "\n" param_4_1)? "</function>" "\n"
vllm-rs-svc2  | param_0_0: "<parameter=url>" "\n" value_0_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_0_1: "<parameter=proxy>" "\n" value_0_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_1_0: "<parameter=path>" "\n" value_1_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_2_0: "<parameter=path>" "\n" value_2_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_0: "<parameter=query>" "\n" value_4_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_1: "<parameter=searxng>" "\n" value_4_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | text: /(?s:.+?)/
vllm-rs-svc2  | value_0_0_string: %json {"type":"string","description":"The URL to scrape."}
vllm-rs-svc2  | value_0_1_string: %json {"type":"string","description":"The proxy URL in the format protocol://host:port."}
vllm-rs-svc2  | value_1_0_string: %json {"type":"string","description":"The path of the file to read"}
vllm-rs-svc2  | value_2_0_string: %json {"type":"string","description":"The path of the directory to list"}
vllm-rs-svc2  | value_4_0_string: %json {"type":"string","description":"The query to search for."}
vllm-rs-svc2  | value_4_1_string: %json {"type":"string","description":"Optional searxng URL overriding the env var"}
vllm-rs-svc2  | eos:  ( <[248046]> <[248044]> )

we should not have <think> tags generated so it seems to generate "offset" from the intended start:

$ aichat -m g62 --role %research% research vllm.rs on the web
<think>
The user is asking me to research vllm.rs on the web. This is a research request about a specific project named vllm.rs. Let me search for information about this project.

I should use the web_search_searxng function to research this project. This will help me find information about vllm.rs, its developers, purpose, and what it does.
 
I'll start by searching for information about vllm.rs to understand what this project is about.
 
I'll make the tool call to search for information about vllm.rs.
 
I'll search for information about vllm.rs to understand what this project is about.
```tool_call

<function=web_search_searxng>
<parameter=query>
"vllm.rs"
</parameter>
</function>

Call web_search_searxng {"query":"vllm.rs"}
Processing 4096 tokens per chunk
<think>
I found information about vLLM.rs. Let me summarize the key findings:

1. vLLM.rs is a minimal, high-performance LLM inference engine implemented in Rust
2. It's a lightweight implementation of vLLM
3. Published on PyPI as vllm-rs
4. The project is maintained by guoqingbao
5. It supports CUDA and is cross-platform
6. It's designed for high-throughput and memory-efficient inference

I should provide a comprehensive overview of this project.

<function=web_search_searxng>
<parameter=query>
"vLLM.rs"
</parameter>
</function>

Call web_search_searxng {"query":"vLLM.rs"}
Processing 4096 tokens per chunk
<think>
I need to gather more specific information about vLLM.rs. Let me search for details about the implementation, architecture, and key components.


<function=web_search_searxng>
<parameter=query>
"vLLM.rs"
</parameter>
</function>

Call web_search_searxng {"query":"vLLM.rs"}

This PR is very educational, thank you - starting to see an angle of this i missed before re alignment with PromptReplay

@sempervictus

Copy link
Copy Markdown
Contributor

BTW i'm trying to stuff all special token strings into SpecialTokens partially to ease agent development. We've come a long way toward being able to emit special token strings but its not perfect and it gets worse in tool calls. Idea here is to contain them all to one file - thoughts?

@guoqingbao

Copy link
Copy Markdown
Owner Author

BTW i'm trying to stuff all special token strings into SpecialTokens partially to ease agent development. We've come a long way toward being able to emit special token strings but its not perfect and it gets worse in tool calls. Idea here is to contain them all to one file - thoughts?

I didn't escaping the special tokens within main content, so it might has problem of dealing with tool calling related topics.

@guoqingbao guoqingbao force-pushed the rewrite_chat_template branch 2 times, most recently from 5c68ae2 to efaca36 Compare March 28, 2026 13:53
@sempervictus

Copy link
Copy Markdown
Contributor

i've got q3n coder trying to adapt the grammars PR to this - never tried to have it keep up with an in-progress one before, effect is pure comedy. 😁

Looking to suffix= notation in Lark to try and get these thinking run-aways under control

start: reasoning_block? ( text | tool_call )+ eos
reasoning_block: <[248068]> think_text <[248069]> ("\n")?
think_text[suffix="<[248069]>"]: /(?s:.+?)/
tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
tool_0: "<function=get_weather>"  "\n" param_0_0 "</function>" "\n"
tool_content: tool_0
param_0_0: "<parameter=city>" "\n" value_0_0_string "\n" "</parameter>" "\n"
text: /(?s:.+?)/
value_0_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
eos:  ( <[248046]> <[248044]> )

is the current testing thread (will push soon as i can determine if it breaks anything else)

@sempervictus

sempervictus commented Mar 28, 2026

Copy link
Copy Markdown
Contributor

@guoqingbao so we definitely need to stabilize the chat template generation for the grammar constraint application to work correctly because grammar constrains everything which the model outputs from start: and where the models starts generating after the chat template matters a lot. The run-on effects i'm seeing seem to happen because of positional inconsistency relative to start: 😄.

start: reasoning_block? ( text | tool_call )+ eos
reasoning_block: <[248068]> think_text <[248069]> ("\n")?
think_text[suffix="<[248069]>"]: /(?s:.+?)/
tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
tool_0: "<function=get_weather>"  "\n" param_0_0 "</function>" "\n"
tool_content: tool_0
param_0_0: "<parameter=city>" "\n" value_0_0_string "\n" "</parameter>" "\n"
text: /(?s:.+?)/
value_0_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
eos:  ( <[248046]> <[248044]> )

means that if the chat template already has reasoning_block equivalent then we end up double-generating a think-block presuming its not ? (anything above ReasoningEffort::Low right now is non-optional)

When the template already provides think tags, the reasoning grammar can't work because those are already in-place. We will likely need dynamic chat template composition to really make that work but in order to implement that we first need a "this works out of the box for everything" approach to the template and build the composition logic atop that solid core.

Alternatively we scrap trying to make the chat template pre-render this and use LLG's fast-tokens from the <[bos_token_id]>assistant\n point all the way out to eos to enable/disable reasoning, permit tool call envelopes if tools are provided, and permit normal generation out to eos. I think using LLG for this is cleaner but then again i've been in that code for weeks and you're far more familiar with this part.

@sempervictus

Copy link
Copy Markdown
Contributor

Pondering this a bit... i think i have a solution to address all code paths:

  1. What you're doing in this branch handles the non-grammar-driven generation (lets pretend --feature guidance is a thing and we turn it off)
  2. Grammar generation is location-specific and therefore should use what you are doing here to decompose the template for everything after the last EOS (<|im_end|> equiv).
  3. Using the decomposed sections it should then constrain everything from start to eos because of the positional accuracy problem: an offset write can spin the bitmask causing run-on generation, incorrect tool-call format, etc and i think this would have caused rollbacks in my original LLG code which was pedantically checking logit-per-token in the position of the bitmask 🤦

Somewhat separately: we need to address the tool-call instruction blocks and format shown to the model in its system prompt when we use --enforce-parser because we can run into really weird things around ~8X YARN scale wherein a model which has grammar enforcement to generate JSON-style but a template to generate XML style will terminate the JSON format with </function>\n</parameter>\n<[tool_end_token_id]> resulting in "spillage" of that XML into the chat.

Ideally we make system prompt separate from prefix cache because there are various bots which will change up the tools available per request and while the grammar-gen handles that gracefully it causes a full refill of the KV. Ditto having an IDE change out the system prompt by changing the "role" of the agent running in it.

@guoqingbao guoqingbao force-pushed the rewrite_chat_template branch from efaca36 to 52ed9d3 Compare March 28, 2026 16:49
@guoqingbao

Copy link
Copy Markdown
Owner Author

Ideally we make system prompt separate from prefix cache because there are various bots which will change up the tools available per request and while the grammar-gen handles that gracefully it causes a full refill of the KV. Ditto having an IDE change out the system prompt by changing the "role" of the agent running in it.

I updated using a different approach.

@sempervictus

Copy link
Copy Markdown
Contributor

This is the Heisenberg branch :-p. Reading but i actually think what i proposed above still flies (might take a bit more thinking to implement): if we control bos->eos in our generation, there can be no offset problem because all output is encapsulated.

@guoqingbao

Copy link
Copy Markdown
Owner Author

This is the Heisenberg branch :-p. Reading but i actually think what i proposed above still flies (might take a bit more thinking to implement): if we control bos->eos in our generation, there can be no offset problem because all output is encapsulated.

The difficulty lies in aligning with the decoding cache, even under reasoning. The latest approach seems working well.

@sempervictus

Copy link
Copy Markdown
Contributor

Will definitely try to get my approach aligned with the current one.

BTW, before you force-pushed i had the coder implement a "comodal approach" between this branch and #265 which produces:

vllm-rs-svc2  | 2026-03-28T16:58:22.020348Z  INFO vllm_rs::utils::guidance: GRAMMAR:
vllm-rs-svc2  | start: ( text | tool_call )+ eos
vllm-rs-svc2  | tool_content: tool_0 | tool_1 | tool_2 | tool_3 | tool_4
vllm-rs-svc2  | tool_0: "<function=fetch_url_via_curl>"  "\n" param_0_0 ( "\n" param_0_1)? "</function>" "\n"
vllm-rs-svc2  | tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
vllm-rs-svc2  | tool_1: "<function=fs_cat>"  "\n" param_1_0 "</function>" "\n"
vllm-rs-svc2  | tool_2: "<function=fs_ls>"  "\n" param_2_0 "</function>" "\n"
vllm-rs-svc2  | tool_3: "<function=get_current_time>"  "</function>" "\n"
vllm-rs-svc2  | tool_4: "<function=web_search_searxng>"  "\n" param_4_0 ( "\n" param_4_1)? "</function>" "\n"
vllm-rs-svc2  | param_0_0: "<parameter=url>" "\n" value_0_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_0_1: "<parameter=proxy>" "\n" value_0_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_1_0: "<parameter=path>" "\n" value_1_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_2_0: "<parameter=path>" "\n" value_2_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_0: "<parameter=query>" "\n" value_4_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_1: "<parameter=searxng>" "\n" value_4_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | text: /(?s:.+?)/
vllm-rs-svc2  | value_0_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_0_1_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_1_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_2_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_4_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_4_1_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | eos:  ( <[248046]> <[248044]> )

with an 0.8 doing this:

$ aichat -m g62 --role %research% research and report on llguidance rust constraint mechanics
Call web_search_searxng {"query":"llguidance rust constraint mechanics"}
Processing 4096 tokens per chunk

Details

LLM Constraint Mechanics: Technical Analysis

Executive Summary

The Large Language Model (LLM) constraint ecosystem operates under a complex triad of architectural, computational, and physical constraints that govern deployment feasibility, performance metrics, and security implications. This analysis examines the interplay between architectural design principles, computational resource limitations, and physical infrastructure constraints within the GPUaaS (General Purpose AI Service) context.

1. Architectural Constraint Framework

1.1. Computational Resource Constraints

  • Token Generation Limits: Models face hard limits on context window size, inference latency, and token generation throughput.
  • Memory Boundaries: Memory constraints prevent large-scale model training and inference at arbitrary scale.
  • Latency Requirements: Real-time applications require millisecond-level latency with tight time windows.

1.2. Physical Infrastructure Constraints

  • Power Consumption: GPUs consume significant power; thermal throttling affects operational limits.
  • Cooling Capacity: Air cooling requires adequate airflow, limiting concurrent processing capacity.
  • Hardware Availability: GPU availability varies by geographic region and network connectivity.

2. Security Implications

2.1. Access Control Mechanisms

  • Service Model Restrictions: IaaS, PaaS, and SaaS have distinct access control requirements based on service delivery models.
  • Zero Trust Architecture: Required for cloud security to prevent lateral movement within the AI infrastructure.
  • Least Privilege Principle: Application-level access controls minimize attack surface while maintaining service continuity.

2.2. Threat Modeling Considerations

  • Privileged Account Risks: High-risk accounts may enable privilege escalation.
  • Network Exposure: Exposure of internal networks increases attack vectors.
  • Data Integrity Protection: Encryption at rest and in transit is critical for sensitive data.

3. Operational Design Considerations

3.1. Cost Optimization Strategies

  • Hardware Selection: Balancing GPU type, size, and cooling requirements.
  • Power Management: Implementing power-saving modes to reduce energy consumption.
  • Resource Allocation: Dynamic resource allocation based on workload demands.

3.2. Disaster Recovery Considerations

  • Failover Mechanisms: Multi-region deployments with automatic failover capabilities.
  • Data Backup Strategies: Regular backups with automated restoration procedures.
  • Recovery Time Objective (RTO): Defined targets for recovery periods.

4. Security Best Practices

4.1. Network Security

  • Firewall Configurations: Strategic firewall rules to prevent unauthorized access.
  • Segmentation: Isolate critical infrastructure from public networks.
  • Microsegmentation: Limit lateral movement within the network structure.

4.2. Data Protection

  • Encryption Standards: Implementing encryption at rest and in transit.
  • Access Control: Role-based access control (RBAC) for sensitive data handling.
  • Audit Logging: Comprehensive logging of all security events for compliance.

4.3. Vulnerability Management

  • Patch Management: Regular patch deployments to maintain system integrity.
  • Code Review: Automated code review pipelines to detect vulnerabilities.
  • Penetration Testing: Continuous security assessments to identify gaps.

5. Critical Infrastructure Considerations

5.1. Physical Datacenter Constraints

  • Cooling Systems: Liquid cooling is a primary constraint requiring specialized equipment.
  • Power Distribution: Proper power management systems are essential for stable operation.
  • Environment Control: Temperature and humidity control within data centers.

5.2. Energy Efficiency Goals

  • Green Computing Initiatives: Reducing energy consumption through hardware and software optimization.
  • Carbon Footprint Reduction: Targeted initiatives to minimize environmental impact.

6. Security Architecture Recommendations

6.1. Containerization Approach

  • Docker/Kubernetes: Standardize infrastructure across deployments.
  • Service Discovery: Enable discovery of services across the cluster.
  • Isolation: Prevent cross-container security breaches.

6.2. Network Security Layers

  • Security Groups: Define precise allowed ports and IP addresses.
  • Network Segmentation: Create logical boundaries for different service domains.
  • Traffic Monitoring: Real-time monitoring of network traffic patterns.

6.3. Threat Defense Strategies

  • SIEM Integration: Centralized security information and event management.
  • Behavioral Analysis: Detecting anomalous activity patterns.
  • Automated Response: Rapid response mechanisms to detected threats.

7. Conclusion

The LLM constraint mechanics represent a sophisticated ecosystem where architectural decisions directly impact security posture, operational efficiency, and long-term viability. As cloud infrastructure continues to evolve with AI workloads, understanding these constraints is essential for architects and engineers designing secure, efficient, and scalable solutions.

Citations:

  • CSA Security Guidance for Cloud Computing (2024)
  • InterGlobix Magazine article on AI data center expansion
  • Google Cloud Infrastructure security design documentation
  • NIST SP 800-210 Access Control Guidance
    rage@unknown: vllm.rs(grammars/pr)$ aichat -m g62 --role %research% research and report on llguidance rust constraint mechanics
    Call web_search_searxng {"query":"llguidance rust constraint mechanics"}
    Processing 4096 tokens per chunk

llguidance Rust Constraint Mechanics

Core Architecture

llguidance is a high-performance, Rust-based library for constrained decoding of Large Language Models (LLMs). It implements Earley parsers for efficient constrained generation.

Key Characteristics:

  • Token Masking: The library pre-computes token masks for all automaton states, significantly reducing sampling overhead
  • Startup Cost: Minimal - no upfront token mask computation
  • Performance: Approximately 50μs per token for 128k tokenizers with typical models

The Constraint API

The primary entry point is the Constraint struct:

// Two-step sampling loop:
let allowed = constraint.compute_mask();
let consumed = constraint.commit_token();

compute_mask() returns the set of allowed tokens at each state.
commit_token() advances the grammar state after sampling.

Implementation Details

Rust Crate Structure

The main crate follows a clean, documented interface:

pub use guidance_ai::llguidance::{
    Constraint, Matcher, TokenParser, TopLevelGrammar, TopLevelTokenGenerator,
};

Token Generation Flow

  1. ParserFactory constructs an Earley parser from the grammar
  2. Constraint computes token masks for all states
  3. TokenGenerator samples tokens using the mask
  4. Commit advances the parser state

Example Pattern

fn generate_with_constraints(input: &str) -> String {
    let grammar = parse_grammar(&input);
    let parser = ParserFactory::from_llguidance_json(grammar);
    let constraint = Constraint::new();
    
    let mut tokenizer = TokenGenerator::new(parser);
    let result = tokenizer.generate(input);
    constraint.commit_token();
    result
}

Architecture Layers

  1. Low-Level: llguidance crate (Rust parser implementation)
  2. Intermediate: guidance-ai/guidance crate (HTML/XML generation wrapper)
  3. Higher-Level: guidance-ai/llguidance crate (main constrained decoding engine)

Performance Metrics

Metric Value
Token Masking Cost ~50μs per token
Startup Cost Minimal/no mask computation
Memory Overhead Low - masks precomputed for all states
CPU Time 50μs × tokens

Key Design Decisions

  1. State-Sensitive Masking: Masks are computed per automaton state, allowing fine-grained control over which tokens are allowed at each point in the generation process.

  2. Deterministic Sampling: Uses Earley parser semantics to ensure deterministic sampling across different runs.

  3. Memory Efficiency: Pre-computes masks only when needed, avoiding unnecessary data structures during sampling.

  4. Efficient State Management: The parser maintains a compact state representation that minimizes memory usage while preserving parsing correctness.

Limitations and Considerations

  • Constraint Complexity: Complex constraints may impact sampling efficiency
  • Startup Overhead: While minimal, there's still some initial setup cost
  • Token Masking Cost: 50μs per token is significant for large models (>1M tokens)
  • Memory Requirements: Requires sufficient RAM for token masks

Comparison with Alternatives

Library Rust Support Token Masking Startup Cost Memory
llguidance Yes (~50μs/tokens) Minimal Low
LLInterpreter No High High
FFI No High High

llguidance offers the best balance of performance, memory efficiency, and Rust compatibility for constrained decoding applications.


For detailed documentation, consult the official Rust API at docs.rs/llguidance

@sempervictus

Copy link
Copy Markdown
Contributor

@guoqingbao - a possible solution for all of this:

vllm-rs-svc2  | start: bos ( text | tool_call )+ eos
vllm-rs-svc2  | tool_content: tool_0 ? tool_1 ? tool_2 ? tool_3 ? tool_4
vllm-rs-svc2  | tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
vllm-rs-svc2  | tool_0: "<function=fetch_url_via_curl>"  "\n" param_0_0 ( "\n" param_0_1)? "</function>" "\n"
vllm-rs-svc2  | tool_1: "<function=fs_cat>"  "\n" param_1_0 "</function>" "\n"
vllm-rs-svc2  | tool_2: "<function=fs_ls>"  "\n" param_2_0 "</function>" "\n"
vllm-rs-svc2  | tool_3: "<function=get_current_time>"  "</function>" "\n"
vllm-rs-svc2  | tool_4: "<function=web_search_searxng>"  "\n" param_4_0 ( "\n" param_4_1)? "</function>" "\n"
vllm-rs-svc2  | param_0_0: "<parameter=url>" "\n" value_0_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_0_1: "<parameter=proxy>" "\n" value_0_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_1_0: "<parameter=path>" "\n" value_1_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_2_0: "<parameter=path>" "\n" value_2_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_0: "<parameter=query>" "\n" value_4_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_1: "<parameter=searxng>" "\n" value_4_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | text: /(?s:.+?)/
vllm-rs-svc2  | value_0_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_0_1_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_1_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_2_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_4_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_4_1_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | eos:  ( <[248046]> <[248044]> )
vllm-rs-svc2  | bos: <[248045]> "assistant\n" 

with the reasoning version being ~

vllm-rs-svc2  | start: bos reasoning_block? ( text | tool_call )+ eos

wherein we always set add_generation_prompt to false and use the <eos>\n of the last message as the generation boundary. If that is present in every grammar, the engine can strip it on emission from showing up in the user output and we can control everything between bos and eos meaning we should be able to both emit those token strings for reasoning to the chat and anchor generation by message iteration boundary. Does that work or am i missing some nuance around the generation cache alignment?

In order to fit this PR the approach needs to be a bit different - i've added a function to ChatTemplate to determine IF you are pre-filling the generation with bos <[reasoning_start_token_id]> in Lark terms and modifying the grammar to start inside of the reasoning block instead of starting a reasoning block.

@sempervictus

Copy link
Copy Markdown
Contributor

this PR has an odd effect - it "bumps" initial generation by ~\t to the right which causes all sorts of fun with grammar masking.

@sempervictus

Copy link
Copy Markdown
Contributor

@guoqingbao i think i've solved this in #265 with a slightly different direction for template adjustment without the offset problem

@guoqingbao

Copy link
Copy Markdown
Owner Author

this PR has an odd effect - it "bumps" initial generation by ~\t to the right which causes all sorts of fun with grammar masking.

Because it replaced reasoning headers with whitespacewhitespace when sending to clients given that majority of AI agents won't strip out the reasoning markers in their outputs.

@guoqingbao

Copy link
Copy Markdown
Owner Author

this PR has an odd effect - it "bumps" initial generation by ~\t to the right which causes all sorts of fun with grammar masking.

Because it replaced reasoning headers with whitespacewhitespace when sending to clients given that majority of AI agents won't strip out the reasoning markers in their outputs.

It doesn't do that replacement for non tool call requests.

@guoqingbao

Copy link
Copy Markdown
Owner Author

@guoqingbao i think i've solved this in #265 with a slightly different direction for template adjustment without the offset problem

If you strip out the whitespaces, it will cause cache miss because the return number of tokens from previous turns by the client willl not match with current cache.

@sempervictus

sempervictus commented Mar 29, 2026

Copy link
Copy Markdown
Contributor

Have been running that for a few hrs and no cache misses yet because theres no mismatch - same tokens ('\n|) are generated just from guidance forcing it instead of template. This is the guided grammar after stripping that tagged reproducing the same line such that prefix matches are aligned:

...
2026-03-29T15:10:02.272392Z  INFO vllm_rs::server::parser: Tool parser selected: qwen_coder (model_id=Qwen/Qwen3-Coder-Next-FP8, enforce_parser=none)
2026-03-29T15:10:02.273443Z  INFO vllm_rs::core::block_manager: Prefix cache hit seq 95 (33280 cached tokens, 520 blocks)
2026-03-29T15:10:02.277654Z  INFO vllm_rs::core::runner: Restored mamba prefix state for seq 95 (cached 33280 tokens)
2026-03-29T15:10:02.277669Z  INFO vllm_rs::core::runner: Restored mamba prefix state for seq 95 (cached 33280 tokens)
2026-03-29T15:10:02.277803Z  INFO vllm_rs::core::runner: Restored mamba prefix state for seq 95 (cached 33280 tokens)
2026-03-29T15:10:02.277812Z  INFO vllm_rs::core::runner: Restored mamba prefix state for seq 95 (cached 33280 tokens)
2026-03-29T15:10:02.773425Z  INFO vllm_rs::utils::guidance: GRAMMAR:
start: reasoning_block ( text | tool_call )+ eos
reasoning_block: <[151667]> "\n" think_text "\n" <[151668]> "\n"
think_text[suffix="\n"]: /[ -~]+/
tool_call: <[151657]> text <[151658]>
text: /(?s:.+?)/
eos:  ( <[151645]> <[151643]> )
...

That said: Qwen3.5 is full of "magic" - remember those <thinking> tags we caught the 80B CoderNext using? They're not in the Tokenizer.added_vocabulary but in the regular one and despite the CoderNext ChatTemplate not having an actual reasoning section it does reason by itself somehow utilizing those non-special tokens like it would use <function=... (also in the regular vocab). The chat templates i've looked at for the 3.5s i have (0.8->122) do seem to have those sections but still i have caught them using <thinking> of their own volition from the regular vocab as well. Something more clever than the usual special-tokens-bounded reasoning block is going on here though the current approach in #265 does work over iterative conversation (prefix cache matching).

Just to be sure though: could you possibly point me to the logic we are trying to ensure re aligning generation? I'm working off the premise that you mean "prefix cache alignment for matching" in which case everything works correctly because i'm stripping that start-reasoning line from the template at the last moment and generating the same tokens in Guidance so all of the prefix block accounting should be accurate without any change (same number of tokens emitted in same position as what was accounted for in the template before my exision).

I have a branch of #265 and the current state of #279 together (its a 1-line change in engine.rs to merge them but i can throw it in my GH if you'd like a precanned one to test) but whatever is happening in this branch to cause that \t-sized offset at the start of generation seems to offset the masking position for guidance as well.

@guoqingbao

Copy link
Copy Markdown
Owner Author

Have been running that for a few hrs and no cache misses yet because theres no mismatch - same tokens ('\n|) are generated just from guidance forcing it instead of template. This is the guided grammar after stripping that tagged reproducing the same line such that prefix matches are aligned:

That's the decoding cache miss and it won't report that, meaning each request can only reuse the previous prompt cache. That's why I added this PR.

@sempervictus

Copy link
Copy Markdown
Contributor

I see, so that is different from prefix cache? Will dig in on the merge branch for this and #265 since that now seems to be stable with and without reasoning levels set (testing some cleanup work to push presently)

@sempervictus

Copy link
Copy Markdown
Contributor

According to the 122B with thinking enabled via grammar:

2026-03-29T18:39:43.830675Z  WARN vllm_rs::server::parser: Tool start token IDs corrected from tokenizer for model Qwen3VL: {248058}
2026-03-29T18:39:43.830708Z  WARN vllm_rs::server::parser: Tool end token IDs corrected from tokenizer for model Qwen3VL: {248059}
2026-03-29T18:39:43.830936Z  WARN vllm_rs::server::server: Tools enabled for request
2026-03-29T18:39:43.838353Z  INFO vllm_rs::core::engine: [llg] Guidance enabled, trimming <think> from pre-generation. Generation starting at <|im_start|>assistant
2026-03-29T18:39:43.988069Z  WARN vllm_rs::core::engine: [Stream] New request [Seq_id 27, 191202 tokens] received! (session_id: None)
2026-03-29T18:39:43.988159Z  INFO vllm_rs::server::parser: Tool parser selected: qwen_coder (model_id=Qwen/Qwen3.5-122B-A10B-FP8, enforce_parser=none)
2026-03-29T18:39:43.990732Z  INFO vllm_rs::core::block_manager: Prefix cache hit seq 27 (186624 cached tokens, 2916 blocks)
...
2026-03-29T18:39:44.002845Z  INFO vllm_rs::core::runner: Restored mamba prefix state for seq 27 (cached 186624 tokens)
2026-03-29T18:39:45.366038Z  INFO vllm_rs::utils::guidance: GRAMMAR:
start: reasoning_block? ( text | tool_call )+ eos
reasoning_block: <[248068]> "\n" think_text "\n" <[248069]> ("\n")?
think_text[suffix="\n"]: /[ -~]+/
tool_call: <[248058]> text <[248059]>
text: /(?s:.+?)/
eos:  ( <[248046]> <[248044]> )

the merged state between that PR and this one are complementary:

Analysis: Marker Revision Intention and LLG Grammar Coherency Alignment

1. Intention of Marker Revision (Without Emitting Marker Strings)

The marker revision mechanism in #279 serves a coherency preservation purpose:

Core Problem:

  • When caching prompts with reasoning markers (like ``, </thought>), the cache must maintain exact token sequences
  • However, if the same logical prompt is re-generated with slightly different whitespace or marker placement, the cache becomes ineffective
  • The revision system detects when a cached prompt's suffix matches the current request but differs only in reasoning marker representation

Solution Without Breaking Chat Stream:

  • The try_revise_reasoning_markers function performs token-level normalization, not string replacement
  • It identifies reasoning marker tokens and replaces them with a consistent "space token" placeholder
  • This ensures that logically equivalent prompts (same content, different marker formatting) map to the same cache entry
  • The actual marker strings are never emitted into tools, chat, or thinking blocks - only their token IDs are tracked and normalized

Key Mechanism:

Original: [prefix tokens] + [marker_start] + [content] + [marker_end] + [suffix]
Revised:  [prefix tokens] + [space_token] + [content] + [space_token] + [suffix]

The revision happens at the token ID level before caching, ensuring coherency without exposing marker strings to downstream parsers.


2. Alignment with Coherency Concerns in src/core/engine.rs:1210-1219

The code section you referenced demonstrates LLG grammar-aware prompt trimming:

if params.grammar.is_some() {
    if let Some((start_str, _end_str)) = get_reasoning_token_strings(&self.guidance_tokens, &self.tokenizer) {
        if prompt.trim().ends_with(&start_str) {
            let prompt = prompt.trim().trim_end_matches(&start_str).to_string();
            log_info!("[llg] Guidance enabled, trimming {} from pre-generation...", &start_str);
            return (prompt, image_idx, replay)
        }
    }
}

How This Aligns with Marker Revision Coherency:

Concern LLG Grammar Trimming (lines 1210-1219) Marker Revision System
Purpose Remove reasoning start marker when grammar is active Normalize reasoning markers across cache boundaries
Trigger params.grammar.is_some() + prompt ends with marker Cache hit with marker token mismatch
Action Trim marker string from prompt before generation Replace marker tokens with space tokens in cache
Coherency Impact Ensures grammar-aligned prompts don't duplicate cache entries Ensures logically equivalent prompts share cache entries

Critical Alignment Point:

Both mechanisms address the same coherency problem from different angles:

  1. LLG Grammar Trimming (pre-generation):

    • Detects when a reasoning marker (``) appears at prompt end
    • Removes it because the guidance grammar will handle marker insertion
    • Prevents redundant marker storage in the generated prompt
  2. Marker Revision (post-caching):

    • Detects when cached prompts have marker token variations
    • Normalizes them to space tokens for consistent cache lookup
    • Enables cache reuse across logically equivalent prompts with different marker representations

Why Both Are Necessary:

Without LLG grammar trimming: Prompts would store marker strings that the grammar system would re-insert, causing cache pollution.

Without marker revision: Cache lookups would fail for logically identical prompts that happen to have different marker token sequences due to template variations.

The Coherency Guarantee:

Together, these mechanisms ensure:

  • prompt + ``` → trimmed to prompt` before caching (LLG grammar)
  • prompt in cache → normalized marker tokens enable cross-request hits (marker revision)
  • Result: Maximum cache utilization while maintaining grammatical correctness

3. Data Flow: How Marker Revision Preserves Coherency

flowchart LR
    A[Incoming Request] --> B{Has Grammar?}
    B -->|Yes| C[Check for Reasoning Marker Suffix]
    C --> D{Marker Present?}
    D -->|Yes| E[Trim Marker - LLG Grammar Path]
    D -->|No| F[Normal Processing]
    E --> G[Cache Prompt Without Redundant Marker]
    F --> H{Cache Hit?}
    H -->|Yes| I[Apply Marker Revision Normalization]
    H -->|No| J[Store with Original Tokens]
    I --> K[Return Coherent Cache Entry]
    J --> K
    K --> L[Generation with Grammar-Aligned Markers]
Loading

Key Insight: The marker revision system doesn't just "replace" markers - it normalizes the token representation so that:

  • Cache lookups succeed for logically equivalent prompts
  • Grammar insertion happens at the correct token boundaries
  • No duplicate marker storage occurs

This is why the revision happens at the token ID level rather than string level - it preserves the exact boundary conditions that the guidance grammar expects.


Summary

The marker revision mechanism in #279 ensures cache coherency across reasoning marker variations by normalizing marker tokens to space tokens at the token-ID level. This aligns perfectly with the LLG grammar trimming logic in src/core/engine.rs:1210-1219, which removes redundant markers before caching. Together, they guarantee:

  1. No marker duplication - LLG grammar trimming prevents storing markers that will be re-inserted
  2. Maximum cache reuse - Marker revision enables cross-request hits despite template variations
  3. Grammar alignment - Both mechanisms preserve the exact token boundaries that guidance grammars expect

The system works because marker revision operates on token IDs, not strings, ensuring that the coherency guarantees hold even when different templates produce logically identical prompts with different marker representations.

So... how do we want to handle merge ordering? Merge this one first and i realign #265 over it or should i push the merged branch into #265 for your review?

@sempervictus

Copy link
Copy Markdown
Contributor

BTW, we're basically 1 push away from havnig full BOS->EOS control over all rendering which should give you explicit cache-state matching control since you can generate cache-state UUIDs forcibly at the start and omit them from transmission to the user (internal watermarking, effectively). That trick should also empower /responses API (#26).

The correct version of such an impl would require me to actually extract the string from the template right after the BOS to handle stuff like tool-response-role gracefully but i can throw together a commit which we know for a fact will work on the current qwen models as a test case and work our way to idiomatic coverage for all models from there.

@sempervictus

sempervictus commented Mar 30, 2026

Copy link
Copy Markdown
Contributor

Unfortunately this branch breaks expectations of OpenAI API clients which expect the think tags to actually be sent through from the server - reasoning sections come through formatted differently but not within reasoning blocks as expected by the client:

image

or the "more fun" version from the 122B (the runaway region i what would be text on the start: line in #265):
image

Since we can pin those tags now in the grammar, is there still a reason to do the masking/replacement piece for cache coherence?

@guoqingbao

Copy link
Copy Markdown
Owner Author

Unfortunately this branch breaks expectations of OpenAI API clients which expect the think tags to actually be sent through from the server

That's not a break, it simply made reasoning contents as normal output so all clients can render them normally, otherwise, certain agents like opencode and claude code render reasoning tags visually (which is annoying), and that will break the cache (chat template can strip out reasoning parts).

@guoqingbao

Copy link
Copy Markdown
Owner Author

Since we can pin those tags now in the grammar, is there still a reason to do the masking/replacement piece for cache coherence?

It's irrelevant with the grammar, it's the server and client interaction behavior, if you send reasoning tags, they will send back and chat template will remove previous reasoning parts making a hole of kvcache mismatch, causing decoding cache unable to be reused.

@guoqingbao

Copy link
Copy Markdown
Owner Author

Related anomalyco/opencode#11439

@guoqingbao

Copy link
Copy Markdown
Owner Author

It seems we need to keep the reasoning tags as it's original format when sending to client, it's the opencode bug, all of the popular inference frameworks including vLLM and sglang do sending raw reasoning markers to the client, same issue found within opencode. I may revert the reasoning tag replacement logic @sempervictus

@guoqingbao

Copy link
Copy Markdown
Owner Author

Finished in #281

@guoqingbao guoqingbao closed this Mar 30, 2026
@guoqingbao guoqingbao deleted the rewrite_chat_template branch April 18, 2026 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants