Rewrite special markers in prompts to fix cache mismatch by guoqingbao · Pull Request #279 · guoqingbao/xinfer

guoqingbao · 2026-03-27T16:08:35Z

No description provided.

sempervictus · 2026-03-28T02:49:59Z

So ... this is interesting because with this grammar

vllm-rs-svc2  | 2026-03-28T00:50:53.154035Z  INFO vllm_rs::utils::guidance: GRAMMAR:
vllm-rs-svc2  | start: ( text | tool_call )+ eos
vllm-rs-svc2  | tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
vllm-rs-svc2  | tool_0: "<function=fetch_url_via_curl>"  "\n" param_0_0 ( "\n" param_0_1)? "</function>" "\n"
vllm-rs-svc2  | tool_content: tool_0 | tool_1 | tool_2 | tool_3 | tool_4
vllm-rs-svc2  | tool_1: "<function=fs_cat>"  "\n" param_1_0 "</function>" "\n"
vllm-rs-svc2  | tool_2: "<function=fs_ls>"  "\n" param_2_0 "</function>" "\n"
vllm-rs-svc2  | tool_3: "<function=get_current_time>"  "</function>" "\n"
vllm-rs-svc2  | tool_4: "<function=web_search_searxng>"  "\n" param_4_0 ( "\n" param_4_1)? "</function>" "\n"
vllm-rs-svc2  | param_0_0: "<parameter=url>" "\n" value_0_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_0_1: "<parameter=proxy>" "\n" value_0_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_1_0: "<parameter=path>" "\n" value_1_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_2_0: "<parameter=path>" "\n" value_2_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_0: "<parameter=query>" "\n" value_4_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_1: "<parameter=searxng>" "\n" value_4_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | text: /(?s:.+?)/
vllm-rs-svc2  | value_0_0_string: %json {"type":"string","description":"The URL to scrape."}
vllm-rs-svc2  | value_0_1_string: %json {"type":"string","description":"The proxy URL in the format protocol://host:port."}
vllm-rs-svc2  | value_1_0_string: %json {"type":"string","description":"The path of the file to read"}
vllm-rs-svc2  | value_2_0_string: %json {"type":"string","description":"The path of the directory to list"}
vllm-rs-svc2  | value_4_0_string: %json {"type":"string","description":"The query to search for."}
vllm-rs-svc2  | value_4_1_string: %json {"type":"string","description":"Optional searxng URL overriding the env var"}
vllm-rs-svc2  | eos:  ( <[248046]> <[248044]> )

we should not have <think> tags generated so it seems to generate "offset" from the intended start:

$ aichat -m g62 --role %research% research vllm.rs on the web
<think>
The user is asking me to research vllm.rs on the web. This is a research request about a specific project named vllm.rs. Let me search for information about this project.

I should use the web_search_searxng function to research this project. This will help me find information about vllm.rs, its developers, purpose, and what it does.
 
I'll start by searching for information about vllm.rs to understand what this project is about.
 
I'll make the tool call to search for information about vllm.rs.
 
I'll search for information about vllm.rs to understand what this project is about.
```tool_call

<function=web_search_searxng>
<parameter=query>
"vllm.rs"
</parameter>
</function>

Call web_search_searxng {"query":"vllm.rs"}
Processing 4096 tokens per chunk
<think>
I found information about vLLM.rs. Let me summarize the key findings:

1. vLLM.rs is a minimal, high-performance LLM inference engine implemented in Rust
2. It's a lightweight implementation of vLLM
3. Published on PyPI as vllm-rs
4. The project is maintained by guoqingbao
5. It supports CUDA and is cross-platform
6. It's designed for high-throughput and memory-efficient inference

I should provide a comprehensive overview of this project.

<function=web_search_searxng>
<parameter=query>
"vLLM.rs"
</parameter>
</function>

Call web_search_searxng {"query":"vLLM.rs"}
Processing 4096 tokens per chunk
<think>
I need to gather more specific information about vLLM.rs. Let me search for details about the implementation, architecture, and key components.


<function=web_search_searxng>
<parameter=query>
"vLLM.rs"
</parameter>
</function>

Call web_search_searxng {"query":"vLLM.rs"}

This PR is very educational, thank you - starting to see an angle of this i missed before re alignment with PromptReplay

sempervictus · 2026-03-28T03:24:09Z

BTW i'm trying to stuff all special token strings into SpecialTokens partially to ease agent development. We've come a long way toward being able to emit special token strings but its not perfect and it gets worse in tool calls. Idea here is to contain them all to one file - thoughts?

guoqingbao · 2026-03-28T04:03:21Z

BTW i'm trying to stuff all special token strings into SpecialTokens partially to ease agent development. We've come a long way toward being able to emit special token strings but its not perfect and it gets worse in tool calls. Idea here is to contain them all to one file - thoughts?

I didn't escaping the special tokens within main content, so it might has problem of dealing with tool calling related topics.

sempervictus · 2026-03-28T14:53:12Z

i've got q3n coder trying to adapt the grammars PR to this - never tried to have it keep up with an in-progress one before, effect is pure comedy. 😁

Looking to suffix= notation in Lark to try and get these thinking run-aways under control

start: reasoning_block? ( text | tool_call )+ eos
reasoning_block: <[248068]> think_text <[248069]> ("\n")?
think_text[suffix="<[248069]>"]: /(?s:.+?)/
tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
tool_0: "<function=get_weather>"  "\n" param_0_0 "</function>" "\n"
tool_content: tool_0
param_0_0: "<parameter=city>" "\n" value_0_0_string "\n" "</parameter>" "\n"
text: /(?s:.+?)/
value_0_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
eos:  ( <[248046]> <[248044]> )

is the current testing thread (will push soon as i can determine if it breaks anything else)

sempervictus · 2026-03-28T16:04:06Z

@guoqingbao so we definitely need to stabilize the chat template generation for the grammar constraint application to work correctly because grammar constrains everything which the model outputs from start: and where the models starts generating after the chat template matters a lot. The run-on effects i'm seeing seem to happen because of positional inconsistency relative to start: 😄.

start: reasoning_block? ( text | tool_call )+ eos
reasoning_block: <[248068]> think_text <[248069]> ("\n")?
think_text[suffix="<[248069]>"]: /(?s:.+?)/
tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
tool_0: "<function=get_weather>"  "\n" param_0_0 "</function>" "\n"
tool_content: tool_0
param_0_0: "<parameter=city>" "\n" value_0_0_string "\n" "</parameter>" "\n"
text: /(?s:.+?)/
value_0_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
eos:  ( <[248046]> <[248044]> )

means that if the chat template already has reasoning_block equivalent then we end up double-generating a think-block presuming its not ? (anything above ReasoningEffort::Low right now is non-optional)

When the template already provides think tags, the reasoning grammar can't work because those are already in-place. We will likely need dynamic chat template composition to really make that work but in order to implement that we first need a "this works out of the box for everything" approach to the template and build the composition logic atop that solid core.

Alternatively we scrap trying to make the chat template pre-render this and use LLG's fast-tokens from the <[bos_token_id]>assistant\n point all the way out to eos to enable/disable reasoning, permit tool call envelopes if tools are provided, and permit normal generation out to eos. I think using LLG for this is cleaner but then again i've been in that code for weeks and you're far more familiar with this part.

sempervictus · 2026-03-28T16:47:07Z

Pondering this a bit... i think i have a solution to address all code paths:

What you're doing in this branch handles the non-grammar-driven generation (lets pretend --feature guidance is a thing and we turn it off)
Grammar generation is location-specific and therefore should use what you are doing here to decompose the template for everything after the last EOS (<|im_end|> equiv).
Using the decomposed sections it should then constrain everything from start to eos because of the positional accuracy problem: an offset write can spin the bitmask causing run-on generation, incorrect tool-call format, etc and i think this would have caused rollbacks in my original LLG code which was pedantically checking logit-per-token in the position of the bitmask 🤦

Somewhat separately: we need to address the tool-call instruction blocks and format shown to the model in its system prompt when we use --enforce-parser because we can run into really weird things around ~8X YARN scale wherein a model which has grammar enforcement to generate JSON-style but a template to generate XML style will terminate the JSON format with </function>\n</parameter>\n<[tool_end_token_id]> resulting in "spillage" of that XML into the chat.

Ideally we make system prompt separate from prefix cache because there are various bots which will change up the tools available per request and while the grammar-gen handles that gracefully it causes a full refill of the KV. Ditto having an IDE change out the system prompt by changing the "role" of the agent running in it.

guoqingbao · 2026-03-28T16:52:00Z

Ideally we make system prompt separate from prefix cache because there are various bots which will change up the tools available per request and while the grammar-gen handles that gracefully it causes a full refill of the KV. Ditto having an IDE change out the system prompt by changing the "role" of the agent running in it.

I updated using a different approach.

sempervictus · 2026-03-28T16:55:15Z

This is the Heisenberg branch :-p. Reading but i actually think what i proposed above still flies (might take a bit more thinking to implement): if we control bos->eos in our generation, there can be no offset problem because all output is encapsulated.

guoqingbao · 2026-03-28T16:59:06Z

This is the Heisenberg branch :-p. Reading but i actually think what i proposed above still flies (might take a bit more thinking to implement): if we control bos->eos in our generation, there can be no offset problem because all output is encapsulated.

The difficulty lies in aligning with the decoding cache, even under reasoning. The latest approach seems working well.

sempervictus · 2026-03-28T17:01:30Z

Will definitely try to get my approach aligned with the current one.

BTW, before you force-pushed i had the coder implement a "comodal approach" between this branch and #265 which produces:

vllm-rs-svc2  | 2026-03-28T16:58:22.020348Z  INFO vllm_rs::utils::guidance: GRAMMAR:
vllm-rs-svc2  | start: ( text | tool_call )+ eos
vllm-rs-svc2  | tool_content: tool_0 | tool_1 | tool_2 | tool_3 | tool_4
vllm-rs-svc2  | tool_0: "<function=fetch_url_via_curl>"  "\n" param_0_0 ( "\n" param_0_1)? "</function>" "\n"
vllm-rs-svc2  | tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
vllm-rs-svc2  | tool_1: "<function=fs_cat>"  "\n" param_1_0 "</function>" "\n"
vllm-rs-svc2  | tool_2: "<function=fs_ls>"  "\n" param_2_0 "</function>" "\n"
vllm-rs-svc2  | tool_3: "<function=get_current_time>"  "</function>" "\n"
vllm-rs-svc2  | tool_4: "<function=web_search_searxng>"  "\n" param_4_0 ( "\n" param_4_1)? "</function>" "\n"
vllm-rs-svc2  | param_0_0: "<parameter=url>" "\n" value_0_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_0_1: "<parameter=proxy>" "\n" value_0_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_1_0: "<parameter=path>" "\n" value_1_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_2_0: "<parameter=path>" "\n" value_2_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_0: "<parameter=query>" "\n" value_4_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_1: "<parameter=searxng>" "\n" value_4_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | text: /(?s:.+?)/
vllm-rs-svc2  | value_0_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_0_1_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_1_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_2_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_4_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_4_1_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | eos:  ( <[248046]> <[248044]> )

with an 0.8 doing this:

$ aichat -m g62 --role %research% research and report on llguidance rust constraint mechanics
Call web_search_searxng {"query":"llguidance rust constraint mechanics"}
Processing 4096 tokens per chunk

Details

LLM Constraint Mechanics: Technical Analysis

Executive Summary

The Large Language Model (LLM) constraint ecosystem operates under a complex triad of architectural, computational, and physical constraints that govern deployment feasibility, performance metrics, and security implications. This analysis examines the interplay between architectural design principles, computational resource limitations, and physical infrastructure constraints within the GPUaaS (General Purpose AI Service) context.

1. Architectural Constraint Framework

1.1. Computational Resource Constraints

Token Generation Limits: Models face hard limits on context window size, inference latency, and token generation throughput.
Memory Boundaries: Memory constraints prevent large-scale model training and inference at arbitrary scale.
Latency Requirements: Real-time applications require millisecond-level latency with tight time windows.

1.2. Physical Infrastructure Constraints

Power Consumption: GPUs consume significant power; thermal throttling affects operational limits.
Cooling Capacity: Air cooling requires adequate airflow, limiting concurrent processing capacity.
Hardware Availability: GPU availability varies by geographic region and network connectivity.

2. Security Implications

2.1. Access Control Mechanisms

Service Model Restrictions: IaaS, PaaS, and SaaS have distinct access control requirements based on service delivery models.
Zero Trust Architecture: Required for cloud security to prevent lateral movement within the AI infrastructure.
Least Privilege Principle: Application-level access controls minimize attack surface while maintaining service continuity.

2.2. Threat Modeling Considerations

Privileged Account Risks: High-risk accounts may enable privilege escalation.
Network Exposure: Exposure of internal networks increases attack vectors.
Data Integrity Protection: Encryption at rest and in transit is critical for sensitive data.

3. Operational Design Considerations

3.1. Cost Optimization Strategies

Hardware Selection: Balancing GPU type, size, and cooling requirements.
Power Management: Implementing power-saving modes to reduce energy consumption.
Resource Allocation: Dynamic resource allocation based on workload demands.

3.2. Disaster Recovery Considerations

Failover Mechanisms: Multi-region deployments with automatic failover capabilities.
Data Backup Strategies: Regular backups with automated restoration procedures.
Recovery Time Objective (RTO): Defined targets for recovery periods.

4. Security Best Practices

4.1. Network Security

Firewall Configurations: Strategic firewall rules to prevent unauthorized access.
Segmentation: Isolate critical infrastructure from public networks.
Microsegmentation: Limit lateral movement within the network structure.

4.2. Data Protection

Encryption Standards: Implementing encryption at rest and in transit.
Access Control: Role-based access control (RBAC) for sensitive data handling.
Audit Logging: Comprehensive logging of all security events for compliance.

4.3. Vulnerability Management

Patch Management: Regular patch deployments to maintain system integrity.
Code Review: Automated code review pipelines to detect vulnerabilities.
Penetration Testing: Continuous security assessments to identify gaps.

5. Critical Infrastructure Considerations

5.1. Physical Datacenter Constraints

Cooling Systems: Liquid cooling is a primary constraint requiring specialized equipment.
Power Distribution: Proper power management systems are essential for stable operation.
Environment Control: Temperature and humidity control within data centers.

5.2. Energy Efficiency Goals

Green Computing Initiatives: Reducing energy consumption through hardware and software optimization.
Carbon Footprint Reduction: Targeted initiatives to minimize environmental impact.

6. Security Architecture Recommendations

6.1. Containerization Approach

Docker/Kubernetes: Standardize infrastructure across deployments.
Service Discovery: Enable discovery of services across the cluster.
Isolation: Prevent cross-container security breaches.

6.2. Network Security Layers

Security Groups: Define precise allowed ports and IP addresses.
Network Segmentation: Create logical boundaries for different service domains.
Traffic Monitoring: Real-time monitoring of network traffic patterns.

6.3. Threat Defense Strategies

SIEM Integration: Centralized security information and event management.
Behavioral Analysis: Detecting anomalous activity patterns.
Automated Response: Rapid response mechanisms to detected threats.

7. Conclusion

The LLM constraint mechanics represent a sophisticated ecosystem where architectural decisions directly impact security posture, operational efficiency, and long-term viability. As cloud infrastructure continues to evolve with AI workloads, understanding these constraints is essential for architects and engineers designing secure, efficient, and scalable solutions.

Citations:

CSA Security Guidance for Cloud Computing (2024)
InterGlobix Magazine article on AI data center expansion
Google Cloud Infrastructure security design documentation
NIST SP 800-210 Access Control Guidance
rage@unknown: vllm.rs(grammars/pr)$ aichat -m g62 --role %research% research and report on llguidance rust constraint mechanics
Call web_search_searxng {"query":"llguidance rust constraint mechanics"}
Processing 4096 tokens per chunk

llguidance Rust Constraint Mechanics

Core Architecture

llguidance is a high-performance, Rust-based library for constrained decoding of Large Language Models (LLMs). It implements Earley parsers for efficient constrained generation.

Key Characteristics:

Token Masking: The library pre-computes token masks for all automaton states, significantly reducing sampling overhead
Startup Cost: Minimal - no upfront token mask computation
Performance: Approximately 50μs per token for 128k tokenizers with typical models

The Constraint API

The primary entry point is the Constraint struct:

// Two-step sampling loop:
let allowed = constraint.compute_mask();
let consumed = constraint.commit_token();

compute_mask() returns the set of allowed tokens at each state.
commit_token() advances the grammar state after sampling.

Implementation Details

Rust Crate Structure

The main crate follows a clean, documented interface:

pub use guidance_ai::llguidance::{
    Constraint, Matcher, TokenParser, TopLevelGrammar, TopLevelTokenGenerator,
};

Token Generation Flow

ParserFactory constructs an Earley parser from the grammar
Constraint computes token masks for all states
TokenGenerator samples tokens using the mask
Commit advances the parser state

Example Pattern

fn generate_with_constraints(input: &str) -> String {
    let grammar = parse_grammar(&input);
    let parser = ParserFactory::from_llguidance_json(grammar);
    let constraint = Constraint::new();
    
    let mut tokenizer = TokenGenerator::new(parser);
    let result = tokenizer.generate(input);
    constraint.commit_token();
    result
}

Architecture Layers

Low-Level: llguidance crate (Rust parser implementation)
Intermediate: guidance-ai/guidance crate (HTML/XML generation wrapper)
Higher-Level: guidance-ai/llguidance crate (main constrained decoding engine)

Performance Metrics

Metric	Value
Token Masking Cost	~50μs per token
Startup Cost	Minimal/no mask computation
Memory Overhead	Low - masks precomputed for all states
CPU Time	50μs × tokens

Key Design Decisions

State-Sensitive Masking: Masks are computed per automaton state, allowing fine-grained control over which tokens are allowed at each point in the generation process.
Deterministic Sampling: Uses Earley parser semantics to ensure deterministic sampling across different runs.
Memory Efficiency: Pre-computes masks only when needed, avoiding unnecessary data structures during sampling.
Efficient State Management: The parser maintains a compact state representation that minimizes memory usage while preserving parsing correctness.

Limitations and Considerations

Constraint Complexity: Complex constraints may impact sampling efficiency
Startup Overhead: While minimal, there's still some initial setup cost
Token Masking Cost: 50μs per token is significant for large models (>1M tokens)
Memory Requirements: Requires sufficient RAM for token masks

Comparison with Alternatives

Library	Rust Support	Token Masking	Startup Cost	Memory
llguidance	✅	Yes (~50μs/tokens)	Minimal	Low
LLInterpreter	❌	No	High	High
FFI	❌	No	High	High

llguidance offers the best balance of performance, memory efficiency, and Rust compatibility for constrained decoding applications.

For detailed documentation, consult the official Rust API at docs.rs/llguidance

sempervictus · 2026-03-29T04:05:18Z

@guoqingbao - a possible solution for all of this:

vllm-rs-svc2  | start: bos ( text | tool_call )+ eos
vllm-rs-svc2  | tool_content: tool_0 ? tool_1 ? tool_2 ? tool_3 ? tool_4
vllm-rs-svc2  | tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
vllm-rs-svc2  | tool_0: "<function=fetch_url_via_curl>"  "\n" param_0_0 ( "\n" param_0_1)? "</function>" "\n"
vllm-rs-svc2  | tool_1: "<function=fs_cat>"  "\n" param_1_0 "</function>" "\n"
vllm-rs-svc2  | tool_2: "<function=fs_ls>"  "\n" param_2_0 "</function>" "\n"
vllm-rs-svc2  | tool_3: "<function=get_current_time>"  "</function>" "\n"
vllm-rs-svc2  | tool_4: "<function=web_search_searxng>"  "\n" param_4_0 ( "\n" param_4_1)? "</function>" "\n"
vllm-rs-svc2  | param_0_0: "<parameter=url>" "\n" value_0_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_0_1: "<parameter=proxy>" "\n" value_0_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_1_0: "<parameter=path>" "\n" value_1_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_2_0: "<parameter=path>" "\n" value_2_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_0: "<parameter=query>" "\n" value_4_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_1: "<parameter=searxng>" "\n" value_4_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | text: /(?s:.+?)/
vllm-rs-svc2  | value_0_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_0_1_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_1_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_2_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_4_0_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | value_4_1_string[suffix="\n</parameter>"]: /(?s:.+?)/
vllm-rs-svc2  | eos:  ( <[248046]> <[248044]> )
vllm-rs-svc2  | bos: <[248045]> "assistant\n"

with the reasoning version being ~

vllm-rs-svc2  | start: bos reasoning_block? ( text | tool_call )+ eos

wherein we always set add_generation_prompt to false and use the <eos>\n of the last message as the generation boundary. If that is present in every grammar, the engine can strip it on emission from showing up in the user output and we can control everything between bos and eos meaning we should be able to both emit those token strings for reasoning to the chat and anchor generation by message iteration boundary. Does that work or am i missing some nuance around the generation cache alignment?

In order to fit this PR the approach needs to be a bit different - i've added a function to ChatTemplate to determine IF you are pre-filling the generation with bos <[reasoning_start_token_id]> in Lark terms and modifying the grammar to start inside of the reasoning block instead of starting a reasoning block.

sempervictus · 2026-03-29T07:55:24Z

this PR has an odd effect - it "bumps" initial generation by ~\t to the right which causes all sorts of fun with grammar masking.

sempervictus · 2026-03-29T08:06:50Z

@guoqingbao i think i've solved this in #265 with a slightly different direction for template adjustment without the offset problem

guoqingbao · 2026-03-29T09:09:12Z

this PR has an odd effect - it "bumps" initial generation by ~\t to the right which causes all sorts of fun with grammar masking.

Because it replaced reasoning headers with whitespacewhitespace when sending to clients given that majority of AI agents won't strip out the reasoning markers in their outputs.

guoqingbao · 2026-03-29T09:09:56Z

this PR has an odd effect - it "bumps" initial generation by ~\t to the right which causes all sorts of fun with grammar masking.

Because it replaced reasoning headers with whitespacewhitespace when sending to clients given that majority of AI agents won't strip out the reasoning markers in their outputs.

It doesn't do that replacement for non tool call requests.

guoqingbao · 2026-03-29T09:11:34Z

@guoqingbao i think i've solved this in #265 with a slightly different direction for template adjustment without the offset problem

If you strip out the whitespaces, it will cause cache miss because the return number of tokens from previous turns by the client willl not match with current cache.

sempervictus · 2026-03-29T14:39:39Z

Have been running that for a few hrs and no cache misses yet because theres no mismatch - same tokens ('\n|) are generated just from guidance forcing it instead of template. This is the guided grammar after stripping that tagged reproducing the same line such that prefix matches are aligned:

...
2026-03-29T15:10:02.272392Z  INFO vllm_rs::server::parser: Tool parser selected: qwen_coder (model_id=Qwen/Qwen3-Coder-Next-FP8, enforce_parser=none)
2026-03-29T15:10:02.273443Z  INFO vllm_rs::core::block_manager: Prefix cache hit seq 95 (33280 cached tokens, 520 blocks)
2026-03-29T15:10:02.277654Z  INFO vllm_rs::core::runner: Restored mamba prefix state for seq 95 (cached 33280 tokens)
2026-03-29T15:10:02.277669Z  INFO vllm_rs::core::runner: Restored mamba prefix state for seq 95 (cached 33280 tokens)
2026-03-29T15:10:02.277803Z  INFO vllm_rs::core::runner: Restored mamba prefix state for seq 95 (cached 33280 tokens)
2026-03-29T15:10:02.277812Z  INFO vllm_rs::core::runner: Restored mamba prefix state for seq 95 (cached 33280 tokens)
2026-03-29T15:10:02.773425Z  INFO vllm_rs::utils::guidance: GRAMMAR:
start: reasoning_block ( text | tool_call )+ eos
reasoning_block: <[151667]> "\n" think_text "\n" <[151668]> "\n"
think_text[suffix="\n"]: /[ -~]+/
tool_call: <[151657]> text <[151658]>
text: /(?s:.+?)/
eos:  ( <[151645]> <[151643]> )
...

That said: Qwen3.5 is full of "magic" - remember those <thinking> tags we caught the 80B CoderNext using? They're not in the Tokenizer.added_vocabulary but in the regular one and despite the CoderNext ChatTemplate not having an actual reasoning section it does reason by itself somehow utilizing those non-special tokens like it would use <function=... (also in the regular vocab). The chat templates i've looked at for the 3.5s i have (0.8->122) do seem to have those sections but still i have caught them using <thinking> of their own volition from the regular vocab as well. Something more clever than the usual special-tokens-bounded reasoning block is going on here though the current approach in #265 does work over iterative conversation (prefix cache matching).

Just to be sure though: could you possibly point me to the logic we are trying to ensure re aligning generation? I'm working off the premise that you mean "prefix cache alignment for matching" in which case everything works correctly because i'm stripping that start-reasoning line from the template at the last moment and generating the same tokens in Guidance so all of the prefix block accounting should be accurate without any change (same number of tokens emitted in same position as what was accounted for in the template before my exision).

I have a branch of #265 and the current state of #279 together (its a 1-line change in engine.rs to merge them but i can throw it in my GH if you'd like a precanned one to test) but whatever is happening in this branch to cause that \t-sized offset at the start of generation seems to offset the masking position for guidance as well.

guoqingbao · 2026-03-29T15:29:52Z

Have been running that for a few hrs and no cache misses yet because theres no mismatch - same tokens ('\n|) are generated just from guidance forcing it instead of template. This is the guided grammar after stripping that tagged reproducing the same line such that prefix matches are aligned:

That's the decoding cache miss and it won't report that, meaning each request can only reuse the previous prompt cache. That's why I added this PR.

sempervictus · 2026-03-29T17:31:16Z

I see, so that is different from prefix cache? Will dig in on the merge branch for this and #265 since that now seems to be stable with and without reasoning levels set (testing some cleanup work to push presently)

sempervictus · 2026-03-29T18:47:41Z

According to the 122B with thinking enabled via grammar:

2026-03-29T18:39:43.830675Z  WARN vllm_rs::server::parser: Tool start token IDs corrected from tokenizer for model Qwen3VL: {248058}
2026-03-29T18:39:43.830708Z  WARN vllm_rs::server::parser: Tool end token IDs corrected from tokenizer for model Qwen3VL: {248059}
2026-03-29T18:39:43.830936Z  WARN vllm_rs::server::server: Tools enabled for request
2026-03-29T18:39:43.838353Z  INFO vllm_rs::core::engine: [llg] Guidance enabled, trimming <think> from pre-generation. Generation starting at <|im_start|>assistant
2026-03-29T18:39:43.988069Z  WARN vllm_rs::core::engine: [Stream] New request [Seq_id 27, 191202 tokens] received! (session_id: None)
2026-03-29T18:39:43.988159Z  INFO vllm_rs::server::parser: Tool parser selected: qwen_coder (model_id=Qwen/Qwen3.5-122B-A10B-FP8, enforce_parser=none)
2026-03-29T18:39:43.990732Z  INFO vllm_rs::core::block_manager: Prefix cache hit seq 27 (186624 cached tokens, 2916 blocks)
...
2026-03-29T18:39:44.002845Z  INFO vllm_rs::core::runner: Restored mamba prefix state for seq 27 (cached 186624 tokens)
2026-03-29T18:39:45.366038Z  INFO vllm_rs::utils::guidance: GRAMMAR:
start: reasoning_block? ( text | tool_call )+ eos
reasoning_block: <[248068]> "\n" think_text "\n" <[248069]> ("\n")?
think_text[suffix="\n"]: /[ -~]+/
tool_call: <[248058]> text <[248059]>
text: /(?s:.+?)/
eos:  ( <[248046]> <[248044]> )

the merged state between that PR and this one are complementary:

Analysis: Marker Revision Intention and LLG Grammar Coherency Alignment

1. Intention of Marker Revision (Without Emitting Marker Strings)

The marker revision mechanism in #279 serves a coherency preservation purpose:

Core Problem:

When caching prompts with reasoning markers (like ``, </thought>), the cache must maintain exact token sequences

However, if the same logical prompt is re-generated with slightly different whitespace or marker placement, the cache becomes ineffective

The revision system detects when a cached prompt's suffix matches the current request but differs only in reasoning marker representation

Solution Without Breaking Chat Stream:

The try_revise_reasoning_markers function performs token-level normalization, not string replacement

It identifies reasoning marker tokens and replaces them with a consistent "space token" placeholder

This ensures that logically equivalent prompts (same content, different marker formatting) map to the same cache entry

The actual marker strings are never emitted into tools, chat, or thinking blocks - only their token IDs are tracked and normalized

Key Mechanism:
Original: [prefix tokens] + [marker_start] + [content] + [marker_end] + [suffix]
Revised:  [prefix tokens] + [space_token] + [content] + [space_token] + [suffix]
The revision happens at the token ID level before caching, ensuring coherency without exposing marker strings to downstream parsers.

2. Alignment with Coherency Concerns in src/core/engine.rs:1210-1219

The code section you referenced demonstrates LLG grammar-aware prompt trimming:
if params.grammar.is_some() {
    if let Some((start_str, _end_str)) = get_reasoning_token_strings(&self.guidance_tokens, &self.tokenizer) {
        if prompt.trim().ends_with(&start_str) {
            let prompt = prompt.trim().trim_end_matches(&start_str).to_string();
            log_info!("[llg] Guidance enabled, trimming {} from pre-generation...", &start_str);
            return (prompt, image_idx, replay)
        }
    }
}
How This Aligns with Marker Revision Coherency:

Concern LLG Grammar Trimming (lines 1210-1219) Marker Revision System

Purpose Remove reasoning start marker when grammar is active Normalize reasoning markers across cache boundaries

Trigger params.grammar.is_some() + prompt ends with marker Cache hit with marker token mismatch

Action Trim marker string from prompt before generation Replace marker tokens with space tokens in cache

Coherency Impact Ensures grammar-aligned prompts don't duplicate cache entries Ensures logically equivalent prompts share cache entries

Critical Alignment Point:

Both mechanisms address the same coherency problem from different angles:

LLG Grammar Trimming (pre-generation):

Detects when a reasoning marker (``) appears at prompt end

Removes it because the guidance grammar will handle marker insertion

Prevents redundant marker storage in the generated prompt

Marker Revision (post-caching):

Detects when cached prompts have marker token variations

Normalizes them to space tokens for consistent cache lookup

Enables cache reuse across logically equivalent prompts with different marker representations

Why Both Are Necessary:

Without LLG grammar trimming: Prompts would store marker strings that the grammar system would re-insert, causing cache pollution.

Without marker revision: Cache lookups would fail for logically identical prompts that happen to have different marker token sequences due to template variations.

The Coherency Guarantee:

Together, these mechanisms ensure:

prompt + ``` → trimmed to prompt` before caching (LLG grammar)

prompt in cache → normalized marker tokens enable cross-request hits (marker revision)

Result: Maximum cache utilization while maintaining grammatical correctness

3. Data Flow: How Marker Revision Preserves Coherency
flowchart LR
    A[Incoming Request] --> B{Has Grammar?}
    B -->|Yes| C[Check for Reasoning Marker Suffix]
    C --> D{Marker Present?}
    D -->|Yes| E[Trim Marker - LLG Grammar Path]
    D -->|No| F[Normal Processing]
    E --> G[Cache Prompt Without Redundant Marker]
    F --> H{Cache Hit?}
    H -->|Yes| I[Apply Marker Revision Normalization]
    H -->|No| J[Store with Original Tokens]
    I --> K[Return Coherent Cache Entry]
    J --> K
    K --> L[Generation with Grammar-Aligned Markers]
Loading
Key Insight: The marker revision system doesn't just "replace" markers - it normalizes the token representation so that:

Cache lookups succeed for logically equivalent prompts

Grammar insertion happens at the correct token boundaries

No duplicate marker storage occurs

This is why the revision happens at the token ID level rather than string level - it preserves the exact boundary conditions that the guidance grammar expects.

Summary

The marker revision mechanism in #279 ensures cache coherency across reasoning marker variations by normalizing marker tokens to space tokens at the token-ID level. This aligns perfectly with the LLG grammar trimming logic in src/core/engine.rs:1210-1219, which removes redundant markers before caching. Together, they guarantee:

No marker duplication - LLG grammar trimming prevents storing markers that will be re-inserted

Maximum cache reuse - Marker revision enables cross-request hits despite template variations

Grammar alignment - Both mechanisms preserve the exact token boundaries that guidance grammars expect

The system works because marker revision operates on token IDs, not strings, ensuring that the coherency guarantees hold even when different templates produce logically identical prompts with different marker representations.

So... how do we want to handle merge ordering? Merge this one first and i realign #265 over it or should i push the merged branch into #265 for your review?

sempervictus · 2026-03-29T18:51:06Z

BTW, we're basically 1 push away from havnig full BOS->EOS control over all rendering which should give you explicit cache-state matching control since you can generate cache-state UUIDs forcibly at the start and omit them from transmission to the user (internal watermarking, effectively). That trick should also empower /responses API (#26).

The correct version of such an impl would require me to actually extract the string from the template right after the BOS to handle stuff like tool-response-role gracefully but i can throw together a commit which we know for a fact will work on the current qwen models as a test case and work our way to idiomatic coverage for all models from there.

sempervictus · 2026-03-30T00:38:06Z

Unfortunately this branch breaks expectations of OpenAI API clients which expect the think tags to actually be sent through from the server - reasoning sections come through formatted differently but not within reasoning blocks as expected by the client:

or the "more fun" version from the 122B (the runaway region i what would be text on the start: line in #265):

Since we can pin those tags now in the grammar, is there still a reason to do the masking/replacement piece for cache coherence?

guoqingbao · 2026-03-30T02:08:35Z

Unfortunately this branch breaks expectations of OpenAI API clients which expect the think tags to actually be sent through from the server

That's not a break, it simply made reasoning contents as normal output so all clients can render them normally, otherwise, certain agents like opencode and claude code render reasoning tags visually (which is annoying), and that will break the cache (chat template can strip out reasoning parts).

guoqingbao · 2026-03-30T02:11:34Z

Since we can pin those tags now in the grammar, is there still a reason to do the masking/replacement piece for cache coherence?

It's irrelevant with the grammar, it's the server and client interaction behavior, if you send reasoning tags, they will send back and chat template will remove previous reasoning parts making a hole of kvcache mismatch, causing decoding cache unable to be reused.

guoqingbao · 2026-03-30T03:15:48Z

Related anomalyco/opencode#11439

guoqingbao · 2026-03-30T04:39:56Z

It seems we need to keep the reasoning tags as it's original format when sending to client, it's the opencode bug, all of the popular inference frameworks including vLLM and sglang do sending raw reasoning markers to the client, same issue found within opencode. I may revert the reasoning tag replacement logic @sempervictus

guoqingbao · 2026-03-30T11:50:08Z

Finished in #281

sempervictus mentioned this pull request Mar 27, 2026

LLG: Comprehensive Guided Decoding Infrastructure #265

Open

guoqingbao force-pushed the rewrite_chat_template branch 2 times, most recently from 5c68ae2 to efaca36 Compare March 28, 2026 13:53

Use placeholder approach

52ed9d3

guoqingbao force-pushed the rewrite_chat_template branch from efaca36 to 52ed9d3 Compare March 28, 2026 16:49

guoqingbao mentioned this pull request Mar 29, 2026

Output parser regression on reasoning tokens #280

Closed

Revise capture stride

693f95c

guoqingbao mentioned this pull request Mar 29, 2026

Fix decoding cache mismatch #277

Closed

Bump version to v0.9.16

dc6611e

Fix claude server path & update ReadMe

c5511e6

guoqingbao mentioned this pull request Mar 30, 2026

Fix reasoning marker issue #281

Merged

guoqingbao closed this Mar 30, 2026

guoqingbao deleted the rewrite_chat_template branch April 18, 2026 06:34

Conversation

guoqingbao commented Mar 27, 2026

Uh oh!

sempervictus commented Mar 28, 2026

Uh oh!

sempervictus commented Mar 28, 2026

Uh oh!

guoqingbao commented Mar 28, 2026

Uh oh!

sempervictus commented Mar 28, 2026

Uh oh!

sempervictus commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sempervictus commented Mar 28, 2026

Uh oh!

guoqingbao commented Mar 28, 2026

Uh oh!

sempervictus commented Mar 28, 2026

Uh oh!

guoqingbao commented Mar 28, 2026

Uh oh!

sempervictus commented Mar 28, 2026

LLM Constraint Mechanics: Technical Analysis

Executive Summary

1. Architectural Constraint Framework

1.1. Computational Resource Constraints

1.2. Physical Infrastructure Constraints

2. Security Implications

2.1. Access Control Mechanisms

2.2. Threat Modeling Considerations

3. Operational Design Considerations

3.1. Cost Optimization Strategies

3.2. Disaster Recovery Considerations

4. Security Best Practices

4.1. Network Security

4.2. Data Protection

4.3. Vulnerability Management

5. Critical Infrastructure Considerations

5.1. Physical Datacenter Constraints

5.2. Energy Efficiency Goals

6. Security Architecture Recommendations

6.1. Containerization Approach

6.2. Network Security Layers

6.3. Threat Defense Strategies

7. Conclusion

llguidance Rust Constraint Mechanics

Core Architecture

The Constraint API

Implementation Details

Rust Crate Structure

Token Generation Flow

Example Pattern

Architecture Layers

Performance Metrics

Key Design Decisions

Limitations and Considerations

Comparison with Alternatives

Uh oh!

sempervictus commented Mar 29, 2026

Uh oh!

sempervictus commented Mar 29, 2026

Uh oh!

sempervictus commented Mar 29, 2026

Uh oh!

guoqingbao commented Mar 29, 2026

Uh oh!

guoqingbao commented Mar 29, 2026

Uh oh!

guoqingbao commented Mar 29, 2026

Uh oh!

sempervictus commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guoqingbao commented Mar 29, 2026

Uh oh!

sempervictus commented Mar 29, 2026

Uh oh!

sempervictus commented Mar 29, 2026

Analysis: Marker Revision Intention and LLG Grammar Coherency Alignment

sempervictus commented Mar 28, 2026 •

edited

Loading

sempervictus commented Mar 29, 2026 •

edited

Loading

2. Alignment with Coherency Concerns in `src/core/engine.rs:1210-1219`

sempervictus commented Mar 30, 2026 •

edited

Loading