Skip to content

Enable Reasoning via Guided Enforcement#262

Merged
guoqingbao merged 24 commits into
guoqingbao:mainfrom
sempervictus:reasoning/pr
Mar 16, 2026
Merged

Enable Reasoning via Guided Enforcement#262
guoqingbao merged 24 commits into
guoqingbao:mainfrom
sempervictus:reasoning/pr

Conversation

@sempervictus
Copy link
Copy Markdown
Contributor

@sempervictus sempervictus commented Mar 11, 2026

This should more or less complete the LLG work by including reasoning API support a la OpenAI's.

Overview

The guided inference system supports multiple reasoning effort levels through the ReasoningEffort enum. Each level implements a different reasoning strategy optimized for specific use cases:

Level Description Use Case
None No structured reasoning - direct output only Fast generation, low latency
Low Constrained single-paragraph reasoning (~150 chars max) Fast Thinking, reduces hallucination
Medium Standard multi-step Chain-of-Thought (CoT) Balanced reasoning depth
High Adversarial analysis with self-correction phases Complex tasks requiring error checking
ChainOfThought Best-of-breed CoVe + Self-Critique Maximum accuracy for fact-sensitive tasks

Reasoning Effort Levels

ReasoningEffort::Low - Fast Thinking

Implements "Fast Thinking" with tight length constraints (~150 chars max). This reduces hallucination risk by limiting the generation space.

start: reasoning_block
reasoning_block: <[START_ID]> "\n" thinkgram "\n" <[END_ID]> "\n"
thinkgram: /^[^\n]{1,150}/

ReasoningEffort::Medium - Standard CoT

Implements Wei et al. (2022) baseline with sentence-based termination. Allows multiple steps but enforces sentence boundaries.

start: reasoning_block
reasoning_block: <[START_ID]> "\n" thinkgram "\n" <[END_ID]> "\n"
thinkgram: /(?s:[^.!?]+[.!?])+/  # Multiple sentences

ReasoningEffort::High - Adversarial Analysis

Implements Cheng & Su (2025) adversarial critique pattern. Forces explicit error checking before final output.

start: reasoning_block* analysis_block*
reasoning_block: <[START_ID]> "\n" thinkgram "\n" <[END_ID]> "\n"
analysis_block: "<ANALYZE>" "\n" analysis_content "\n" "</ANALYZE>" "\n"
thinkgram: /(?s:[^.!?]+[.!?])+/
analysis_content: /(?s:.*)/

ReasoningEffort::ChainOfThought - CoVe Pattern

Combines Madaan et al. (2024) Chain-of-Verification with adversarial self-correction. Maximum accuracy for complex/fact-sensitive tasks.

start: cots*
cots: <[START_ID]> "\n" draft_phase verification_phase critique_phase final_phase "\n" <[END_ID]> "\n"
draft_phase: /(?s:[^.!?]+[.!?])+/
verification_phase: "<VERIFY>" "\n" verification_questions "\n" verification_answers "\n" "</VERIFY>" "\n"
verification_questions: /(?s:[^.!?]+[.!?])+/
verification_answers: /(?s:.*)/
critique_phase: "<CRITIQUE>" "\n" self_critique "\n" "</CRITIQUE>" "\n"
self_critique: /(?s:.*)/
final_phase: "<FINAL_ANSWER>" "\n" final_content
final_content: /(?s:.*)/

API Reference

ReasoningEffort Enum

Location: src/utils/reasoning.rs:16-39

#[derive(Clone, Debug, serde::Serialize, serde::Deserialize, PartialEq, Eq)]
pub enum ReasoningEffort {
    None,
    Low,
    Medium,
    High,
    ChainOfThought,
}

ThinkingGrammarBuilder

Location: src/utils/reasoning.rs:143-178

pub struct ThinkingGrammarBuilder {
    start_id: u32,
    end_id: u32,
    effort: Option<ReasoningEffort>,
}

Methods:

  • ThinkingGrammarBuilder::new(start_id, end_id, effort) - Create new builder
  • ThinkingGrammarBuilder::from_string(start_id, end_id) - Create from string IDs
  • build() - Generate Lark grammar string
  • build_grammar() - Generate TopLevelGrammar

build_reasoning_grammar()

Location: src/utils/reasoning.rs:183-218

Wraps a base composer with reasoning blocks when reasoning effort is enabled.

pub fn build_reasoning_grammar(
    base_grammar: TopLevelGrammar,
    reasoning_effort: ReasoningEffort,
    special_tokens: &SpecialTokens,
) -> TopLevelGrammar

Includes #232 and #260

sequenceDiagram
    participant User
    participant API
    participant Pipeline
    participant SpecialTokens
    participant LLGFactory
    participant Matcher
    participant TokenParser
    participant EarleyParser
    participant Lexer
    participant TokTrie
    participant Sampler
    participant LogitsProcessor
    participant Model

    User->>API: Request with constraint (regex/json_schema/lark/llguidance)

    Note over User,API: Phase 1: Request Setup and Grammar Building

    API->>SpecialTokens: SpecialTokens::new(&tokenizer)
    SpecialTokens-->>API: Return EOS, BOS, TOOL token IDs
    API->>Pipeline: build_llg_factory(tokenizer)
    Pipeline->>LLGFactory: toktrie_hf_tokenizers::ByteTokenizer::from_tokenizer(tokenizer)
    LLGFactory->>TokTrie: Create token trie from tokenizer vocabulary
    TokTrie-->>LLGFactory: Return TokEnv with trie
    LLGFactory->>LLGFactory: ParserFactory::new_simple(&env)
    LLGFactory-->>Pipeline: Return Arc<ParserFactory>

    Pipeline->>Pipeline: llg_grammar_from_constraint(&request.constraint)
    Pipeline->>Matcher: constraint_from_llg_grammar(&factory, grm)
    Matcher->>Matcher: factory.create_parser(grm)
    Matcher->>TokenParser: Create with grammar_init
    TokenParser->>EarleyParser: Build CGrammar from grammar
    TokenParser->>Lexer: Build LexerSpec from grammar
    Lexer->>TokTrie: Precompute large lexemes if needed
    TokTrie-->>Lexer: Return optimized lexeme sets

    Note over User,Matcher: Phase 2: Prompt Processing (if needed)

    User->>API: Optional: process_prompt(prompt_tokens)
    API->>TokenParser: process_prompt(prompt_tokens)
    TokenParser->>TokenParser: tokenize_bytes_marker(&prompt_bytes)
    TokenParser->>TokenParser: process_prompt() returns new prompt

    Note over User,Matcher: Phase 3: Inference Loop

    loop for each token generation

    Model->>Model: Forward pass on input tokens
    Model-->>Pipeline: Return logits tensor

    Pipeline->>Sampler: sample_sequence(logits, seq, ...)

    Note over Sampler: Two-stage sampling with llguidance

    Sampler->>LogitsProcessor: Apply llguidance constraint

    LogitsProcessor->>TokenParser: compute_mask()
    TokenParser->>TokenParser: compute_mask_inner()
    TokenParser->>EarleyParser: run_speculative("compute_mask")
    EarleyParser->>EarleyParser: trie_started("compute_mask")
    EarleyParser->>EarleyParser: compute_bias()
    EarleyParser->>Lexer: compute_bias() with token_prefix

    Note over Lexer,TokTrie: Lexical Scope Analysis

    Lexer->>TokTrie: Walk token trie for allowed lexemes
    TokTrie-->>Lexer: Return SimpleVob bit mask

    Lexer->>EarleyParser: Return mask to TokenParser
    TokenParser->>TokenParser: cache mask for fast-forward

    TokenParser-->>LogitsProcessor: Return SimpleVob mask

    LogitsProcessor->>LogitsProcessor: Check if sampled token is allowed
    LogitsProcessor->>Sampler: Apply logit biasing

    alt Token is allowed
        Sampler->>Sampler: No biasing needed
    else Token is not allowed
        Sampler->>Sampler: Set invalid tokens to -f32::INFINITY
        Sampler->>Sampler: Re-sample with biased logits
    end

    Sampler->>TokenParser: consume_token(sampled_token)
    TokenParser->>TokenParser: apply_token(sampled_token)
    TokenParser->>TokenParser: llm_tokens.push(sampled_token)
    TokenParser->>TokenParser: llm_bytes.extend(token_bytes)
    TokenParser->>EarleyParser: parser.apply_token(token_bytes, token_id)
    EarleyParser->>Lexer: advance lexer state
    Lexer->>Lexer: Update lexer_stack with new state
    Lexer->>EarleyParser: Return backtrack count

    alt Backtrack needed
        EarleyParser->>EarleyParser: rollback(backtrack_bytes)
        EarleyParser->>EarleyParser: Update llm_tokens and llm_bytes
    end

    TokenParser->>TokenParser: check_stop()
    TokenParser-->>Sampler: Return CommitResult

    Note over Sampler: Phase 4: Fast-Forward (if enabled)

    Sampler->>TokenParser: compute_ff_tokens()
    TokenParser->>TokenParser: ff_tokens()
    TokenParser->>TokTrie: Tokenize forced bytes
    TokTrie-->>TokenParser: Return fast-forward tokens

    alt Fast-forward tokens available
        TokenParser->>TokenParser: consume_ff_tokens()
        loop for each ff_token
            TokenParser->>TokenParser: consume_token(ff_token)
            TokenParser->>TokenParser: llm_tokens.push(ff_token)
            TokenParser->>TokenParser: llm_bytes.extend(ff_token_bytes)
        end
    end

    Note over Sampler: Phase 5: Speculative Decoding (if enabled)

    Model->>Model: Draft model forward pass
    Model-->>Pipeline: Return draft logits

    Pipeline->>Sampler: sample_target_sequence_speculative()
    Sampler->>TokenParser: rollback(n_toks)
    TokenParser->>EarleyParser: parser.rollback(bytes_to_drop)
    EarleyParser->>Lexer: pop lexer states
    Lexer-->>TokenParser: Return rollback result

    Sampler->>Sampler: Sample draft tokens
    Sampler->>TokenParser: validate_tokens(draft_tokens)
    TokenParser->>TokenParser: consume_token(draft_token)

    alt Draft token accepted
        TokenParser->>TokenParser: Continue with next draft
    else Draft token rejected
        TokenParser->>TokenParser: Accept partial draft
        TokenParser->>TokenParser: Rollback to last valid state
    end

    end

    Note over User,Matcher: Phase 6: Token Geometry and Binary Data State

    TokTrie->>TokTrie: Token encoding (8:24 bit split)
    TokTrie->>TokTrie: node.bits = (token_id << 8) | byte
    TokTrie->>TokTrie: node.bits2 = (subtree_size << 10) | num_parents

    TokTrie->>SimpleVob: Bit mask storage
    SimpleVob->>SimpleVob: data: Vec<u32> (32 tokens per word)
    SimpleVob->>SimpleVob: allow_token(tok): data[tok>>5] |= 1 << (tok&31)

    Note over User,Matcher: Phase 7: Rollback and Verification

    TokenParser->>TokenParser: validate_tokens(tokens)
    TokenParser->>EarleyParser: validate_tokens_raw(tokens)
    EarleyParser->>Lexer: Check if tokens match current lexer state
    Lexer-->>TokenParser: Return number of valid tokens

    TokenParser->>TokenParser: rollback(n_tokens)
    TokenParser->>EarleyParser: parser.rollback(bytes_to_drop)
    EarleyParser->>Lexer: pop lexer states
    TokenParser->>TokenParser: llm_tokens.truncate(new_len)
    TokenParser->>TokenParser: llm_bytes.truncate(new_len)

    Note over User,Matcher: Phase 8: Response Generation

    Pipeline->>API: Return completion with tokens
    API->>User: Stream or return final response
Loading

RageLtMan and others added 11 commits March 9, 2026 10:12
This implements the full llguidance integration enabling
grammar-constrained inference for structured outputs, tool calling,
and custom constraints.

Architecture:
- TopLevelGrammar serialized via rmp_serde across RPC boundaries
- Grammar flows: Server → params.grammar → Runner → GuidanceState
 → Matcher
- Inline correction via logits masking during sampling
- Post-process correction via rollback on validation failure

Key components:
- params.grammar field in SamplingParams for RPC serialization
GuidanceState
- GuidanceState::new() with Matcher state management
- GuidanceState::reset() for proper state cleanup
- Rollback counter (MAX_ROLLBACK_ATTEMPTS=3) preventing infinite
loops
- guidance_failed/guidance_mismatch sets cleared on rollback
- Vocab size validation in build_llg_factory()
- Lark grammar generation from tools via
build_tool_call_lark_grammar()

CLI flags:
- --enable-tool-grammar: Auto-build LLG grammar from MCP tools
- --allow-constraint-api: Accept client-provided
structured_outputs/response_format
Expand SpecialTokens usage to cover EOS uses across the codebase
to include the chat template. This gates access to the EOS tokens
through a single common API providing an interdiction point to add
or remove them as needed per-model or family as requried.
- Replace manual EOS token extraction logic with centralized
SpecialTokens::new() and idiomatic accessors
- Eliminate EosTokenId enum and related complex serialization logic
in favor of direct Vec<u32>
- Update all callers to use SpecialTokens for tool start/end token
IDs
- Remove stop_token_ids from SamplingParams and related logic
(now handled via SpecialTokens)
- Simplify tokenizer config by replacing EosTokenEntry with
Option<String>
- Add comprehensive SpecialTokens API with category-based
accessors, ID/string sets, and search methods
Improve the binary example to be a handy extractor for models which
developers can use to update special_tokens.rs quickly.

Add tags extracted from Qwen3.5 0.8B
Narrow the Common category search specifically to find string dups
of actually special tokens (handle "aftermarket" models/merges).

Add and test Llama4 and Qwen3.5 MoE
This PR introduces a new `reasoning_effort` parameter to control
reasoning block generation in the chat completion API, matching OpenAI's
reasoning API behavior.

- **API Extension**: Added `reasoning_effort` field to
`ChatCompletionRequest` accepting "none", "low", "medium", or "high"
values (case-insensitive)

- **New Module**: Created `src/utils/reasoning.rs` with:
  - `ReasoningEffort` enum with `from_str` deserialization
  - `ThinkingGrammarBuilder` for reasoning block grammar
construction
  - `thinking_grammar_with_reasoning_block()` generating
Lark grammar patterns
  - `build_reasoning_grammar()` for composing reasoning blocks
with base grammars

- **Integration**: Updated `compose_grammars()` in
`src/utils/guidance.rs` to accept and apply reasoning effort levels

- **Schema Sanitization**: Enhanced `sanitize_schema_for_llguidance()`
in `src/tools/schema.rs` to strip null from required field types,
ensuring grammars enforce field presence for tool parameters

- **Special Token Helpers**: Added `reasoning_start_ids()`,
`reasoning_end_ids()`, and `reasoning_tokens()` methods to
`SpecialTokens` for robust token detection

- **Comprehensive Tests**: Added 11 new tests covering:
  - Reasoning effort parsing and validation
  - Thinking grammar builder functionality
  - Schema null-stripping for required/optional fields
  - Grammar composition permutations with reasoning
This change updates the ToolGrammarBuilder to correctly use pad
token IDs for XML tool call termination when building Lark grammars
for models that use XML-style tool calling (e.g., Qwen3-Coder/3.5).

The XML format requires closing markers for </function> and
</parameter> tags. When the tokenizer lacks special closing
tags the model can run-on generating forever as XML is not a
finite stateless grammar; see guidance-ai/llguidance/issues/306.

Use pad tokens as "magic" terminating markers embedded into the
grammar and recognizable by the tokenizer/llg mask as not output
a model can normally emit in its textual output (masked to 0.0
logprob normaly). Anchor the XML function/parameter generation
like we bound tool-call and text. This is easier on the model
than forcing JSON parsing (qwen3) especially in conjunction with
forcing it to `<think>` if its not trained to do so.

Mechanically, we modify the chat template to special pad tags
after the closing tags for function and param and inject those
into the grammar template submitted to the model as tool-choice.

Call path:
  src/tools/schema.rs:361 build_xml_with_anchors(pad_ids)
    ├─ Uses pad_ids[0] as </function> anchor
    └─ Uses pad_ids[1] as </parameter> anchor

Grammar structure:
-  start: ( text | tool_call )+ eos?
-  tool_call: <[tool_start_id]> tool_content <[tool_end_id]>
-  tool_0: "<function=fetch_url_via_curl>" param_0_0 ...
"</function>" <[pad_id_0]>
-  param_0_0: "<parameter=url>" value_0_0 ... "</parameter>"
<[pad_id_1]>
- ...

The pad tokens serve as finite termination points for the XML
parser, allowing the Lark grammar to generate valid, parseable tool
calls without requiring explicit special closing tags in the
tokenizer vocabulary.
@sempervictus
Copy link
Copy Markdown
Contributor Author

@guoqingbao
Copy link
Copy Markdown
Owner

Here are some review comments:

SpecialTokens::Category::Eos is a heuristic bucket of “end-like” tokens. That is useful for discovery and tooling, but it is not the same thing as the model’s actual semantic eos_token_id. For example, <|im_end|>, <|eot_id|>, <|header_end|>, and <|eom_id|> may be structural delimiters, not global EOS, unless the model config explicitly defines them as EOS.

Guided decoding only needs the real EOS IDs when it adds optional EOS termination to a grammar. Those IDs should come from resolved model config / generation config, not from the heuristic list in special_tokens.rs.


The sample loop is not using the standard llguidance flow cleanly. Standard flow is: compute_mask from current matcher state, apply mask, sample once from masked logits, consume_token on the chosen token. Your code does that first, but then adds a second validation/resample loop afterward. That second pass is redundant if the first mask was correct, and it hides state/mask bugs instead of fixing them.

The fallback re-sample path is batch-wide, not per-row. When one token fails validation, it rebuilds a full-batch tensor and calls sample_with_strategy() across the whole batch, then only keeps re_sampled[seq_idx]. That is nonstandard, wastes work, and can change RNG advancement for unrelated rows in the batch. A standard fallback would re-sample only the affected row, or better, remove the fallback entirely.

The rollback/prefix-cache changes were unnecessary for guided decoding and unsafe for existing logic. RollbackSnapshot and the rollback path added state into scheduler/block_manager/prefix_cache, but the postprocess check was effectively dead and the rollback math/cache mutation could corrupt normal prefix-cache behavior. I removed that path entirely.

Existing non-guided behavior had regressed. OpenAI stop was not applied, stop_token_ids matching had been removed, and engine tool token handling had drifted from the model-specific ToolConfig path.

Grammar composition had standard-mismatch bugs: merged grammars used repeated alternation +, single-EOS formatting was wrong, and regex/literal helpers silently dropped non-ASCII input.

Guided decoding setup only lived in the OpenAI handler.

GuidanceState still carries unused speculative/rollback/cache machinery that the current sampling path does not use. That is not standard guided decoding; it is leftover complexity.

I'm refactoring and resolve these issues.

@guoqingbao
Copy link
Copy Markdown
Owner

Enable tool grammer is not working, this is log before refactoring:

2026-03-12T10:44:51.848006Z  WARN vllm_rs::core::runner: [Seq 13] Failed to create guidance state: 3(50): Expected value (at ":" (':'))
   1 | start: ( text | tool_call )+ eos?
   2 | 
   3 | tool_call: <[151657]> tool_obj <[151658]>
...
   4 | eos:  ( <[151643]> | <[151645]> | <[151647]> | <[151649]> | <[151651]> | <[151653]> )
. Disabling constraints for this sequence.

After refactor:

It can create guidance state but failed to generate tool call, client receive plain text.

I'll help you find bugs in guided decoding. Let me first search through the codebase to locate relevant files.
First, let me find files related to guided decoding:
 cramped
{"name": "glob", "arguments": {"pattern": "*/guideddecod"}}
 flattened
{"name": "glob", "arguments": {"pattern": "/decoder"}}
 flattened
{"name": "glob", "arguments": {"pattern": "/guided"}}
 flattened
{"name": "grep", "arguments": {"pattern": "guided.decoder", "include": ".rs"}}
 flattened
{"name": "grep", "arguments": {"pattern": "guided.decode", "include": ".rs"}}
 flattened
{"name": "grep", "arguments": {"pattern": "guided.decoding", "include": ".rs"}}
 flattened
{"name": "read", "arguments": {"filePath": "/root/vllm.rs/src/core/decoder.rs"}}
 flattened
{"name": "read", "arguments": {"filePath": "/root/vllm.rs/src/core/decoder.rs"}}
 flattened
{"name": "read", "arguments": {"filePath": "/root/vllm.rs/src/core/decoder.rs"}}
 flattened
{"name": "read", "arguments": {"filePath": "/root/vllm.rs/src/core/decoder.rs"}}
 flattened
{"name": "read", "arguments": {"filePath": "/root/vllm.rs/src/core/decoder.rs"}}
 flattened
{"name": "read", "arguments": {"filePath": "/root/vllm.rs/src/core/decoder.rs"}}
 flattened
{"name": "read", "arguments": {"filePath": "/root/vllm.rs/src/core/decoder.rs"}}
 flattened
{"name": "read", "arguments": {"filePath": "/root/vllm.rs/src/core/decoder.rs"}}
 flattened
{"name": "read", "arguments": {"filePath": "/root/vllm.rs/src/core/decoder.rs"}}

Model: Qwen3-30B-A3B-Instruct-2507

So, we've better remove the guided decoding for tool call grammar.

@guoqingbao
Copy link
Copy Markdown
Owner

@sempervictus I have removed the tool grammar since it is not working. It passed all constraint cases except this one "Emit a search tool call". Are you able to make a last test? In terms of guided decoding, I think it works and we've ready to merge it.

@guoqingbao
Copy link
Copy Markdown
Owner

I think once this one is merged, you can submit another PR to address the tool grammar issue.

@sempervictus
Copy link
Copy Markdown
Contributor Author

sempervictus commented Mar 12, 2026

i removed the XML binary anchors - trying \n stubs but worst case the jinja allows the actual arguments to parameters to be JSON. Testing presently w/ the updated precision PR.

Sorry, probably havent read the giant doc but --enforce-parser is respected by the grammar. So you can force qwen3 JSON format on q3coder. Its not ideal, hence attempts to fix the XML.

We also need to excise the stupid instruction from the jinja telling it to use the XML format before giving it the template

@guoqingbao
Copy link
Copy Markdown
Owner

i removed the XML binary anchors - trying \n stubs but worst case the jinja allows the actual arguments to parameters to be JSON. Testing presently w/ the updated precision PR.

Sorry, probably havent read the giant doc but --enforce-parser is respected by the grammar. So you can force qwen3 JSON format on q3coder. Its not ideal, hence attempts to fix the XML.

We also need to excise the stupid instruction from the jinja telling it to use the XML format before giving it the template

omg, you made your old branch back, we can't merge it at current form because the the above-mentioned issues, I have fixed them in the previous commits and you simply taking them back.😂

@guoqingbao
Copy link
Copy Markdown
Owner

I suggest merging the refactored version from #263 with a note such as Co-authored-by: @sempervictus. Alternatively, you could force-push #263 here for a clean merge, and then continue developing the tool grammar in new PRs.

The refactored version resolves several issues, including:


SpecialTokens::Category::Eos is a heuristic bucket of “end-like” tokens. That is useful for discovery and tooling, but it is not the same thing as the model’s actual semantic eos_token_id. For example, <|im_end|>, <|eot_id|>, <|header_end|>, and <|eom_id|> may be structural delimiters, not global EOS, unless the model config explicitly defines them as EOS.

Guided decoding only needs the real EOS IDs when it adds optional EOS termination to a grammar. Those IDs should come from resolved model config / generation config, not from the heuristic list in special_tokens.rs.

The sample loop is not using the standard llguidance flow cleanly. Standard flow is: compute_mask from current matcher state, apply mask, sample once from masked logits, consume_token on the chosen token. Your code does that first, but then adds a second validation/resample loop afterward. That second pass is redundant if the first mask was correct, and it hides state/mask bugs instead of fixing them.

The fallback re-sample path is batch-wide, not per-row. When one token fails validation, it rebuilds a full-batch tensor and calls sample_with_strategy() across the whole batch, then only keeps re_sampled[seq_idx]. That is nonstandard, wastes work, and can change RNG advancement for unrelated rows in the batch. A standard fallback would re-sample only the affected row, or better, remove the fallback entirely.

The rollback/prefix-cache changes were unnecessary for guided decoding and unsafe for existing logic. RollbackSnapshot and the rollback path added state into scheduler/block_manager/prefix_cache, but the postprocess check was effectively dead and the rollback math/cache mutation could corrupt normal prefix-cache behavior. I removed that path entirely.

Existing non-guided behavior had regressed. OpenAI stop was not applied, stop_token_ids matching had been removed, and engine tool token handling had drifted from the model-specific ToolConfig path.

Grammar composition had standard-mismatch bugs: merged grammars used repeated alternation +, single-EOS formatting was wrong, and regex/literal helpers silently dropped non-ASCII input.

Guided decoding setup only lived in the OpenAI handler.

GuidanceState still carries unused speculative/rollback/cache machinery that the current sampling path does not use. That is not standard guided decoding; it is leftover complexity.


I also refactored your Markdown file since it was too heavy (~2,500 lines), which isn’t suitable for documentation and tends to cause problems for agents. Large files like this can overwhelm agents’ memory with hundreds of thousands of tokens.

@sempervictus
Copy link
Copy Markdown
Contributor Author

Done sir done! 😄

I'll get the tools piece into another PR in the morning after i run some coders through it to see if they can pass something that breaks the XML (JSON has been solid since day 1)

@guoqingbao
Copy link
Copy Markdown
Owner

Got an issue in the last audit:

Opencode with Qwen3.5-27B-FP8 report this (prompts: /init and then "find bugs in the guided decoding")

Unprocessable Entity: messages[77] role=tool requires non-empty content

The server log:

2026-03-14T08:21:28.664877Z  INFO vllm_rs::core::scheduler: GPU Kvcache: 3778 blocks (241792 tokens) free, used 32.6% (7.13GB/21.89GB); CPU swap used 0.0% (0.00GB/4.38GB)
2026-03-14T08:21:28.664897Z  INFO vllm_rs::core::scheduler: GPU MambaState: 1 / 16 slots used (6.2%), approx 0.14GB/2.43GB (slot 146.81MB)
2026-03-14T08:21:28.781768Z  WARN vllm_rs::server::server: Tools enabled for request

While, the current main works well on it.

@sempervictus
Copy link
Copy Markdown
Contributor Author

Right now we are at #263 level where I thought the tools path is inert. I'm afk for the evening but in the tools branch I do not see those with the flag enabled or disabled. Will push that commit in a few hours once at keyboard.

@guoqingbao
Copy link
Copy Markdown
Owner

Right now we are at #263 level where I thought the tools path is inert. I'm afk for the evening but in the tools branch I do not see those with the flag enabled or disabled. Will push that commit in a few hours once at keyboard.

Another bug here:

let (guided_logits, guided_seq_ids) = if let Some(factory) = &self.llg_factory {
...
}

We should not apply that if contraint grammar not provided.

@guoqingbao
Copy link
Copy Markdown
Owner

Right now we are at #263 level where I thought the tools path is inert. I'm afk for the evening but in the tools branch I do not see those with the flag enabled or disabled. Will push that commit in a few hours once at keyboard.

Another bug here:

let (guided_logits, guided_seq_ids) = if let Some(factory) = &self.llg_factory {
...
}

We should not apply that if contraint grammar not provided.

The revision of scheduler.rs for tool calling also buggy, I reverted that together with fix for above. It works now. You need force push #263 again to here.

@guoqingbao
Copy link
Copy Markdown
Owner

Another bug, once the guided decoding request processed, it can affect normal requests making them unable to properly handle tool calls. The state management for guided decoding requires double check.

@guoqingbao
Copy link
Copy Markdown
Owner

Another bug, once the guided decoding request processed, it can affect normal requests making them unable to properly handle tool calls. The state management for guided decoding requires double check.

Root cause:

The old gate was effectively “if the runner has an llguidance factory, enter the guided path”, which is a batch-level condition, not a per-request one. So when a constrained request and a normal tool-call request were decoded in the same batch, the whole batch went through the llguidance masking/commit flow.

@guoqingbao
Copy link
Copy Markdown
Owner

Another bug, once the guided decoding request processed, it can affect normal requests making them unable to properly handle tool calls. The state management for guided decoding requires double check.

Root cause:

The old gate was effectively “if the runner has an llguidance factory, enter the guided path”, which is a batch-level condition, not a per-request one. So when a constrained request and a normal tool-call request were decoded in the same batch, the whole batch went through the llguidance masking/commit flow.

Fixed that in #263 by adding per-sequence guided decoding.

I think after final alignment with #263, @sempervictus you'd better do a last test especially for tool calling, guided decoding or mixed normal decoding and guided one. If they all passed, we can merge it. I think I have fixed almost all vulnerable points.

@sempervictus
Copy link
Copy Markdown
Contributor Author

If you're still awake, i actually had a question about this piece (removed in your branch) - if we dont have full seq rollback but the constraint+SamplingParams mask end up producing something not viable for the constraint itself, this chunk was supposed to "fix" it token by token. I have a faster version of this which has early abort on the check but i left this in specifically so we could verify if a correction needed to happen so we'd know that something bears looking into. Is the logic wrong or did you cull this because its a performance hog?

-        self.commit_guided_tokens(&seq_ids, &tokens, guided_seq_ids);
+        // Clone guided_seq_ids for commit_guided_tokens and the following block
+        let guided_seq_ids_for_commit = guided_seq_ids.clone();
+        self.commit_guided_tokens(&seq_ids, &tokens, guided_seq_ids_for_commit);
+
+        if let Some(ref guided_seq_ids) = guided_seq_ids {
+            let mut guidance_states = self.guidance_states.write();
+            let mut guidance_failed = self.guidance_failed.write();
+            for (seq_idx, seq_id) in seq_ids.iter().enumerate() {
+                if !guided_seq_ids.contains(seq_id) || guidance_failed.contains(seq_id) {
+                    continue;
+                }
+
+                if let Some(state) = guidance_states.get_mut(seq_id) {
+                    if state.is_finished() {
+                        continue;
+                    }
+
+                    let token = tokens[seq_idx];
+                    if let Err(err) = state.commit_token(token) {
+                        if guidance_failed.insert(*seq_id) {
+                            crate::log_warn!(
+                                "[Seq {}] Failed to commit guided token {}: {}. Disabling constraints for this sequence.",
+                                seq_id,
+                                token,
+                                err
+                            );
+                        }
+                        let _ = guidance_states.remove(seq_id);
+                    }
+                }
+            }
+        }
 

pulling it out of the commit tree but figure since i have it in front of me i would ask 😄

@guoqingbao
Copy link
Copy Markdown
Owner

If you're still awake, i actually had a question about this piece (removed in your branch) - if we dont have full seq rollback but the constraint+SamplingParams mask end up producing something not viable for the constraint itself, this chunk was supposed to "fix" it token by token. I have a faster version of this which has early abort on the check but i left this in specifically so we could verify if a correction needed to happen so we'd know that something bears looking into. Is the logic wrong or did you cull this because its a performance hog?

The post-sample validation was intentional, but I didn’t remove that behavior. I moved it into commit_guided_tokens(...), which is still called immediately after sampling.

@sempervictus
Copy link
Copy Markdown
Contributor Author

ah! thanks

348 test cases - only the user-supplied regex failing. having that handled, out for a few hours.

@guoqingbao
Copy link
Copy Markdown
Owner

ah! thanks

348 test cases - only the user-supplied regex failing. having that handled, out for a few hours.

Have you pushed latest changes from #263 here, there are more bug fixes.

@sempervictus
Copy link
Copy Markdown
Contributor Author

Pulled them in yeaterday and will rebase branch again now - am getting all the grammar tests to pass as youre way better at the engine stuff so just using 263 state as a base and collapsing as much as I can into guidance.rs and schema to avoid pollution of server and auch. Sorry its a bit hectic this week w gtc - live in a rather remote area, its a whole thing to get to civilization and conferences :).

The good news is all the lark logic works after moving the typed ones back to their types from llg so we aren't having to process those inline which simplified the string composition logic. The @ refs I haven't been able to get working but all 11 permutation inputs and their outputs are being built and compiling to Matcher FSMs in the tests (takes a while there's 348 variants in the current set but it can be thousand when you feed the tester lots of tool options and grammar

@guoqingbao
Copy link
Copy Markdown
Owner

Pulled them in yeaterday and will rebase branch again now - am getting all the grammar tests to pass as youre way better at the engine stuff so just using 263 state as a base and collapsing as much as I can into guidance.rs and schema to avoid pollution of server and auch. Sorry its a bit hectic this week w gtc - live in a rather remote area, its a whole thing to get to civilization and conferences :).

I've done last audit in #263, simplely force push #263 here and I will merge this PR.

@sempervictus
Copy link
Copy Markdown
Contributor Author

ah! sorry! pushed before i switched to this tab. 1s, will rebase on 263 only and separate this to another PR

@sempervictus
Copy link
Copy Markdown
Contributor Author

All set sir, we're aligned. Will open another one for review with the various grammar options and i'll include the example in that branch while its in draft so we (and rather importantly the LLMs) can see all of the resulting grammars and compile them.

@guoqingbao
Copy link
Copy Markdown
Owner

Thanks!

@guoqingbao guoqingbao merged commit 1661da8 into guoqingbao:main Mar 16, 2026
1 check passed
guoqingbao added a commit that referenced this pull request May 21, 2026
* Implement Constrained Generation via LLGuidance

This implements the full llguidance integration enabling
grammar-constrained inference for structured outputs, tool calling,
and custom constraints.

Architecture:
- TopLevelGrammar serialized via rmp_serde across RPC boundaries
- Grammar flows: Server → params.grammar → Runner → GuidanceState
 → Matcher
- Inline correction via logits masking during sampling
- Post-process correction via rollback on validation failure

Key components:
- params.grammar field in SamplingParams for RPC serialization
GuidanceState
- GuidanceState::new() with Matcher state management
- GuidanceState::reset() for proper state cleanup
- Rollback counter (MAX_ROLLBACK_ATTEMPTS=3) preventing infinite
loops
- guidance_failed/guidance_mismatch sets cleared on rollback
- Vocab size validation in build_llg_factory()
- Lark grammar generation from tools via
build_tool_call_lark_grammar()

CLI flags:
- --enable-tool-grammar: Auto-build LLG grammar from MCP tools
- --allow-constraint-api: Accept client-provided
structured_outputs/response_format

* Support Qwen3.5 Dense models on Metal (#258)

* Utilize SpecialTokens Idiomatic Accessor for EOS

Expand SpecialTokens usage to cover EOS uses across the codebase
to include the chat template. This gates access to the EOS tokens
through a single common API providing an interdiction point to add
or remove them as needed per-model or family as requried.

* Idiomatic SpecialTokens Access Pattern

- Replace manual EOS token extraction logic with centralized
SpecialTokens::new() and idiomatic accessors
- Eliminate EosTokenId enum and related complex serialization logic
in favor of direct Vec<u32>
- Update all callers to use SpecialTokens for tool start/end token
IDs
- Remove stop_token_ids from SamplingParams and related logic
(now handled via SpecialTokens)
- Simplify tokenizer config by replacing EosTokenEntry with
Option<String>
- Add comprehensive SpecialTokens API with category-based
accessors, ID/string sets, and search methods

* More SpecialTokens, Improve Example/Binary

Improve the binary example to be a handy extractor for models which
developers can use to update special_tokens.rs quickly.

Add tags extracted from Qwen3.5 0.8B

* SpecialTokens Strings for Llama4 and Qwen3.5 MoE

Narrow the Common category search specifically to find string dups
of actually special tokens (handle "aftermarket" models/merges).

Add and test Llama4 and Qwen3.5 MoE

* ToolConfig Population w/ SpecialTokens

* Drop ToolFormat

* Lead The Horse to Water, Make Him <Think>

This PR introduces a new `reasoning_effort` parameter to control
reasoning block generation in the chat completion API, matching OpenAI's
reasoning API behavior.

- **API Extension**: Added `reasoning_effort` field to
`ChatCompletionRequest` accepting "none", "low", "medium", or "high"
values (case-insensitive)

- **New Module**: Created `src/utils/reasoning.rs` with:
  - `ReasoningEffort` enum with `from_str` deserialization
  - `ThinkingGrammarBuilder` for reasoning block grammar
construction
  - `thinking_grammar_with_reasoning_block()` generating
Lark grammar patterns
  - `build_reasoning_grammar()` for composing reasoning blocks
with base grammars

- **Integration**: Updated `compose_grammars()` in
`src/utils/guidance.rs` to accept and apply reasoning effort levels

- **Schema Sanitization**: Enhanced `sanitize_schema_for_llguidance()`
in `src/tools/schema.rs` to strip null from required field types,
ensuring grammars enforce field presence for tool parameters

- **Special Token Helpers**: Added `reasoning_start_ids()`,
`reasoning_end_ids()`, and `reasoning_tokens()` methods to
`SpecialTokens` for robust token detection

- **Comprehensive Tests**: Added 11 new tests covering:
  - Reasoning effort parsing and validation
  - Thinking grammar builder functionality
  - Schema null-stripping for required/optional fields
  - Grammar composition permutations with reasoning

* Tier Reasoning Effort

* Anchor XML Tool-Grammar With SpecialTokens Pads

This change updates the ToolGrammarBuilder to correctly use pad
token IDs for XML tool call termination when building Lark grammars
for models that use XML-style tool calling (e.g., Qwen3-Coder/3.5).

The XML format requires closing markers for </function> and
</parameter> tags. When the tokenizer lacks special closing
tags the model can run-on generating forever as XML is not a
finite stateless grammar; see guidance-ai/llguidance/issues/306.

Use pad tokens as "magic" terminating markers embedded into the
grammar and recognizable by the tokenizer/llg mask as not output
a model can normally emit in its textual output (masked to 0.0
logprob normaly). Anchor the XML function/parameter generation
like we bound tool-call and text. This is easier on the model
than forcing JSON parsing (qwen3) especially in conjunction with
forcing it to `<think>` if its not trained to do so.

Mechanically, we modify the chat template to special pad tags
after the closing tags for function and param and inject those
into the grammar template submitted to the model as tool-choice.

Call path:
  src/tools/schema.rs:361 build_xml_with_anchors(pad_ids)
    ├─ Uses pad_ids[0] as </function> anchor
    └─ Uses pad_ids[1] as </parameter> anchor

Grammar structure:
-  start: ( text | tool_call )+ eos?
-  tool_call: <[tool_start_id]> tool_content <[tool_end_id]>
-  tool_0: "<function=fetch_url_via_curl>" param_0_0 ...
"</function>" <[pad_id_0]>
-  param_0_0: "<parameter=url>" value_0_0 ... "</parameter>"
<[pad_id_1]>
- ...

The pad tokens serve as finite termination points for the XML
parser, allowing the Lark grammar to generate valid, parseable tool
calls without requiring explicit special closing tags in the
tokenizer vocabulary.

* Cargo fmt

* Refactor guided decoding

* Update docs

* Typo fix

* Remove tool grammar & fix slow first token response for sync request

* Strip guided-decoding’s leftover tool grammar surface

* Fix incorrect guidance application

* Revert changes for scheduler.rs (tool call related)

* Apply per-sequence guided decoding

* Remove redundancy

* Fix corner case

* Permit empty tool call result

---------

Co-authored-by: RageLtMan <rageltman [at] sempervictus>
Co-authored-by: Guoqing Bao <topon@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants