Enable Reasoning via Guided Enforcement by sempervictus · Pull Request #262 · guoqingbao/xinfer

sempervictus · 2026-03-11T13:52:21Z

This should more or less complete the LLG work by including reasoning API support a la OpenAI's.

Overview

The guided inference system supports multiple reasoning effort levels through the ReasoningEffort enum. Each level implements a different reasoning strategy optimized for specific use cases:

Level	Description	Use Case
`None`	No structured reasoning - direct output only	Fast generation, low latency
`Low`	Constrained single-paragraph reasoning (~150 chars max)	Fast Thinking, reduces hallucination
`Medium`	Standard multi-step Chain-of-Thought (CoT)	Balanced reasoning depth
`High`	Adversarial analysis with self-correction phases	Complex tasks requiring error checking
`ChainOfThought`	Best-of-breed CoVe + Self-Critique	Maximum accuracy for fact-sensitive tasks

Reasoning Effort Levels

`ReasoningEffort::Low` - Fast Thinking

Implements "Fast Thinking" with tight length constraints (~150 chars max). This reduces hallucination risk by limiting the generation space.

start: reasoning_block
reasoning_block: <[START_ID]> "\n" thinkgram "\n" <[END_ID]> "\n"
thinkgram: /^[^\n]{1,150}/

`ReasoningEffort::Medium` - Standard CoT

Implements Wei et al. (2022) baseline with sentence-based termination. Allows multiple steps but enforces sentence boundaries.

start: reasoning_block
reasoning_block: <[START_ID]> "\n" thinkgram "\n" <[END_ID]> "\n"
thinkgram: /(?s:[^.!?]+[.!?])+/  # Multiple sentences

`ReasoningEffort::High` - Adversarial Analysis

Implements Cheng & Su (2025) adversarial critique pattern. Forces explicit error checking before final output.

start: reasoning_block* analysis_block*
reasoning_block: <[START_ID]> "\n" thinkgram "\n" <[END_ID]> "\n"
analysis_block: "<ANALYZE>" "\n" analysis_content "\n" "</ANALYZE>" "\n"
thinkgram: /(?s:[^.!?]+[.!?])+/
analysis_content: /(?s:.*)/

`ReasoningEffort::ChainOfThought` - CoVe Pattern

Combines Madaan et al. (2024) Chain-of-Verification with adversarial self-correction. Maximum accuracy for complex/fact-sensitive tasks.

start: cots*
cots: <[START_ID]> "\n" draft_phase verification_phase critique_phase final_phase "\n" <[END_ID]> "\n"
draft_phase: /(?s:[^.!?]+[.!?])+/
verification_phase: "<VERIFY>" "\n" verification_questions "\n" verification_answers "\n" "</VERIFY>" "\n"
verification_questions: /(?s:[^.!?]+[.!?])+/
verification_answers: /(?s:.*)/
critique_phase: "<CRITIQUE>" "\n" self_critique "\n" "</CRITIQUE>" "\n"
self_critique: /(?s:.*)/
final_phase: "<FINAL_ANSWER>" "\n" final_content
final_content: /(?s:.*)/

API Reference

`ReasoningEffort` Enum

Location: src/utils/reasoning.rs:16-39

#[derive(Clone, Debug, serde::Serialize, serde::Deserialize, PartialEq, Eq)]
pub enum ReasoningEffort {
    None,
    Low,
    Medium,
    High,
    ChainOfThought,
}

`ThinkingGrammarBuilder`

Location: src/utils/reasoning.rs:143-178

pub struct ThinkingGrammarBuilder {
    start_id: u32,
    end_id: u32,
    effort: Option<ReasoningEffort>,
}

Methods:

ThinkingGrammarBuilder::new(start_id, end_id, effort) - Create new builder
ThinkingGrammarBuilder::from_string(start_id, end_id) - Create from string IDs
build() - Generate Lark grammar string
build_grammar() - Generate TopLevelGrammar

`build_reasoning_grammar()`

Location: src/utils/reasoning.rs:183-218

Wraps a base composer with reasoning blocks when reasoning effort is enabled.

pub fn build_reasoning_grammar(
    base_grammar: TopLevelGrammar,
    reasoning_effort: ReasoningEffort,
    special_tokens: &SpecialTokens,
) -> TopLevelGrammar

Includes #232 and #260

sequenceDiagram
    participant User
    participant API
    participant Pipeline
    participant SpecialTokens
    participant LLGFactory
    participant Matcher
    participant TokenParser
    participant EarleyParser
    participant Lexer
    participant TokTrie
    participant Sampler
    participant LogitsProcessor
    participant Model

    User->>API: Request with constraint (regex/json_schema/lark/llguidance)

    Note over User,API: Phase 1: Request Setup and Grammar Building

    API->>SpecialTokens: SpecialTokens::new(&tokenizer)
    SpecialTokens-->>API: Return EOS, BOS, TOOL token IDs
    API->>Pipeline: build_llg_factory(tokenizer)
    Pipeline->>LLGFactory: toktrie_hf_tokenizers::ByteTokenizer::from_tokenizer(tokenizer)
    LLGFactory->>TokTrie: Create token trie from tokenizer vocabulary
    TokTrie-->>LLGFactory: Return TokEnv with trie
    LLGFactory->>LLGFactory: ParserFactory::new_simple(&env)
    LLGFactory-->>Pipeline: Return Arc<ParserFactory>

    Pipeline->>Pipeline: llg_grammar_from_constraint(&request.constraint)
    Pipeline->>Matcher: constraint_from_llg_grammar(&factory, grm)
    Matcher->>Matcher: factory.create_parser(grm)
    Matcher->>TokenParser: Create with grammar_init
    TokenParser->>EarleyParser: Build CGrammar from grammar
    TokenParser->>Lexer: Build LexerSpec from grammar
    Lexer->>TokTrie: Precompute large lexemes if needed
    TokTrie-->>Lexer: Return optimized lexeme sets

    Note over User,Matcher: Phase 2: Prompt Processing (if needed)

    User->>API: Optional: process_prompt(prompt_tokens)
    API->>TokenParser: process_prompt(prompt_tokens)
    TokenParser->>TokenParser: tokenize_bytes_marker(&prompt_bytes)
    TokenParser->>TokenParser: process_prompt() returns new prompt

    Note over User,Matcher: Phase 3: Inference Loop

    loop for each token generation

    Model->>Model: Forward pass on input tokens
    Model-->>Pipeline: Return logits tensor

    Pipeline->>Sampler: sample_sequence(logits, seq, ...)

    Note over Sampler: Two-stage sampling with llguidance

    Sampler->>LogitsProcessor: Apply llguidance constraint

    LogitsProcessor->>TokenParser: compute_mask()
    TokenParser->>TokenParser: compute_mask_inner()
    TokenParser->>EarleyParser: run_speculative("compute_mask")
    EarleyParser->>EarleyParser: trie_started("compute_mask")
    EarleyParser->>EarleyParser: compute_bias()
    EarleyParser->>Lexer: compute_bias() with token_prefix

    Note over Lexer,TokTrie: Lexical Scope Analysis

    Lexer->>TokTrie: Walk token trie for allowed lexemes
    TokTrie-->>Lexer: Return SimpleVob bit mask

    Lexer->>EarleyParser: Return mask to TokenParser
    TokenParser->>TokenParser: cache mask for fast-forward

    TokenParser-->>LogitsProcessor: Return SimpleVob mask

    LogitsProcessor->>LogitsProcessor: Check if sampled token is allowed
    LogitsProcessor->>Sampler: Apply logit biasing

    alt Token is allowed
        Sampler->>Sampler: No biasing needed
    else Token is not allowed
        Sampler->>Sampler: Set invalid tokens to -f32::INFINITY
        Sampler->>Sampler: Re-sample with biased logits
    end

    Sampler->>TokenParser: consume_token(sampled_token)
    TokenParser->>TokenParser: apply_token(sampled_token)
    TokenParser->>TokenParser: llm_tokens.push(sampled_token)
    TokenParser->>TokenParser: llm_bytes.extend(token_bytes)
    TokenParser->>EarleyParser: parser.apply_token(token_bytes, token_id)
    EarleyParser->>Lexer: advance lexer state
    Lexer->>Lexer: Update lexer_stack with new state
    Lexer->>EarleyParser: Return backtrack count

    alt Backtrack needed
        EarleyParser->>EarleyParser: rollback(backtrack_bytes)
        EarleyParser->>EarleyParser: Update llm_tokens and llm_bytes
    end

    TokenParser->>TokenParser: check_stop()
    TokenParser-->>Sampler: Return CommitResult

    Note over Sampler: Phase 4: Fast-Forward (if enabled)

    Sampler->>TokenParser: compute_ff_tokens()
    TokenParser->>TokenParser: ff_tokens()
    TokenParser->>TokTrie: Tokenize forced bytes
    TokTrie-->>TokenParser: Return fast-forward tokens

    alt Fast-forward tokens available
        TokenParser->>TokenParser: consume_ff_tokens()
        loop for each ff_token
            TokenParser->>TokenParser: consume_token(ff_token)
            TokenParser->>TokenParser: llm_tokens.push(ff_token)
            TokenParser->>TokenParser: llm_bytes.extend(ff_token_bytes)
        end
    end

    Note over Sampler: Phase 5: Speculative Decoding (if enabled)

    Model->>Model: Draft model forward pass
    Model-->>Pipeline: Return draft logits

    Pipeline->>Sampler: sample_target_sequence_speculative()
    Sampler->>TokenParser: rollback(n_toks)
    TokenParser->>EarleyParser: parser.rollback(bytes_to_drop)
    EarleyParser->>Lexer: pop lexer states
    Lexer-->>TokenParser: Return rollback result

    Sampler->>Sampler: Sample draft tokens
    Sampler->>TokenParser: validate_tokens(draft_tokens)
    TokenParser->>TokenParser: consume_token(draft_token)

    alt Draft token accepted
        TokenParser->>TokenParser: Continue with next draft
    else Draft token rejected
        TokenParser->>TokenParser: Accept partial draft
        TokenParser->>TokenParser: Rollback to last valid state
    end

    end

    Note over User,Matcher: Phase 6: Token Geometry and Binary Data State

    TokTrie->>TokTrie: Token encoding (8:24 bit split)
    TokTrie->>TokTrie: node.bits = (token_id << 8) | byte
    TokTrie->>TokTrie: node.bits2 = (subtree_size << 10) | num_parents

    TokTrie->>SimpleVob: Bit mask storage
    SimpleVob->>SimpleVob: data: Vec<u32> (32 tokens per word)
    SimpleVob->>SimpleVob: allow_token(tok): data[tok>>5] |= 1 << (tok&31)

    Note over User,Matcher: Phase 7: Rollback and Verification

    TokenParser->>TokenParser: validate_tokens(tokens)
    TokenParser->>EarleyParser: validate_tokens_raw(tokens)
    EarleyParser->>Lexer: Check if tokens match current lexer state
    Lexer-->>TokenParser: Return number of valid tokens

    TokenParser->>TokenParser: rollback(n_tokens)
    TokenParser->>EarleyParser: parser.rollback(bytes_to_drop)
    EarleyParser->>Lexer: pop lexer states
    TokenParser->>TokenParser: llm_tokens.truncate(new_len)
    TokenParser->>TokenParser: llm_bytes.truncate(new_len)

    Note over User,Matcher: Phase 8: Response Generation

    Pipeline->>API: Return completion with tokens
    API->>User: Stream or return final response

This implements the full llguidance integration enabling grammar-constrained inference for structured outputs, tool calling, and custom constraints. Architecture: - TopLevelGrammar serialized via rmp_serde across RPC boundaries - Grammar flows: Server → params.grammar → Runner → GuidanceState → Matcher - Inline correction via logits masking during sampling - Post-process correction via rollback on validation failure Key components: - params.grammar field in SamplingParams for RPC serialization GuidanceState - GuidanceState::new() with Matcher state management - GuidanceState::reset() for proper state cleanup - Rollback counter (MAX_ROLLBACK_ATTEMPTS=3) preventing infinite loops - guidance_failed/guidance_mismatch sets cleared on rollback - Vocab size validation in build_llg_factory() - Lark grammar generation from tools via build_tool_call_lark_grammar() CLI flags: - --enable-tool-grammar: Auto-build LLG grammar from MCP tools - --allow-constraint-api: Accept client-provided structured_outputs/response_format

Expand SpecialTokens usage to cover EOS uses across the codebase to include the chat template. This gates access to the EOS tokens through a single common API providing an interdiction point to add or remove them as needed per-model or family as requried.

- Replace manual EOS token extraction logic with centralized SpecialTokens::new() and idiomatic accessors - Eliminate EosTokenId enum and related complex serialization logic in favor of direct Vec<u32> - Update all callers to use SpecialTokens for tool start/end token IDs - Remove stop_token_ids from SamplingParams and related logic (now handled via SpecialTokens) - Simplify tokenizer config by replacing EosTokenEntry with Option<String> - Add comprehensive SpecialTokens API with category-based accessors, ID/string sets, and search methods

Improve the binary example to be a handy extractor for models which developers can use to update special_tokens.rs quickly. Add tags extracted from Qwen3.5 0.8B

Narrow the Common category search specifically to find string dups of actually special tokens (handle "aftermarket" models/merges). Add and test Llama4 and Qwen3.5 MoE

This PR introduces a new `reasoning_effort` parameter to control reasoning block generation in the chat completion API, matching OpenAI's reasoning API behavior. - **API Extension**: Added `reasoning_effort` field to `ChatCompletionRequest` accepting "none", "low", "medium", or "high" values (case-insensitive) - **New Module**: Created `src/utils/reasoning.rs` with: - `ReasoningEffort` enum with `from_str` deserialization - `ThinkingGrammarBuilder` for reasoning block grammar construction - `thinking_grammar_with_reasoning_block()` generating Lark grammar patterns - `build_reasoning_grammar()` for composing reasoning blocks with base grammars - **Integration**: Updated `compose_grammars()` in `src/utils/guidance.rs` to accept and apply reasoning effort levels - **Schema Sanitization**: Enhanced `sanitize_schema_for_llguidance()` in `src/tools/schema.rs` to strip null from required field types, ensuring grammars enforce field presence for tool parameters - **Special Token Helpers**: Added `reasoning_start_ids()`, `reasoning_end_ids()`, and `reasoning_tokens()` methods to `SpecialTokens` for robust token detection - **Comprehensive Tests**: Added 11 new tests covering: - Reasoning effort parsing and validation - Thinking grammar builder functionality - Schema null-stripping for required/optional fields - Grammar composition permutations with reasoning

This change updates the ToolGrammarBuilder to correctly use pad token IDs for XML tool call termination when building Lark grammars for models that use XML-style tool calling (e.g., Qwen3-Coder/3.5). The XML format requires closing markers for </function> and </parameter> tags. When the tokenizer lacks special closing tags the model can run-on generating forever as XML is not a finite stateless grammar; see guidance-ai/llguidance/issues/306. Use pad tokens as "magic" terminating markers embedded into the grammar and recognizable by the tokenizer/llg mask as not output a model can normally emit in its textual output (masked to 0.0 logprob normaly). Anchor the XML function/parameter generation like we bound tool-call and text. This is easier on the model than forcing JSON parsing (qwen3) especially in conjunction with forcing it to `<think>` if its not trained to do so. Mechanically, we modify the chat template to special pad tags after the closing tags for function and param and inject those into the grammar template submitted to the model as tool-choice. Call path: src/tools/schema.rs:361 build_xml_with_anchors(pad_ids) ├─ Uses pad_ids[0] as </function> anchor └─ Uses pad_ids[1] as </parameter> anchor Grammar structure: - start: ( text | tool_call )+ eos? - tool_call: <[tool_start_id]> tool_content <[tool_end_id]> - tool_0: "<function=fetch_url_via_curl>" param_0_0 ... "</function>" <[pad_id_0]> - param_0_0: "<parameter=url>" value_0_0 ... "</parameter>" <[pad_id_1]> - ... The pad tokens serve as finite termination points for the XML parser, allowing the Lark grammar to generate valid, parseable tool calls without requiring explicit special closing tags in the tokenizer vocabulary.

sempervictus · 2026-03-11T21:42:57Z

Supersedes:

guoqingbao · 2026-03-12T07:37:54Z

Here are some review comments:

SpecialTokens::Category::Eos is a heuristic bucket of “end-like” tokens. That is useful for discovery and tooling, but it is not the same thing as the model’s actual semantic eos_token_id. For example, <|im_end|>, <|eot_id|>, <|header_end|>, and <|eom_id|> may be structural delimiters, not global EOS, unless the model config explicitly defines them as EOS.

Guided decoding only needs the real EOS IDs when it adds optional EOS termination to a grammar. Those IDs should come from resolved model config / generation config, not from the heuristic list in special_tokens.rs.


The sample loop is not using the standard llguidance flow cleanly. Standard flow is: compute_mask from current matcher state, apply mask, sample once from masked logits, consume_token on the chosen token. Your code does that first, but then adds a second validation/resample loop afterward. That second pass is redundant if the first mask was correct, and it hides state/mask bugs instead of fixing them.

The fallback re-sample path is batch-wide, not per-row. When one token fails validation, it rebuilds a full-batch tensor and calls sample_with_strategy() across the whole batch, then only keeps re_sampled[seq_idx]. That is nonstandard, wastes work, and can change RNG advancement for unrelated rows in the batch. A standard fallback would re-sample only the affected row, or better, remove the fallback entirely.

The rollback/prefix-cache changes were unnecessary for guided decoding and unsafe for existing logic. RollbackSnapshot and the rollback path added state into scheduler/block_manager/prefix_cache, but the postprocess check was effectively dead and the rollback math/cache mutation could corrupt normal prefix-cache behavior. I removed that path entirely.

Existing non-guided behavior had regressed. OpenAI stop was not applied, stop_token_ids matching had been removed, and engine tool token handling had drifted from the model-specific ToolConfig path.

Grammar composition had standard-mismatch bugs: merged grammars used repeated alternation +, single-EOS formatting was wrong, and regex/literal helpers silently dropped non-ASCII input.

Guided decoding setup only lived in the OpenAI handler.

GuidanceState still carries unused speculative/rollback/cache machinery that the current sampling path does not use. That is not standard guided decoding; it is leftover complexity.

I'm refactoring and resolve these issues.

guoqingbao · 2026-03-12T10:54:38Z

Enable tool grammer is not working, this is log before refactoring:

2026-03-12T10:44:51.848006Z  WARN vllm_rs::core::runner: [Seq 13] Failed to create guidance state: 3(50): Expected value (at ":" (':'))
   1 | start: ( text | tool_call )+ eos?
   2 | 
   3 | tool_call: <[151657]> tool_obj <[151658]>
...
   4 | eos:  ( <[151643]> | <[151645]> | <[151647]> | <[151649]> | <[151651]> | <[151653]> )
. Disabling constraints for this sequence.

After refactor:

It can create guidance state but failed to generate tool call, client receive plain text.

I'll help you find bugs in guided decoding. Let me first search through the codebase to locate relevant files.
First, let me find files related to guided decoding:
 cramped
{"name": "glob", "arguments": {"pattern": "*/guideddecod"}}
 flattened
{"name": "glob", "arguments": {"pattern": "/decoder"}}
 flattened
{"name": "glob", "arguments": {"pattern": "/guided"}}
 flattened
{"name": "grep", "arguments": {"pattern": "guided.decoder", "include": ".rs"}}
 flattened
{"name": "grep", "arguments": {"pattern": "guided.decode", "include": ".rs"}}
 flattened
{"name": "grep", "arguments": {"pattern": "guided.decoding", "include": ".rs"}}
 flattened
{"name": "read", "arguments": {"filePath": "/root/vllm.rs/src/core/decoder.rs"}}
 flattened
{"name": "read", "arguments": {"filePath": "/root/vllm.rs/src/core/decoder.rs"}}
 flattened
{"name": "read", "arguments": {"filePath": "/root/vllm.rs/src/core/decoder.rs"}}
 flattened
{"name": "read", "arguments": {"filePath": "/root/vllm.rs/src/core/decoder.rs"}}
 flattened
{"name": "read", "arguments": {"filePath": "/root/vllm.rs/src/core/decoder.rs"}}
 flattened
{"name": "read", "arguments": {"filePath": "/root/vllm.rs/src/core/decoder.rs"}}
 flattened
{"name": "read", "arguments": {"filePath": "/root/vllm.rs/src/core/decoder.rs"}}
 flattened
{"name": "read", "arguments": {"filePath": "/root/vllm.rs/src/core/decoder.rs"}}
 flattened
{"name": "read", "arguments": {"filePath": "/root/vllm.rs/src/core/decoder.rs"}}

Model: Qwen3-30B-A3B-Instruct-2507

So, we've better remove the guided decoding for tool call grammar.

guoqingbao · 2026-03-12T11:35:32Z

@sempervictus I have removed the tool grammar since it is not working. It passed all constraint cases except this one "Emit a search tool call". Are you able to make a last test? In terms of guided decoding, I think it works and we've ready to merge it.

guoqingbao · 2026-03-12T11:52:03Z

I think once this one is merged, you can submit another PR to address the tool grammar issue.

sempervictus · 2026-03-12T21:19:03Z

i removed the XML binary anchors - trying \n stubs but worst case the jinja allows the actual arguments to parameters to be JSON. Testing presently w/ the updated precision PR.

Sorry, probably havent read the giant doc but --enforce-parser is respected by the grammar. So you can force qwen3 JSON format on q3coder. Its not ideal, hence attempts to fix the XML.

We also need to excise the stupid instruction from the jinja telling it to use the XML format before giving it the template

guoqingbao · 2026-03-13T00:36:46Z

i removed the XML binary anchors - trying \n stubs but worst case the jinja allows the actual arguments to parameters to be JSON. Testing presently w/ the updated precision PR.

Sorry, probably havent read the giant doc but --enforce-parser is respected by the grammar. So you can force qwen3 JSON format on q3coder. Its not ideal, hence attempts to fix the XML.

We also need to excise the stupid instruction from the jinja telling it to use the XML format before giving it the template

omg, you made your old branch back, we can't merge it at current form because the the above-mentioned issues, I have fixed them in the previous commits and you simply taking them back.😂

guoqingbao · 2026-03-13T03:05:14Z

I suggest merging the refactored version from #263 with a note such as Co-authored-by: @sempervictus. Alternatively, you could force-push #263 here for a clean merge, and then continue developing the tool grammar in new PRs.

The refactored version resolves several issues, including:

Guided decoding only needs the real EOS IDs when it adds optional EOS termination to a grammar. Those IDs should come from resolved model config / generation config, not from the heuristic list in special_tokens.rs.

The sample loop is not using the standard llguidance flow cleanly. Standard flow is: compute_mask from current matcher state, apply mask, sample once from masked logits, consume_token on the chosen token. Your code does that first, but then adds a second validation/resample loop afterward. That second pass is redundant if the first mask was correct, and it hides state/mask bugs instead of fixing them.

The fallback re-sample path is batch-wide, not per-row. When one token fails validation, it rebuilds a full-batch tensor and calls sample_with_strategy() across the whole batch, then only keeps re_sampled[seq_idx]. That is nonstandard, wastes work, and can change RNG advancement for unrelated rows in the batch. A standard fallback would re-sample only the affected row, or better, remove the fallback entirely.

The rollback/prefix-cache changes were unnecessary for guided decoding and unsafe for existing logic. RollbackSnapshot and the rollback path added state into scheduler/block_manager/prefix_cache, but the postprocess check was effectively dead and the rollback math/cache mutation could corrupt normal prefix-cache behavior. I removed that path entirely.

Existing non-guided behavior had regressed. OpenAI stop was not applied, stop_token_ids matching had been removed, and engine tool token handling had drifted from the model-specific ToolConfig path.

Grammar composition had standard-mismatch bugs: merged grammars used repeated alternation +, single-EOS formatting was wrong, and regex/literal helpers silently dropped non-ASCII input.

Guided decoding setup only lived in the OpenAI handler.

GuidanceState still carries unused speculative/rollback/cache machinery that the current sampling path does not use. That is not standard guided decoding; it is leftover complexity.

I also refactored your Markdown file since it was too heavy (~2,500 lines), which isn’t suitable for documentation and tends to cause problems for agents. Large files like this can overwhelm agents’ memory with hundreds of thousands of tokens.

sempervictus · 2026-03-14T08:30:31Z

Done sir done! 😄

I'll get the tools piece into another PR in the morning after i run some coders through it to see if they can pass something that breaks the XML (JSON has been solid since day 1)

guoqingbao · 2026-03-14T08:32:33Z

Got an issue in the last audit:

Opencode with Qwen3.5-27B-FP8 report this (prompts: /init and then "find bugs in the guided decoding")

Unprocessable Entity: messages[77] role=tool requires non-empty content

The server log:

2026-03-14T08:21:28.664877Z  INFO vllm_rs::core::scheduler: GPU Kvcache: 3778 blocks (241792 tokens) free, used 32.6% (7.13GB/21.89GB); CPU swap used 0.0% (0.00GB/4.38GB)
2026-03-14T08:21:28.664897Z  INFO vllm_rs::core::scheduler: GPU MambaState: 1 / 16 slots used (6.2%), approx 0.14GB/2.43GB (slot 146.81MB)
2026-03-14T08:21:28.781768Z  WARN vllm_rs::server::server: Tools enabled for request

While, the current main works well on it.

sempervictus · 2026-03-14T08:39:53Z

Right now we are at #263 level where I thought the tools path is inert. I'm afk for the evening but in the tools branch I do not see those with the flag enabled or disabled. Will push that commit in a few hours once at keyboard.

guoqingbao · 2026-03-14T08:45:58Z

Right now we are at #263 level where I thought the tools path is inert. I'm afk for the evening but in the tools branch I do not see those with the flag enabled or disabled. Will push that commit in a few hours once at keyboard.

Another bug here:

let (guided_logits, guided_seq_ids) = if let Some(factory) = &self.llg_factory {
...
}

We should not apply that if contraint grammar not provided.

guoqingbao · 2026-03-14T08:58:59Z

Right now we are at #263 level where I thought the tools path is inert. I'm afk for the evening but in the tools branch I do not see those with the flag enabled or disabled. Will push that commit in a few hours once at keyboard.

Another bug here:
let (guided_logits, guided_seq_ids) = if let Some(factory) = &self.llg_factory {
...
}
We should not apply that if contraint grammar not provided.

The revision of scheduler.rs for tool calling also buggy, I reverted that together with fix for above. It works now. You need force push #263 again to here.

guoqingbao · 2026-03-14T09:12:17Z

Another bug, once the guided decoding request processed, it can affect normal requests making them unable to properly handle tool calls. The state management for guided decoding requires double check.

guoqingbao · 2026-03-14T09:13:28Z

Another bug, once the guided decoding request processed, it can affect normal requests making them unable to properly handle tool calls. The state management for guided decoding requires double check.

Root cause:

The old gate was effectively “if the runner has an llguidance factory, enter the guided path”, which is a batch-level condition, not a per-request one. So when a constrained request and a normal tool-call request were decoded in the same batch, the whole batch went through the llguidance masking/commit flow.

guoqingbao · 2026-03-14T09:28:03Z

Another bug, once the guided decoding request processed, it can affect normal requests making them unable to properly handle tool calls. The state management for guided decoding requires double check.

Root cause:

The old gate was effectively “if the runner has an llguidance factory, enter the guided path”, which is a batch-level condition, not a per-request one. So when a constrained request and a normal tool-call request were decoded in the same batch, the whole batch went through the llguidance masking/commit flow.

Fixed that in #263 by adding per-sequence guided decoding.

I think after final alignment with #263, @sempervictus you'd better do a last test especially for tool calling, guided decoding or mixed normal decoding and guided one. If they all passed, we can merge it. I think I have fixed almost all vulnerable points.

sempervictus · 2026-03-14T15:11:49Z

If you're still awake, i actually had a question about this piece (removed in your branch) - if we dont have full seq rollback but the constraint+SamplingParams mask end up producing something not viable for the constraint itself, this chunk was supposed to "fix" it token by token. I have a faster version of this which has early abort on the check but i left this in specifically so we could verify if a correction needed to happen so we'd know that something bears looking into. Is the logic wrong or did you cull this because its a performance hog?

-        self.commit_guided_tokens(&seq_ids, &tokens, guided_seq_ids);
+        // Clone guided_seq_ids for commit_guided_tokens and the following block
+        let guided_seq_ids_for_commit = guided_seq_ids.clone();
+        self.commit_guided_tokens(&seq_ids, &tokens, guided_seq_ids_for_commit);
+
+        if let Some(ref guided_seq_ids) = guided_seq_ids {
+            let mut guidance_states = self.guidance_states.write();
+            let mut guidance_failed = self.guidance_failed.write();
+            for (seq_idx, seq_id) in seq_ids.iter().enumerate() {
+                if !guided_seq_ids.contains(seq_id) || guidance_failed.contains(seq_id) {
+                    continue;
+                }
+
+                if let Some(state) = guidance_states.get_mut(seq_id) {
+                    if state.is_finished() {
+                        continue;
+                    }
+
+                    let token = tokens[seq_idx];
+                    if let Err(err) = state.commit_token(token) {
+                        if guidance_failed.insert(*seq_id) {
+                            crate::log_warn!(
+                                "[Seq {}] Failed to commit guided token {}: {}. Disabling constraints for this sequence.",
+                                seq_id,
+                                token,
+                                err
+                            );
+                        }
+                        let _ = guidance_states.remove(seq_id);
+                    }
+                }
+            }
+        }

pulling it out of the commit tree but figure since i have it in front of me i would ask 😄

guoqingbao · 2026-03-14T15:27:25Z

If you're still awake, i actually had a question about this piece (removed in your branch) - if we dont have full seq rollback but the constraint+SamplingParams mask end up producing something not viable for the constraint itself, this chunk was supposed to "fix" it token by token. I have a faster version of this which has early abort on the check but i left this in specifically so we could verify if a correction needed to happen so we'd know that something bears looking into. Is the logic wrong or did you cull this because its a performance hog?

The post-sample validation was intentional, but I didn’t remove that behavior. I moved it into commit_guided_tokens(...), which is still called immediately after sampling.

sempervictus · 2026-03-14T17:11:02Z

ah! thanks

348 test cases - only the user-supplied regex failing. having that handled, out for a few hours.

guoqingbao · 2026-03-15T01:13:20Z

ah! thanks

348 test cases - only the user-supplied regex failing. having that handled, out for a few hours.

Have you pushed latest changes from #263 here, there are more bug fixes.

sempervictus · 2026-03-15T13:42:24Z

Pulled them in yeaterday and will rebase branch again now - am getting all the grammar tests to pass as youre way better at the engine stuff so just using 263 state as a base and collapsing as much as I can into guidance.rs and schema to avoid pollution of server and auch. Sorry its a bit hectic this week w gtc - live in a rather remote area, its a whole thing to get to civilization and conferences :).

The good news is all the lark logic works after moving the typed ones back to their types from llg so we aren't having to process those inline which simplified the string composition logic. The @ refs I haven't been able to get working but all 11 permutation inputs and their outputs are being built and compiling to Matcher FSMs in the tests (takes a while there's 348 variants in the current set but it can be thousand when you feed the tester lots of tool options and grammar

guoqingbao · 2026-03-15T14:20:28Z

Pulled them in yeaterday and will rebase branch again now - am getting all the grammar tests to pass as youre way better at the engine stuff so just using 263 state as a base and collapsing as much as I can into guidance.rs and schema to avoid pollution of server and auch. Sorry its a bit hectic this week w gtc - live in a rather remote area, its a whole thing to get to civilization and conferences :).

I've done last audit in #263, simplely force push #263 here and I will merge this PR.

sempervictus · 2026-03-15T14:26:05Z

ah! sorry! pushed before i switched to this tab. 1s, will rebase on 263 only and separate this to another PR

sempervictus · 2026-03-15T14:29:22Z

All set sir, we're aligned. Will open another one for review with the various grammar options and i'll include the example in that branch while its in draft so we (and rather importantly the LLMs) can see all of the resulting grammars and compile them.

guoqingbao · 2026-03-16T00:31:58Z

Thanks!

* Implement Constrained Generation via LLGuidance This implements the full llguidance integration enabling grammar-constrained inference for structured outputs, tool calling, and custom constraints. Architecture: - TopLevelGrammar serialized via rmp_serde across RPC boundaries - Grammar flows: Server → params.grammar → Runner → GuidanceState → Matcher - Inline correction via logits masking during sampling - Post-process correction via rollback on validation failure Key components: - params.grammar field in SamplingParams for RPC serialization GuidanceState - GuidanceState::new() with Matcher state management - GuidanceState::reset() for proper state cleanup - Rollback counter (MAX_ROLLBACK_ATTEMPTS=3) preventing infinite loops - guidance_failed/guidance_mismatch sets cleared on rollback - Vocab size validation in build_llg_factory() - Lark grammar generation from tools via build_tool_call_lark_grammar() CLI flags: - --enable-tool-grammar: Auto-build LLG grammar from MCP tools - --allow-constraint-api: Accept client-provided structured_outputs/response_format * Support Qwen3.5 Dense models on Metal (#258) * Utilize SpecialTokens Idiomatic Accessor for EOS Expand SpecialTokens usage to cover EOS uses across the codebase to include the chat template. This gates access to the EOS tokens through a single common API providing an interdiction point to add or remove them as needed per-model or family as requried. * Idiomatic SpecialTokens Access Pattern - Replace manual EOS token extraction logic with centralized SpecialTokens::new() and idiomatic accessors - Eliminate EosTokenId enum and related complex serialization logic in favor of direct Vec<u32> - Update all callers to use SpecialTokens for tool start/end token IDs - Remove stop_token_ids from SamplingParams and related logic (now handled via SpecialTokens) - Simplify tokenizer config by replacing EosTokenEntry with Option<String> - Add comprehensive SpecialTokens API with category-based accessors, ID/string sets, and search methods * More SpecialTokens, Improve Example/Binary Improve the binary example to be a handy extractor for models which developers can use to update special_tokens.rs quickly. Add tags extracted from Qwen3.5 0.8B * SpecialTokens Strings for Llama4 and Qwen3.5 MoE Narrow the Common category search specifically to find string dups of actually special tokens (handle "aftermarket" models/merges). Add and test Llama4 and Qwen3.5 MoE * ToolConfig Population w/ SpecialTokens * Drop ToolFormat * Lead The Horse to Water, Make Him <Think> This PR introduces a new `reasoning_effort` parameter to control reasoning block generation in the chat completion API, matching OpenAI's reasoning API behavior. - **API Extension**: Added `reasoning_effort` field to `ChatCompletionRequest` accepting "none", "low", "medium", or "high" values (case-insensitive) - **New Module**: Created `src/utils/reasoning.rs` with: - `ReasoningEffort` enum with `from_str` deserialization - `ThinkingGrammarBuilder` for reasoning block grammar construction - `thinking_grammar_with_reasoning_block()` generating Lark grammar patterns - `build_reasoning_grammar()` for composing reasoning blocks with base grammars - **Integration**: Updated `compose_grammars()` in `src/utils/guidance.rs` to accept and apply reasoning effort levels - **Schema Sanitization**: Enhanced `sanitize_schema_for_llguidance()` in `src/tools/schema.rs` to strip null from required field types, ensuring grammars enforce field presence for tool parameters - **Special Token Helpers**: Added `reasoning_start_ids()`, `reasoning_end_ids()`, and `reasoning_tokens()` methods to `SpecialTokens` for robust token detection - **Comprehensive Tests**: Added 11 new tests covering: - Reasoning effort parsing and validation - Thinking grammar builder functionality - Schema null-stripping for required/optional fields - Grammar composition permutations with reasoning * Tier Reasoning Effort * Anchor XML Tool-Grammar With SpecialTokens Pads This change updates the ToolGrammarBuilder to correctly use pad token IDs for XML tool call termination when building Lark grammars for models that use XML-style tool calling (e.g., Qwen3-Coder/3.5). The XML format requires closing markers for </function> and </parameter> tags. When the tokenizer lacks special closing tags the model can run-on generating forever as XML is not a finite stateless grammar; see guidance-ai/llguidance/issues/306. Use pad tokens as "magic" terminating markers embedded into the grammar and recognizable by the tokenizer/llg mask as not output a model can normally emit in its textual output (masked to 0.0 logprob normaly). Anchor the XML function/parameter generation like we bound tool-call and text. This is easier on the model than forcing JSON parsing (qwen3) especially in conjunction with forcing it to `<think>` if its not trained to do so. Mechanically, we modify the chat template to special pad tags after the closing tags for function and param and inject those into the grammar template submitted to the model as tool-choice. Call path: src/tools/schema.rs:361 build_xml_with_anchors(pad_ids) ├─ Uses pad_ids[0] as </function> anchor └─ Uses pad_ids[1] as </parameter> anchor Grammar structure: - start: ( text | tool_call )+ eos? - tool_call: <[tool_start_id]> tool_content <[tool_end_id]> - tool_0: "<function=fetch_url_via_curl>" param_0_0 ... "</function>" <[pad_id_0]> - param_0_0: "<parameter=url>" value_0_0 ... "</parameter>" <[pad_id_1]> - ... The pad tokens serve as finite termination points for the XML parser, allowing the Lark grammar to generate valid, parseable tool calls without requiring explicit special closing tags in the tokenizer vocabulary. * Cargo fmt * Refactor guided decoding * Update docs * Typo fix * Remove tool grammar & fix slow first token response for sync request * Strip guided-decoding’s leftover tool grammar surface * Fix incorrect guidance application * Revert changes for scheduler.rs (tool call related) * Apply per-sequence guided decoding * Remove redundancy * Fix corner case * Permit empty tool call result --------- Co-authored-by: RageLtMan <rageltman [at] sempervictus> Co-authored-by: Guoqing Bao <topon@outlook.com>

RageLtMan and others added 11 commits March 9, 2026 10:12

Support Qwen3.5 Dense models on Metal (guoqingbao#258)

9a1ab79

More SpecialTokens, Improve Example/Binary

2956d5c

Improve the binary example to be a handy extractor for models which developers can use to update special_tokens.rs quickly. Add tags extracted from Qwen3.5 0.8B

SpecialTokens Strings for Llama4 and Qwen3.5 MoE

0eda75b

Narrow the Common category search specifically to find string dups of actually special tokens (handle "aftermarket" models/merges). Add and test Llama4 and Qwen3.5 MoE

ToolConfig Population w/ SpecialTokens

dbb961d

Drop ToolFormat

621b3a6

Tier Reasoning Effort

f71d7fe

This was referenced Mar 11, 2026

Implement LLGuidance #232

Closed

Idiomatic Structural Handling for Special Tokens via Macro-Driven Dispatch in Token Library Struct #260

Closed

Idiomatic Special Tokenization #259

Open

guoqingbao added 2 commits March 12, 2026 03:51

Merge remote-tracking branch 'origin/main' into reasoning/pr

b3f72f2

Cargo fmt

fd53bc6

guoqingbao added 3 commits March 12, 2026 09:33

Refactor guided decoding

786fd24

Update docs

0612ddf

Typo fix

6c80f26

Remove tool grammar & fix slow first token response for sync request

2b02393

sempervictus force-pushed the reasoning/pr branch from 2b02393 to e95183b Compare March 12, 2026 21:13

guoqingbao mentioned this pull request Mar 13, 2026

Support guided decoding #263

Closed

sempervictus force-pushed the reasoning/pr branch from 22a2664 to 7f70675 Compare March 14, 2026 08:29

guoqingbao added 2 commits March 14, 2026 08:46

Fix incorrect guidance application

0157c41

Revert changes for scheduler.rs (tool call related)

89341b3

Apply per-sequence guided decoding

ba46d78

Remove redundancy

4997457

Fix corner case

60e88e0

Permit empty tool call result

6d8daae

sempervictus force-pushed the reasoning/pr branch from 6f95821 to 6d8daae Compare March 15, 2026 14:28

guoqingbao merged commit 1661da8 into guoqingbao:main Mar 16, 2026
1 check passed

This was referenced Mar 16, 2026

Enforce guided model output for tool calling #208

Closed

LLG: Comprehensive Guided Decoding Infrastructure #265

Open

Fix gguf loading issue caused by llguidance #266

Merged

Conversation

sempervictus commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Reasoning Effort Levels

ReasoningEffort::Low - Fast Thinking

ReasoningEffort::Medium - Standard CoT

ReasoningEffort::High - Adversarial Analysis

ReasoningEffort::ChainOfThought - CoVe Pattern

API Reference

ReasoningEffort Enum

ThinkingGrammarBuilder

build_reasoning_grammar()

Includes #232 and #260

Uh oh!

sempervictus commented Mar 11, 2026

Uh oh!

guoqingbao commented Mar 12, 2026

Uh oh!

guoqingbao commented Mar 12, 2026

Uh oh!

guoqingbao commented Mar 12, 2026

Uh oh!

guoqingbao commented Mar 12, 2026

Uh oh!

sempervictus commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guoqingbao commented Mar 13, 2026

Uh oh!

guoqingbao commented Mar 13, 2026

Uh oh!

sempervictus commented Mar 14, 2026

Uh oh!

guoqingbao commented Mar 14, 2026

Uh oh!

sempervictus commented Mar 14, 2026

Uh oh!

guoqingbao commented Mar 14, 2026

Uh oh!

guoqingbao commented Mar 14, 2026

Uh oh!

guoqingbao commented Mar 14, 2026

Uh oh!

guoqingbao commented Mar 14, 2026

Uh oh!

guoqingbao commented Mar 14, 2026

Uh oh!

sempervictus commented Mar 14, 2026

Uh oh!

guoqingbao commented Mar 14, 2026

Uh oh!

sempervictus commented Mar 14, 2026

Uh oh!

guoqingbao commented Mar 15, 2026

Uh oh!

sempervictus commented Mar 15, 2026

Uh oh!

guoqingbao commented Mar 15, 2026

Uh oh!

sempervictus commented Mar 15, 2026

Uh oh!

sempervictus commented Mar 15, 2026

Uh oh!

guoqingbao commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sempervictus commented Mar 11, 2026 •

edited

Loading

`ReasoningEffort::Low` - Fast Thinking

`ReasoningEffort::Medium` - Standard CoT

`ReasoningEffort::High` - Adversarial Analysis

`ReasoningEffort::ChainOfThought` - CoVe Pattern

`ReasoningEffort` Enum

`ThinkingGrammarBuilder`

`build_reasoning_grammar()`

sempervictus commented Mar 12, 2026 •

edited

Loading