LLG: Comprehensive Guided Decoding Infrastructure by sempervictus · Pull Request #265 · guoqingbao/xinfer

sempervictus · 2026-03-15T16:02:50Z

Comprehensive Guided Decoding Infrastructure

1. Structured Output Constraints (Opt-In Security Model)

Client applications can now request structured outputs via the OpenAI-compatible API using structured_outputs field. The system supports:

Choice constraints: Force responses to be selected from a predefined list of strings
Regex constraints: Enforce text patterns via Lark grammar definitions
JSON Schema constraints: Validate responses against JSON Schema definitions
Custom Lark grammars: Full grammar definition control for advanced use cases
Structural tag constraints: Enforce tool call formatting with custom start/end markers

Security Model: All client-provided constraints are BLOCKED by default. Operators must explicitly enable via --allow-constraint-api CLI flag. This prevents malicious grammar injection attacks that could:

Bypass role boundaries in chat templates
Inject system prompts or tool calls
Exploit ReDoS vulnerabilities through regex patterns

2. Automatic Tool Grammar Generation

When tools are defined in a request, the system can automatically generate grammars that constrain tool calls to valid JSON schemas:

XML-style tool calls: Full tool grammar with <tool_call> markers and schema validation
Fallback text-based tool envelope: Text-tagged tool calls when token-based markers are unavailable
Parser-specific tool formatting: Automatic grammar adaptation based on configured parser (json, regex, etc.)

Security Model: Tool grammar generation is BLOCKED by default. Enable via --enable-tool-grammar CLI flag. When enabled, tools are validated against the defined schema before grammar generation.

3. Reasoning Effort Control

The system supports configurable reasoning effort levels that generate model-specific thinking grammar:

None: No reasoning block, direct output
Low: Fast thinking with single paragraph constraint (~150 characters)
Medium: Multi-step reasoning with sentence boundary enforcement
High: Adversarial critique pattern (Cheng & Su 2025)
Chain of Thought: Multi-phase verification with draft, verify, critique, and structure phases

Reasoning effort integrates with constraint grammars to produce composed output that includes thinking blocks before final response.

4. Grammar-Only Completion Endpoint

New /v1/grammar POST endpoint allows clients to submit Lark/Regex/JSON Schema/Choice grammars directly:

Grammar type specification: Explicit type declaration (lark, regex, json_schema, choice)
Content validation: Grammar content is parsed and validated before use
Model-specific token handling: Automatic adaptation to model's special token vocabulary

System-Level Changes

1. Enhanced Token Infrastructure

BOS token support: Added bos_token_ids to GuidanceTokens for proper beginning-of-sequence handling
Tool call token pairs: Recognition of multiple tool call marker formats across different model families
Reasoning token pairs: Multi-token support for reasoning markers (, , , etc.)
BOS token collection: Automatic detection of BOS tokens from tokenizer vocabulary

2. Grammar Composition Engine

Multi-grammar composition: compose_grammars now handles constraint + tool + reasoning combinations
Explicit constraint-tool alternation: Grammar rules like start: (text | tool_call)+ for flexible output
Reasoning block wrapping: Proper sequencing of reasoning followed by constraint or tool output
EOS termination handling: Automatic EOS token addition to all generated grammars

3. Security Validation Layer

Per-request constraint validation: Each request validates allow_constraint_api flag before processing
Graceful degradation: When constraints are disabled, system logs warning and returns unconstrained output
ASCII sanitization: Non-ASCII characters filtered from tool call markers to prevent injection
Grammar content validation: Empty or invalid grammar content rejected with descriptive error messages

4. Fallback Mechanisms

Token-based to text-based tool markers: When special token IDs are unavailable, falls back to string literal tags
Parser name detection: Automatic parser selection based on model type (StreamToolParser)
JSON schema sanitization: Schema values sanitized for llguidance compatibility before grammar generation

5. API Contract Extensions

ChatCompletionRequest enhancements: Added structured_outputs, constraint, constraint_type, reasoning_effort, tool_choice, extra_body fields
GrammarRequest struct: Dedicated request type for /v1/grammar endpoint with explicit grammar type/content separation
GrammarResponse struct: Response type for grammar completions with metadata about applied constraints

sequenceDiagram
    participant Client
    participant Server as src/server/mod.rs
    participant API as src/api.rs
    participant Guidance as src/utils/guidance.rs
    participant Engine as src/core/engine.rs
    participant Runner as src/core/runner.rs

    Client->>Server: POST /v1/chat/completions
    Client->>Server: POST /v1/grammar (new)
    
    Note over Server: Parse request & extract flags
    Server->>Server: Check enable_tool_grammar CLI flag
    Server->>Server: Check allow_constraint_api CLI flag
    
    alt Constraint API Disabled
        Server->>Server: Log warning: "constraint ignored: allow_constraint_api=false"
        Server->>API: Parse request without constraints
    else Constraint API Enabled
        Server->>API: extract_guidance_tokens(tokenizer, eos_ids, bos_ids)
        API->>Guidance: parse_grammar_from_chat_request(request, allow_constraint_api)
        
        Note over Guidance: Validate constraint fields
        Guidance->>Guidance: Check structured_outputs.choice/regex/json/grammar/structural_tag
        Guidance->>Guidance: Check response_format.json_schema/json_object
        Guidance->>Guidance: Check legacy constraint field
        
        alt Valid Constraint Found
            Guidance->>Guidance: Build TopLevelGrammar from constraint
            Guidance-->>API: Some(grammar)
        else No Constraint
            Guidance-->>API: None
        end
    end
    
    API->>Guidance: generate_grammar_from_request(request, guidance_tokens, enable_tool_grammar, allow_constraint_api, model_type)
    
    Note over Guidance: Dual grammar generation path
    alt enable_tool_grammar=true
        Guidance->>Guidance: Build XML tool grammar (build_xml_tool_grammar_for_parser)
        Guidance->>Guidance: Generate schema-based tool grammar with %json
    else allow_constraint_api=true
        Guidance->>Guidance: Build fallback tool envelope (build_fallback_tool_envelope_grammar)
        Guidance->>Guidance: Generate text-tagged tool grammar with start/end tags
    else Neither enabled
        Guidance->>Guidance: Return None for tool grammar
    end
    
    alt Reasoning Effort Specified
        Guidance->>Guidance: Check reasoning tokens available (reasoning_start_ids, reasoning_end_ids)
        Guidance->>Guidance: Generate reasoning grammar based on effort level
        Note over Guidance: None/Low/Medium/High/ChainOfThought
    end
    
    Guidance->>Guidance: compose_grammars(constraint_grammars, tool_grammar, tool_choice_required, forced_tool_name, max_tokens, guidance_tokens, reasoning_effort)
    
    Note over Guidance: Grammar composition logic
    Guidance->>Guidance: GrammarComposerBuilder.build(guidance_tokens)
    
    alt No constraints, no tools, no reasoning
        Guidance-->>Engine: None (unconstrained generation)
    else Constraint only
        Guidance->>Guidance: GrammarComposers::Constraint(constraint_gram)
        Guidance-->>Engine: constraint grammar
    else Tools only
        Guidance->>Guidance: GrammarComposers::Tool(tool_gram)
        Guidance-->>Engine: tool grammar
    else Constraint + Tools
        Guidance->>Guidance: GrammarComposers::ConstraintOrTool(constraint, tool)
        Guidance-->>Engine: (constraint | tool)+
    else Reasoning + Constraint
        Guidance->>Guidance: GrammarComposers::WithReasoning(reasoning, inner)
        Guidance-->>Engine: reasoning_block -> inner grammar
    end
    
    Engine->>Engine: create_engine(config, bos_token_id, guidance_tokens)
    Engine->>Engine: Extract BOS tokens from tokenizer if not provided
    Engine->>Runner: SamplingParams { grammar: final_grammar, reasoning_effort, ... }
    
    Runner->>Runner: sample_with_grammar(prompt_tokens, grammar, eos_token_ids)
    Note over Runner: llguidance parser factory creates matcher
    
    loop Generation
        Runner->>Runner: Token generation with grammar constraints
        Runner->>Runner: EOS token detection
        alt Reasoning grammar detected
            Runner->>Runner: Extract reasoning_block content
            Runner->>Runner: Validate reasoning tokens match start/end markers
        end
        alt Tool call detected
            Runner->>Runner: Parse tool_call content via StreamToolParser
            Runner->>Runner: Extract function name and arguments
        end
        Runner->>Server: Stream token response
        Server->>Client: SSE stream chunk
    end
    
    Runner->>Runner: Prefix cache lookup (if --prefix-cache)
    Runner->>Runner: KV cache management (if --pd-server)

sempervictus · 2026-03-15T16:05:50Z

@guoqingbao - there's a grammar playground in examples while this is draft and it spits GRAMMAR lines into the log when they are generated for validation. Even if the rest is ready we probably want to remove those before merge (if you decide to merge while i'm in transit for GTC).

Current state works correctly even with XML generation for the 0.8 3.5 which makes it a tool-call monster :-).

vllm-rs-svc2  | 2026-03-15T15:53:52.644891Z  INFO vllm_rs::utils::guidance: GRAMMAR:
vllm-rs-svc2  | start: ( text | tool_call )+ eos
vllm-rs-svc2  | text: /(?s:.*)/
vllm-rs-svc2  | tool_call: <[151657]> "\n" tool_content <[151658]>
vllm-rs-svc2  | value_0_0: /[ -~]*?/
vllm-rs-svc2  | param_0_0: "<parameter=url>" "\n" value_0_0 "\n" "</parameter>" "\n" 
vllm-rs-svc2  | value_0_1: /[ -~]*?/
vllm-rs-svc2  | param_0_1: "<parameter=proxy>" "\n" value_0_1 "\n" "</parameter>" "\n" 
vllm-rs-svc2  | tool_0: "<function=fetch_url_via_curl>" "\n" param_0_0 (param_0_1)? "</function>" "\n" 
vllm-rs-svc2  | value_1_0: /[ -~]*?/
vllm-rs-svc2  | param_1_0: "<parameter=path>" "\n" value_1_0 "\n" "</parameter>" "\n" 
vllm-rs-svc2  | tool_1: "<function=fs_cat>" "\n" param_1_0 "</function>" "\n" 
vllm-rs-svc2  | value_2_0: /[ -~]*?/
vllm-rs-svc2  | param_2_0: "<parameter=path>" "\n" value_2_0 "\n" "</parameter>" "\n" 
vllm-rs-svc2  | tool_2: "<function=fs_ls>" "\n" param_2_0 "</function>" "\n" 
vllm-rs-svc2  | tool_3: "<function=get_current_time>" "\n"  "</function>" "\n" 
vllm-rs-svc2  | value_4_0: /[ -~]*?/
vllm-rs-svc2  | param_4_0: "<parameter=query>" "\n" value_4_0 "\n" "</parameter>" "\n" 
vllm-rs-svc2  | value_4_1: /[ -~]*?/
vllm-rs-svc2  | param_4_1: "<parameter=searxng>" "\n" value_4_1 "\n" "</parameter>" "\n" 
vllm-rs-svc2  | tool_4: "<function=web_search_searxng>" "\n" param_4_0 (param_4_1)? "</function>" "\n" 
vllm-rs-svc2  | tool_content: tool_0 | tool_1 | tool_2 | tool_3 | tool_4
vllm-rs-svc2  | eos: <[248046]>
vllm-rs-svc2  |

turns out the EOS thing was killing us with multimodals but secondly on XML we have to have the pedantic literals for "\n" in the same places at the chat template has them. Ideally we would actually EXTRACT the chat template into the grammar builder so this never needs any touch-up in the future but minijinja AST is not available and nom parsing is a bit... crude. All ears for ideas on that one.

Re: EOS - i would like to convert Config to have the multi-eos variant for multimodals instead of overwriting the one from tokenizer_config with the string one entirely but i dont want to break any fragile code. Any thoughts on this approach instead of what i'm doing here to populate GuidanceTokens with the correct EOS (that ID is <|im_end|> without which the model will generate forever as the chat template becomes unbounded)

sempervictus · 2026-03-15T16:26:54Z

The last two commits are in testing now, apologies for the branch depth will compress it all to one commit before merge - on the road for a bit so working remote and using GH as a pivot

guoqingbao · 2026-03-16T02:29:16Z

Are you able rebase on main? PR #262 has been merged.

sempervictus · 2026-03-16T10:28:16Z

Done sir. Will see how much I can get done on plane.
Any thoughts on the idea of adding the eos for mm models to the EosTokenId type changing it from singular to multiple but allowing both to be used in grammar bounding and also filtering both from emission back to user? Technically I think the original eos token being pulled from tokenizerconfig is actually an end of message token (<|im_end|>) but since LLMs only produce text it seems a bit misnamed in the industry - the one you're getting out of the mm config is much more aptly an eos but unfortunately doesn't cap grammar bounds because the model needs to emit the EOM whether it elects to emit a true EOS or not.

sempervictus · 2026-03-17T16:17:28Z

@guoqingbao if you have a couple of minutes, mind looking at the last commit? Trying to be graceful with this and not break
any of your logic which might be opaque to me.

Review Summary

This change refactors EOS token ID handling in multimodal configurations. The modifications:

Simplify extract_guidance_tokens() call in LLMEngine::new() by removing redundant intermediate variables

Add logic in init_config_tokenizer() to merge tokenizer's EOS token with config's EOS token IDs using the existing EosTokenId::merge_dedup() method

Overall, this improves code consistency and ensures both tokenizer and config EOS tokens are properly handled for multimodal models.

Issues Found

Severity File:Line Issue

SUGGESTION vllm.rs/src/utils/mod.rs:683-704 The comment "For multimodal models, merge tokenizer's eos_token string to token IDs" is misleading since this code runs for all model types when config_tokenizer.eos_token is present

Detailed Findings

File: vllm.rs/src/core/engine.rs:130-137

Confidence: 95%

Problem: The original code was more complex with unnecessary intermediate variables (guidance_eos_ids, guidance_tokens) that have been properly simplified

Suggestion: No changes needed - the refactoring improves readability

File: vllm.rs/src/utils/mod.rs:683-704

Confidence: 80%

Problem: The comment states "For multimodal models" but the code executes whenever config_tokenizer.eos_token is present, not just for multimodal models. This could confuse readers about when this logic runs

Suggestion: Update comment to accurately reflect the scope: "Merge tokenizer's eos_token string to token IDs to ensure EOSTOKENIDS includes tokens from both tokenizer and config"

Recommendation

APPROVE WITH SUGGESTIONS - The core logic changes are sound and improve code maintainability. Only the misleading comment needs clarification.

sempervictus · 2026-03-17T16:20:46Z

Apologies for any compile issues, I can only run build tests on cuda GPU systems now due to the attentionrs change and haven't had time to dig into fixing that while at gtc (the netbook I have w me is igpu only and we don't support mkl yet)

guoqingbao · 2026-03-18T02:20:02Z

Apologies for any compile issues, I can only run build tests on cuda GPU systems now due to the attentionrs change and haven't had time to dig into fixing that while at gtc (the netbook I have w me is igpu only and we don't support mkl yet)

No hurry, this is an optional feature.

sempervictus · 2026-03-18T06:37:46Z

Agreed although having both openai and anthropic capabilities + user supplied chain of thought mechanisms opens the doors to some awesome appdev options both using vllmrs as a standalone bin and even moreso as a lib. One example being running a preprocessing micromodel to define grammars for the big one actually doing the work in a single runtime, another being long context w small models on vram constrained systems.

That said, it may be handy to cherry pick the eos commit while I'm at gtc and slower than usual to complete the PR - I'm betting that's why we see those run-on problems w newer qwens on older gear: they're grabbing EOS vs EOM which is also why we occasionally see the EOM rendered in their output as its not being masked. Something about naive attention seems to allow them to omit like that but this should make both EOS and EOM work. I need to understand attention computation product better to prove that but the behavior is 1:1 with what happened here before I switched from the multimodal EOS single type to the EOM ID or now multi type

sempervictus · 2026-03-22T05:36:01Z

@guoqingbao - some decent progress :-)

dug into the jinja templates for these coder-type models a bit and it turns out they can use %json constraint types in their value fields... finite stateless grammar win:

vllm-rs-svc2  | 2026-03-22T05:28:19.885268Z  INFO vllm_rs::utils::guidance: GRAMMAR:
vllm-rs-svc2  | start: ( text | tool_call )+ eos
vllm-rs-svc2  | tool_content: tool_0 ? tool_1 ? tool_2 ? tool_3 ? tool_4
vllm-rs-svc2  | tool_0: "<function=fetch_url_via_curl>" "\n" param_0_0 (param_0_1)? "</function>" "\n"
vllm-rs-svc2  | tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
vllm-rs-svc2  | tool_1: "<function=fs_cat>" "\n" param_1_0 "</function>" "\n"
vllm-rs-svc2  | tool_2: "<function=fs_ls>" "\n" param_2_0 "</function>" "\n"
vllm-rs-svc2  | tool_3: "<function=get_current_time>" "\n"  "</function>" "\n"
vllm-rs-svc2  | tool_4: "<function=web_search_searxng>" "\n" param_4_0 (param_4_1)? "</function>" "\n"
vllm-rs-svc2  | param_0_0: "<parameter=url>" "\n" value_0_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_0_1: "<parameter=proxy>" "\n" value_0_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_1_0: "<parameter=path>" "\n" value_1_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_2_0: "<parameter=path>" "\n" value_2_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_0: "<parameter=query>" "\n" value_4_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_1: "<parameter=searxng>" "\n" value_4_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | text: /(?s:.+?)/
vllm-rs-svc2  | value_0_0_string: %json {"type":"string","description":"The URL to scrape."}
vllm-rs-svc2  | value_0_1_string: %json {"type":"string","description":"The proxy URL in the format protocol://host:port."}
vllm-rs-svc2  | value_1_0_string: %json {"type":"string","description":"The path of the file to read"}
vllm-rs-svc2  | value_2_0_string: %json {"type":"string","description":"The path of the directory to list"}
vllm-rs-svc2  | value_4_0_string: %json {"type":"string","description":"The query to search for."}
vllm-rs-svc2  | value_4_1_string: %json {"type":"string","description":"Optional searxng URL overriding the env var"}
vllm-rs-svc2  | eos:  ( <[248044]> | <[248046]> )

LLG updated to help handle the new multi-EOS models
fallback grammar generation for partial enablement is in testing to include the custom types.
the LLG-native grammar and constraint types are properly isolated in generation
i ended up having to derive BOS for a tiny gemma but i think that had more to do with quantizing it down to 4b so while the functionality is there, it's currently "just in case" we find other models getting shaky at low bits and lots of early ff-tokens

sempervictus · 2026-03-23T01:30:42Z

All three variants of grammar-induced tool-calls are working for Q3.5/Next from 0.8->122, tested on the older Q3 0.6 too with excellent results (not the XML type obviously for that model):

xml-explicit - allows multi-tool calls via tool_content: tool_0 ? tool_1 ? tool_2 ? tool_3 ? tool_4

 start: ( text | tool_call )+ eos
 tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
 tool_content: tool_0 ? tool_1 ? tool_2 ? tool_3 ? tool_4
 tool_0: "<function=fetch_url_via_curl>"  "\n" param_0_0 ( "\n" param_0_1)? "</function>" "\n"
 tool_1: "<function=fs_cat>"  "\n" param_1_0 "</function>" "\n"
 tool_2: "<function=fs_ls>"  "\n" param_2_0 "</function>" "\n"
 tool_3: "<function=get_current_time>"  "</function>" "\n"
 tool_4: "<function=web_search_searxng>"  "\n" param_4_0 ( "\n" param_4_1)? "</function>" "\n"
 param_0_0: "<parameter=url>" "\n" value_0_0_string "\n" "</parameter>" "\n"
 param_0_1: "<parameter=proxy>" "\n" value_0_1_string "\n" "</parameter>" "\n"
 param_1_0: "<parameter=path>" "\n" value_1_0_string "\n" "</parameter>" "\n"
 param_2_0: "<parameter=path>" "\n" value_2_0_string "\n" "</parameter>" "\n"
 param_4_0: "<parameter=query>" "\n" value_4_0_string "\n" "</parameter>" "\n"
 param_4_1: "<parameter=searxng>" "\n" value_4_1_string "\n" "</parameter>" "\n"
 text: /(?s:.+?)/
 value_0_0_string: %json {"type":"string","description":"The URL to scrape."}
 value_0_1_string: %json {"type":"string","description":"The proxy URL in the format protocol://host:port."}
 value_1_0_string: %json {"type":"string","description":"The path of the file to read"}
 value_2_0_string: %json {"type":"string","description":"The path of the directory to list"}
 value_4_0_string: %json {"type":"string","description":"The query to search for."}
 value_4_1_string: %json {"type":"string","description":"Optional searxng URL overriding the env var"}
 eos:  ( <[248046]> <[248044]> )

json-explicit - allows multi-tool calls via anyOf

 start: ( text | tool_call )+ eos
 tool_call: <[248058]> tool_content <[248059]> ("\n")?
 tool_content: %json {"anyOf":[{"type":"object","properties":{"name":{"type":"string","enum":["fetch_url_via_curl"]},"arguments":{"type":"object","properties":{"url":{"type":"string","description":"The URL to scrape."},"proxy":{"type":"string","description":"The proxy URL in the format protocol://host:port."}},"required":["url"]}},"required":["name","arguments"],"additionalProperties":false},{"type":"object","properties":{"name":{"type":"string","enum":["fs_cat"]},"arguments":{"type":"object","properties":{"path":{"type":"string","description":"The path of the file to read"}},"required":["path"]}},"required":["name","arguments"],"additionalProperties":false},{"type":"object","properties":{"name":{"type":"string","enum":["fs_ls"]},"arguments":{"type":"object","properties":{"path":{"type":"string","description":"The path of the directory to list"}},"required":["path"]}},"required":["name","arguments"],"additionalProperties":false},{"type":"object","properties":{"name":{"type":"string","enum":["get_current_time"]},"arguments":{"type":"object","properties":{},"required":[]}},"required":["name","arguments"],"additionalProperties":false},{"type":"object","properties":{"name":{"type":"string","enum":["web_search_searxng"]},"arguments":{"type":"object","properties":{"query":{"type":"string","description":"The query to search for."},"searxng":{"type":"string","description":"Optional searxng URL overriding the env var"}},"required":["query"]}},"required":["name","arguments"],"additionalProperties":false}]}
 text: /(?s:.+?)/
 eos:  ( <[248046]> <[248044]> )

text-native - if explicit tool grammar generation is not enabled but other types are, we need this to ensure the model can call tools as the text gram doesn't included non-printable added-vocab. Not as precise as explicit generation but avoids a potential injection route:

 start: ( text | tool_call )+ eos
 tool_call: <[248058]> text <[248059]>
 text: /(?s:.+?)/
 eos:  ( <[248044]> <[248046]> )

No grammar if none of the CLI flags are set.

Functionally-speaking this is a huge help for small models or non-flash attention models (V100s running the various 27/35B 3.5s or the 80B coder seem very happy about this). The LLG 1.7 bump helps but it still seems preferable to use the --enforce-parser qwen option when using this for the coder or other big models due to the complexity of their tool-call workloads and the evil which is XML. I'll go over the lark generation again with a fine-tooth comb but i think i'm missing a "\n" in one of the optional generation paths.

Will push cleanup code shortly and collapse the commit for review.

sempervictus · 2026-03-23T16:22:20Z

@guoqingbao apologies i didn't take the draft tag off last night. I think this is at a good place for a review boundary. You use different clients and my biggest question about the logic of this state is "how do they respond to the multi-tool-call logic?" since this permits both JSON and XML models to put multiple calls in one <tool_call>../ block for efficiency but i'm not sure all clients handle that well so we might want to fall back to the per-call (oneOf in JSON and | join in XML) pattern to simply allow multiple calls per turn as before instead of multiple function calls per tool-call.

Propose splitting-out the following task areas to future PRs:

Ingress "firewall" for special tokens and template-role-flipping injections - pretty much a "re-tokenizer" for inputs to catch <[token_id]> injections in grammars and <|im_end|><|im_start|>role\n style injections in prompts
Chat template control - all of this "works" because we force generation along certain patterns but in doing so we may be masking-away some valuable logic the model might want to emit at those offsets in its chat template. Deep reasoning blocks, re-entrant tool blocks (<tool_response> markup), and other useful faculties may be out of reach without idiomatic understanding of template sections and their iterable or conditional nature.
2a. jinja2json library seems to provide us AST-style access or at least idiomatic JSON access to anchor blocks. minijinja itself keeps the AST private. The more idiomatic access we have for string and token anchor extraction the more we can align grammar definition to what the model was trained to emit or at least what we tell it that it should emit when we pass the per-sequence template in for generation.
2b. The code in this PR guards reasoning if no reasoning token strings can be found in the chat template. Models like Q3CoderNext which can think but dont have reasoning in their chat template can produce unbounded rambling in that thinking block which runs the sequence forever... if we were to inject a reasoning block with the tags we see it using on occasion then we may be able to enable reasoning for models than is currently safe. Such injections could open the doors to other specialized function or reasoning blocks for enterprise workflows to use.

In terms of this work: the grammar control seems most useful on small models with any attention mechanism, any models not using flash* attention, and 50/50 on larger models with flash* - they dont make tool call errors nearly as much but they seem more prone to long tool calls this way emitting all parameters whereas the tiny ones seem to just want to get it right with minimal output and move on. Larger models appear to do better on the qwen style parser, smaller ones equally good but oddly faster on the XML one of thats their native format.

guoqingbao · 2026-03-27T16:22:57Z

In terms of this work: the grammar control seems most useful on small models with any attention mechanism, any models not using flash* attention, and 50/50 on larger models with flash* - they dont make tool call errors nearly as much but they seem more prone to long tool calls this way emitting all parameters whereas the tiny ones seem to just want to get it right with minimal output and move on. Larger models appear to do better on the qwen style parser, smaller ones equally good but oddly faster on the XML one of thats their native format.

I’ll take this on once the decoding cache mismatch issue is resolved.

P.S. I’m also casually working on a bot project in Rust. I can loop you in for initial usage if you’re interested.

sempervictus · 2026-03-27T17:30:23Z

Thanks, this is a big one to QA/manage and we absolutely want to align caches with what we generate here because we have full output control and might actually be able to simplify caching strategies using this once they prove solid with all consumers.

sempervictus · 2026-03-27T17:34:33Z

P.S. I’m also casually working on a bot project in Rust. I can loop you in for initial usage if you’re interested.

of course - i'm doing same for that coreui workbench, next will be wiring sandboxing and stakpak/n8n-rs/etc into it for our engineers. Will add you to the repo soon as i push it as it will come with a --feature vllm.rs option to build this library into it to run the utility/dispatch models, do local image parsing, etc or just run the full LLM-workbench stack on a single system from a single binary. The idea is that the utility model will induce grammar based generation from the big model (also part of why i want to get draft decoding working for throughput).

sempervictus · 2026-03-27T17:42:23Z

@guoqingbao you should see an invite to a debugging tool repo for this PR - it exercises the right portions of the internal pipeline but might be of help w #277 / #279 too

sempervictus · 2026-03-29T08:08:29Z

@guoqingbao - a littlebit of "hackery" on what we send to the tokenizer seems to do the trick:

$ curl -s http://127.0.0.1:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Evaluate the purpose of your execution"}],
    "thinking": true,
    "reasoning_effort": "high",
    "max_tokens": 2000
  }'|jq -r '.choices[0].message.content'

<think>
<ANALYZE>
User wants me to evaluate the purpose of my execution. This is a meta-reflection question. I need to determine what I'm doing and what I'm learning.

</ANALYZE>

<CRITIQUE>
The user is asking about the purpose of my execution. This is a meta-reflection question. I need to determine what I'm doing and what I'm learning.

</CRITIQUE>

<STRUCTUREDANSWER>
I'll provide a structured response that covers:

</STRUCTUREDANSWER>
</think>

My execution is designed to be efficient, accurate, and beneficial for users. It focuses on clarity, accuracy, and helpfulness while avoiding unnecessary complexity or confusion. My goal is to offer high-quality assistance without overwhelming users with excessive detail or technical jargon.

Purpose of Execution:

Clarity & Simplicity – Delivering concise, understandable content that meets the user's needs efficiently.

Accuracy & Verification – Ensuring information is correct before sharing it.

Efficiency – Avoiding redundancy and focusing on core value propositions.

User Experience – Designing interfaces that are intuitive, easy to navigate, and aligned with user goals.

Engagement & Value – Providing meaningful insights while maintaining a balanced tone.

Strategic Alignment:

User-Centric: Every decision prioritizes the user experience.

Consistency: Maintains quality across all interactions.

Ethical Standards: Adheres to guidelines for helpfulness and transparency.

This approach ensures my execution remains both high-quality and user-friendly, enabling effective communication and engagement.

A bit more adjustment of the grammar probably needed to make that more useful but the building blocks all work correctly now.

sempervictus · 2026-03-29T15:20:50Z

@guoqingbao here is the current code, stripping start-think-tag, working correctly on a 122B q35 with prefix cache matching:

...
2026-03-29T15:18:49.491680Z  WARN vllm_rs::server::server: Tools enabled for request
2026-03-29T15:18:49.492814Z  INFO vllm_rs::core::engine: [llg] Guidance enabled, trimming <think> from pre-generation. Generation starting at <|im_start|>assistant
2026-03-29T15:18:49.543518Z  WARN vllm_rs::core::engine: [Stream] New request [Seq_id 2, 30149 tokens] received! (session_id: None)

2026-03-29T15:18:49.543594Z  INFO vllm_rs::server::parser: Tool parser selected: qwen_coder (model_id=Qwen/Qwen3.5-122B-A10B-FP8, enforce_parser=none)
2026-03-29T15:18:49.544224Z  INFO vllm_rs::core::block_manager: Prefix cache hit seq 2 (26304 cached tokens, 411 blocks)
2026-03-29T15:18:49.546848Z  INFO vllm_rs::core::runner: Restored mamba prefix state for seq 2 (cached 26304 tokens)
...
2026-03-29T15:18:50.360038Z  INFO vllm_rs::utils::guidance: GRAMMAR:
start: reasoning_block ( text | tool_call )+ eos
reasoning_block: <[248068]> "\n" think_text "\n" <[248069]> "\n"
think_text[suffix="\n"]: /[ -~]+/
tool_call: <[248058]> text <[248059]>
text: /(?s:.+?)/
eos:  ( <[248046]> <[248044]> )

...

2026-03-29T15:18:50.371090Z  WARN vllm_rs::core::scheduler: Seq 2 - chunk prefill finished (30149 tokens)
2026-03-29T15:18:50.371113Z  INFO vllm_rs::core::engine: Prefilling [seq_id 2]: 30150 tokens in 0.86s (35098.95 tokens/s, cache included)

prefix cache keeps working through dozens of thinking+output+tool-call iterations:

2026-03-29T16:23:14.890209Z  INFO vllm_rs::server::server: [Seq 43] ⏱️ Prompt: 202405 tokens in 1.09s (185522.47 t/s)
2026-03-29T16:23:14.890217Z  INFO vllm_rs::server::server: [Seq 43] ⏱️ Decoded: 375 tokens in 8.14s (46.09 t/s)

to include proper stream-parsing of tools and reasoning tags:

Details

2026-03-29T16:23:05.531501Z  INFO vllm_rs::core::engine: [llg] Guidance enabled, trimming <think> from pre-generation. Generation starting at <|im_start|>assistant
2026-03-29T16:23:05.692275Z  WARN vllm_rs::core::engine: [Stream] New request [Seq_id 43, 202405 tokens] received! (session_id: None)

2026-03-29T16:23:05.692366Z  INFO vllm_rs::server::parser: Tool parser selected: qwen_coder (model_id=Qwen/Qwen3.5-122B-A10B-FP8, enforce_parser=none)
2026-03-29T16:23:05.694077Z  INFO vllm_rs::core::block_manager: Prefix cache hit seq 43 (199040 cached tokens, 3110 blocks)
2026-03-29T16:23:05.696627Z  INFO vllm_rs::core::runner: Restored mamba prefix state for seq 43 (cached 199040 tokens)
...

2026-03-29T16:23:06.742515Z  INFO vllm_rs::utils::guidance: GRAMMAR:
start: reasoning_block ( text | tool_call )+ eos
reasoning_block: <[248068]> "\n" think_text "\n" <[248069]> "\n"
think_text[suffix="\n"]: /[ -~]+/
tool_call: <[248058]> text <[248059]>
text: /(?s:.+?)/
eos:  ( <[248046]> <[248044]> )

2026-03-29T16:23:06.753581Z  WARN vllm_rs::core::scheduler: Seq 43 - chunk prefill finished (202405 tokens)
2026-03-29T16:23:06.753687Z  INFO vllm_rs::core::engine: Prefilling [seq_id 43]: 202406 tokens in 1.09s (185523.38 tokens/s, cache included)
2026-03-29T16:23:07.068805Z  INFO vllm_rs::server::parser: Tool call <tool_call> (248058) found, start buffering!
2026-03-29T16:23:11.765408Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [43]], avg. 46 tokens/s per request (total: 46 tokens/s)
2026-03-29T16:23:14.887214Z  INFO vllm_rs::core::scheduler: [Seq 43] Detected </tool_call> token 248059, finishing for external handling
2026-03-29T16:23:14.888253Z  INFO vllm_rs::core::block_manager: Prefix cache insert seq 43 (202780 tokens, 3168 blocks)
2026-03-29T16:23:14.889646Z  INFO vllm_rs::server::parser: Tool call buffering end, reached </tool_call> (248059)
2026-03-29T16:23:14.889808Z  INFO vllm_rs::server::parser: Building tool call: [StreamingToolCallState { name: Some("attempt_completion"), arguments: "{\"result\": \"## Implementation Plan: Reasoning Grammar Detection and Chat Template Validation\\n\\n### Concern 1: Detect Reasoning Grammars in `engine.rs::apply_template()`\\n\\n**Current State**: The code at line 1073 in `engine.rs` already has the detection logic:\\n```rust\\nif let Some(grammar) = params.grammar && is_reasoning_grammar(&grammar) {\\n    prompt_template.set_enable_thinking(true);\\n}\\n```\\n\\n**Function Location**: [`is_reasoning_grammar()`](src/utils/reasoning.rs:262) in `src/utils/reasoning.rs`\\n\\n**Detection Logic**:\\n- Converts `TopLevelGrammar` to JSON string representation\\n- Checks for reasoning-specific patterns: `reasoning_block`, `think_text`, `analysis_block`, `draft_phase`, `verification_phase`\\n\\n**Action Required**: None - this functionality already exists and is integrated.\\n\\n---\\n\\n### Concern 2: Chat Template Reasoning Tag Validation\\n\\n**Objective**: Ensure reasoning grammars are only generated when the chat template contains matching reasoning tags.\\n\\n**Current Gap**: No validation exists to check if the chat template has reasoning capability before generating reasoning grammars.\\n\\n**Proposed Solution**: Add a validation function in `src/utils/chat_template.rs`:\\n\\n```rust\\n/// Check if chat template contains reasoning-related markers\\npub fn template_supports_reasoning(template_content: &str) -> bool {\\n    // Look for common reasoning tag patterns\\n    let reasoning_markers = [\\n        \\\"<｜begin▁of▁thought｜>\\\",\\n        \\\"<｜end▁of▁thought｜>\\\",\\n        \\\"\"}" }]
2026-03-29T16:23:14.890155Z  INFO vllm_rs::tools::helpers: Valid tool call(s): attempt_completion(args={"result":"## Implementation Plan: Reasoning Grammar Detection and Chat Template Validation\n\n### Concern 1: Detect Reasoning Grammars in `engine.rs::apply_tem...)
2026-03-29T16:23:14.890187Z  INFO vllm_rs::server::server: Final chunk emitted after tool-call delta chunk(s): ChatCompletionChunk { id: "seq-43", object: "chat.completion.chunk", created: 1774801385525, model: "g60", choices: [ChatChoiceChunk { index: 0, delta: Delta { role: None, content: None, tool_calls: None }, finish_reason: Some("tool_calls"), error: None }], usage: Some(Usage { prompt_tokens: 202405, completion_tokens: 375, total_tokens: 202780 }) }
2026-03-29T16:23:14.890205Z  WARN vllm_rs::server::server: --- Performance Metrics ---
2026-03-29T16:23:14.890209Z  INFO vllm_rs::server::server: [Seq 43] ⏱️ Prompt: 202405 tokens in 1.09s (185522.47 t/s)
2026-03-29T16:23:14.890217Z  INFO vllm_rs::server::server: [Seq 43] ⏱️ Decoded: 375 tokens in 8.14s (46.09 t/s)

The resulting IO renders correctly with clients which handle reasoning tags:

^^ is on the 122B - i need a better guard to ensure we cannot enable thinking on models like CoderNext as models without a reasoning block in their template appear much more prone to 'runaway generation' in those blocks than reasoning-capable ones.

This is where #279 and this PR might "collide" and why i'm trying to be extra careful in the parallel dev of this one - if i strip that start-think line but disable reasoning entirely then your code will have accounted for two more tokens in the prefix than are actually emitted. That said, same problem: if you emit a start-think-tag in the template but the model doesn't know how to end it, regardless of constraints being enabled or not it will emit nonsense after a little while because it doesn't have a lookup for the closing tag which would allow transition to normal generation through EOS.

sempervictus · 2026-05-12T04:43:25Z

That was a weird one but $defs and $ref are resolved now - turns out JSON schema can be referential 😮‍💨

guoqingbao · 2026-06-02T09:13:18Z

Found one regression, it seems the tool grammar only suitable for qwen models, here is the GLM4.7 Flash, which works without tool grammar.

When tool grammar enabled, it report:

│  Let me use the correct tool invocation syntax:                                          ▲
│                                                                                          │
│  ✓ ⋮ {"name": ─────────────────────────────────────────────────────────────────────      │
│    │ 🛠 {"name":                                                                          │
│    │ ✗ Error: Tool '{"name":' not found                                                  │
│    │                                                                                     █
│    │   [Analyze the error above and try a different approach.]                           │
│    └───────────────────────────────────────────────────────────────────────────────      │
│                                                                                          │
│  I apologize — I'm having a technical issue with the tool invocation. Let me try readin  │
│  g the SKILL.md file directly:                                                           │
│                                                                                          │
│  ✓ ⋮ {"name": ─────────────────────────────────────────────────────────────────────      │
│    │ 🛠 {"name":                                                                          │
│    │ ✗ Error: Tool '{"name":' not found                                                  │
│    │                                                                                     │
│    │   [Analyze the error above and try a different approach.]

This introduces a complete grammar composition system built on llguidance that handles multiple constraint types, tool calling styles, and reasoning effort levels through a single coherent architecture. Key capabilities: Multi-format constraint grammars via Lark, regex, JSON schema, choice lists, and structural tags - all normalized to TopLevelGrammar Tool grammar generation for both Qwen-style JSON format and Qwen3-Coder XML format with proper parameter ordering and deduplication Reasoning effort levels (none, low, medium, high, chain_of_thought) which wrap base grammars with reasoning block constraints GrammarComposer.compose_all_grammars() that merges constraints, tools, and reasoning in the correct precedence order 5.Thinking fallback via VLLM_RS_PROVIDE_THINKING_FALLBACK that transforms <[token_id]> syntax to string literals for models not trained on reasoning tokens, detecting via chat_template.enable_thinking flag Schema sanitization that strips unsupported format attributes from JSON schemas before passing to llguidance Architecture: - GrammarBuilder trait for composable grammar fragments - GrammarRequestDispatcher for building grammars from request context - GrammarComposer for merging constraint, tool, and reasoning grammars - apply_thinking_fallback() for models without reasoning tokens in template Examples: ```lark start: "yes" | "no" eos eos: ( <[248044]> | <[248046]> ) ``` ```lark start: reasoning_block "positive" | "negative" | "neutral" eos reasoning_block: <[248068]> "\n" think_text "\n" (think_text+ "\n")? "\n" <[248069]> "\n\n" think_text[suffix="\n"]: /[ -~]+/ eos: ( <[248044]> | <[248046]> ) ``` ```lark start: ( text | tool_call )+ eos text: /(?s:.+?)/ tool_call: <[248058]> tool_content <[248059]> param_0_0: "\n<parameter=url>\n" value_string param_0_1: "\n<parameter=proxy>\n" value_string tool_0: "\n<function=fetch_url_via_curl>" param_0_0 (param_0_1)? "</function>\n" param_1_0: "\n<parameter=path>\n" value_string tool_1: "\n<function=fs_cat>" param_1_0 "</function>\n" param_2_0: "\n<parameter=path>\n" value_string tool_2: "\n<function=fs_ls>" param_2_0 "</function>\n" tool_3: "\n<function=get_current_time>\n" "</function>\n" param_4_0: "\n<parameter=query>\n" value_string param_4_1: "\n<parameter=searxng>\n" value_string tool_4: "\n<function=web_search_searxng>" param_4_0 (param_4_1)? "</function>\n" tool_content: tool_0 | tool_1 | tool_2 | tool_3 | tool_4 value_string[suffix="\n</parameter>\n"]: /[\x20-\x7E\x0A\x0D]+?/ eos: ( <[248044]> | <[248046]> ) ``` ```lark start: reasoning_block text eos reasoning_block: <[248068]> analysis_block critique_block structure_block "\n" <[248069]> "\n\n" analysis_block: "\n<analysis>\n" analysis_text analysis_text[suffix="\n</analysis>\n"]: /[\x20-\x7E\x0A\x0D]+?/ critique_block: "\n<critique>\n" critique_text critique_text[suffix="\n</critique>\n"]: /[\x20-\x7E\x0A\x0D]+?/ structure_block: "\n<structure_response>\n" structure_text structure_text[suffix="\n</structure_response>\n"]: /[\x20-\x7E\x0A\x0D]+?/ text: %json {"type":"object","properties":{"steps":{"type":"array","items":{"type":"string"}},"final_answer":{"type":"string"}},"required":["steps","final_answer"]} eos: ( <[248044]> | <[248046]> ) ```

@guoqingbao

Remove template preamble when grammar generation is acctive for control of the entire token-space emitted. guoqingbao#352 testing observed some NVFP4 models insisting on producing a BOS regardless of the template preamble and grammar-constrained generation prevents them from doing so which reduces logbprobs for the remainder of the sequence produced drastically shortening output. Create ReasoningEffort::ModelDefault to replace @guoqingbao's way of preempting a model to reason by way of an open `<think>` tag predicated on the same CLI parameter state as the original template -driven implementation.

Revise reasoning logic to limit max tokens at each lexeme and add a ModelDefault ReasoningLevel to handle the case where no actual reasoning level is provided but `--disable-reasoning` is not set. Make the reasoning text between the reasoning tokens optional to allow opt-out for models not trained to generate in the block. This allows `RedHatAI/Qwen3.6-35B-A3B-NVFP4` to actually produce output atop 4xV100 under guided constraints as using: ```lark start: text eos text: /(?s:.+?)/ eos: ( <[248044]> | <[248046]> ) ``` results in the model emitting an immediate EOS because the first token it produces without constraints is always `<|im_start|>` even with the `<think>` tag preamble provided by the template to prime the model for generating the content of the reasoning block. When it is only allowed to emit from the standard vocabulary and EOS it either outputs a nonsense token and then EOS or just goes straight to EOS. With the expanded constraint which incldues reasoning tokens: ```lark start: reasoning_block text eos reasoning_block: <[248068]> "\n" think_text? "\n\n" <[248069]> "\n" think_text[temperature=0, max_tokens=768]: /(?s:.+?)/ text: /(?s:.+?)/ eos: ( <[248044]> | <[248046]> ) ``` the first token emitted is from the added vocabulary which is at least proximate to the BOS token it's not being allowed to emit. Tokens subsequent to the added vocabulary one appear to be stable for up to a few hundred reasoning tokens then start to collapse and loop until max_tokens of the reasoning block is reached and another special token emitted (which stabilizes output in normal text/tool-call phase). Of note: the larger the input context the more stable this specifc model appears. Moreover when there are no constraints applied, the BOS tag it emits usually has the model role included (ignoring the template preamble) but in some cases such as the open `<think>` tag being "noticed" in generation it outputs absurdities such as `<|im_start|><think>` without the requisite role definition or newline in-between.

Set max_tokens value in the default reasoning block via env var or fall-back to 512 default. Use XINFER_DEFAULT_REASONING_MAX_TOKENS to adjust at startup.

sempervictus · 2026-06-02T22:51:00Z

Found one regression, it seems the tool grammar only suitable for qwen models, here is the GLM4.7 Flash, which works without tool grammar.

When tool grammar enabled, it report:

│  Let me use the correct tool invocation syntax:                                          ▲
│                                                                                          │
│  ✓ ⋮ {"name": ─────────────────────────────────────────────────────────────────────      │
│    │ 🛠 {"name":                                                                          │
│    │ ✗ Error: Tool '{"name":' not found                                                  │
│    │                                                                                     █
│    │   [Analyze the error above and try a different approach.]                           │
│    └───────────────────────────────────────────────────────────────────────────────      │
│                                                                                          │
│  I apologize — I'm having a technical issue with the tool invocation. Let me try readin  │
│  g the SKILL.md file directly:                                                           │
│                                                                                          │
│  ✓ ⋮ {"name": ─────────────────────────────────────────────────────────────────────      │
│    │ 🛠 {"name":                                                                          │
│    │ ✗ Error: Tool '{"name":' not found                                                  │
│    │                                                                                     │
│    │   [Analyze the error above and try a different approach.]

Thank you for checking in on this branch sir. I wrote grammar generators for Gemma4 and MiniMax but haven't run into GLM yet. Agree we need per-model appropriate grammars, for all models. That said... --enforce-parser qwen should address this since it makes the model generate JSON tool-calls and the tool-parser handle that format (you should be able to force any of the valid formats, to include GLM once we add it).

I'll get a hold of a chat template and add that as well. Should i be moving model-specific grammars into the existing model's file or starting a new subtree? Some of the files are getting a bit porcine.

Current state is:

Trying to get ff-tokens shoved into the Sequence from the Runner's sample() call (free sampling-ahead, which currently we're not really set up for in sample() and downstream)
"smooth" the logit mask a bit - instead of -inf maybe something like -1000 bias applied to make them really unlikely but if they are required dont completely remove them
get the masking directly into the sampling kernel since it may benefit from operational fusion and offer a smoother gradient of logit modification relative to the sampling strategy used (or maybe force an argmax when constrained, working that out).

The more i get into the maths of what's going on before and after this the more my head hurts. The actual guidance process is fairly simple but the downstream implications on generation from prior enforcement biasing subsequent token selection should result in all sorts of messed up output... but they dont. We're only really eliminating special tokens for most of the generation (.* only covers common vocab) and our tool grammars are 1:1 or very close to the trained chat template output for those models but i'm kind of surprised we havent run into invisible padding token issues or other weird things like that.

sempervictus · 2026-06-02T22:54:00Z

BTW the gemma4 generator is for its native "JSON-like" format, and it works. I know you reverted to using JSON tool-parsing at some point and i realigned the grammargen for that change but the code is still there if you want to use the really odd but native syntax they wrote for tool calls.

sempervictus · 2026-06-03T00:04:00Z

vllm-rs-svc0  | 2026-06-02T23:56:25.637980Z  INFO xinfer::core::engine: Prefilling 1 seq(s) [0]: 1360187 total tokens in 451.52s (3012.48 tokens/s, cache included)
vllm-rs-svc0  | 2026-06-02T23:56:26.253979Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 198
vllm-rs-svc0  | 2026-06-02T23:56:26.253996Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 27
vllm-rs-svc0  | 2026-06-02T23:56:26.253999Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 1688
vllm-rs-svc0  | 2026-06-02T23:56:26.254097Z  INFO xinfer::server::parser: Tool call <tool_call> (151657) found, start buffering!
vllm-rs-svc0  | 2026-06-02T23:56:26.292107Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 3152
vllm-rs-svc0  | 2026-06-02T23:56:26.292118Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 10716
vllm-rs-svc0  | 2026-06-02T23:56:26.292122Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 397
vllm-rs-svc0  | 2026-06-02T23:56:26.292125Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 27
vllm-rs-svc0  | 2026-06-02T23:56:26.292127Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 16181
vllm-rs-svc0  | 2026-06-02T23:56:26.292130Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 81940
vllm-rs-svc0  | 2026-06-02T23:56:26.579813Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 198
vllm-rs-svc0  | 2026-06-02T23:56:26.579828Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 27
vllm-rs-svc0  | 2026-06-02T23:56:26.579831Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 16181
vllm-rs-svc0  | 2026-06-02T23:56:26.579834Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 79194
vllm-rs-svc0  | 2026-06-02T23:56:27.252387Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 522
vllm-rs-svc0  | 2026-06-02T23:56:27.252399Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 1688
vllm-rs-svc0  | 2026-06-02T23:56:27.252402Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 397
vllm-rs-svc0  | 2026-06-02T23:56:27.252405Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 151658

.... 🤞

guoqingbao · 2026-06-03T02:33:58Z

BTW the gemma4 generator is for its native "JSON-like" format, and it works. I know you reverted to using JSON tool-parsing at some point and i realigned the grammargen for that change but the code is still there if you want to use the really odd but native syntax they wrote for tool calls.

I made some changes to fix bugs that were already addressed in my previous commits, but your latest force push removed those changes. Please base your work on the most recent updates so we can get this PR merged, rather than overwriting or ignoring them.

LLGuidance Matcher state machine can produce fast-forward tokens from certain sequence positions followed by known token IDs based on the grammar definition. This permits emitting tokens which must be correct and avoiding forward passes for them entirely and also allows comparison of sampled token to the FF state as an additional validator of sampling correctness relative to mask/FSM. Used token is commited to the FSM regardless of whether sampled or FF ensuring alignmet of FSM increment to sequence position. TODO: currently only ff_tokens[0] is used as a healing mechanism but the function can return many tokens at once. These cannot be handled in the context of sample() as they must be appended to the sequence in the right order during `Schedueler::postprocess()` and cached state along with seq pos and FSM increment must align with each-other before the next forward pass. Draft infrastructure may be useful for handling the associated complexity when stabilized via EAGLE or other mechanism.

The original implementation used hard masking (f32::NEG_INFINITY) which completely eliminates probability mass from disallowed tokens. This creates a gradient cliff that can disrupt the inference loop, especially when grammar constraints appear in text sections that only allow common vocabulary. Soft-masking with configurable parameters: - SoftMaskConfig struct with mask_shift (-1000.0), min_logit (-1e9), and enabled flag - Environment variable XINFER_SOFT_MASK_DISABLED controls soft-masking behavior When enabled, masked tokens get logit = (original_logit - 1000.0).max(-1e9) Mathematical rationale: - F32 range: min = -3.4028235e+38, max = 3.4028235e+38 - Softmax probability for shifted logit: exp(-1000) / sum(exp(logits)) ~ 10^-435 - This is effectively zero for practical purposes while preserving gradient flow - The gradient of softmax at -1000 is non-zero (unlike -inf which has zero gradient) - min_logit = -1e9 is safe: well above f32::MIN, prevents underflow to -inf

sempervictus · 2026-06-03T16:28:18Z

I made some changes to fix bugs that were already addressed in my previous commits, but your latest force push removed those changes. Please base your work on the most recent updates so we can get this PR merged, rather than overwriting or ignoring them.

Apologies, didn't see the changes - i rebase the local branch off main whenever merge conflicts arise. ~~Will take a look above before the rebases and cherry-pick those commits~~

Looking through the history in this GH page it doesn't show commit added to the branch by anyone other than me - am i somehow deleting GH history too when i rebase? :-\ Happen to have a branch from which i can pick those correctly (or just re-add them here and i'll merge everything down to 1 commit)?

I'll push the latest state with softscaling and single-token FF (only a recovery function right now, the multi-token FF requires more work to align sequence states/FSM/cached_tokens/possibly other things) - i also found the FP4 models take better to having their repetition penalties applied before masking at longer (1m+) context but that might be workload specific because the agent re-reads the same files a ton which induces repetition from the chat history ... so curious if yo usee the same.

sempervictus · 2026-06-04T14:01:21Z

@guoqingbao if you have a chance before you sign off for the night, could you please push any changes you had to this branch? I didn't realize you were pushing here and rebased locally off of main (see above). I've added PR branches to my git config to have it pull all PRs when i fetch so i should see changes moving forward but unfortunatley it looks like i lost yours pretty thoroughly in the rebase.

sempervictus marked this pull request as draft March 15, 2026 16:02

sempervictus force-pushed the grammars/pr branch from e789778 to 862b18b Compare March 16, 2026 03:02

sempervictus force-pushed the grammars/pr branch 2 times, most recently from 90e39de to 443f552 Compare March 22, 2026 05:20

sempervictus marked this pull request as ready for review March 23, 2026 16:06

sempervictus force-pushed the grammars/pr branch 2 times, most recently from 0b55df7 to 68554aa Compare March 26, 2026 05:24

sempervictus mentioned this pull request Mar 27, 2026

Fix decoding cache mismatch #277

Closed

sempervictus force-pushed the grammars/pr branch from 114445c to 5c51592 Compare March 27, 2026 22:49

sempervictus mentioned this pull request Mar 28, 2026

Rewrite special markers in prompts to fix cache mismatch #279

Closed

This was referenced Mar 29, 2026

FP8 CTX Cache #43

Closed

Output parser regression on reasoning tokens #280

Closed

sempervictus mentioned this pull request Mar 30, 2026

Fix reasoning marker issue #281

Merged

sempervictus mentioned this pull request May 15, 2026

Support FP8 KV Cache on all CUDA and Metal platforms #347

Merged

guoqingbao force-pushed the main branch from 4b7b57d to 7dc2fe6 Compare May 21, 2026 16:02

sempervictus force-pushed the grammars/pr branch 5 times, most recently from 1203887 to f5ccf3c Compare May 27, 2026 14:15

This was referenced May 28, 2026

Support multi-token prediction (Eagle MTP) #369

Open

FP4 models on SM70 not decoding #352

Closed

sempervictus force-pushed the grammars/pr branch from f5ccf3c to 13ef6a0 Compare May 31, 2026 06:28

This was referenced May 31, 2026

Use Ceiling Division for Scale Tensor #374

Closed

Nemotron Elastic Nested Model Support #375

Open

SM120 FP4 Behavior #317

Closed

Support more NVFP4 model formats (mixed & MLX) #373

Merged

RageLtMan added 5 commits June 2, 2026 16:11

LLG: Set Default Reasoning Length via ENV

f20f86f

Set max_tokens value in the default reasoning block via env var or fall-back to 512 default. Use XINFER_DEFAULT_REASONING_MAX_TOKENS to adjust at startup.

LLG: Guidance Mask After Penalties

c80e182

sempervictus force-pushed the grammars/pr branch from 00eb740 to f20f86f Compare June 2, 2026 20:23

RageLtMan added 2 commits June 2, 2026 22:40

Conversation

sempervictus commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comprehensive Guided Decoding Infrastructure

1. Structured Output Constraints (Opt-In Security Model)

2. Automatic Tool Grammar Generation

3. Reasoning Effort Control

4. Grammar-Only Completion Endpoint

System-Level Changes

1. Enhanced Token Infrastructure

2. Grammar Composition Engine

3. Security Validation Layer

4. Fallback Mechanisms

5. API Contract Extensions

Uh oh!

sempervictus commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sempervictus commented Mar 15, 2026

Uh oh!

guoqingbao commented Mar 16, 2026

Uh oh!

sempervictus commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sempervictus commented Mar 17, 2026

Review Summary

Issues Found

Detailed Findings

Recommendation

Uh oh!

sempervictus commented Mar 17, 2026

Uh oh!

guoqingbao commented Mar 18, 2026

Uh oh!

sempervictus commented Mar 18, 2026

Uh oh!

sempervictus commented Mar 22, 2026

Uh oh!

sempervictus commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sempervictus commented Mar 23, 2026

Uh oh!

guoqingbao commented Mar 27, 2026

Uh oh!

sempervictus commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sempervictus commented Mar 27, 2026

Uh oh!

sempervictus commented Mar 27, 2026

Uh oh!

sempervictus commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of Execution:

Strategic Alignment:

Uh oh!

sempervictus commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sempervictus commented May 12, 2026

Uh oh!

guoqingbao commented Jun 2, 2026

Uh oh!

sempervictus commented Jun 2, 2026

Uh oh!

sempervictus commented Jun 2, 2026

Uh oh!

sempervictus commented Jun 3, 2026

Uh oh!

guoqingbao commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sempervictus commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sempervictus commented Jun 4, 2026

Uh oh!

Reviewers

sempervictus commented Mar 15, 2026 •

edited

Loading

sempervictus commented Mar 15, 2026 •

edited

Loading

sempervictus commented Mar 16, 2026 •

edited

Loading

sempervictus commented Mar 23, 2026 •

edited

Loading

sempervictus commented Mar 27, 2026 •

edited

Loading

sempervictus commented Mar 29, 2026 •

edited

Loading

sempervictus commented Mar 29, 2026 •

edited

Loading

guoqingbao commented Jun 3, 2026 •

edited

Loading

sempervictus commented Jun 3, 2026 •

edited

Loading