Skip to content

LLG: Comprehensive Guided Decoding Infrastructure#265

Open
sempervictus wants to merge 7 commits into
guoqingbao:mainfrom
sempervictus:grammars/pr
Open

LLG: Comprehensive Guided Decoding Infrastructure#265
sempervictus wants to merge 7 commits into
guoqingbao:mainfrom
sempervictus:grammars/pr

Conversation

@sempervictus

@sempervictus sempervictus commented Mar 15, 2026

Copy link
Copy Markdown
Contributor

Comprehensive Guided Decoding Infrastructure

1. Structured Output Constraints (Opt-In Security Model)

Client applications can now request structured outputs via the OpenAI-compatible API using structured_outputs field. The system supports:

  • Choice constraints: Force responses to be selected from a predefined list of strings
  • Regex constraints: Enforce text patterns via Lark grammar definitions
  • JSON Schema constraints: Validate responses against JSON Schema definitions
  • Custom Lark grammars: Full grammar definition control for advanced use cases
  • Structural tag constraints: Enforce tool call formatting with custom start/end markers

Security Model: All client-provided constraints are BLOCKED by default. Operators must explicitly enable via --allow-constraint-api CLI flag. This prevents malicious grammar injection attacks that could:

  • Bypass role boundaries in chat templates
  • Inject system prompts or tool calls
  • Exploit ReDoS vulnerabilities through regex patterns

2. Automatic Tool Grammar Generation

When tools are defined in a request, the system can automatically generate grammars that constrain tool calls to valid JSON schemas:

  • XML-style tool calls: Full tool grammar with <tool_call> markers and schema validation
  • Fallback text-based tool envelope: Text-tagged tool calls when token-based markers are unavailable
  • Parser-specific tool formatting: Automatic grammar adaptation based on configured parser (json, regex, etc.)

Security Model: Tool grammar generation is BLOCKED by default. Enable via --enable-tool-grammar CLI flag. When enabled, tools are validated against the defined schema before grammar generation.

3. Reasoning Effort Control

The system supports configurable reasoning effort levels that generate model-specific thinking grammar:

  • None: No reasoning block, direct output
  • Low: Fast thinking with single paragraph constraint (~150 characters)
  • Medium: Multi-step reasoning with sentence boundary enforcement
  • High: Adversarial critique pattern (Cheng & Su 2025)
  • Chain of Thought: Multi-phase verification with draft, verify, critique, and structure phases

Reasoning effort integrates with constraint grammars to produce composed output that includes thinking blocks before final response.

4. Grammar-Only Completion Endpoint

New /v1/grammar POST endpoint allows clients to submit Lark/Regex/JSON Schema/Choice grammars directly:

  • Grammar type specification: Explicit type declaration (lark, regex, json_schema, choice)
  • Content validation: Grammar content is parsed and validated before use
  • Model-specific token handling: Automatic adaptation to model's special token vocabulary

System-Level Changes

1. Enhanced Token Infrastructure

  • BOS token support: Added bos_token_ids to GuidanceTokens for proper beginning-of-sequence handling
  • Tool call token pairs: Recognition of multiple tool call marker formats across different model families
  • Reasoning token pairs: Multi-token support for reasoning markers (, , , etc.)
  • BOS token collection: Automatic detection of BOS tokens from tokenizer vocabulary

2. Grammar Composition Engine

  • Multi-grammar composition: compose_grammars now handles constraint + tool + reasoning combinations
  • Explicit constraint-tool alternation: Grammar rules like start: (text | tool_call)+ for flexible output
  • Reasoning block wrapping: Proper sequencing of reasoning followed by constraint or tool output
  • EOS termination handling: Automatic EOS token addition to all generated grammars

3. Security Validation Layer

  • Per-request constraint validation: Each request validates allow_constraint_api flag before processing
  • Graceful degradation: When constraints are disabled, system logs warning and returns unconstrained output
  • ASCII sanitization: Non-ASCII characters filtered from tool call markers to prevent injection
  • Grammar content validation: Empty or invalid grammar content rejected with descriptive error messages

4. Fallback Mechanisms

  • Token-based to text-based tool markers: When special token IDs are unavailable, falls back to string literal tags
  • Parser name detection: Automatic parser selection based on model type (StreamToolParser)
  • JSON schema sanitization: Schema values sanitized for llguidance compatibility before grammar generation

5. API Contract Extensions

  • ChatCompletionRequest enhancements: Added structured_outputs, constraint, constraint_type, reasoning_effort, tool_choice, extra_body fields
  • GrammarRequest struct: Dedicated request type for /v1/grammar endpoint with explicit grammar type/content separation
  • GrammarResponse struct: Response type for grammar completions with metadata about applied constraints

sequenceDiagram
    participant Client
    participant Server as src/server/mod.rs
    participant API as src/api.rs
    participant Guidance as src/utils/guidance.rs
    participant Engine as src/core/engine.rs
    participant Runner as src/core/runner.rs

    Client->>Server: POST /v1/chat/completions
    Client->>Server: POST /v1/grammar (new)
    
    Note over Server: Parse request & extract flags
    Server->>Server: Check enable_tool_grammar CLI flag
    Server->>Server: Check allow_constraint_api CLI flag
    
    alt Constraint API Disabled
        Server->>Server: Log warning: "constraint ignored: allow_constraint_api=false"
        Server->>API: Parse request without constraints
    else Constraint API Enabled
        Server->>API: extract_guidance_tokens(tokenizer, eos_ids, bos_ids)
        API->>Guidance: parse_grammar_from_chat_request(request, allow_constraint_api)
        
        Note over Guidance: Validate constraint fields
        Guidance->>Guidance: Check structured_outputs.choice/regex/json/grammar/structural_tag
        Guidance->>Guidance: Check response_format.json_schema/json_object
        Guidance->>Guidance: Check legacy constraint field
        
        alt Valid Constraint Found
            Guidance->>Guidance: Build TopLevelGrammar from constraint
            Guidance-->>API: Some(grammar)
        else No Constraint
            Guidance-->>API: None
        end
    end
    
    API->>Guidance: generate_grammar_from_request(request, guidance_tokens, enable_tool_grammar, allow_constraint_api, model_type)
    
    Note over Guidance: Dual grammar generation path
    alt enable_tool_grammar=true
        Guidance->>Guidance: Build XML tool grammar (build_xml_tool_grammar_for_parser)
        Guidance->>Guidance: Generate schema-based tool grammar with %json
    else allow_constraint_api=true
        Guidance->>Guidance: Build fallback tool envelope (build_fallback_tool_envelope_grammar)
        Guidance->>Guidance: Generate text-tagged tool grammar with start/end tags
    else Neither enabled
        Guidance->>Guidance: Return None for tool grammar
    end
    
    alt Reasoning Effort Specified
        Guidance->>Guidance: Check reasoning tokens available (reasoning_start_ids, reasoning_end_ids)
        Guidance->>Guidance: Generate reasoning grammar based on effort level
        Note over Guidance: None/Low/Medium/High/ChainOfThought
    end
    
    Guidance->>Guidance: compose_grammars(constraint_grammars, tool_grammar, tool_choice_required, forced_tool_name, max_tokens, guidance_tokens, reasoning_effort)
    
    Note over Guidance: Grammar composition logic
    Guidance->>Guidance: GrammarComposerBuilder.build(guidance_tokens)
    
    alt No constraints, no tools, no reasoning
        Guidance-->>Engine: None (unconstrained generation)
    else Constraint only
        Guidance->>Guidance: GrammarComposers::Constraint(constraint_gram)
        Guidance-->>Engine: constraint grammar
    else Tools only
        Guidance->>Guidance: GrammarComposers::Tool(tool_gram)
        Guidance-->>Engine: tool grammar
    else Constraint + Tools
        Guidance->>Guidance: GrammarComposers::ConstraintOrTool(constraint, tool)
        Guidance-->>Engine: (constraint | tool)+
    else Reasoning + Constraint
        Guidance->>Guidance: GrammarComposers::WithReasoning(reasoning, inner)
        Guidance-->>Engine: reasoning_block -> inner grammar
    end
    
    Engine->>Engine: create_engine(config, bos_token_id, guidance_tokens)
    Engine->>Engine: Extract BOS tokens from tokenizer if not provided
    Engine->>Runner: SamplingParams { grammar: final_grammar, reasoning_effort, ... }
    
    Runner->>Runner: sample_with_grammar(prompt_tokens, grammar, eos_token_ids)
    Note over Runner: llguidance parser factory creates matcher
    
    loop Generation
        Runner->>Runner: Token generation with grammar constraints
        Runner->>Runner: EOS token detection
        alt Reasoning grammar detected
            Runner->>Runner: Extract reasoning_block content
            Runner->>Runner: Validate reasoning tokens match start/end markers
        end
        alt Tool call detected
            Runner->>Runner: Parse tool_call content via StreamToolParser
            Runner->>Runner: Extract function name and arguments
        end
        Runner->>Server: Stream token response
        Server->>Client: SSE stream chunk
    end
    
    Runner->>Runner: Prefix cache lookup (if --prefix-cache)
    Runner->>Runner: KV cache management (if --pd-server)
Loading

@sempervictus sempervictus marked this pull request as draft March 15, 2026 16:02
@sempervictus

sempervictus commented Mar 15, 2026

Copy link
Copy Markdown
Contributor Author

@guoqingbao - there's a grammar playground in examples while this is draft and it spits GRAMMAR lines into the log when they are generated for validation. Even if the rest is ready we probably want to remove those before merge (if you decide to merge while i'm in transit for GTC).

Current state works correctly even with XML generation for the 0.8 3.5 which makes it a tool-call monster :-).

vllm-rs-svc2  | 2026-03-15T15:53:52.644891Z  INFO vllm_rs::utils::guidance: GRAMMAR:
vllm-rs-svc2  | start: ( text | tool_call )+ eos
vllm-rs-svc2  | text: /(?s:.*)/
vllm-rs-svc2  | tool_call: <[151657]> "\n" tool_content <[151658]>
vllm-rs-svc2  | value_0_0: /[ -~]*?/
vllm-rs-svc2  | param_0_0: "<parameter=url>" "\n" value_0_0 "\n" "</parameter>" "\n" 
vllm-rs-svc2  | value_0_1: /[ -~]*?/
vllm-rs-svc2  | param_0_1: "<parameter=proxy>" "\n" value_0_1 "\n" "</parameter>" "\n" 
vllm-rs-svc2  | tool_0: "<function=fetch_url_via_curl>" "\n" param_0_0 (param_0_1)? "</function>" "\n" 
vllm-rs-svc2  | value_1_0: /[ -~]*?/
vllm-rs-svc2  | param_1_0: "<parameter=path>" "\n" value_1_0 "\n" "</parameter>" "\n" 
vllm-rs-svc2  | tool_1: "<function=fs_cat>" "\n" param_1_0 "</function>" "\n" 
vllm-rs-svc2  | value_2_0: /[ -~]*?/
vllm-rs-svc2  | param_2_0: "<parameter=path>" "\n" value_2_0 "\n" "</parameter>" "\n" 
vllm-rs-svc2  | tool_2: "<function=fs_ls>" "\n" param_2_0 "</function>" "\n" 
vllm-rs-svc2  | tool_3: "<function=get_current_time>" "\n"  "</function>" "\n" 
vllm-rs-svc2  | value_4_0: /[ -~]*?/
vllm-rs-svc2  | param_4_0: "<parameter=query>" "\n" value_4_0 "\n" "</parameter>" "\n" 
vllm-rs-svc2  | value_4_1: /[ -~]*?/
vllm-rs-svc2  | param_4_1: "<parameter=searxng>" "\n" value_4_1 "\n" "</parameter>" "\n" 
vllm-rs-svc2  | tool_4: "<function=web_search_searxng>" "\n" param_4_0 (param_4_1)? "</function>" "\n" 
vllm-rs-svc2  | tool_content: tool_0 | tool_1 | tool_2 | tool_3 | tool_4
vllm-rs-svc2  | eos: <[248046]>
vllm-rs-svc2  | 

turns out the EOS thing was killing us with multimodals but secondly on XML we have to have the pedantic literals for "\n" in the same places at the chat template has them. Ideally we would actually EXTRACT the chat template into the grammar builder so this never needs any touch-up in the future but minijinja AST is not available and nom parsing is a bit... crude. All ears for ideas on that one.

Re: EOS - i would like to convert Config to have the multi-eos variant for multimodals instead of overwriting the one from tokenizer_config with the string one entirely but i dont want to break any fragile code. Any thoughts on this approach instead of what i'm doing here to populate GuidanceTokens with the correct EOS (that ID is <|im_end|> without which the model will generate forever as the chat template becomes unbounded)

@sempervictus

Copy link
Copy Markdown
Contributor Author

The last two commits are in testing now, apologies for the branch depth will compress it all to one commit before merge - on the road for a bit so working remote and using GH as a pivot

@guoqingbao

Copy link
Copy Markdown
Owner

Are you able rebase on main? PR #262 has been merged.

@sempervictus

sempervictus commented Mar 16, 2026

Copy link
Copy Markdown
Contributor Author

Done sir. Will see how much I can get done on plane.
Any thoughts on the idea of adding the eos for mm models to the EosTokenId type changing it from singular to multiple but allowing both to be used in grammar bounding and also filtering both from emission back to user? Technically I think the original eos token being pulled from tokenizerconfig is actually an end of message token (<|im_end|>) but since LLMs only produce text it seems a bit misnamed in the industry - the one you're getting out of the mm config is much more aptly an eos but unfortunately doesn't cap grammar bounds because the model needs to emit the EOM whether it elects to emit a true EOS or not.

@sempervictus

Copy link
Copy Markdown
Contributor Author

@guoqingbao if you have a couple of minutes, mind looking at the last commit? Trying to be graceful with this and not break
any of your logic which might be opaque to me.

Review Summary

This change refactors EOS token ID handling in multimodal configurations. The modifications:

  1. Simplify extract_guidance_tokens() call in LLMEngine::new() by removing redundant intermediate variables
  2. Add logic in init_config_tokenizer() to merge tokenizer's EOS token with config's EOS token IDs using the existing EosTokenId::merge_dedup() method

Overall, this improves code consistency and ensures both tokenizer and config EOS tokens are properly handled for multimodal models.

Issues Found

Severity File:Line Issue
SUGGESTION vllm.rs/src/utils/mod.rs:683-704 The comment "For multimodal models, merge tokenizer's eos_token string to token IDs" is misleading since this code runs for all model types when config_tokenizer.eos_token is present

Detailed Findings

File: vllm.rs/src/core/engine.rs:130-137

  • Confidence: 95%
  • Problem: The original code was more complex with unnecessary intermediate variables (guidance_eos_ids, guidance_tokens) that have been properly simplified
  • Suggestion: No changes needed - the refactoring improves readability

File: vllm.rs/src/utils/mod.rs:683-704

  • Confidence: 80%
  • Problem: The comment states "For multimodal models" but the code executes whenever config_tokenizer.eos_token is present, not just for multimodal models. This could confuse readers about when this logic runs
  • Suggestion: Update comment to accurately reflect the scope: "Merge tokenizer's eos_token string to token IDs to ensure EOSTOKENIDS includes tokens from both tokenizer and config"

Recommendation

APPROVE WITH SUGGESTIONS - The core logic changes are sound and improve code maintainability. Only the misleading comment needs clarification.

@sempervictus

Copy link
Copy Markdown
Contributor Author

Apologies for any compile issues, I can only run build tests on cuda GPU systems now due to the attentionrs change and haven't had time to dig into fixing that while at gtc (the netbook I have w me is igpu only and we don't support mkl yet)

@guoqingbao

Copy link
Copy Markdown
Owner

Apologies for any compile issues, I can only run build tests on cuda GPU systems now due to the attentionrs change and haven't had time to dig into fixing that while at gtc (the netbook I have w me is igpu only and we don't support mkl yet)

No hurry, this is an optional feature.

@sempervictus

Copy link
Copy Markdown
Contributor Author

Agreed although having both openai and anthropic capabilities + user supplied chain of thought mechanisms opens the doors to some awesome appdev options both using vllmrs as a standalone bin and even moreso as a lib. One example being running a preprocessing micromodel to define grammars for the big one actually doing the work in a single runtime, another being long context w small models on vram constrained systems.

That said, it may be handy to cherry pick the eos commit while I'm at gtc and slower than usual to complete the PR - I'm betting that's why we see those run-on problems w newer qwens on older gear: they're grabbing EOS vs EOM which is also why we occasionally see the EOM rendered in their output as its not being masked. Something about naive attention seems to allow them to omit like that but this should make both EOS and EOM work. I need to understand attention computation product better to prove that but the behavior is 1:1 with what happened here before I switched from the multimodal EOS single type to the EOM ID or now multi type

@sempervictus sempervictus force-pushed the grammars/pr branch 2 times, most recently from 90e39de to 443f552 Compare March 22, 2026 05:20
@sempervictus

Copy link
Copy Markdown
Contributor Author

@guoqingbao - some decent progress :-)

  1. dug into the jinja templates for these coder-type models a bit and it turns out they can use %json constraint types in their value fields... finite stateless grammar win:
vllm-rs-svc2  | 2026-03-22T05:28:19.885268Z  INFO vllm_rs::utils::guidance: GRAMMAR:
vllm-rs-svc2  | start: ( text | tool_call )+ eos
vllm-rs-svc2  | tool_content: tool_0 ? tool_1 ? tool_2 ? tool_3 ? tool_4
vllm-rs-svc2  | tool_0: "<function=fetch_url_via_curl>" "\n" param_0_0 (param_0_1)? "</function>" "\n"
vllm-rs-svc2  | tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
vllm-rs-svc2  | tool_1: "<function=fs_cat>" "\n" param_1_0 "</function>" "\n"
vllm-rs-svc2  | tool_2: "<function=fs_ls>" "\n" param_2_0 "</function>" "\n"
vllm-rs-svc2  | tool_3: "<function=get_current_time>" "\n"  "</function>" "\n"
vllm-rs-svc2  | tool_4: "<function=web_search_searxng>" "\n" param_4_0 (param_4_1)? "</function>" "\n"
vllm-rs-svc2  | param_0_0: "<parameter=url>" "\n" value_0_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_0_1: "<parameter=proxy>" "\n" value_0_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_1_0: "<parameter=path>" "\n" value_1_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_2_0: "<parameter=path>" "\n" value_2_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_0: "<parameter=query>" "\n" value_4_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | param_4_1: "<parameter=searxng>" "\n" value_4_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2  | text: /(?s:.+?)/
vllm-rs-svc2  | value_0_0_string: %json {"type":"string","description":"The URL to scrape."}
vllm-rs-svc2  | value_0_1_string: %json {"type":"string","description":"The proxy URL in the format protocol://host:port."}
vllm-rs-svc2  | value_1_0_string: %json {"type":"string","description":"The path of the file to read"}
vllm-rs-svc2  | value_2_0_string: %json {"type":"string","description":"The path of the directory to list"}
vllm-rs-svc2  | value_4_0_string: %json {"type":"string","description":"The query to search for."}
vllm-rs-svc2  | value_4_1_string: %json {"type":"string","description":"Optional searxng URL overriding the env var"}
vllm-rs-svc2  | eos:  ( <[248044]> | <[248046]> )
  1. LLG updated to help handle the new multi-EOS models
  2. fallback grammar generation for partial enablement is in testing to include the custom types.
  3. the LLG-native grammar and constraint types are properly isolated in generation
  4. i ended up having to derive BOS for a tiny gemma but i think that had more to do with quantizing it down to 4b so while the functionality is there, it's currently "just in case" we find other models getting shaky at low bits and lots of early ff-tokens

@sempervictus

sempervictus commented Mar 23, 2026

Copy link
Copy Markdown
Contributor Author

All three variants of grammar-induced tool-calls are working for Q3.5/Next from 0.8->122, tested on the older Q3 0.6 too with excellent results (not the XML type obviously for that model):

  1. xml-explicit - allows multi-tool calls via tool_content: tool_0 ? tool_1 ? tool_2 ? tool_3 ? tool_4
 start: ( text | tool_call )+ eos
 tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
 tool_content: tool_0 ? tool_1 ? tool_2 ? tool_3 ? tool_4
 tool_0: "<function=fetch_url_via_curl>"  "\n" param_0_0 ( "\n" param_0_1)? "</function>" "\n"
 tool_1: "<function=fs_cat>"  "\n" param_1_0 "</function>" "\n"
 tool_2: "<function=fs_ls>"  "\n" param_2_0 "</function>" "\n"
 tool_3: "<function=get_current_time>"  "</function>" "\n"
 tool_4: "<function=web_search_searxng>"  "\n" param_4_0 ( "\n" param_4_1)? "</function>" "\n"
 param_0_0: "<parameter=url>" "\n" value_0_0_string "\n" "</parameter>" "\n"
 param_0_1: "<parameter=proxy>" "\n" value_0_1_string "\n" "</parameter>" "\n"
 param_1_0: "<parameter=path>" "\n" value_1_0_string "\n" "</parameter>" "\n"
 param_2_0: "<parameter=path>" "\n" value_2_0_string "\n" "</parameter>" "\n"
 param_4_0: "<parameter=query>" "\n" value_4_0_string "\n" "</parameter>" "\n"
 param_4_1: "<parameter=searxng>" "\n" value_4_1_string "\n" "</parameter>" "\n"
 text: /(?s:.+?)/
 value_0_0_string: %json {"type":"string","description":"The URL to scrape."}
 value_0_1_string: %json {"type":"string","description":"The proxy URL in the format protocol://host:port."}
 value_1_0_string: %json {"type":"string","description":"The path of the file to read"}
 value_2_0_string: %json {"type":"string","description":"The path of the directory to list"}
 value_4_0_string: %json {"type":"string","description":"The query to search for."}
 value_4_1_string: %json {"type":"string","description":"Optional searxng URL overriding the env var"}
 eos:  ( <[248046]> <[248044]> )
 
  1. json-explicit - allows multi-tool calls via anyOf
 start: ( text | tool_call )+ eos
 tool_call: <[248058]> tool_content <[248059]> ("\n")?
 tool_content: %json {"anyOf":[{"type":"object","properties":{"name":{"type":"string","enum":["fetch_url_via_curl"]},"arguments":{"type":"object","properties":{"url":{"type":"string","description":"The URL to scrape."},"proxy":{"type":"string","description":"The proxy URL in the format protocol://host:port."}},"required":["url"]}},"required":["name","arguments"],"additionalProperties":false},{"type":"object","properties":{"name":{"type":"string","enum":["fs_cat"]},"arguments":{"type":"object","properties":{"path":{"type":"string","description":"The path of the file to read"}},"required":["path"]}},"required":["name","arguments"],"additionalProperties":false},{"type":"object","properties":{"name":{"type":"string","enum":["fs_ls"]},"arguments":{"type":"object","properties":{"path":{"type":"string","description":"The path of the directory to list"}},"required":["path"]}},"required":["name","arguments"],"additionalProperties":false},{"type":"object","properties":{"name":{"type":"string","enum":["get_current_time"]},"arguments":{"type":"object","properties":{},"required":[]}},"required":["name","arguments"],"additionalProperties":false},{"type":"object","properties":{"name":{"type":"string","enum":["web_search_searxng"]},"arguments":{"type":"object","properties":{"query":{"type":"string","description":"The query to search for."},"searxng":{"type":"string","description":"Optional searxng URL overriding the env var"}},"required":["query"]}},"required":["name","arguments"],"additionalProperties":false}]}
 text: /(?s:.+?)/
 eos:  ( <[248046]> <[248044]> )
 
  1. text-native - if explicit tool grammar generation is not enabled but other types are, we need this to ensure the model can call tools as the text gram doesn't included non-printable added-vocab. Not as precise as explicit generation but avoids a potential injection route:
 start: ( text | tool_call )+ eos
 tool_call: <[248058]> text <[248059]>
 text: /(?s:.+?)/
 eos:  ( <[248044]> <[248046]> )
  1. No grammar if none of the CLI flags are set.

Functionally-speaking this is a huge help for small models or non-flash attention models (V100s running the various 27/35B 3.5s or the 80B coder seem very happy about this). The LLG 1.7 bump helps but it still seems preferable to use the --enforce-parser qwen option when using this for the coder or other big models due to the complexity of their tool-call workloads and the evil which is XML. I'll go over the lark generation again with a fine-tooth comb but i think i'm missing a "\n" in one of the optional generation paths.

Will push cleanup code shortly and collapse the commit for review.

@sempervictus sempervictus marked this pull request as ready for review March 23, 2026 16:06
@sempervictus

Copy link
Copy Markdown
Contributor Author

@guoqingbao apologies i didn't take the draft tag off last night. I think this is at a good place for a review boundary. You use different clients and my biggest question about the logic of this state is "how do they respond to the multi-tool-call logic?" since this permits both JSON and XML models to put multiple calls in one <tool_call>../ block for efficiency but i'm not sure all clients handle that well so we might want to fall back to the per-call (oneOf in JSON and | join in XML) pattern to simply allow multiple calls per turn as before instead of multiple function calls per tool-call.

Propose splitting-out the following task areas to future PRs:

  1. Ingress "firewall" for special tokens and template-role-flipping injections - pretty much a "re-tokenizer" for inputs to catch <[token_id]> injections in grammars and <|im_end|><|im_start|>role\n style injections in prompts
  2. Chat template control - all of this "works" because we force generation along certain patterns but in doing so we may be masking-away some valuable logic the model might want to emit at those offsets in its chat template. Deep reasoning blocks, re-entrant tool blocks (<tool_response> markup), and other useful faculties may be out of reach without idiomatic understanding of template sections and their iterable or conditional nature.
    2a. jinja2json library seems to provide us AST-style access or at least idiomatic JSON access to anchor blocks. minijinja itself keeps the AST private. The more idiomatic access we have for string and token anchor extraction the more we can align grammar definition to what the model was trained to emit or at least what we tell it that it should emit when we pass the per-sequence template in for generation.
    2b. The code in this PR guards reasoning if no reasoning token strings can be found in the chat template. Models like Q3CoderNext which can think but dont have reasoning in their chat template can produce unbounded rambling in that thinking block which runs the sequence forever... if we were to inject a reasoning block with the tags we see it using on occasion then we may be able to enable reasoning for models than is currently safe. Such injections could open the doors to other specialized function or reasoning blocks for enterprise workflows to use.

In terms of this work: the grammar control seems most useful on small models with any attention mechanism, any models not using flash* attention, and 50/50 on larger models with flash* - they dont make tool call errors nearly as much but they seem more prone to long tool calls this way emitting all parameters whereas the tiny ones seem to just want to get it right with minimal output and move on. Larger models appear to do better on the qwen style parser, smaller ones equally good but oddly faster on the XML one of thats their native format.

@sempervictus sempervictus force-pushed the grammars/pr branch 2 times, most recently from 0b55df7 to 68554aa Compare March 26, 2026 05:24
@guoqingbao

Copy link
Copy Markdown
Owner

In terms of this work: the grammar control seems most useful on small models with any attention mechanism, any models not using flash* attention, and 50/50 on larger models with flash* - they dont make tool call errors nearly as much but they seem more prone to long tool calls this way emitting all parameters whereas the tiny ones seem to just want to get it right with minimal output and move on. Larger models appear to do better on the qwen style parser, smaller ones equally good but oddly faster on the XML one of thats their native format.

I’ll take this on once the decoding cache mismatch issue is resolved.

P.S. I’m also casually working on a bot project in Rust. I can loop you in for initial usage if you’re interested.

@sempervictus

sempervictus commented Mar 27, 2026

Copy link
Copy Markdown
Contributor Author

Thanks, this is a big one to QA/manage and we absolutely want to align caches with what we generate here because we have full output control and might actually be able to simplify caching strategies using this once they prove solid with all consumers.

@sempervictus

Copy link
Copy Markdown
Contributor Author

P.S. I’m also casually working on a bot project in Rust. I can loop you in for initial usage if you’re interested.

of course - i'm doing same for that coreui workbench, next will be wiring sandboxing and stakpak/n8n-rs/etc into it for our engineers. Will add you to the repo soon as i push it as it will come with a --feature vllm.rs option to build this library into it to run the utility/dispatch models, do local image parsing, etc or just run the full LLM-workbench stack on a single system from a single binary. The idea is that the utility model will induce grammar based generation from the big model (also part of why i want to get draft decoding working for throughput).

@sempervictus

Copy link
Copy Markdown
Contributor Author

@guoqingbao you should see an invite to a debugging tool repo for this PR - it exercises the right portions of the internal pipeline but might be of help w #277 / #279 too

@sempervictus

sempervictus commented Mar 29, 2026

Copy link
Copy Markdown
Contributor Author

@guoqingbao - a littlebit of "hackery" on what we send to the tokenizer seems to do the trick:

$ curl -s http://127.0.0.1:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Evaluate the purpose of your execution"}],
    "thinking": true,
    "reasoning_effort": "high",
    "max_tokens": 2000
  }'|jq -r '.choices[0].message.content'

<think>
<ANALYZE>
User wants me to evaluate the purpose of my execution. This is a meta-reflection question. I need to determine what I'm doing and what I'm learning.

</ANALYZE>

<CRITIQUE>
The user is asking about the purpose of my execution. This is a meta-reflection question. I need to determine what I'm doing and what I'm learning.

</CRITIQUE>

<STRUCTUREDANSWER>
I'll provide a structured response that covers:

</STRUCTUREDANSWER>
</think>

My execution is designed to be efficient, accurate, and beneficial for users. It focuses on clarity, accuracy, and helpfulness while avoiding unnecessary complexity or confusion. My goal is to offer high-quality assistance without overwhelming users with excessive detail or technical jargon.

Purpose of Execution:

  1. Clarity & Simplicity – Delivering concise, understandable content that meets the user's needs efficiently.
  2. Accuracy & Verification – Ensuring information is correct before sharing it.
  3. Efficiency – Avoiding redundancy and focusing on core value propositions.
  4. User Experience – Designing interfaces that are intuitive, easy to navigate, and aligned with user goals.
  5. Engagement & Value – Providing meaningful insights while maintaining a balanced tone.

Strategic Alignment:

  • User-Centric: Every decision prioritizes the user experience.
  • Consistency: Maintains quality across all interactions.
  • Ethical Standards: Adheres to guidelines for helpfulness and transparency.

This approach ensures my execution remains both high-quality and user-friendly, enabling effective communication and engagement.

A bit more adjustment of the grammar probably needed to make that more useful but the building blocks all work correctly now.

@sempervictus

sempervictus commented Mar 29, 2026

Copy link
Copy Markdown
Contributor Author

@guoqingbao here is the current code, stripping start-think-tag, working correctly on a 122B q35 with prefix cache matching:

...
2026-03-29T15:18:49.491680Z  WARN vllm_rs::server::server: Tools enabled for request
2026-03-29T15:18:49.492814Z  INFO vllm_rs::core::engine: [llg] Guidance enabled, trimming <think> from pre-generation. Generation starting at <|im_start|>assistant
2026-03-29T15:18:49.543518Z  WARN vllm_rs::core::engine: [Stream] New request [Seq_id 2, 30149 tokens] received! (session_id: None)

2026-03-29T15:18:49.543594Z  INFO vllm_rs::server::parser: Tool parser selected: qwen_coder (model_id=Qwen/Qwen3.5-122B-A10B-FP8, enforce_parser=none)
2026-03-29T15:18:49.544224Z  INFO vllm_rs::core::block_manager: Prefix cache hit seq 2 (26304 cached tokens, 411 blocks)
2026-03-29T15:18:49.546848Z  INFO vllm_rs::core::runner: Restored mamba prefix state for seq 2 (cached 26304 tokens)
...
2026-03-29T15:18:50.360038Z  INFO vllm_rs::utils::guidance: GRAMMAR:
start: reasoning_block ( text | tool_call )+ eos
reasoning_block: <[248068]> "\n" think_text "\n" <[248069]> "\n"
think_text[suffix="\n"]: /[ -~]+/
tool_call: <[248058]> text <[248059]>
text: /(?s:.+?)/
eos:  ( <[248046]> <[248044]> )

...

2026-03-29T15:18:50.371090Z  WARN vllm_rs::core::scheduler: Seq 2 - chunk prefill finished (30149 tokens)
2026-03-29T15:18:50.371113Z  INFO vllm_rs::core::engine: Prefilling [seq_id 2]: 30150 tokens in 0.86s (35098.95 tokens/s, cache included)

prefix cache keeps working through dozens of thinking+output+tool-call iterations:

2026-03-29T16:23:14.890209Z  INFO vllm_rs::server::server: [Seq 43] ⏱️ Prompt: 202405 tokens in 1.09s (185522.47 t/s)
2026-03-29T16:23:14.890217Z  INFO vllm_rs::server::server: [Seq 43] ⏱️ Decoded: 375 tokens in 8.14s (46.09 t/s)

to include proper stream-parsing of tools and reasoning tags:

Details
2026-03-29T16:23:05.531501Z  INFO vllm_rs::core::engine: [llg] Guidance enabled, trimming <think> from pre-generation. Generation starting at <|im_start|>assistant
2026-03-29T16:23:05.692275Z  WARN vllm_rs::core::engine: [Stream] New request [Seq_id 43, 202405 tokens] received! (session_id: None)

2026-03-29T16:23:05.692366Z  INFO vllm_rs::server::parser: Tool parser selected: qwen_coder (model_id=Qwen/Qwen3.5-122B-A10B-FP8, enforce_parser=none)
2026-03-29T16:23:05.694077Z  INFO vllm_rs::core::block_manager: Prefix cache hit seq 43 (199040 cached tokens, 3110 blocks)
2026-03-29T16:23:05.696627Z  INFO vllm_rs::core::runner: Restored mamba prefix state for seq 43 (cached 199040 tokens)
...

2026-03-29T16:23:06.742515Z  INFO vllm_rs::utils::guidance: GRAMMAR:
start: reasoning_block ( text | tool_call )+ eos
reasoning_block: <[248068]> "\n" think_text "\n" <[248069]> "\n"
think_text[suffix="\n"]: /[ -~]+/
tool_call: <[248058]> text <[248059]>
text: /(?s:.+?)/
eos:  ( <[248046]> <[248044]> )

2026-03-29T16:23:06.753581Z  WARN vllm_rs::core::scheduler: Seq 43 - chunk prefill finished (202405 tokens)
2026-03-29T16:23:06.753687Z  INFO vllm_rs::core::engine: Prefilling [seq_id 43]: 202406 tokens in 1.09s (185523.38 tokens/s, cache included)
2026-03-29T16:23:07.068805Z  INFO vllm_rs::server::parser: Tool call <tool_call> (248058) found, start buffering!
2026-03-29T16:23:11.765408Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [43]], avg. 46 tokens/s per request (total: 46 tokens/s)
2026-03-29T16:23:14.887214Z  INFO vllm_rs::core::scheduler: [Seq 43] Detected </tool_call> token 248059, finishing for external handling
2026-03-29T16:23:14.888253Z  INFO vllm_rs::core::block_manager: Prefix cache insert seq 43 (202780 tokens, 3168 blocks)
2026-03-29T16:23:14.889646Z  INFO vllm_rs::server::parser: Tool call buffering end, reached </tool_call> (248059)
2026-03-29T16:23:14.889808Z  INFO vllm_rs::server::parser: Building tool call: [StreamingToolCallState { name: Some("attempt_completion"), arguments: "{\"result\": \"## Implementation Plan: Reasoning Grammar Detection and Chat Template Validation\\n\\n### Concern 1: Detect Reasoning Grammars in `engine.rs::apply_template()`\\n\\n**Current State**: The code at line 1073 in `engine.rs` already has the detection logic:\\n```rust\\nif let Some(grammar) = params.grammar && is_reasoning_grammar(&grammar) {\\n    prompt_template.set_enable_thinking(true);\\n}\\n```\\n\\n**Function Location**: [`is_reasoning_grammar()`](src/utils/reasoning.rs:262) in `src/utils/reasoning.rs`\\n\\n**Detection Logic**:\\n- Converts `TopLevelGrammar` to JSON string representation\\n- Checks for reasoning-specific patterns: `reasoning_block`, `think_text`, `analysis_block`, `draft_phase`, `verification_phase`\\n\\n**Action Required**: None - this functionality already exists and is integrated.\\n\\n---\\n\\n### Concern 2: Chat Template Reasoning Tag Validation\\n\\n**Objective**: Ensure reasoning grammars are only generated when the chat template contains matching reasoning tags.\\n\\n**Current Gap**: No validation exists to check if the chat template has reasoning capability before generating reasoning grammars.\\n\\n**Proposed Solution**: Add a validation function in `src/utils/chat_template.rs`:\\n\\n```rust\\n/// Check if chat template contains reasoning-related markers\\npub fn template_supports_reasoning(template_content: &str) -> bool {\\n    // Look for common reasoning tag patterns\\n    let reasoning_markers = [\\n        \\\"<|begin▁of▁thought|>\\\",\\n        \\\"<|end▁of▁thought|>\\\",\\n        \\\"\"}" }]
2026-03-29T16:23:14.890155Z  INFO vllm_rs::tools::helpers: Valid tool call(s): attempt_completion(args={"result":"## Implementation Plan: Reasoning Grammar Detection and Chat Template Validation\n\n### Concern 1: Detect Reasoning Grammars in `engine.rs::apply_tem...)
2026-03-29T16:23:14.890187Z  INFO vllm_rs::server::server: Final chunk emitted after tool-call delta chunk(s): ChatCompletionChunk { id: "seq-43", object: "chat.completion.chunk", created: 1774801385525, model: "g60", choices: [ChatChoiceChunk { index: 0, delta: Delta { role: None, content: None, tool_calls: None }, finish_reason: Some("tool_calls"), error: None }], usage: Some(Usage { prompt_tokens: 202405, completion_tokens: 375, total_tokens: 202780 }) }
2026-03-29T16:23:14.890205Z  WARN vllm_rs::server::server: --- Performance Metrics ---
2026-03-29T16:23:14.890209Z  INFO vllm_rs::server::server: [Seq 43] ⏱️ Prompt: 202405 tokens in 1.09s (185522.47 t/s)
2026-03-29T16:23:14.890217Z  INFO vllm_rs::server::server: [Seq 43] ⏱️ Decoded: 375 tokens in 8.14s (46.09 t/s)

The resulting IO renders correctly with clients which handle reasoning tags:
image

^^ is on the 122B - i need a better guard to ensure we cannot enable thinking on models like CoderNext as models without a reasoning block in their template appear much more prone to 'runaway generation' in those blocks than reasoning-capable ones.

This is where #279 and this PR might "collide" and why i'm trying to be extra careful in the parallel dev of this one - if i strip that start-think line but disable reasoning entirely then your code will have accounted for two more tokens in the prefix than are actually emitted. That said, same problem: if you emit a start-think-tag in the template but the model doesn't know how to end it, regardless of constraints being enabled or not it will emit nonsense after a little while because it doesn't have a lookup for the closing tag which would allow transition to normal generation through EOS.

@sempervictus

Copy link
Copy Markdown
Contributor Author

That was a weird one but $defs and $ref are resolved now - turns out JSON schema can be referential 😮‍💨

@guoqingbao

Copy link
Copy Markdown
Owner

Found one regression, it seems the tool grammar only suitable for qwen models, here is the GLM4.7 Flash, which works without tool grammar.

When tool grammar enabled, it report:

│  Let me use the correct tool invocation syntax:                                          ▲
│                                                                                          │
│  ✓ ⋮ {"name": ─────────────────────────────────────────────────────────────────────      │
│    │ 🛠 {"name":                                                                          │
│    │ ✗ Error: Tool '{"name":' not found                                                  │
│    │                                                                                     █
│    │   [Analyze the error above and try a different approach.]                           │
│    └───────────────────────────────────────────────────────────────────────────────      │
│                                                                                          │
│  I apologize — I'm having a technical issue with the tool invocation. Let me try readin  │
│  g the SKILL.md file directly:                                                           │
│                                                                                          │
│  ✓ ⋮ {"name": ─────────────────────────────────────────────────────────────────────      │
│    │ 🛠 {"name":                                                                          │
│    │ ✗ Error: Tool '{"name":' not found                                                  │
│    │                                                                                     │
│    │   [Analyze the error above and try a different approach.]  

RageLtMan added 5 commits June 2, 2026 16:11
This introduces a complete grammar composition system built on
llguidance that handles multiple constraint types, tool calling
styles, and reasoning effort levels through a single coherent
architecture.

Key capabilities:

Multi-format constraint grammars via Lark, regex, JSON schema,
choice lists, and structural tags - all normalized to TopLevelGrammar

Tool grammar generation for both Qwen-style JSON format and
Qwen3-Coder XML format with proper parameter ordering and
deduplication

Reasoning effort levels (none, low, medium, high,
chain_of_thought) which wrap base grammars with reasoning block
constraints

GrammarComposer.compose_all_grammars() that merges constraints,
tools, and reasoning in the correct precedence order

5.Thinking fallback via VLLM_RS_PROVIDE_THINKING_FALLBACK that
transforms <[token_id]> syntax to string literals for models not
trained on reasoning tokens, detecting via
chat_template.enable_thinking flag

Schema sanitization that strips unsupported format attributes from
JSON schemas before passing to llguidance Architecture:

- GrammarBuilder trait for composable grammar fragments
- GrammarRequestDispatcher for building grammars from request
context
- GrammarComposer for merging constraint, tool, and reasoning
grammars
- apply_thinking_fallback() for models without reasoning tokens in
template

Examples:

```lark
start: "yes" | "no" eos
eos: ( <[248044]> | <[248046]> )
```

```lark
start: reasoning_block "positive" | "negative" | "neutral" eos
reasoning_block: <[248068]> "\n" think_text "\n" (think_text+ "\n")? "\n" <[248069]> "\n\n"
think_text[suffix="\n"]: /[ -~]+/
eos: ( <[248044]> | <[248046]> )
```

```lark
start: ( text | tool_call )+ eos
text: /(?s:.+?)/
tool_call: <[248058]> tool_content <[248059]>
param_0_0: "\n<parameter=url>\n" value_string
param_0_1: "\n<parameter=proxy>\n" value_string
tool_0: "\n<function=fetch_url_via_curl>" param_0_0 (param_0_1)? "</function>\n"
param_1_0: "\n<parameter=path>\n" value_string
tool_1: "\n<function=fs_cat>" param_1_0 "</function>\n"
param_2_0: "\n<parameter=path>\n" value_string
tool_2: "\n<function=fs_ls>" param_2_0 "</function>\n"
tool_3: "\n<function=get_current_time>\n" "</function>\n"
param_4_0: "\n<parameter=query>\n" value_string
param_4_1: "\n<parameter=searxng>\n" value_string
tool_4: "\n<function=web_search_searxng>" param_4_0 (param_4_1)? "</function>\n"
tool_content: tool_0 | tool_1 | tool_2 | tool_3 | tool_4
value_string[suffix="\n</parameter>\n"]: /[\x20-\x7E\x0A\x0D]+?/
eos: ( <[248044]> | <[248046]> )
```

```lark
start: reasoning_block text eos
reasoning_block: <[248068]> analysis_block critique_block structure_block "\n" <[248069]> "\n\n"
analysis_block: "\n<analysis>\n" analysis_text
analysis_text[suffix="\n</analysis>\n"]: /[\x20-\x7E\x0A\x0D]+?/
critique_block: "\n<critique>\n" critique_text
critique_text[suffix="\n</critique>\n"]: /[\x20-\x7E\x0A\x0D]+?/
structure_block: "\n<structure_response>\n" structure_text
structure_text[suffix="\n</structure_response>\n"]: /[\x20-\x7E\x0A\x0D]+?/
text: %json {"type":"object","properties":{"steps":{"type":"array","items":{"type":"string"}},"final_answer":{"type":"string"}},"required":["steps","final_answer"]}
eos: ( <[248044]> | <[248046]> )
```
Remove template preamble when grammar generation is acctive for
control of the entire token-space emitted. guoqingbao#352 testing observed
some NVFP4 models insisting on producing a BOS regardless of the
template preamble and grammar-constrained generation prevents them
from doing so which reduces logbprobs for the remainder of the
sequence produced drastically shortening output.

Create ReasoningEffort::ModelDefault to replace @guoqingbao's way
of preempting a model to reason by way of an open `<think>` tag
predicated on the same CLI parameter state as the original template
-driven implementation.
Revise reasoning logic to limit max tokens at each lexeme and add
a ModelDefault ReasoningLevel to handle the case where no actual
reasoning level is provided but `--disable-reasoning` is not set.
Make the reasoning text between the reasoning tokens optional to
allow opt-out for models not trained to generate in the block.

This allows `RedHatAI/Qwen3.6-35B-A3B-NVFP4` to actually produce
output atop 4xV100 under guided constraints as using:
```lark
start: text eos
text: /(?s:.+?)/
eos: ( <[248044]> | <[248046]> )
```
results in the model emitting an immediate EOS because the first
token it produces without constraints is always `<|im_start|>`
even with the `<think>` tag preamble provided by the template to
prime the model for generating the content of the reasoning block.
When it is only allowed to emit from the standard vocabulary and
EOS it either outputs a nonsense token and then EOS or just goes
straight to EOS.

With the expanded constraint which incldues reasoning tokens:
```lark
start: reasoning_block text eos
reasoning_block: <[248068]> "\n" think_text? "\n\n" <[248069]> "\n"
think_text[temperature=0, max_tokens=768]: /(?s:.+?)/
text: /(?s:.+?)/
eos: ( <[248044]> | <[248046]> )
```
the first token emitted is from the added vocabulary which is at
least proximate to the BOS token it's not being allowed to emit.
Tokens subsequent to the added vocabulary one appear to be stable
for up to a few hundred reasoning tokens then start to collapse
and loop until max_tokens of the reasoning block is reached and
another special token emitted (which stabilizes output in normal
text/tool-call phase).

Of note: the larger the input context the more stable this specifc
model appears. Moreover when there are no constraints applied, the
BOS tag it emits usually has the model role included (ignoring the
template preamble) but in some cases such as the open `<think>`
tag being "noticed" in generation it outputs absurdities such as
`<|im_start|><think>` without the requisite role definition or
newline in-between.
Set max_tokens value in the default reasoning block via env var or
fall-back to 512 default. Use XINFER_DEFAULT_REASONING_MAX_TOKENS
to adjust at startup.
@sempervictus

Copy link
Copy Markdown
Contributor Author

Found one regression, it seems the tool grammar only suitable for qwen models, here is the GLM4.7 Flash, which works without tool grammar.

When tool grammar enabled, it report:

│  Let me use the correct tool invocation syntax:                                          ▲
│                                                                                          │
│  ✓ ⋮ {"name": ─────────────────────────────────────────────────────────────────────      │
│    │ 🛠 {"name":                                                                          │
│    │ ✗ Error: Tool '{"name":' not found                                                  │
│    │                                                                                     █
│    │   [Analyze the error above and try a different approach.]                           │
│    └───────────────────────────────────────────────────────────────────────────────      │
│                                                                                          │
│  I apologize — I'm having a technical issue with the tool invocation. Let me try readin  │
│  g the SKILL.md file directly:                                                           │
│                                                                                          │
│  ✓ ⋮ {"name": ─────────────────────────────────────────────────────────────────────      │
│    │ 🛠 {"name":                                                                          │
│    │ ✗ Error: Tool '{"name":' not found                                                  │
│    │                                                                                     │
│    │   [Analyze the error above and try a different approach.]  

Thank you for checking in on this branch sir. I wrote grammar generators for Gemma4 and MiniMax but haven't run into GLM yet. Agree we need per-model appropriate grammars, for all models. That said... --enforce-parser qwen should address this since it makes the model generate JSON tool-calls and the tool-parser handle that format (you should be able to force any of the valid formats, to include GLM once we add it).

I'll get a hold of a chat template and add that as well. Should i be moving model-specific grammars into the existing model's file or starting a new subtree? Some of the files are getting a bit porcine.

Current state is:

  1. Trying to get ff-tokens shoved into the Sequence from the Runner's sample() call (free sampling-ahead, which currently we're not really set up for in sample() and downstream)
  2. "smooth" the logit mask a bit - instead of -inf maybe something like -1000 bias applied to make them really unlikely but if they are required dont completely remove them
  3. get the masking directly into the sampling kernel since it may benefit from operational fusion and offer a smoother gradient of logit modification relative to the sampling strategy used (or maybe force an argmax when constrained, working that out).

The more i get into the maths of what's going on before and after this the more my head hurts. The actual guidance process is fairly simple but the downstream implications on generation from prior enforcement biasing subsequent token selection should result in all sorts of messed up output... but they dont. We're only really eliminating special tokens for most of the generation (.* only covers common vocab) and our tool grammars are 1:1 or very close to the trained chat template output for those models but i'm kind of surprised we havent run into invisible padding token issues or other weird things like that.

@sempervictus

Copy link
Copy Markdown
Contributor Author

BTW the gemma4 generator is for its native "JSON-like" format, and it works. I know you reverted to using JSON tool-parsing at some point and i realigned the grammargen for that change but the code is still there if you want to use the really odd but native syntax they wrote for tool calls.

@sempervictus

Copy link
Copy Markdown
Contributor Author
vllm-rs-svc0  | 2026-06-02T23:56:25.637980Z  INFO xinfer::core::engine: Prefilling 1 seq(s) [0]: 1360187 total tokens in 451.52s (3012.48 tokens/s, cache included)
vllm-rs-svc0  | 2026-06-02T23:56:26.253979Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 198
vllm-rs-svc0  | 2026-06-02T23:56:26.253996Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 27
vllm-rs-svc0  | 2026-06-02T23:56:26.253999Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 1688
vllm-rs-svc0  | 2026-06-02T23:56:26.254097Z  INFO xinfer::server::parser: Tool call <tool_call> (151657) found, start buffering!
vllm-rs-svc0  | 2026-06-02T23:56:26.292107Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 3152
vllm-rs-svc0  | 2026-06-02T23:56:26.292118Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 10716
vllm-rs-svc0  | 2026-06-02T23:56:26.292122Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 397
vllm-rs-svc0  | 2026-06-02T23:56:26.292125Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 27
vllm-rs-svc0  | 2026-06-02T23:56:26.292127Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 16181
vllm-rs-svc0  | 2026-06-02T23:56:26.292130Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 81940
vllm-rs-svc0  | 2026-06-02T23:56:26.579813Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 198
vllm-rs-svc0  | 2026-06-02T23:56:26.579828Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 27
vllm-rs-svc0  | 2026-06-02T23:56:26.579831Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 16181
vllm-rs-svc0  | 2026-06-02T23:56:26.579834Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 79194
vllm-rs-svc0  | 2026-06-02T23:56:27.252387Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 522
vllm-rs-svc0  | 2026-06-02T23:56:27.252399Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 1688
vllm-rs-svc0  | 2026-06-02T23:56:27.252402Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 397
vllm-rs-svc0  | 2026-06-02T23:56:27.252405Z  INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 151658

.... 🤞

@guoqingbao

guoqingbao commented Jun 3, 2026

Copy link
Copy Markdown
Owner

BTW the gemma4 generator is for its native "JSON-like" format, and it works. I know you reverted to using JSON tool-parsing at some point and i realigned the grammargen for that change but the code is still there if you want to use the really odd but native syntax they wrote for tool calls.

I made some changes to fix bugs that were already addressed in my previous commits, but your latest force push removed those changes. Please base your work on the most recent updates so we can get this PR merged, rather than overwriting or ignoring them.

RageLtMan added 2 commits June 2, 2026 22:40
LLGuidance Matcher state machine can produce fast-forward tokens
from certain sequence positions followed by known token IDs based
on the grammar definition. This permits emitting tokens which must
be correct and avoiding forward passes for them entirely and also
allows comparison of sampled token to the FF state as an additional
validator of sampling correctness relative to mask/FSM. Used token
is commited to the FSM regardless of whether sampled or FF ensuring
alignmet of FSM increment to sequence position.

TODO: currently only ff_tokens[0] is used as a healing mechanism
but the function can return many tokens at once. These cannot be
handled in the context of sample() as they must be appended to the
sequence in the right order during `Schedueler::postprocess()` and
cached state along with seq pos and FSM increment must align with
each-other before the next forward pass.
Draft infrastructure may be useful for handling the associated
complexity when stabilized via EAGLE or other mechanism.
The original implementation used hard masking (f32::NEG_INFINITY)
which completely eliminates probability mass from disallowed
tokens. This creates a gradient cliff that can disrupt the
inference loop, especially when grammar constraints appear in
text sections that only allow common vocabulary.

Soft-masking with configurable parameters:

- SoftMaskConfig struct with mask_shift (-1000.0), min_logit
(-1e9), and enabled flag
- Environment variable XINFER_SOFT_MASK_DISABLED controls
soft-masking behavior When enabled, masked tokens get
logit = (original_logit - 1000.0).max(-1e9)

Mathematical rationale:

- F32 range: min = -3.4028235e+38, max = 3.4028235e+38
- Softmax probability for shifted logit:
exp(-1000) / sum(exp(logits)) ~ 10^-435
- This is effectively zero for practical purposes while
preserving gradient flow
- The gradient of softmax at -1000 is non-zero (unlike -inf
which has zero gradient)
- min_logit = -1e9 is safe: well above f32::MIN, prevents
underflow to -inf
@sempervictus

sempervictus commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

I made some changes to fix bugs that were already addressed in my previous commits, but your latest force push removed those changes. Please base your work on the most recent updates so we can get this PR merged, rather than overwriting or ignoring them.

Apologies, didn't see the changes - i rebase the local branch off main whenever merge conflicts arise. Will take a look above before the rebases and cherry-pick those commits

Looking through the history in this GH page it doesn't show commit added to the branch by anyone other than me - am i somehow deleting GH history too when i rebase? :-\ Happen to have a branch from which i can pick those correctly (or just re-add them here and i'll merge everything down to 1 commit)?

I'll push the latest state with softscaling and single-token FF (only a recovery function right now, the multi-token FF requires more work to align sequence states/FSM/cached_tokens/possibly other things) - i also found the FP4 models take better to having their repetition penalties applied before masking at longer (1m+) context but that might be workload specific because the agent re-reads the same files a ton which induces repetition from the chat history ... so curious if yo usee the same.

@sempervictus

Copy link
Copy Markdown
Contributor Author

@guoqingbao if you have a chance before you sign off for the night, could you please push any changes you had to this branch? I didn't realize you were pushing here and rebased locally off of main (see above). I've added PR branches to my git config to have it pull all PRs when i fetch so i should see changes moving forward but unfortunatley it looks like i lost yours pretty thoroughly in the rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants